From 72af9abf5de1e96036118486b19259a1e622bb57 Mon Sep 17 00:00:00 2001 From: JohnnyB Date: Fri, 2 Feb 2024 16:47:23 +0000 Subject: [PATCH 1/3] Load Balancing Cluster Example Added documentation for a LB cluster I've piloted. --- examples/server/README.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/examples/server/README.md b/examples/server/README.md index fe934dab1..e111face1 100644 --- a/examples/server/README.md +++ b/examples/server/README.md @@ -352,6 +352,13 @@ Notice that each `probs` is an array of length `n_probs`. ## More examples +### Load Balancing +The server example is mostly stateless since the completion/chat thread is presented by the client in each API call. Since cache is the only local resource it becomes very easy to load balance a cluster or multiple instances of server for concurrent services. Cluster nodes may be heterogenius or homogenius, though homogenius similarly spec'ed nodes will deliver a more consistent user experience: +![LlamaServerCluster](https://github.com/jboero/llama.cpp/assets/7536012/1ec986f7-d409-449d-8b37-af672e509a9e) +Example Llama server cluster of 3 heterogenius servers. Each server should use the same model or unexpected results will occur. As OpenCL currently only supports a single device, a single server may be used to support one server instance per GPU but this is only recommended when VRAM fits the entire model. + +Behavior will change if server is updated to perform more concurrent sessions per process. Parallel `-np` concurrency does not yet behave as you might think. https://github.com/ggerganov/llama.cpp/issues/4216 Still it is possible to load balance multiple instances of server processes in a mixed environment if you want to build a shared group installation. Load balancing policy is up to the user. + ### Change system prompt on runtime To use the server example to serve multiple chat-type clients while keeping the same system prompt, you can utilize the option `system_prompt` to achieve that. This only needs to be done once to establish it. From 34432a39a83447e5e0fb54d55fac5c3bda32ae66 Mon Sep 17 00:00:00 2001 From: JohnnyB Date: Mon, 5 Feb 2024 10:32:17 +0000 Subject: [PATCH 2/3] Fix spelling I was hasty and made a typo/misspelling. --- examples/server/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/server/README.md b/examples/server/README.md index e111face1..e2213cda6 100644 --- a/examples/server/README.md +++ b/examples/server/README.md @@ -353,9 +353,9 @@ Notice that each `probs` is an array of length `n_probs`. ## More examples ### Load Balancing -The server example is mostly stateless since the completion/chat thread is presented by the client in each API call. Since cache is the only local resource it becomes very easy to load balance a cluster or multiple instances of server for concurrent services. Cluster nodes may be heterogenius or homogenius, though homogenius similarly spec'ed nodes will deliver a more consistent user experience: +The server example is mostly stateless since the completion/chat thread is presented by the client in each API call. Since cache is the only local resource it becomes very easy to load balance a cluster or multiple instances of server for concurrent services. Cluster nodes may be heterogeneus or homogeneus, though homogeneus similarly spec'ed nodes will deliver a more consistent user experience: ![LlamaServerCluster](https://github.com/jboero/llama.cpp/assets/7536012/1ec986f7-d409-449d-8b37-af672e509a9e) -Example Llama server cluster of 3 heterogenius servers. Each server should use the same model or unexpected results will occur. As OpenCL currently only supports a single device, a single server may be used to support one server instance per GPU but this is only recommended when VRAM fits the entire model. +Example Llama server cluster of 3 heterogeneus servers. Each server should use the same model or unexpected results will occur. As OpenCL currently only supports a single device, a single server may be used to support one server instance per GPU but this is only recommended when VRAM fits the entire model. Behavior will change if server is updated to perform more concurrent sessions per process. Parallel `-np` concurrency does not yet behave as you might think. https://github.com/ggerganov/llama.cpp/issues/4216 Still it is possible to load balance multiple instances of server processes in a mixed environment if you want to build a shared group installation. Load balancing policy is up to the user. From 853dbf17cd98bd8d9e21d1c12e3c845f86087157 Mon Sep 17 00:00:00 2001 From: JohnnyB Date: Mon, 5 Feb 2024 10:35:00 +0000 Subject: [PATCH 3/3] Spelling Third time is the charm. --- examples/server/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/server/README.md b/examples/server/README.md index e2213cda6..2fd5526de 100644 --- a/examples/server/README.md +++ b/examples/server/README.md @@ -353,9 +353,9 @@ Notice that each `probs` is an array of length `n_probs`. ## More examples ### Load Balancing -The server example is mostly stateless since the completion/chat thread is presented by the client in each API call. Since cache is the only local resource it becomes very easy to load balance a cluster or multiple instances of server for concurrent services. Cluster nodes may be heterogeneus or homogeneus, though homogeneus similarly spec'ed nodes will deliver a more consistent user experience: +The server example is mostly stateless since the completion/chat thread is presented by the client in each API call. Since cache is the only local resource it becomes very easy to load balance a cluster or multiple instances of server for concurrent services. Cluster nodes may be heterogeneous or homogeneous, though homogeneous similarly spec'ed nodes will deliver a more consistent user experience: ![LlamaServerCluster](https://github.com/jboero/llama.cpp/assets/7536012/1ec986f7-d409-449d-8b37-af672e509a9e) -Example Llama server cluster of 3 heterogeneus servers. Each server should use the same model or unexpected results will occur. As OpenCL currently only supports a single device, a single server may be used to support one server instance per GPU but this is only recommended when VRAM fits the entire model. +Example Llama server cluster of 3 heterogeneous servers. Each server should use the same model or unexpected results will occur. As OpenCL currently only supports a single device, a single server may be used to support one server instance per GPU but this is only recommended when VRAM fits the entire model. Behavior will change if server is updated to perform more concurrent sessions per process. Parallel `-np` concurrency does not yet behave as you might think. https://github.com/ggerganov/llama.cpp/issues/4216 Still it is possible to load balance multiple instances of server processes in a mixed environment if you want to build a shared group installation. Load balancing policy is up to the user.