From 34432a39a83447e5e0fb54d55fac5c3bda32ae66 Mon Sep 17 00:00:00 2001 From: JohnnyB Date: Mon, 5 Feb 2024 10:32:17 +0000 Subject: [PATCH] Fix spelling I was hasty and made a typo/misspelling. --- examples/server/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/server/README.md b/examples/server/README.md index e111face1..e2213cda6 100644 --- a/examples/server/README.md +++ b/examples/server/README.md @@ -353,9 +353,9 @@ Notice that each `probs` is an array of length `n_probs`. ## More examples ### Load Balancing -The server example is mostly stateless since the completion/chat thread is presented by the client in each API call. Since cache is the only local resource it becomes very easy to load balance a cluster or multiple instances of server for concurrent services. Cluster nodes may be heterogenius or homogenius, though homogenius similarly spec'ed nodes will deliver a more consistent user experience: +The server example is mostly stateless since the completion/chat thread is presented by the client in each API call. Since cache is the only local resource it becomes very easy to load balance a cluster or multiple instances of server for concurrent services. Cluster nodes may be heterogeneus or homogeneus, though homogeneus similarly spec'ed nodes will deliver a more consistent user experience: ![LlamaServerCluster](https://github.com/jboero/llama.cpp/assets/7536012/1ec986f7-d409-449d-8b37-af672e509a9e) -Example Llama server cluster of 3 heterogenius servers. Each server should use the same model or unexpected results will occur. As OpenCL currently only supports a single device, a single server may be used to support one server instance per GPU but this is only recommended when VRAM fits the entire model. +Example Llama server cluster of 3 heterogeneus servers. Each server should use the same model or unexpected results will occur. As OpenCL currently only supports a single device, a single server may be used to support one server instance per GPU but this is only recommended when VRAM fits the entire model. Behavior will change if server is updated to perform more concurrent sessions per process. Parallel `-np` concurrency does not yet behave as you might think. https://github.com/ggerganov/llama.cpp/issues/4216 Still it is possible to load balance multiple instances of server processes in a mixed environment if you want to build a shared group installation. Load balancing policy is up to the user.