Third time is the charm.
This commit is contained in:
JohnnyB 2024-02-05 10:35:00 +00:00 committed by GitHub
parent 34432a39a8
commit 853dbf17cd
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -353,9 +353,9 @@ Notice that each `probs` is an array of length `n_probs`.
## More examples
### Load Balancing
The server example is mostly stateless since the completion/chat thread is presented by the client in each API call. Since cache is the only local resource it becomes very easy to load balance a cluster or multiple instances of server for concurrent services. Cluster nodes may be heterogeneus or homogeneus, though homogeneus similarly spec'ed nodes will deliver a more consistent user experience:
The server example is mostly stateless since the completion/chat thread is presented by the client in each API call. Since cache is the only local resource it becomes very easy to load balance a cluster or multiple instances of server for concurrent services. Cluster nodes may be heterogeneous or homogeneous, though homogeneous similarly spec'ed nodes will deliver a more consistent user experience:
![LlamaServerCluster](https://github.com/jboero/llama.cpp/assets/7536012/1ec986f7-d409-449d-8b37-af672e509a9e)
Example Llama server cluster of 3 heterogeneus servers. Each server should use the same model or unexpected results will occur. As OpenCL currently only supports a single device, a single server may be used to support one server instance per GPU but this is only recommended when VRAM fits the entire model.
Example Llama server cluster of 3 heterogeneous servers. Each server should use the same model or unexpected results will occur. As OpenCL currently only supports a single device, a single server may be used to support one server instance per GPU but this is only recommended when VRAM fits the entire model.
Behavior will change if server is updated to perform more concurrent sessions per process. Parallel `-np` concurrency does not yet behave as you might think. https://github.com/ggerganov/llama.cpp/issues/4216 Still it is possible to load balance multiple instances of server processes in a mixed environment if you want to build a shared group installation. Load balancing policy is up to the user.