This commit is contained in:
JohnnyB 2024-09-25 08:09:00 +02:00 committed by GitHub
commit 25a6141a12
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -761,6 +761,13 @@ To know the `id` of the adapter, use GET `/lora-adapters`
## More examples
### Load Balancing
The server example is mostly stateless since the completion/chat thread is presented by the client in each API call. Since cache is the only local resource it becomes very easy to load balance a cluster or multiple instances of server for concurrent services. Cluster nodes may be heterogeneous or homogeneous, though homogeneous similarly spec'ed nodes will deliver a more consistent user experience:
![LlamaServerCluster](https://github.com/jboero/llama.cpp/assets/7536012/1ec986f7-d409-449d-8b37-af672e509a9e)
Example Llama server cluster of 3 heterogeneous servers. Each server should use the same model or unexpected results will occur. As OpenCL currently only supports a single device, a single server may be used to support one server instance per GPU but this is only recommended when VRAM fits the entire model.
Behavior will change if server is updated to perform more concurrent sessions per process. Parallel `-np` concurrency does not yet behave as you might think. https://github.com/ggerganov/llama.cpp/issues/4216 Still it is possible to load balance multiple instances of server processes in a mixed environment if you want to build a shared group installation. Load balancing policy is up to the user.
### Change system prompt on runtime
To use the server example to serve multiple chat-type clients while keeping the same system prompt, you can utilize the option `system_prompt`. This only needs to be used once.