server : allow using LoRA adapters per-request (#10994)
* slot.can_batch_with * lora per request * test: force disable cache prompt * move can_batch_with check * fix condition * add slow test with llama 8b * update docs * move lora change task to queue * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * lora_base * remove redundant check --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit is contained in:
parent
a45433ba20
commit
0da5d86026
8 changed files with 235 additions and 59 deletions
|
@ -452,6 +452,8 @@ These words will not be included in the completion, so make sure to add them to
|
|||
|
||||
`response_fields`: A list of response fields, for example: `"response_fields": ["content", "generation_settings/n_predict"]`. If the specified field is missing, it will simply be omitted from the response without triggering an error. Note that fields with a slash will be unnested; for example, `generation_settings/n_predict` will move the field `n_predict` from the `generation_settings` object to the root of the response and give it a new name.
|
||||
|
||||
`lora`: A list of LoRA adapters to be applied to this specific request. Each object in the list must contain `id` and `scale` fields. For example: `[{"id": 0, "scale": 0.5}, {"id": 1, "scale": 1.1}]`. If a LoRA adapter is not specified in the list, its scale will default to `0.0`. Please note that requests with different LoRA configurations will not be batched together, which may result in performance degradation.
|
||||
|
||||
**Response format**
|
||||
|
||||
- Note: In streaming mode (`stream`), only `content`, `tokens` and `stop` will be returned until end of completion. Responses are sent using the [Server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html) standard. Note: the browser's `EventSource` interface cannot be used due to its lack of `POST` request support.
|
||||
|
@ -945,6 +947,8 @@ This endpoint returns the loaded LoRA adapters. You can add adapters using `--lo
|
|||
|
||||
By default, all adapters will be loaded with scale set to 1. To initialize all adapters scale to 0, add `--lora-init-without-apply`
|
||||
|
||||
Please note that this value will be overwritten by the `lora` field for each request.
|
||||
|
||||
If an adapter is disabled, the scale will be set to 0.
|
||||
|
||||
**Response format**
|
||||
|
@ -966,6 +970,8 @@ If an adapter is disabled, the scale will be set to 0.
|
|||
|
||||
### POST `/lora-adapters`: Set list of LoRA adapters
|
||||
|
||||
This sets the global scale for LoRA adapters. Please note that this value will be overwritten by the `lora` field for each request.
|
||||
|
||||
To disable an adapter, either remove it from the list below, or set scale to 0.
|
||||
|
||||
**Request format**
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue