llama : save and restore kv cache for single seq id (#6341)
* llama : save and restore kv cache for single seq id * remove trailing whitespace * respond error in case there's no space in the kv cache * add kv seq save restore to test case * add --slot-save-path arg to enable save restore and restrict save location * Returning 0 for some cases, instead of asserting. * cleanup error cases * rename sequence state functions * rename state get set functions * add previous function names back in with DEPRECATED notice * update doc * adjust endpoints to preferred style * fix restoring zero cell count * handle seq rm return value * unused param * keep in the size check * fix return types * add server test case for slot save restore * cleanup * add cake * cleanup style * add special * removing a whole sequence never fails * move sequence state file functionality from server to llama to match session api and add version tags * catch exceptions on save as well * error log messages * check types for stricter restore * update server doc * readme : update API changes date * strict filename validation * move include, reject bom as well * also reject empty filename * reject whitespace and trailing dot --------- Co-authored-by: Martin Evans <martindevans@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit is contained in:
parent
87fb5b4234
commit
beea6e1b16
11 changed files with 1086 additions and 31 deletions
|
@ -57,6 +57,7 @@ page cache before using this. See https://github.com/ggerganov/llama.cpp/issues/
|
|||
- `-n N, --n-predict N`: Set the maximum tokens to predict. Default: `-1`
|
||||
- `--slots-endpoint-disable`: To disable slots state monitoring endpoint. Slots state may contain user data, prompts included.
|
||||
- `--metrics`: enable prometheus `/metrics` compatible endpoint. Default: disabled
|
||||
- `--slot-save-path PATH`: Specifies the path where the state of slots (the prompt cache) can be stored. If not provided, the slot management endpoints will be disabled.
|
||||
- `--chat-template JINJA_TEMPLATE`: Set custom jinja chat template. This parameter accepts a string, not a file name. Default: template taken from model's metadata. We only support [some pre-defined templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
|
||||
- `--log-disable`: Output logs to stdout only, not to `llama.log`. Default: enabled
|
||||
- `--log-format FORMAT`: Define the log output to FORMAT: json or text Default: `json`
|
||||
|
@ -517,6 +518,57 @@ Available metrics:
|
|||
- `llamacpp:requests_processing`: Number of requests processing.
|
||||
- `llamacpp:requests_deferred`: Number of requests deferred.
|
||||
|
||||
- **POST** `/slots/{id_slot}?action=save`: Save the prompt cache of the specified slot to a file.
|
||||
|
||||
*Options:*
|
||||
|
||||
`filename`: Name of the file to save the slot's prompt cache. The file will be saved in the directory specified by the `--slot-save-path` server parameter.
|
||||
|
||||
### Result JSON
|
||||
|
||||
```json
|
||||
{
|
||||
"id_slot": 0,
|
||||
"filename": "slot_save_file.bin",
|
||||
"n_saved": 1745,
|
||||
"n_written": 14309796,
|
||||
"timings": {
|
||||
"save_ms": 49.865
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- **POST** `/slots/{id_slot}?action=restore`: Restore the prompt cache of the specified slot from a file.
|
||||
|
||||
*Options:*
|
||||
|
||||
`filename`: Name of the file to restore the slot's prompt cache from. The file should be located in the directory specified by the `--slot-save-path` server parameter.
|
||||
|
||||
### Result JSON
|
||||
|
||||
```json
|
||||
{
|
||||
"id_slot": 0,
|
||||
"filename": "slot_save_file.bin",
|
||||
"n_restored": 1745,
|
||||
"n_read": 14309796,
|
||||
"timings": {
|
||||
"restore_ms": 42.937
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- **POST** `/slots/{id_slot}?action=erase`: Erase the prompt cache of the specified slot.
|
||||
|
||||
### Result JSON
|
||||
|
||||
```json
|
||||
{
|
||||
"id_slot": 0,
|
||||
"n_erased": 1745
|
||||
}
|
||||
```
|
||||
|
||||
## More examples
|
||||
|
||||
### Change system prompt on runtime
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue