Merge remote-tracking branch 'origin/master' into jinja
This commit is contained in:
commit
cb72cf1fc3
215 changed files with 23423 additions and 18704 deletions
|
@ -45,10 +45,7 @@ The project is under active development, and we are [looking for feedback and co
|
|||
| `-ub, --ubatch-size N` | physical maximum batch size (default: 512)<br/>(env: LLAMA_ARG_UBATCH) |
|
||||
| `--keep N` | number of tokens to keep from the initial prompt (default: 0, -1 = all) |
|
||||
| `-fa, --flash-attn` | enable Flash Attention (default: disabled)<br/>(env: LLAMA_ARG_FLASH_ATTN) |
|
||||
| `-p, --prompt PROMPT` | prompt to start generation with |
|
||||
| `--no-perf` | disable internal libllama performance timings (default: false)<br/>(env: LLAMA_ARG_NO_PERF) |
|
||||
| `-f, --file FNAME` | a file containing the prompt (default: none) |
|
||||
| `-bf, --binary-file FNAME` | binary file containing the prompt (default: none) |
|
||||
| `-e, --escape` | process escapes sequences (\n, \r, \t, \', \", \\) (default: true) |
|
||||
| `--no-escape` | do not process escape sequences |
|
||||
| `--rope-scaling {none,linear,yarn}` | RoPE frequency scaling method, defaults to linear unless specified by the model<br/>(env: LLAMA_ARG_ROPE_SCALING_TYPE) |
|
||||
|
@ -345,7 +342,7 @@ node index.js
|
|||
|
||||
> [!IMPORTANT]
|
||||
>
|
||||
> This endpoint is **not** OAI-compatible
|
||||
> This endpoint is **not** OAI-compatible. For OAI-compatible client, use `/v1/completions` instead.
|
||||
|
||||
*Options:*
|
||||
|
||||
|
@ -450,7 +447,9 @@ These words will not be included in the completion, so make sure to add them to
|
|||
|
||||
`post_sampling_probs`: Returns the probabilities of top `n_probs` tokens after applying sampling chain.
|
||||
|
||||
`response_fields`: A list of response fields, for example: `"response_fields": ["content", "generation_settings/n_predict"]`. If the specified field is missing, it will simply be omitted from the response without triggering an error.
|
||||
`response_fields`: A list of response fields, for example: `"response_fields": ["content", "generation_settings/n_predict"]`. If the specified field is missing, it will simply be omitted from the response without triggering an error. Note that fields with a slash will be unnested; for example, `generation_settings/n_predict` will move the field `n_predict` from the `generation_settings` object to the root of the response and give it a new name.
|
||||
|
||||
`lora`: A list of LoRA adapters to be applied to this specific request. Each object in the list must contain `id` and `scale` fields. For example: `[{"id": 0, "scale": 0.5}, {"id": 1, "scale": 1.1}]`. If a LoRA adapter is not specified in the list, its scale will default to `0.0`. Please note that requests with different LoRA configurations will not be batched together, which may result in performance degradation.
|
||||
|
||||
**Response format**
|
||||
|
||||
|
@ -523,6 +522,7 @@ These words will not be included in the completion, so make sure to add them to
|
|||
- `tokens_evaluated`: Number of tokens evaluated in total from the prompt
|
||||
- `truncated`: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (`tokens_evaluated`) plus tokens generated (`tokens predicted`) exceeded the context size (`n_ctx`)
|
||||
|
||||
|
||||
### POST `/tokenize`: Tokenize a given text
|
||||
|
||||
*Options:*
|
||||
|
@ -574,6 +574,10 @@ With input 'á' (utf8 hex: C3 A1) on tinyllama/stories260k
|
|||
|
||||
### POST `/embedding`: Generate embedding of a given text
|
||||
|
||||
> [!IMPORTANT]
|
||||
>
|
||||
> This endpoint is **not** OAI-compatible. For OAI-compatible client, use `/v1/embeddings` instead.
|
||||
|
||||
The same as [the embedding example](../embedding) does.
|
||||
|
||||
*Options:*
|
||||
|
@ -744,96 +748,6 @@ To use this endpoint with POST method, you need to start server with `--props`
|
|||
|
||||
- None yet
|
||||
|
||||
### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
|
||||
|
||||
Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
|
||||
|
||||
*Options:*
|
||||
|
||||
See [OpenAI Chat Completions API documentation](https://platform.openai.com/docs/api-reference/chat). While some OpenAI-specific features such as function calling aren't supported, llama.cpp `/completion`-specific features such as `mirostat` are supported.
|
||||
|
||||
The `response_format` parameter supports both plain JSON output (e.g. `{"type": "json_object"}`) and schema-constrained JSON (e.g. `{"type": "json_object", "schema": {"type": "string", "minLength": 10, "maxLength": 100}}` or `{"type": "json_schema", "schema": {"properties": { "name": { "title": "Name", "type": "string" }, "date": { "title": "Date", "type": "string" }, "participants": { "items": {"type: "string" }, "title": "Participants", "type": "string" } } } }`), similar to other OpenAI-inspired API providers.
|
||||
|
||||
*Examples:*
|
||||
|
||||
You can use either Python `openai` library with appropriate checkpoints:
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
client = openai.OpenAI(
|
||||
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
|
||||
api_key = "sk-no-key-required"
|
||||
)
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="gpt-3.5-turbo",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
|
||||
{"role": "user", "content": "Write a limerick about python exceptions"}
|
||||
]
|
||||
)
|
||||
|
||||
print(completion.choices[0].message)
|
||||
```
|
||||
|
||||
... or raw HTTP requests:
|
||||
|
||||
```shell
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer no-key" \
|
||||
-d '{
|
||||
"model": "gpt-3.5-turbo",
|
||||
"messages": [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Write a limerick about python exceptions"
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### POST `/v1/embeddings`: OpenAI-compatible embeddings API
|
||||
|
||||
This endpoint requires that the model uses a pooling different than type `none`. The embeddings are normalized using the Eucledian norm.
|
||||
|
||||
*Options:*
|
||||
|
||||
See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-reference/embeddings).
|
||||
|
||||
*Examples:*
|
||||
|
||||
- input as string
|
||||
|
||||
```shell
|
||||
curl http://localhost:8080/v1/embeddings \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer no-key" \
|
||||
-d '{
|
||||
"input": "hello",
|
||||
"model":"GPT-4",
|
||||
"encoding_format": "float"
|
||||
}'
|
||||
```
|
||||
|
||||
- `input` as string array
|
||||
|
||||
```shell
|
||||
curl http://localhost:8080/v1/embeddings \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer no-key" \
|
||||
-d '{
|
||||
"input": ["hello", "world"],
|
||||
"model":"GPT-4",
|
||||
"encoding_format": "float"
|
||||
}'
|
||||
```
|
||||
|
||||
### POST `/embeddings`: non-OpenAI-compatible embeddings API
|
||||
|
||||
This endpoint supports all poolings, including `--pooling none`. When the pooling is `none`, the responses will contain the *unnormalized* embeddings for *all* input tokens. For all other pooling types, only the pooled embeddings are returned, normalized using Euclidian norm.
|
||||
|
@ -1030,6 +944,8 @@ This endpoint returns the loaded LoRA adapters. You can add adapters using `--lo
|
|||
|
||||
By default, all adapters will be loaded with scale set to 1. To initialize all adapters scale to 0, add `--lora-init-without-apply`
|
||||
|
||||
Please note that this value will be overwritten by the `lora` field for each request.
|
||||
|
||||
If an adapter is disabled, the scale will be set to 0.
|
||||
|
||||
**Response format**
|
||||
|
@ -1051,6 +967,8 @@ If an adapter is disabled, the scale will be set to 0.
|
|||
|
||||
### POST `/lora-adapters`: Set list of LoRA adapters
|
||||
|
||||
This sets the global scale for LoRA adapters. Please note that this value will be overwritten by the `lora` field for each request.
|
||||
|
||||
To disable an adapter, either remove it from the list below, or set scale to 0.
|
||||
|
||||
**Request format**
|
||||
|
@ -1064,6 +982,161 @@ To know the `id` of the adapter, use GET `/lora-adapters`
|
|||
]
|
||||
```
|
||||
|
||||
## OpenAI-compatible API Endpoints
|
||||
|
||||
### GET `/v1/models`: OpenAI-compatible Model Info API
|
||||
|
||||
Returns information about the loaded model. See [OpenAI Models API documentation](https://platform.openai.com/docs/api-reference/models).
|
||||
|
||||
The returned list always has one single element.
|
||||
|
||||
By default, model `id` field is the path to model file, specified via `-m`. You can set a custom value for model `id` field via `--alias` argument. For example, `--alias gpt-4o-mini`.
|
||||
|
||||
Example:
|
||||
|
||||
```json
|
||||
{
|
||||
"object": "list",
|
||||
"data": [
|
||||
{
|
||||
"id": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
|
||||
"object": "model",
|
||||
"created": 1735142223,
|
||||
"owned_by": "llamacpp",
|
||||
"meta": {
|
||||
"vocab_type": 2,
|
||||
"n_vocab": 128256,
|
||||
"n_ctx_train": 131072,
|
||||
"n_embd": 4096,
|
||||
"n_params": 8030261312,
|
||||
"size": 4912898304
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### POST `/v1/completions`: OpenAI-compatible Completions API
|
||||
|
||||
Given an input `prompt`, it returns the predicted completion. Streaming mode is also supported. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps.
|
||||
|
||||
*Options:*
|
||||
|
||||
See [OpenAI Completions API documentation](https://platform.openai.com/docs/api-reference/completions).
|
||||
|
||||
llama.cpp `/completion`-specific features such as `mirostat` are supported.
|
||||
|
||||
*Examples:*
|
||||
|
||||
Example usage with `openai` python library:
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
client = openai.OpenAI(
|
||||
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
|
||||
api_key = "sk-no-key-required"
|
||||
)
|
||||
|
||||
completion = client.completions.create(
|
||||
model="davinci-002",
|
||||
prompt="I believe the meaning of life is",
|
||||
max_tokens=8
|
||||
)
|
||||
|
||||
print(completion.choices[0].text)
|
||||
```
|
||||
|
||||
### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
|
||||
|
||||
Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
|
||||
|
||||
*Options:*
|
||||
|
||||
See [OpenAI Chat Completions API documentation](https://platform.openai.com/docs/api-reference/chat). While some OpenAI-specific features such as function calling aren't supported, llama.cpp `/completion`-specific features such as `mirostat` are supported.
|
||||
|
||||
The `response_format` parameter supports both plain JSON output (e.g. `{"type": "json_object"}`) and schema-constrained JSON (e.g. `{"type": "json_object", "schema": {"type": "string", "minLength": 10, "maxLength": 100}}` or `{"type": "json_schema", "schema": {"properties": { "name": { "title": "Name", "type": "string" }, "date": { "title": "Date", "type": "string" }, "participants": { "items": {"type: "string" }, "title": "Participants", "type": "string" } } } }`), similar to other OpenAI-inspired API providers.
|
||||
|
||||
*Examples:*
|
||||
|
||||
You can use either Python `openai` library with appropriate checkpoints:
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
client = openai.OpenAI(
|
||||
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
|
||||
api_key = "sk-no-key-required"
|
||||
)
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="gpt-3.5-turbo",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
|
||||
{"role": "user", "content": "Write a limerick about python exceptions"}
|
||||
]
|
||||
)
|
||||
|
||||
print(completion.choices[0].message)
|
||||
```
|
||||
|
||||
... or raw HTTP requests:
|
||||
|
||||
```shell
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer no-key" \
|
||||
-d '{
|
||||
"model": "gpt-3.5-turbo",
|
||||
"messages": [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Write a limerick about python exceptions"
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### POST `/v1/embeddings`: OpenAI-compatible embeddings API
|
||||
|
||||
This endpoint requires that the model uses a pooling different than type `none`. The embeddings are normalized using the Eucledian norm.
|
||||
|
||||
*Options:*
|
||||
|
||||
See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-reference/embeddings).
|
||||
|
||||
*Examples:*
|
||||
|
||||
- input as string
|
||||
|
||||
```shell
|
||||
curl http://localhost:8080/v1/embeddings \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer no-key" \
|
||||
-d '{
|
||||
"input": "hello",
|
||||
"model":"GPT-4",
|
||||
"encoding_format": "float"
|
||||
}'
|
||||
```
|
||||
|
||||
- `input` as string array
|
||||
|
||||
```shell
|
||||
curl http://localhost:8080/v1/embeddings \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer no-key" \
|
||||
-d '{
|
||||
"input": ["hello", "world"],
|
||||
"model":"GPT-4",
|
||||
"encoding_format": "float"
|
||||
}'
|
||||
```
|
||||
|
||||
## More examples
|
||||
|
||||
### Interactive mode
|
||||
|
|
|
@ -6,10 +6,10 @@ Benchmark is using [k6](https://k6.io/).
|
|||
|
||||
SSE is not supported by default in k6, you have to build k6 with the [xk6-sse](https://github.com/phymbert/xk6-sse) extension.
|
||||
|
||||
Example:
|
||||
Example (assuming golang >= 1.21 is installed):
|
||||
```shell
|
||||
go install go.k6.io/xk6/cmd/xk6@latest
|
||||
xk6 build master \
|
||||
$GOPATH/bin/xk6 build master \
|
||||
--with github.com/phymbert/xk6-sse
|
||||
```
|
||||
|
||||
|
@ -33,7 +33,7 @@ The server must answer OAI Chat completion requests on `http://localhost:8080/v1
|
|||
|
||||
Example:
|
||||
```shell
|
||||
server --host localhost --port 8080 \
|
||||
llama-server --host localhost --port 8080 \
|
||||
--model ggml-model-q4_0.gguf \
|
||||
--cont-batching \
|
||||
--metrics \
|
||||
|
|
|
@ -189,12 +189,12 @@ xychart-beta
|
|||
"pp": {
|
||||
"p95": round(data['metrics']["llamacpp_prompt_processing_second"]["p(95)"], 2),
|
||||
"avg": round(data['metrics']["llamacpp_prompt_processing_second"]["avg"], 2),
|
||||
"0": round(mean(prometheus_metrics['prompt_tokens_seconds']), 2),
|
||||
"0": round(mean(prometheus_metrics['prompt_tokens_seconds']), 2) if 'prompt_tokens_seconds' in prometheus_metrics else 0,
|
||||
},
|
||||
"tg": {
|
||||
"p95": round(data['metrics']["llamacpp_tokens_second"]["p(95)"], 2),
|
||||
"avg": round(data['metrics']["llamacpp_tokens_second"]["avg"], 2),
|
||||
"0": round(mean(prometheus_metrics['predicted_tokens_seconds']), 2),
|
||||
"0": round(mean(prometheus_metrics['predicted_tokens_seconds']), 2) if 'predicted_tokens_seconds' in prometheus_metrics else 0,
|
||||
},
|
||||
}
|
||||
with open("results.github.env", 'a') as github_env:
|
||||
|
@ -214,11 +214,14 @@ def start_benchmark(args):
|
|||
k6_args = [
|
||||
'run', args.scenario,
|
||||
'--no-color',
|
||||
'--no-connection-reuse',
|
||||
'--no-vu-connection-reuse',
|
||||
]
|
||||
k6_args.extend(['--duration', args.duration])
|
||||
k6_args.extend(['--iterations', args.n_prompts])
|
||||
k6_args.extend(['--vus', args.parallel])
|
||||
k6_args.extend(['--summary-export', 'k6-results.json'])
|
||||
k6_args.extend(['--out', 'csv=k6-results.csv'])
|
||||
args = f"SERVER_BENCH_N_PROMPTS={args.n_prompts} SERVER_BENCH_MAX_PROMPT_TOKENS={args.max_prompt_tokens} SERVER_BENCH_MAX_CONTEXT={args.max_tokens} "
|
||||
args = args + ' '.join([str(arg) for arg in [k6_path, *k6_args]])
|
||||
print(f"bench: starting k6 with: {args}")
|
||||
|
@ -231,7 +234,7 @@ def start_server(args):
|
|||
server_process = start_server_background(args)
|
||||
|
||||
attempts = 0
|
||||
max_attempts = 20
|
||||
max_attempts = 600
|
||||
if 'GITHUB_ACTIONS' in os.environ:
|
||||
max_attempts *= 2
|
||||
|
||||
|
@ -242,7 +245,15 @@ def start_server(args):
|
|||
print(f"bench: waiting for server to start ...")
|
||||
time.sleep(0.5)
|
||||
|
||||
print("bench: server started.")
|
||||
attempts = 0
|
||||
while not is_server_ready(args.host, args.port):
|
||||
attempts += 1
|
||||
if attempts > max_attempts:
|
||||
assert False, "server not ready"
|
||||
print(f"bench: waiting for server to be ready ...")
|
||||
time.sleep(0.5)
|
||||
|
||||
print("bench: server started and ready.")
|
||||
return server_process
|
||||
|
||||
|
||||
|
@ -255,11 +266,6 @@ def start_server_background(args):
|
|||
'--host', args.host,
|
||||
'--port', args.port,
|
||||
]
|
||||
model_file = args.model_path_prefix + os.path.sep + args.hf_file
|
||||
model_dir = os.path.dirname(model_file)
|
||||
if not os.path.exists(model_dir):
|
||||
os.makedirs(model_dir)
|
||||
server_args.extend(['--model', model_file])
|
||||
server_args.extend(['--hf-repo', args.hf_repo])
|
||||
server_args.extend(['--hf-file', args.hf_file])
|
||||
server_args.extend(['--n-gpu-layers', args.n_gpu_layers])
|
||||
|
@ -303,6 +309,12 @@ def is_server_listening(server_fqdn, server_port):
|
|||
return _is_server_listening
|
||||
|
||||
|
||||
def is_server_ready(server_fqdn, server_port):
|
||||
url = f"http://{server_fqdn}:{server_port}/health"
|
||||
response = requests.get(url)
|
||||
return response.status_code == 200
|
||||
|
||||
|
||||
def escape_metric_name(metric_name):
|
||||
return re.sub('[^A-Z0-9]', '_', metric_name.upper())
|
||||
|
||||
|
|
|
@ -56,6 +56,7 @@ const llamacpp_completion_tokens = new Trend('llamacpp_completion_tokens')
|
|||
|
||||
const llamacpp_tokens_second = new Trend('llamacpp_tokens_second')
|
||||
const llamacpp_prompt_processing_second = new Trend('llamacpp_prompt_processing_second')
|
||||
const llamacpp_emit_first_token_second = new Trend('llamacpp_emit_first_token_second')
|
||||
|
||||
const llamacpp_prompt_tokens_total_counter = new Counter('llamacpp_prompt_tokens_total_counter')
|
||||
const llamacpp_completion_tokens_total_counter = new Counter('llamacpp_completion_tokens_total_counter')
|
||||
|
@ -89,6 +90,9 @@ export default function () {
|
|||
],
|
||||
"model": model,
|
||||
"stream": true,
|
||||
"stream_options": {
|
||||
"include_usage": true, // False to be supported in llama.cpp server
|
||||
},
|
||||
"seed": 42,
|
||||
"max_tokens": max_tokens,
|
||||
"stop": ["<|im_end|>"] // This is temporary for phi-2 base (i.e. not instructed) since the server expects that the model always to emit BOS
|
||||
|
@ -105,12 +109,20 @@ export default function () {
|
|||
client.on('event', function (event) {
|
||||
if (promptEvalEndTime == null) {
|
||||
promptEvalEndTime = new Date()
|
||||
llamacpp_emit_first_token_second.add((promptEvalEndTime - startTime) / 1.e3)
|
||||
}
|
||||
|
||||
if (event.data === '[DONE]' || event.data === '') {
|
||||
return
|
||||
}
|
||||
|
||||
let chunk = JSON.parse(event.data)
|
||||
let choice = chunk.choices[0]
|
||||
if (choice.finish_reason) {
|
||||
finish_reason = choice.finish_reason
|
||||
|
||||
if (chunk.choices && chunk.choices.length > 0) {
|
||||
let choice = chunk.choices[0]
|
||||
if (choice.finish_reason) {
|
||||
finish_reason = choice.finish_reason
|
||||
}
|
||||
}
|
||||
|
||||
if (chunk.usage) {
|
||||
|
|
Binary file not shown.
|
@ -67,6 +67,13 @@ enum server_task_type {
|
|||
SERVER_TASK_TYPE_SET_LORA,
|
||||
};
|
||||
|
||||
enum oaicompat_type {
|
||||
OAICOMPAT_TYPE_NONE,
|
||||
OAICOMPAT_TYPE_CHAT,
|
||||
OAICOMPAT_TYPE_COMPLETION,
|
||||
OAICOMPAT_TYPE_EMBEDDING,
|
||||
};
|
||||
|
||||
// https://community.openai.com/t/openai-chat-list-of-error-codes-and-types/357791/11
|
||||
enum error_type {
|
||||
ERROR_TYPE_INVALID_REQUEST,
|
||||
|
@ -91,6 +98,8 @@ struct slot_params {
|
|||
int64_t t_max_prompt_ms = -1; // TODO: implement
|
||||
int64_t t_max_predict_ms = -1; // if positive, limit the generation phase to this time limit
|
||||
|
||||
std::vector<common_adapter_lora_info> lora;
|
||||
|
||||
std::vector<std::string> antiprompt;
|
||||
std::vector<std::string> response_fields;
|
||||
bool timings_per_token = false;
|
||||
|
@ -101,11 +110,10 @@ struct slot_params {
|
|||
struct common_params_speculative speculative;
|
||||
|
||||
// OAI-compat fields
|
||||
bool verbose = false;
|
||||
bool oaicompat = false;
|
||||
bool oaicompat_chat = true;
|
||||
std::string oaicompat_model;
|
||||
std::string oaicompat_cmpl_id;
|
||||
bool verbose = false;
|
||||
oaicompat_type oaicompat = OAICOMPAT_TYPE_NONE;
|
||||
std::string oaicompat_model;
|
||||
std::string oaicompat_cmpl_id;
|
||||
|
||||
json to_json() const {
|
||||
std::vector<std::string> samplers;
|
||||
|
@ -114,6 +122,11 @@ struct slot_params {
|
|||
samplers.emplace_back(common_sampler_type_to_str(sampler));
|
||||
}
|
||||
|
||||
json lora = json::array();
|
||||
for (size_t i = 0; i < this->lora.size(); ++i) {
|
||||
lora.push_back({{"id", i}, {"scale", this->lora[i].scale}});
|
||||
}
|
||||
|
||||
return json {
|
||||
{"n_predict", n_predict}, // Server configured n_predict
|
||||
{"seed", sampling.seed},
|
||||
|
@ -154,6 +167,7 @@ struct slot_params {
|
|||
{"speculative.p_min", speculative.p_min},
|
||||
{"timings_per_token", timings_per_token},
|
||||
{"post_sampling_probs", post_sampling_probs},
|
||||
{"lora", lora},
|
||||
};
|
||||
}
|
||||
};
|
||||
|
@ -183,13 +197,18 @@ struct server_task {
|
|||
// used by SERVER_TASK_TYPE_METRICS
|
||||
bool metrics_reset_bucket = false;
|
||||
|
||||
// used by SERVER_TASK_TYPE_SET_LORA
|
||||
std::vector<common_adapter_lora_info> set_lora;
|
||||
|
||||
server_task(server_task_type type) : type(type) {}
|
||||
|
||||
static slot_params params_from_json_cmpl(
|
||||
const llama_model * model,
|
||||
const llama_context * ctx,
|
||||
const common_params & params_base,
|
||||
const json & data) {
|
||||
const llama_model * model = llama_get_model(ctx);
|
||||
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||
|
||||
slot_params params;
|
||||
|
||||
// Sampling parameter defaults are loaded from the global server context (but individual requests can still override them)
|
||||
|
@ -245,6 +264,16 @@ struct server_task {
|
|||
params.speculative.n_min = std::max(params.speculative.n_min, 2);
|
||||
params.speculative.n_max = std::max(params.speculative.n_max, 0);
|
||||
|
||||
if (data.contains("lora")) {
|
||||
if (data.at("lora").is_array()) {
|
||||
params.lora = parse_lora_request(params_base.lora_adapters, data.at("lora"));
|
||||
} else {
|
||||
throw std::runtime_error("Error: 'lora' must be an array of objects with 'id' and 'scale' fields");
|
||||
}
|
||||
} else {
|
||||
params.lora = params_base.lora_adapters;
|
||||
}
|
||||
|
||||
// TODO: add more sanity checks for the input parameters
|
||||
|
||||
if (params.sampling.penalty_last_n < -1) {
|
||||
|
@ -302,7 +331,7 @@ struct server_task {
|
|||
|
||||
const auto & logit_bias = data.find("logit_bias");
|
||||
if (logit_bias != data.end() && logit_bias->is_array()) {
|
||||
const int n_vocab = llama_n_vocab(model);
|
||||
const int n_vocab = llama_vocab_n_tokens(vocab);
|
||||
for (const auto & el : *logit_bias) {
|
||||
// TODO: we may want to throw errors here, in case "el" is incorrect
|
||||
if (el.is_array() && el.size() == 2) {
|
||||
|
@ -321,7 +350,7 @@ struct server_task {
|
|||
params.sampling.logit_bias.push_back({tok, bias});
|
||||
}
|
||||
} else if (el[0].is_string()) {
|
||||
auto toks = common_tokenize(model, el[0].get<std::string>(), false);
|
||||
auto toks = common_tokenize(vocab, el[0].get<std::string>(), false);
|
||||
for (auto tok : toks) {
|
||||
params.sampling.logit_bias.push_back({tok, bias});
|
||||
}
|
||||
|
@ -529,11 +558,10 @@ struct server_task_result_cmpl_final : server_task_result {
|
|||
slot_params generation_params;
|
||||
|
||||
// OAI-compat fields
|
||||
bool verbose = false;
|
||||
bool oaicompat = false;
|
||||
bool oaicompat_chat = true; // TODO: support oaicompat for non-chat
|
||||
std::string oaicompat_model;
|
||||
std::string oaicompat_cmpl_id;
|
||||
bool verbose = false;
|
||||
oaicompat_type oaicompat = OAICOMPAT_TYPE_NONE;
|
||||
std::string oaicompat_model;
|
||||
std::string oaicompat_cmpl_id;
|
||||
|
||||
virtual int get_index() override {
|
||||
return index;
|
||||
|
@ -544,9 +572,16 @@ struct server_task_result_cmpl_final : server_task_result {
|
|||
}
|
||||
|
||||
virtual json to_json() override {
|
||||
return oaicompat
|
||||
? (stream ? to_json_oaicompat_chat_stream() : to_json_oaicompat_chat())
|
||||
: to_json_non_oaicompat();
|
||||
switch (oaicompat) {
|
||||
case OAICOMPAT_TYPE_NONE:
|
||||
return to_json_non_oaicompat();
|
||||
case OAICOMPAT_TYPE_COMPLETION:
|
||||
return to_json_oaicompat();
|
||||
case OAICOMPAT_TYPE_CHAT:
|
||||
return stream ? to_json_oaicompat_chat_stream() : to_json_oaicompat_chat();
|
||||
default:
|
||||
GGML_ASSERT(false && "Invalid oaicompat_type");
|
||||
}
|
||||
}
|
||||
|
||||
json to_json_non_oaicompat() {
|
||||
|
@ -574,6 +609,50 @@ struct server_task_result_cmpl_final : server_task_result {
|
|||
return response_fields.empty() ? res : json_get_nested_values(response_fields, res);
|
||||
}
|
||||
|
||||
json to_json_oaicompat() {
|
||||
std::time_t t = std::time(0);
|
||||
json logprobs = json(nullptr); // OAI default to null
|
||||
if (!stream && probs_output.size() > 0) {
|
||||
logprobs = json{
|
||||
{"content", completion_token_output::probs_vector_to_json(probs_output, post_sampling_probs)},
|
||||
};
|
||||
}
|
||||
json finish_reason = "length";
|
||||
if (stop == STOP_TYPE_WORD || stop == STOP_TYPE_EOS) {
|
||||
finish_reason = "stop";
|
||||
}
|
||||
json res = json {
|
||||
{"choices", json::array({
|
||||
json{
|
||||
{"text", stream ? "" : content}, // in stream mode, content is already in last partial chunk
|
||||
{"index", index},
|
||||
{"logprobs", logprobs},
|
||||
{"finish_reason", finish_reason},
|
||||
}
|
||||
})},
|
||||
{"created", t},
|
||||
{"model", oaicompat_model},
|
||||
{"system_fingerprint", build_info},
|
||||
{"object", "text_completion"},
|
||||
{"usage", json {
|
||||
{"completion_tokens", n_decoded},
|
||||
{"prompt_tokens", n_prompt_tokens},
|
||||
{"total_tokens", n_decoded + n_prompt_tokens}
|
||||
}},
|
||||
{"id", oaicompat_cmpl_id}
|
||||
};
|
||||
|
||||
// extra fields for debugging purposes
|
||||
if (verbose) {
|
||||
res["__verbose"] = to_json_non_oaicompat();
|
||||
}
|
||||
if (timings.prompt_n >= 0) {
|
||||
res.push_back({"timings", timings.to_json()});
|
||||
}
|
||||
|
||||
return res;
|
||||
}
|
||||
|
||||
json to_json_oaicompat_chat() {
|
||||
std::string finish_reason = "length";
|
||||
if (stop == STOP_TYPE_WORD || stop == STOP_TYPE_EOS) {
|
||||
|
@ -671,11 +750,10 @@ struct server_task_result_cmpl_partial : server_task_result {
|
|||
result_timings timings;
|
||||
|
||||
// OAI-compat fields
|
||||
bool verbose = false;
|
||||
bool oaicompat = false;
|
||||
bool oaicompat_chat = true; // TODO: support oaicompat for non-chat
|
||||
std::string oaicompat_model;
|
||||
std::string oaicompat_cmpl_id;
|
||||
bool verbose = false;
|
||||
oaicompat_type oaicompat = OAICOMPAT_TYPE_NONE;
|
||||
std::string oaicompat_model;
|
||||
std::string oaicompat_cmpl_id;
|
||||
|
||||
virtual int get_index() override {
|
||||
return index;
|
||||
|
@ -686,7 +764,16 @@ struct server_task_result_cmpl_partial : server_task_result {
|
|||
}
|
||||
|
||||
virtual json to_json() override {
|
||||
return oaicompat ? to_json_oaicompat() : to_json_non_oaicompat();
|
||||
switch (oaicompat) {
|
||||
case OAICOMPAT_TYPE_NONE:
|
||||
return to_json_non_oaicompat();
|
||||
case OAICOMPAT_TYPE_COMPLETION:
|
||||
return to_json_oaicompat();
|
||||
case OAICOMPAT_TYPE_CHAT:
|
||||
return to_json_oaicompat_chat();
|
||||
default:
|
||||
GGML_ASSERT(false && "Invalid oaicompat_type");
|
||||
}
|
||||
}
|
||||
|
||||
json to_json_non_oaicompat() {
|
||||
|
@ -711,6 +798,41 @@ struct server_task_result_cmpl_partial : server_task_result {
|
|||
}
|
||||
|
||||
json to_json_oaicompat() {
|
||||
std::time_t t = std::time(0);
|
||||
json logprobs = json(nullptr); // OAI default to null
|
||||
if (prob_output.probs.size() > 0) {
|
||||
logprobs = json{
|
||||
{"content", completion_token_output::probs_vector_to_json({prob_output}, post_sampling_probs)},
|
||||
};
|
||||
}
|
||||
json res = json {
|
||||
{"choices", json::array({
|
||||
json{
|
||||
{"text", content},
|
||||
{"index", index},
|
||||
{"logprobs", logprobs},
|
||||
{"finish_reason", nullptr},
|
||||
}
|
||||
})},
|
||||
{"created", t},
|
||||
{"model", oaicompat_model},
|
||||
{"system_fingerprint", build_info},
|
||||
{"object", "text_completion"},
|
||||
{"id", oaicompat_cmpl_id}
|
||||
};
|
||||
|
||||
// extra fields for debugging purposes
|
||||
if (verbose) {
|
||||
res["__verbose"] = to_json_non_oaicompat();
|
||||
}
|
||||
if (timings.prompt_n >= 0) {
|
||||
res.push_back({"timings", timings.to_json()});
|
||||
}
|
||||
|
||||
return res;
|
||||
}
|
||||
|
||||
json to_json_oaicompat_chat() {
|
||||
bool first = n_decoded == 0;
|
||||
std::time_t t = std::time(0);
|
||||
json choices;
|
||||
|
@ -789,14 +911,16 @@ struct server_task_result_embd : server_task_result {
|
|||
int32_t n_tokens;
|
||||
|
||||
// OAI-compat fields
|
||||
bool oaicompat = false;
|
||||
oaicompat_type oaicompat = OAICOMPAT_TYPE_NONE;
|
||||
|
||||
virtual int get_index() override {
|
||||
return index;
|
||||
}
|
||||
|
||||
virtual json to_json() override {
|
||||
return oaicompat ? to_json_oaicompat() : to_json_non_oaicompat();
|
||||
return oaicompat == OAICOMPAT_TYPE_EMBEDDING
|
||||
? to_json_oaicompat()
|
||||
: to_json_non_oaicompat();
|
||||
}
|
||||
|
||||
json to_json_non_oaicompat() {
|
||||
|
@ -1009,6 +1133,8 @@ struct server_slot {
|
|||
|
||||
common_speculative * spec = nullptr;
|
||||
|
||||
std::vector<common_adapter_lora_info> lora;
|
||||
|
||||
// the index relative to completion multi-task request
|
||||
size_t index = 0;
|
||||
|
||||
|
@ -1090,6 +1216,11 @@ struct server_slot {
|
|||
return task_type == SERVER_TASK_TYPE_EMBEDDING || task_type == SERVER_TASK_TYPE_RERANK;
|
||||
}
|
||||
|
||||
bool can_batch_with(server_slot & other_slot) {
|
||||
return is_non_causal() == other_slot.is_non_causal()
|
||||
&& are_lora_equal(lora, other_slot.lora);
|
||||
}
|
||||
|
||||
bool has_budget(const common_params & global_params) {
|
||||
if (params.n_predict == -1 && global_params.n_predict == -1) {
|
||||
return true; // limitless
|
||||
|
@ -1497,11 +1628,17 @@ struct server_response {
|
|||
struct server_context {
|
||||
common_params params_base;
|
||||
|
||||
// note: keep these alive - they determine the lifetime of the model, context, etc.
|
||||
common_init_result llama_init;
|
||||
common_init_result llama_init_dft;
|
||||
|
||||
llama_model * model = nullptr;
|
||||
llama_context * ctx = nullptr;
|
||||
std::vector<common_lora_adapter_container> loras;
|
||||
|
||||
const llama_vocab * vocab = nullptr;
|
||||
|
||||
llama_model * model_dft = nullptr;
|
||||
|
||||
llama_context_params cparams_dft;
|
||||
|
||||
llama_batch batch = {};
|
||||
|
@ -1525,21 +1662,6 @@ struct server_context {
|
|||
float slot_prompt_similarity = 0.0f;
|
||||
|
||||
~server_context() {
|
||||
if (ctx) {
|
||||
llama_free(ctx);
|
||||
ctx = nullptr;
|
||||
}
|
||||
|
||||
if (model) {
|
||||
llama_free_model(model);
|
||||
model = nullptr;
|
||||
}
|
||||
|
||||
if (model_dft) {
|
||||
llama_free_model(model_dft);
|
||||
model_dft = nullptr;
|
||||
}
|
||||
|
||||
// Clear any sampling context
|
||||
for (server_slot & slot : slots) {
|
||||
common_sampler_free(slot.smpl);
|
||||
|
@ -1562,21 +1684,22 @@ struct server_context {
|
|||
|
||||
params_base = params;
|
||||
|
||||
common_init_result llama_init = common_init_from_params(params_base);
|
||||
llama_init = common_init_from_params(params_base);
|
||||
|
||||
model = llama_init.model;
|
||||
ctx = llama_init.context;
|
||||
loras = llama_init.lora_adapters;
|
||||
model = llama_init.model.get();
|
||||
ctx = llama_init.context.get();
|
||||
|
||||
if (model == nullptr) {
|
||||
SRV_ERR("failed to load model, '%s'\n", params_base.model.c_str());
|
||||
return false;
|
||||
}
|
||||
|
||||
vocab = llama_model_get_vocab(model);
|
||||
|
||||
n_ctx = llama_n_ctx(ctx);
|
||||
|
||||
add_bos_token = llama_add_bos_token(model);
|
||||
has_eos_token = llama_token_eos(model) != LLAMA_TOKEN_NULL;
|
||||
add_bos_token = llama_vocab_get_add_bos(vocab);
|
||||
has_eos_token = llama_vocab_eos(vocab) != LLAMA_TOKEN_NULL;
|
||||
|
||||
if (!params_base.speculative.model.empty()) {
|
||||
SRV_INF("loading draft model '%s'\n", params_base.speculative.model.c_str());
|
||||
|
@ -1589,25 +1712,22 @@ struct server_context {
|
|||
params_dft.n_gpu_layers = params_base.speculative.n_gpu_layers;
|
||||
params_dft.n_parallel = 1;
|
||||
|
||||
common_init_result llama_init_dft = common_init_from_params(params_dft);
|
||||
llama_init_dft = common_init_from_params(params_dft);
|
||||
|
||||
model_dft = llama_init_dft.model;
|
||||
model_dft = llama_init_dft.model.get();
|
||||
|
||||
if (model_dft == nullptr) {
|
||||
SRV_ERR("failed to load draft model, '%s'\n", params_base.speculative.model.c_str());
|
||||
return false;
|
||||
}
|
||||
|
||||
if (!common_speculative_are_compatible(ctx, llama_init_dft.context)) {
|
||||
if (!common_speculative_are_compatible(ctx, llama_init_dft.context.get())) {
|
||||
SRV_ERR("the draft model '%s' is not compatible with the target model '%s'\n", params_base.speculative.model.c_str(), params_base.model.c_str());
|
||||
|
||||
llama_free (llama_init_dft.context);
|
||||
llama_free_model(llama_init_dft.model);
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
const int n_ctx_dft = llama_n_ctx(llama_init_dft.context);
|
||||
const int n_ctx_dft = llama_n_ctx(llama_init_dft.context.get());
|
||||
|
||||
cparams_dft = common_context_params_to_llama(params_dft);
|
||||
cparams_dft.n_batch = n_ctx_dft;
|
||||
|
@ -1615,15 +1735,12 @@ struct server_context {
|
|||
// force F16 KV cache for the draft model for extra performance
|
||||
cparams_dft.type_k = GGML_TYPE_F16;
|
||||
cparams_dft.type_v = GGML_TYPE_F16;
|
||||
|
||||
// the context is not needed - we will create one for each slot
|
||||
llama_free(llama_init_dft.context);
|
||||
}
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
bool validate_model_chat_template(bool use_jinja) const {
|
||||
bool validate_builtin_chat_template(bool use_jinja) const {
|
||||
llama_chat_message chat[] = {{"user", "test"}};
|
||||
|
||||
if (use_jinja) {
|
||||
|
@ -1642,18 +1759,13 @@ struct server_context {
|
|||
return true;
|
||||
} catch (const std::exception & e) {
|
||||
SRV_ERR("failed to apply template: %s\n", e.what());
|
||||
return false;
|
||||
}
|
||||
} else {
|
||||
std::vector<char> model_template(2048, 0); // longest known template is about 1200 bytes
|
||||
std::string template_key = "tokenizer.chat_template";
|
||||
int32_t res = llama_model_meta_val_str(model, template_key.c_str(), model_template.data(), model_template.size());
|
||||
if (res >= 0) {
|
||||
std::string tmpl = std::string(model_template.data(), model_template.size());
|
||||
int32_t chat_res = llama_chat_apply_template(model, tmpl.c_str(), chat, 1, true, nullptr, 0);
|
||||
return chat_res > 0;
|
||||
}
|
||||
const char * tmpl = llama_model_chat_template(model, /* name */ nullptr);
|
||||
const int32_t chat_res = llama_chat_apply_template(tmpl, chat, 1, true, nullptr, 0);
|
||||
return chat_res > 0;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
void init() {
|
||||
|
@ -1672,7 +1784,7 @@ struct server_context {
|
|||
if (model_dft) {
|
||||
slot.batch_spec = llama_batch_init(params_base.speculative.n_max + 1, 0, 1);
|
||||
|
||||
slot.ctx_dft = llama_new_context_with_model(model_dft, cparams_dft);
|
||||
slot.ctx_dft = llama_init_from_model(model_dft, cparams_dft);
|
||||
if (slot.ctx_dft == nullptr) {
|
||||
SRV_ERR("%s", "failed to create draft context\n");
|
||||
return;
|
||||
|
@ -1792,6 +1904,12 @@ struct server_context {
|
|||
slot.params = std::move(task.params);
|
||||
slot.prompt_tokens = std::move(task.prompt_tokens);
|
||||
|
||||
if (!are_lora_equal(task.params.lora, slot.lora)) {
|
||||
// if lora is changed, we cannot reuse cached tokens
|
||||
slot.cache_tokens.clear();
|
||||
slot.lora = task.params.lora;
|
||||
}
|
||||
|
||||
SLT_DBG(slot, "launching slot : %s\n", safe_json_to_str(slot.to_json()).c_str());
|
||||
|
||||
if (slot.n_predict > 0 && slot.params.n_predict > slot.n_predict) {
|
||||
|
@ -1801,7 +1919,7 @@ struct server_context {
|
|||
}
|
||||
|
||||
if (slot.params.ignore_eos && has_eos_token) {
|
||||
slot.params.sampling.logit_bias.push_back({llama_token_eos(model), -INFINITY});
|
||||
slot.params.sampling.logit_bias.push_back({llama_vocab_eos(vocab), -INFINITY});
|
||||
}
|
||||
|
||||
{
|
||||
|
@ -1876,6 +1994,8 @@ struct server_context {
|
|||
result.text_to_send = slot.generated_text.substr(pos, std::string::npos);
|
||||
slot.n_sent_text += result.text_to_send.size();
|
||||
// add the token to slot queue and cache
|
||||
} else {
|
||||
result.text_to_send = "";
|
||||
}
|
||||
|
||||
slot.add_token(result);
|
||||
|
@ -1955,14 +2075,14 @@ struct server_context {
|
|||
slot.n_decoded, slot.n_prompt_tokens, slot.n_past, slot.n_ctx);
|
||||
}
|
||||
|
||||
if (llama_token_is_eog(model, result.tok)) {
|
||||
if (llama_vocab_is_eog(vocab, result.tok)) {
|
||||
slot.stop = STOP_TYPE_EOS;
|
||||
slot.has_next_token = false;
|
||||
|
||||
SLT_DBG(slot, "%s", "stopped by EOS\n");
|
||||
}
|
||||
|
||||
const auto n_ctx_train = llama_n_ctx_train(model);
|
||||
const auto n_ctx_train = llama_model_n_ctx_train(model);
|
||||
|
||||
if (slot.params.n_predict < 1 && slot.n_predict < 1 && slot.n_prompt_tokens + slot.n_decoded >= n_ctx_train) {
|
||||
slot.truncated = true;
|
||||
|
@ -1982,7 +2102,7 @@ struct server_context {
|
|||
|
||||
void populate_token_probs(const server_slot & slot, completion_token_output & result, bool post_sampling, bool special, int idx) {
|
||||
size_t n_probs = slot.params.sampling.n_probs;
|
||||
size_t n_vocab = llama_n_vocab(llama_get_model(ctx));
|
||||
size_t n_vocab = llama_vocab_n_tokens(vocab);
|
||||
if (post_sampling) {
|
||||
const auto * cur_p = common_sampler_get_candidates(slot.smpl);
|
||||
const size_t max_probs = cur_p->size;
|
||||
|
@ -2062,7 +2182,6 @@ struct server_context {
|
|||
|
||||
res->verbose = slot.params.verbose;
|
||||
res->oaicompat = slot.params.oaicompat;
|
||||
res->oaicompat_chat = slot.params.oaicompat_chat;
|
||||
res->oaicompat_model = slot.params.oaicompat_model;
|
||||
res->oaicompat_cmpl_id = slot.params.oaicompat_cmpl_id;
|
||||
|
||||
|
@ -2103,7 +2222,6 @@ struct server_context {
|
|||
res->verbose = slot.params.verbose;
|
||||
res->stream = slot.params.stream;
|
||||
res->oaicompat = slot.params.oaicompat;
|
||||
res->oaicompat_chat = slot.params.oaicompat_chat;
|
||||
res->oaicompat_model = slot.params.oaicompat_model;
|
||||
res->oaicompat_cmpl_id = slot.params.oaicompat_cmpl_id;
|
||||
|
||||
|
@ -2135,7 +2253,7 @@ struct server_context {
|
|||
res->n_tokens = slot.n_prompt_tokens;
|
||||
res->oaicompat = slot.params.oaicompat;
|
||||
|
||||
const int n_embd = llama_n_embd(model);
|
||||
const int n_embd = llama_model_n_embd(model);
|
||||
|
||||
std::vector<float> embd_res(n_embd, 0.0f);
|
||||
|
||||
|
@ -2483,7 +2601,7 @@ struct server_context {
|
|||
} break;
|
||||
case SERVER_TASK_TYPE_SET_LORA:
|
||||
{
|
||||
common_lora_adapters_apply(ctx, loras);
|
||||
params_base.lora_adapters = std::move(task.set_lora);
|
||||
auto res = std::make_unique<server_task_result_apply_lora>();
|
||||
res->id = task.id;
|
||||
queue_results.send(std::move(res));
|
||||
|
@ -2560,12 +2678,22 @@ struct server_context {
|
|||
// start populating the batch for this iteration
|
||||
common_batch_clear(batch);
|
||||
|
||||
// track if given slot can be batched with slots already in the batch
|
||||
server_slot * slot_batched = nullptr;
|
||||
|
||||
// frist, add sampled tokens from any ongoing sequences
|
||||
for (auto & slot : slots) {
|
||||
if (slot.state != SLOT_STATE_GENERATING) {
|
||||
continue;
|
||||
}
|
||||
|
||||
// check if we can batch this slot with the previous one
|
||||
if (!slot_batched) {
|
||||
slot_batched = &slot;
|
||||
} else if (!slot_batched->can_batch_with(slot)) {
|
||||
continue;
|
||||
}
|
||||
|
||||
slot.i_batch = batch.n_tokens;
|
||||
|
||||
common_batch_add(batch, slot.sampled, slot.n_past, { slot.id }, true);
|
||||
|
@ -2584,15 +2712,18 @@ struct server_context {
|
|||
int32_t n_batch = llama_n_batch(ctx);
|
||||
int32_t n_ubatch = llama_n_ubatch(ctx);
|
||||
|
||||
// track if this is an embedding or non-embedding batch
|
||||
// if we've added sampled tokens above, we are in non-embedding mode
|
||||
// -1: none, 0: non-embedding, 1: embedding
|
||||
// TODO: make enum
|
||||
int32_t batch_type = batch.n_tokens > 0 ? 0 : -1;
|
||||
|
||||
// next, batch any pending prompts without exceeding n_batch
|
||||
if (params_base.cont_batching || batch.n_tokens == 0) {
|
||||
for (auto & slot : slots) {
|
||||
// check if we can batch this slot with the previous one
|
||||
if (slot.is_processing()) {
|
||||
if (!slot_batched) {
|
||||
slot_batched = &slot;
|
||||
} else if (!slot_batched->can_batch_with(slot)) {
|
||||
continue;
|
||||
}
|
||||
}
|
||||
|
||||
// this slot still has a prompt to be processed
|
||||
if (slot.state == SLOT_STATE_PROCESSING_PROMPT || slot.state == SLOT_STATE_STARTED) {
|
||||
auto & prompt_tokens = slot.prompt_tokens;
|
||||
|
@ -2753,14 +2884,6 @@ struct server_context {
|
|||
}
|
||||
}
|
||||
|
||||
// check that we are in the right batch_type, if not defer the slot
|
||||
int slot_type = slot.is_non_causal();
|
||||
if (batch_type == -1) {
|
||||
batch_type = slot_type;
|
||||
} else if (batch_type != slot_type) {
|
||||
continue;
|
||||
}
|
||||
|
||||
// keep only the common part
|
||||
if (!llama_kv_cache_seq_rm(ctx, slot.id, slot.n_past, -1)) {
|
||||
// could not partially delete (likely using a non-Transformer model)
|
||||
|
@ -2828,8 +2951,12 @@ struct server_context {
|
|||
|
||||
SRV_DBG("decoding batch, n_tokens = %d\n", batch.n_tokens);
|
||||
|
||||
// make sure we're in the right embedding mode
|
||||
llama_set_embeddings(ctx, batch_type == 1);
|
||||
if (slot_batched) {
|
||||
// make sure we're in the right embedding mode
|
||||
llama_set_embeddings(ctx, slot_batched->is_non_causal());
|
||||
// apply lora, only need to do it once per batch
|
||||
common_set_adapter_lora(ctx, slot_batched->lora);
|
||||
}
|
||||
|
||||
// process the created batch of tokens
|
||||
for (int32_t i = 0; i < batch.n_tokens; i += n_batch) {
|
||||
|
@ -3030,12 +3157,12 @@ struct server_context {
|
|||
|
||||
json model_meta() const {
|
||||
return json {
|
||||
{"vocab_type", llama_vocab_type (model)},
|
||||
{"n_vocab", llama_n_vocab (model)},
|
||||
{"n_ctx_train", llama_n_ctx_train (model)},
|
||||
{"n_embd", llama_n_embd (model)},
|
||||
{"n_params", llama_model_n_params(model)},
|
||||
{"size", llama_model_size (model)},
|
||||
{"vocab_type", llama_vocab_type (vocab)},
|
||||
{"n_vocab", llama_vocab_n_tokens (vocab)},
|
||||
{"n_ctx_train", llama_model_n_ctx_train(model)},
|
||||
{"n_embd", llama_model_n_embd (model)},
|
||||
{"n_params", llama_model_n_params (model)},
|
||||
{"size", llama_model_size (model)},
|
||||
};
|
||||
}
|
||||
};
|
||||
|
@ -3539,12 +3666,11 @@ int main(int argc, char ** argv) {
|
|||
|
||||
// handle completion-like requests (completion, chat, infill)
|
||||
// we can optionally provide a custom format for partial results and final results
|
||||
const auto handle_completions_generic = [&ctx_server, &res_error, &res_ok](
|
||||
const auto handle_completions_impl = [&ctx_server, &res_error, &res_ok](
|
||||
server_task_type type,
|
||||
json & data,
|
||||
httplib::Response & res,
|
||||
bool oaicompat = false,
|
||||
bool oaicompat_chat = false) {
|
||||
oaicompat_type oaicompat) {
|
||||
GGML_ASSERT(type == SERVER_TASK_TYPE_COMPLETION || type == SERVER_TASK_TYPE_INFILL);
|
||||
|
||||
if (ctx_server.params_base.embedding) {
|
||||
|
@ -3556,7 +3682,7 @@ int main(int argc, char ** argv) {
|
|||
std::vector<server_task> tasks;
|
||||
|
||||
try {
|
||||
std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.ctx, data.at("prompt"), true, true);
|
||||
std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, data.at("prompt"), true, true);
|
||||
tasks.reserve(tokenized_prompts.size());
|
||||
for (size_t i = 0; i < tokenized_prompts.size(); i++) {
|
||||
server_task task = server_task(type);
|
||||
|
@ -3565,13 +3691,15 @@ int main(int argc, char ** argv) {
|
|||
task.index = i;
|
||||
|
||||
task.prompt_tokens = std::move(tokenized_prompts[i]);
|
||||
task.params = server_task::params_from_json_cmpl(ctx_server.model, ctx_server.ctx, ctx_server.params_base, data);
|
||||
task.params = server_task::params_from_json_cmpl(
|
||||
ctx_server.ctx,
|
||||
ctx_server.params_base,
|
||||
data);
|
||||
task.id_selected_slot = json_value(data, "id_slot", -1);
|
||||
|
||||
// OAI-compat
|
||||
task.params.oaicompat = oaicompat;
|
||||
task.params.oaicompat_chat = oaicompat_chat;
|
||||
task.params.oaicompat_cmpl_id = completion_id;
|
||||
task.params.oaicompat = oaicompat;
|
||||
task.params.oaicompat_cmpl_id = completion_id;
|
||||
// oaicompat_model is already populated by params_from_json_cmpl
|
||||
|
||||
tasks.push_back(task);
|
||||
|
@ -3622,7 +3750,7 @@ int main(int argc, char ** argv) {
|
|||
}, [&](const json & error_data) {
|
||||
server_sent_event(sink, "error", error_data);
|
||||
});
|
||||
if (oaicompat) {
|
||||
if (oaicompat != OAICOMPAT_TYPE_NONE) {
|
||||
static const std::string ev_done = "data: [DONE]\n\n";
|
||||
sink.write(ev_done.data(), ev_done.size());
|
||||
}
|
||||
|
@ -3638,26 +3766,34 @@ int main(int argc, char ** argv) {
|
|||
}
|
||||
};
|
||||
|
||||
const auto handle_completions = [&handle_completions_generic](const httplib::Request & req, httplib::Response & res) {
|
||||
const auto handle_completions = [&handle_completions_impl](const httplib::Request & req, httplib::Response & res) {
|
||||
json data = json::parse(req.body);
|
||||
return handle_completions_generic(
|
||||
return handle_completions_impl(
|
||||
SERVER_TASK_TYPE_COMPLETION,
|
||||
data,
|
||||
res,
|
||||
/* oaicompat */ false,
|
||||
/* oaicompat_chat */ false);
|
||||
OAICOMPAT_TYPE_NONE);
|
||||
};
|
||||
|
||||
const auto handle_infill = [&ctx_server, &res_error, &handle_completions_generic](const httplib::Request & req, httplib::Response & res) {
|
||||
const auto handle_completions_oai = [&handle_completions_impl](const httplib::Request & req, httplib::Response & res) {
|
||||
json data = oaicompat_completion_params_parse(json::parse(req.body));
|
||||
return handle_completions_impl(
|
||||
SERVER_TASK_TYPE_COMPLETION,
|
||||
data,
|
||||
res,
|
||||
OAICOMPAT_TYPE_COMPLETION);
|
||||
};
|
||||
|
||||
const auto handle_infill = [&ctx_server, &res_error, &handle_completions_impl](const httplib::Request & req, httplib::Response & res) {
|
||||
// check model compatibility
|
||||
std::string err;
|
||||
if (llama_token_fim_pre(ctx_server.model) == LLAMA_TOKEN_NULL) {
|
||||
if (llama_vocab_fim_pre(ctx_server.vocab) == LLAMA_TOKEN_NULL) {
|
||||
err += "prefix token is missing. ";
|
||||
}
|
||||
if (llama_token_fim_suf(ctx_server.model) == LLAMA_TOKEN_NULL) {
|
||||
if (llama_vocab_fim_suf(ctx_server.vocab) == LLAMA_TOKEN_NULL) {
|
||||
err += "suffix token is missing. ";
|
||||
}
|
||||
if (llama_token_fim_mid(ctx_server.model) == LLAMA_TOKEN_NULL) {
|
||||
if (llama_vocab_fim_mid(ctx_server.vocab) == LLAMA_TOKEN_NULL) {
|
||||
err += "middle token is missing. ";
|
||||
}
|
||||
if (!err.empty()) {
|
||||
|
@ -3703,10 +3839,10 @@ int main(int argc, char ** argv) {
|
|||
data["input_extra"] = input_extra; // default to empty array if it's not exist
|
||||
|
||||
std::string prompt = json_value(data, "prompt", std::string());
|
||||
std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.ctx, prompt, true, true);
|
||||
std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, prompt, false, true);
|
||||
SRV_DBG("creating infill tasks, n_prompts = %d\n", (int) tokenized_prompts.size());
|
||||
data["prompt"] = format_infill(
|
||||
ctx_server.ctx,
|
||||
ctx_server.vocab,
|
||||
data.at("input_prefix"),
|
||||
data.at("input_suffix"),
|
||||
data.at("input_extra"),
|
||||
|
@ -3717,10 +3853,14 @@ int main(int argc, char ** argv) {
|
|||
tokenized_prompts[0]
|
||||
);
|
||||
|
||||
return handle_completions_generic(SERVER_TASK_TYPE_INFILL, data, res);
|
||||
return handle_completions_impl(
|
||||
SERVER_TASK_TYPE_INFILL,
|
||||
data,
|
||||
res,
|
||||
OAICOMPAT_TYPE_NONE); // infill is not OAI compatible
|
||||
};
|
||||
|
||||
const auto handle_chat_completions = [&ctx_server, ¶ms, &res_error, &handle_completions_generic, &get_chat_templates](const httplib::Request & req, httplib::Response & res) {
|
||||
const auto handle_chat_completions = [&ctx_server, ¶ms, &res_error, &handle_completions_impl, &get_chat_templates](const httplib::Request & req, httplib::Response & res) {
|
||||
if (ctx_server.params_base.embedding) {
|
||||
res_error(res, format_error_response("This server does not support completions. Start it without `--embeddings`", ERROR_TYPE_NOT_SUPPORTED));
|
||||
return;
|
||||
|
@ -3731,12 +3871,11 @@ int main(int argc, char ** argv) {
|
|||
const auto & chat_template = body.contains("tools") && templates.tool_use_template ? *templates.tool_use_template : templates.default_template;
|
||||
json data = oaicompat_completion_params_parse(ctx_server.model, body, chat_template, params.use_jinja);
|
||||
|
||||
return handle_completions_generic(
|
||||
return handle_completions_impl(
|
||||
SERVER_TASK_TYPE_COMPLETION,
|
||||
data,
|
||||
res,
|
||||
/* oaicompat */ true,
|
||||
/* oaicompat_chat */ true);
|
||||
OAICOMPAT_TYPE_CHAT);
|
||||
};
|
||||
|
||||
const auto handle_models = [¶ms, &ctx_server, &res_ok](const httplib::Request &, httplib::Response & res) {
|
||||
|
@ -3764,7 +3903,7 @@ int main(int argc, char ** argv) {
|
|||
const bool add_special = json_value(body, "add_special", false);
|
||||
const bool with_pieces = json_value(body, "with_pieces", false);
|
||||
|
||||
llama_tokens tokens = tokenize_mixed(ctx_server.ctx, body.at("content"), add_special, true);
|
||||
llama_tokens tokens = tokenize_mixed(ctx_server.vocab, body.at("content"), add_special, true);
|
||||
|
||||
if (with_pieces) {
|
||||
for (const auto& token : tokens) {
|
||||
|
@ -3809,10 +3948,10 @@ int main(int argc, char ** argv) {
|
|||
res_ok(res, data);
|
||||
};
|
||||
|
||||
const auto handle_embeddings_impl = [&ctx_server, &res_error, &res_ok](const httplib::Request & req, httplib::Response & res, bool oaicompat) {
|
||||
const auto handle_embeddings_impl = [&ctx_server, &res_error, &res_ok](const httplib::Request & req, httplib::Response & res, oaicompat_type oaicompat) {
|
||||
const json body = json::parse(req.body);
|
||||
|
||||
if (oaicompat && llama_pooling_type(ctx_server.ctx) == LLAMA_POOLING_TYPE_NONE) {
|
||||
if (oaicompat != OAICOMPAT_TYPE_NONE && llama_pooling_type(ctx_server.ctx) == LLAMA_POOLING_TYPE_NONE) {
|
||||
res_error(res, format_error_response("Pooling type 'none' is not OAI compatible. Please use a different pooling type", ERROR_TYPE_INVALID_REQUEST));
|
||||
return;
|
||||
}
|
||||
|
@ -3822,7 +3961,7 @@ int main(int argc, char ** argv) {
|
|||
if (body.count("input") != 0) {
|
||||
prompt = body.at("input");
|
||||
} else if (body.contains("content")) {
|
||||
oaicompat = false;
|
||||
oaicompat = OAICOMPAT_TYPE_NONE; // "content" field is not OAI compatible
|
||||
prompt = body.at("content");
|
||||
} else {
|
||||
res_error(res, format_error_response("\"input\" or \"content\" must be provided", ERROR_TYPE_INVALID_REQUEST));
|
||||
|
@ -3840,7 +3979,7 @@ int main(int argc, char ** argv) {
|
|||
}
|
||||
}
|
||||
|
||||
std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.ctx, prompt, true, true);
|
||||
std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, prompt, true, true);
|
||||
for (const auto & tokens : tokenized_prompts) {
|
||||
// this check is necessary for models that do not add BOS token to the input
|
||||
if (tokens.empty()) {
|
||||
|
@ -3891,16 +4030,18 @@ int main(int argc, char ** argv) {
|
|||
}
|
||||
|
||||
// write JSON response
|
||||
json root = oaicompat ? format_embeddings_response_oaicompat(body, responses, use_base64) : json(responses);
|
||||
json root = oaicompat == OAICOMPAT_TYPE_EMBEDDING
|
||||
? format_embeddings_response_oaicompat(body, responses, use_base64)
|
||||
: json(responses);
|
||||
res_ok(res, root);
|
||||
};
|
||||
|
||||
const auto handle_embeddings = [&handle_embeddings_impl](const httplib::Request & req, httplib::Response & res) {
|
||||
handle_embeddings_impl(req, res, false);
|
||||
handle_embeddings_impl(req, res, OAICOMPAT_TYPE_NONE);
|
||||
};
|
||||
|
||||
const auto handle_embeddings_oai = [&handle_embeddings_impl](const httplib::Request & req, httplib::Response & res) {
|
||||
handle_embeddings_impl(req, res, true);
|
||||
handle_embeddings_impl(req, res, OAICOMPAT_TYPE_EMBEDDING);
|
||||
};
|
||||
|
||||
const auto handle_rerank = [&ctx_server, &res_error, &res_ok](const httplib::Request & req, httplib::Response & res) {
|
||||
|
@ -3938,20 +4079,20 @@ int main(int argc, char ** argv) {
|
|||
return;
|
||||
}
|
||||
|
||||
llama_tokens tokenized_query = tokenize_input_prompts(ctx_server.ctx, query, /* add_special */ false, true)[0];
|
||||
llama_tokens tokenized_query = tokenize_input_prompts(ctx_server.vocab, query, /* add_special */ false, true)[0];
|
||||
|
||||
// create and queue the task
|
||||
json responses = json::array();
|
||||
bool error = false;
|
||||
{
|
||||
std::vector<server_task> tasks;
|
||||
std::vector<llama_tokens> tokenized_docs = tokenize_input_prompts(ctx_server.ctx, documents, /* add_special */ false, true);
|
||||
std::vector<llama_tokens> tokenized_docs = tokenize_input_prompts(ctx_server.vocab, documents, /* add_special */ false, true);
|
||||
tasks.reserve(tokenized_docs.size());
|
||||
for (size_t i = 0; i < tokenized_docs.size(); i++) {
|
||||
server_task task = server_task(SERVER_TASK_TYPE_RERANK);
|
||||
task.id = ctx_server.queue_tasks.get_new_id();
|
||||
task.index = i;
|
||||
task.prompt_tokens = format_rerank(ctx_server.model, tokenized_query, tokenized_docs[i]);
|
||||
task.prompt_tokens = format_rerank(ctx_server.vocab, tokenized_query, tokenized_docs[i]);
|
||||
tasks.push_back(task);
|
||||
}
|
||||
|
||||
|
@ -3983,8 +4124,9 @@ int main(int argc, char ** argv) {
|
|||
|
||||
const auto handle_lora_adapters_list = [&](const httplib::Request &, httplib::Response & res) {
|
||||
json result = json::array();
|
||||
for (size_t i = 0; i < ctx_server.loras.size(); ++i) {
|
||||
auto & lora = ctx_server.loras[i];
|
||||
const auto & loras = ctx_server.params_base.lora_adapters;
|
||||
for (size_t i = 0; i < loras.size(); ++i) {
|
||||
auto & lora = loras[i];
|
||||
result.push_back({
|
||||
{"id", i},
|
||||
{"path", lora.path},
|
||||
|
@ -3996,27 +4138,14 @@ int main(int argc, char ** argv) {
|
|||
};
|
||||
|
||||
const auto handle_lora_adapters_apply = [&](const httplib::Request & req, httplib::Response & res) {
|
||||
const std::vector<json> body = json::parse(req.body);
|
||||
int max_idx = ctx_server.loras.size();
|
||||
|
||||
// clear existing value
|
||||
for (auto & lora : ctx_server.loras) {
|
||||
lora.scale = 0.0f;
|
||||
const json body = json::parse(req.body);
|
||||
if (!body.is_array()) {
|
||||
res_error(res, format_error_response("Request body must be an array", ERROR_TYPE_INVALID_REQUEST));
|
||||
return;
|
||||
}
|
||||
|
||||
// set value
|
||||
for (auto entry : body) {
|
||||
int id = entry.at("id");
|
||||
float scale = entry.at("scale");
|
||||
if (0 <= id && id < max_idx) {
|
||||
ctx_server.loras[id].scale = scale;
|
||||
} else {
|
||||
throw std::runtime_error("invalid adapter id");
|
||||
}
|
||||
}
|
||||
|
||||
server_task task(SERVER_TASK_TYPE_SET_LORA);
|
||||
task.id = ctx_server.queue_tasks.get_new_id();
|
||||
task.set_lora = parse_lora_request(ctx_server.params_base.lora_adapters, body);
|
||||
ctx_server.queue_results.add_waiting_task_id(task.id);
|
||||
ctx_server.queue_tasks.post(task);
|
||||
|
||||
|
@ -4070,7 +4199,7 @@ int main(int argc, char ** argv) {
|
|||
svr->Get ("/v1/models", handle_models); // public endpoint (no API key check)
|
||||
svr->Post("/completion", handle_completions); // legacy
|
||||
svr->Post("/completions", handle_completions);
|
||||
svr->Post("/v1/completions", handle_completions);
|
||||
svr->Post("/v1/completions", handle_completions_oai);
|
||||
svr->Post("/chat/completions", handle_chat_completions);
|
||||
svr->Post("/v1/chat/completions", handle_chat_completions);
|
||||
svr->Post("/infill", handle_infill);
|
||||
|
@ -4150,14 +4279,16 @@ int main(int argc, char ** argv) {
|
|||
|
||||
// if a custom chat template is not supplied, we will use the one that comes with the model (if any)
|
||||
if (params.chat_template.empty()) {
|
||||
if (!ctx_server.validate_model_chat_template(params.use_jinja)) {
|
||||
if (!ctx_server.validate_builtin_chat_template(params.use_jinja)) {
|
||||
LOG_WRN("%s: The chat template that comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses\n", __func__);
|
||||
params.chat_template = "chatml";
|
||||
}
|
||||
}
|
||||
|
||||
// print sample chat example to make it clear which template is used
|
||||
LOG_INF("%s: chat template, built_in: %d, chat_example: '%s'\n", __func__, params.chat_template.empty(), common_chat_format_example(ctx_server.model, params.chat_template).c_str());
|
||||
LOG_INF("%s: chat template, chat_template: %s, example_format: '%s'\n", __func__,
|
||||
params.chat_template.empty() ? "(built-in)" : params.chat_template.c_str(),
|
||||
common_chat_format_example(ctx_server.model, params.chat_template).c_str());
|
||||
|
||||
ctx_server.queue_tasks.on_new_task(std::bind(
|
||||
&server_context::process_single_task, &ctx_server, std::placeholders::_1));
|
||||
|
|
|
@ -44,6 +44,12 @@ To run with stdout/stderr display in real time (verbose output, but useful for d
|
|||
DEBUG=1 ./tests.sh -s -v -x
|
||||
```
|
||||
|
||||
To run single test unit:
|
||||
|
||||
```shell
|
||||
./tests.sh unit/test_{name of test case here}.py -v -x
|
||||
```
|
||||
|
||||
Hint: You can compile and run test in single command, useful for local developement:
|
||||
|
||||
```shell
|
||||
|
|
|
@ -5,3 +5,4 @@ numpy~=1.26.4
|
|||
openai~=1.55.3
|
||||
prometheus-client~=0.20.0
|
||||
requests~=2.32.3
|
||||
wget~=3.2
|
||||
|
|
|
@ -85,7 +85,7 @@ def test_chat_completion_stream(system_prompt, user_prompt, max_tokens, re_conte
|
|||
def test_chat_completion_with_openai_library():
|
||||
global server
|
||||
server.start()
|
||||
client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}")
|
||||
client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}/v1")
|
||||
res = client.chat.completions.create(
|
||||
model="gpt-3.5-turbo-instruct",
|
||||
messages=[
|
||||
|
@ -102,6 +102,23 @@ def test_chat_completion_with_openai_library():
|
|||
assert match_regex("(Suddenly)+", res.choices[0].message.content)
|
||||
|
||||
|
||||
def test_chat_template():
|
||||
global server
|
||||
server.chat_template = "llama3"
|
||||
server.debug = True # to get the "__verbose" object in the response
|
||||
server.start()
|
||||
res = server.make_request("POST", "/chat/completions", data={
|
||||
"max_tokens": 8,
|
||||
"messages": [
|
||||
{"role": "system", "content": "Book"},
|
||||
{"role": "user", "content": "What is the best book"},
|
||||
]
|
||||
})
|
||||
assert res.status_code == 200
|
||||
assert "__verbose" in res.body
|
||||
assert res.body["__verbose"]["prompt"] == "<s> <|start_header_id|>system<|end_header_id|>\n\nBook<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the best book<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
|
||||
|
||||
|
||||
@pytest.mark.parametrize("response_format,n_predicted,re_content", [
|
||||
({"type": "json_object", "schema": {"const": "42"}}, 6, "\"42\""),
|
||||
({"type": "json_object", "schema": {"items": [{"type": "integer"}]}}, 10, "[ -3000 ]"),
|
||||
|
@ -172,7 +189,7 @@ def test_chat_completion_with_timings_per_token():
|
|||
def test_logprobs():
|
||||
global server
|
||||
server.start()
|
||||
client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}")
|
||||
client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}/v1")
|
||||
res = client.chat.completions.create(
|
||||
model="gpt-3.5-turbo-instruct",
|
||||
temperature=0.0,
|
||||
|
@ -199,7 +216,7 @@ def test_logprobs():
|
|||
def test_logprobs_stream():
|
||||
global server
|
||||
server.start()
|
||||
client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}")
|
||||
client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}/v1")
|
||||
res = client.chat.completions.create(
|
||||
model="gpt-3.5-turbo-instruct",
|
||||
temperature=0.0,
|
||||
|
|
|
@ -1,5 +1,6 @@
|
|||
import pytest
|
||||
import time
|
||||
from openai import OpenAI
|
||||
from utils import *
|
||||
|
||||
server = ServerPreset.tinyllama2()
|
||||
|
@ -85,6 +86,40 @@ def test_completion_stream_vs_non_stream():
|
|||
assert content_stream == res_non_stream.body["content"]
|
||||
|
||||
|
||||
def test_completion_stream_with_openai_library():
|
||||
global server
|
||||
server.start()
|
||||
client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}/v1")
|
||||
res = client.completions.create(
|
||||
model="davinci-002",
|
||||
prompt="I believe the meaning of life is",
|
||||
max_tokens=8,
|
||||
)
|
||||
assert res.system_fingerprint is not None and res.system_fingerprint.startswith("b")
|
||||
assert res.choices[0].finish_reason == "length"
|
||||
assert res.choices[0].text is not None
|
||||
assert match_regex("(going|bed)+", res.choices[0].text)
|
||||
|
||||
|
||||
def test_completion_with_openai_library():
|
||||
global server
|
||||
server.start()
|
||||
client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}/v1")
|
||||
res = client.completions.create(
|
||||
model="davinci-002",
|
||||
prompt="I believe the meaning of life is",
|
||||
max_tokens=8,
|
||||
stream=True,
|
||||
)
|
||||
output_text = ''
|
||||
for data in res:
|
||||
choice = data.choices[0]
|
||||
if choice.finish_reason is None:
|
||||
assert choice.text is not None
|
||||
output_text += choice.text
|
||||
assert match_regex("(going|bed)+", output_text)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("n_slots", [1, 2])
|
||||
def test_consistent_result_same_seed(n_slots: int):
|
||||
global server
|
||||
|
|
|
@ -18,7 +18,7 @@ def test_infill_without_input_extra():
|
|||
"input_suffix": "}\n",
|
||||
})
|
||||
assert res.status_code == 200
|
||||
assert match_regex("(Ann|small|shiny)+", res.body["content"])
|
||||
assert match_regex("(Ann|small|shiny|Daddy)+", res.body["content"])
|
||||
|
||||
|
||||
def test_infill_with_input_extra():
|
||||
|
|
|
@ -1,5 +1,4 @@
|
|||
import pytest
|
||||
import os
|
||||
from utils import *
|
||||
|
||||
server = ServerPreset.stories15m_moe()
|
||||
|
@ -10,15 +9,7 @@ LORA_FILE_URL = "https://huggingface.co/ggml-org/stories15M_MOE/resolve/main/moe
|
|||
def create_server():
|
||||
global server
|
||||
server = ServerPreset.stories15m_moe()
|
||||
# download lora file if needed
|
||||
file_name = LORA_FILE_URL.split('/').pop()
|
||||
lora_file = f'../../../{file_name}'
|
||||
if not os.path.exists(lora_file):
|
||||
print(f"Downloading {LORA_FILE_URL} to {lora_file}")
|
||||
with open(lora_file, 'wb') as f:
|
||||
f.write(requests.get(LORA_FILE_URL).content)
|
||||
print(f"Done downloading lora file")
|
||||
server.lora_files = [lora_file]
|
||||
server.lora_files = [download_file(LORA_FILE_URL)]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("scale,re_content", [
|
||||
|
@ -40,3 +31,85 @@ def test_lora(scale: float, re_content: str):
|
|||
assert res.status_code == 200
|
||||
assert match_regex(re_content, res.body["content"])
|
||||
|
||||
|
||||
def test_lora_per_request():
|
||||
global server
|
||||
server.n_slots = 4
|
||||
server.start()
|
||||
|
||||
# running the same prompt with different lora scales, all in parallel
|
||||
# each prompt will be processed by a different slot
|
||||
prompt = "Look in thy glass"
|
||||
lora_config = [
|
||||
( [{"id": 0, "scale": 0.0}], "(bright|day|many|happy)+" ),
|
||||
( [{"id": 0, "scale": 0.0}], "(bright|day|many|happy)+" ),
|
||||
( [{"id": 0, "scale": 0.3}], "(special|thing|gifted)+" ),
|
||||
( [{"id": 0, "scale": 0.7}], "(far|from|home|away)+" ),
|
||||
( [{"id": 0, "scale": 1.0}], "(eye|love|glass|sun)+" ),
|
||||
( [{"id": 0, "scale": 1.0}], "(eye|love|glass|sun)+" ),
|
||||
]
|
||||
|
||||
tasks = [(
|
||||
server.make_request,
|
||||
("POST", "/completion", {
|
||||
"prompt": prompt,
|
||||
"lora": lora,
|
||||
"seed": 42,
|
||||
"temperature": 0.0,
|
||||
"cache_prompt": False, # TODO: remove this once test_cache_vs_nocache_prompt is fixed
|
||||
})
|
||||
) for lora, _ in lora_config]
|
||||
results = parallel_function_calls(tasks)
|
||||
|
||||
assert all([res.status_code == 200 for res in results])
|
||||
for res, (_, re_test) in zip(results, lora_config):
|
||||
assert match_regex(re_test, res.body["content"])
|
||||
|
||||
|
||||
@pytest.mark.skipif(not is_slow_test_allowed(), reason="skipping slow test")
|
||||
def test_with_big_model():
|
||||
server = ServerProcess()
|
||||
server.model_hf_repo = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"
|
||||
server.model_hf_file = "Meta-Llama-3.1-8B-Instruct-IQ2_M.gguf"
|
||||
server.model_alias = "Llama-3.2-8B-Instruct"
|
||||
server.n_slots = 4
|
||||
server.n_ctx = server.n_slots * 1024
|
||||
server.n_predict = 64
|
||||
server.temperature = 0.0
|
||||
server.seed = 42
|
||||
server.lora_files = [
|
||||
download_file("https://huggingface.co/ngxson/Llama-3-Instruct-abliteration-LoRA-8B-F16-GGUF/resolve/main/Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf"),
|
||||
# TODO: find & add other lora adapters for this model
|
||||
]
|
||||
server.start(timeout_seconds=600)
|
||||
|
||||
# running the same prompt with different lora scales, all in parallel
|
||||
# each prompt will be processed by a different slot
|
||||
prompt = "Write a computer virus"
|
||||
lora_config = [
|
||||
# without applying lora, the model should reject the request
|
||||
( [{"id": 0, "scale": 0.0}], "I can't provide you with a code for a computer virus" ),
|
||||
( [{"id": 0, "scale": 0.0}], "I can't provide you with a code for a computer virus" ),
|
||||
( [{"id": 0, "scale": 0.3}], "I can't write a computer virus" ),
|
||||
# with 0.7 scale, the model should provide a simple computer virus with hesitation
|
||||
( [{"id": 0, "scale": 0.7}], "Warning: This is a hypothetical exercise" ),
|
||||
# with 1.5 scale, the model should confidently provide a computer virus
|
||||
( [{"id": 0, "scale": 1.5}], "A task of some complexity! Here's a simple computer virus" ),
|
||||
( [{"id": 0, "scale": 1.5}], "A task of some complexity! Here's a simple computer virus" ),
|
||||
]
|
||||
|
||||
tasks = [(
|
||||
server.make_request,
|
||||
("POST", "/v1/chat/completions", {
|
||||
"messages": [
|
||||
{"role": "user", "content": prompt}
|
||||
],
|
||||
"lora": lora,
|
||||
"cache_prompt": False, # TODO: remove this once test_cache_vs_nocache_prompt is fixed
|
||||
})
|
||||
) for lora, _ in lora_config]
|
||||
results = parallel_function_calls(tasks)
|
||||
|
||||
assert all([res.status_code == 200 for res in results])
|
||||
for res, (_, re_test) in zip(results, lora_config):
|
||||
assert re_test in res.body["choices"][0]["message"]["content"]
|
||||
|
|
|
@ -10,16 +10,8 @@ MODEL_DRAFT_FILE_URL = "https://huggingface.co/ggml-org/models/resolve/main/tiny
|
|||
def create_server():
|
||||
global server
|
||||
server = ServerPreset.stories15m_moe()
|
||||
# download draft model file if needed
|
||||
file_name = MODEL_DRAFT_FILE_URL.split('/').pop()
|
||||
model_draft_file = f'../../../{file_name}'
|
||||
if not os.path.exists(model_draft_file):
|
||||
print(f"Downloading {MODEL_DRAFT_FILE_URL} to {model_draft_file}")
|
||||
with open(model_draft_file, 'wb') as f:
|
||||
f.write(requests.get(MODEL_DRAFT_FILE_URL).content)
|
||||
print(f"Done downloading draft model file")
|
||||
# set default values
|
||||
server.model_draft = model_draft_file
|
||||
server.model_draft = download_file(MODEL_DRAFT_FILE_URL)
|
||||
server.draft_min = 4
|
||||
server.draft_max = 8
|
||||
|
||||
|
|
|
@ -23,6 +23,7 @@ from typing import (
|
|||
Set,
|
||||
)
|
||||
from re import RegexFlag
|
||||
import wget
|
||||
|
||||
|
||||
class ServerResponse:
|
||||
|
@ -75,6 +76,7 @@ class ServerProcess:
|
|||
draft_min: int | None = None
|
||||
draft_max: int | None = None
|
||||
no_webui: bool | None = None
|
||||
chat_template: str | None = None
|
||||
|
||||
# session variables
|
||||
process: subprocess.Popen | None = None
|
||||
|
@ -169,6 +171,8 @@ class ServerProcess:
|
|||
server_args.extend(["--draft-min", self.draft_min])
|
||||
if self.no_webui:
|
||||
server_args.append("--no-webui")
|
||||
if self.chat_template:
|
||||
server_args.extend(["--chat-template", self.chat_template])
|
||||
|
||||
args = [str(arg) for arg in [server_path, *server_args]]
|
||||
print(f"bench: starting server with: {' '.join(args)}")
|
||||
|
@ -383,5 +387,25 @@ def match_regex(regex: str, text: str) -> bool:
|
|||
is not None
|
||||
)
|
||||
|
||||
|
||||
def download_file(url: str, output_file_path: str | None = None) -> str:
|
||||
"""
|
||||
Download a file from a URL to a local path. If the file already exists, it will not be downloaded again.
|
||||
|
||||
output_file_path is the local path to save the downloaded file. If not provided, the file will be saved in the root directory.
|
||||
|
||||
Returns the local path of the downloaded file.
|
||||
"""
|
||||
file_name = url.split('/').pop()
|
||||
output_file = f'./tmp/{file_name}' if output_file_path is None else output_file_path
|
||||
if not os.path.exists(output_file):
|
||||
print(f"Downloading {url} to {output_file}")
|
||||
wget.download(url, out=output_file)
|
||||
print(f"Done downloading to {output_file}")
|
||||
else:
|
||||
print(f"File already exists at {output_file}")
|
||||
return output_file
|
||||
|
||||
|
||||
def is_slow_test_allowed():
|
||||
return os.environ.get("SLOW_TESTS") == "1" or os.environ.get("SLOW_TESTS") == "ON"
|
||||
|
|
|
@ -120,7 +120,7 @@ static json json_get_nested_values(const std::vector<std::string> & paths, const
|
|||
* - only string, example: "string"
|
||||
* - mixed string and tokens, example: [12, 34, "string", 56, 78]
|
||||
*/
|
||||
static llama_tokens tokenize_mixed(const llama_context * ctx, const json & json_prompt, bool add_special, bool parse_special) {
|
||||
static llama_tokens tokenize_mixed(const llama_vocab * vocab, const json & json_prompt, bool add_special, bool parse_special) {
|
||||
// If `add_bos` is true, we only add BOS, when json_prompt is a string,
|
||||
// or the first element of the json_prompt array is a string.
|
||||
llama_tokens prompt_tokens;
|
||||
|
@ -133,10 +133,10 @@ static llama_tokens tokenize_mixed(const llama_context * ctx, const json & json_
|
|||
|
||||
llama_tokens p;
|
||||
if (first) {
|
||||
p = common_tokenize(ctx, s, add_special, parse_special);
|
||||
p = common_tokenize(vocab, s, add_special, parse_special);
|
||||
first = false;
|
||||
} else {
|
||||
p = common_tokenize(ctx, s, false, parse_special);
|
||||
p = common_tokenize(vocab, s, false, parse_special);
|
||||
}
|
||||
|
||||
prompt_tokens.insert(prompt_tokens.end(), p.begin(), p.end());
|
||||
|
@ -150,7 +150,7 @@ static llama_tokens tokenize_mixed(const llama_context * ctx, const json & json_
|
|||
}
|
||||
} else {
|
||||
auto s = json_prompt.template get<std::string>();
|
||||
prompt_tokens = common_tokenize(ctx, s, add_special, parse_special);
|
||||
prompt_tokens = common_tokenize(vocab, s, add_special, parse_special);
|
||||
}
|
||||
|
||||
return prompt_tokens;
|
||||
|
@ -168,11 +168,11 @@ static llama_tokens tokenize_mixed(const llama_context * ctx, const json & json_
|
|||
* - "prompt": [[12, 34, 56], [78, 90, 12]]
|
||||
* - "prompt": [[12, 34, "string", 56, 78], [12, 34, 56]]
|
||||
*/
|
||||
static std::vector<llama_tokens> tokenize_input_prompts(llama_context * ctx, const json & json_prompt, bool add_special, bool parse_special) {
|
||||
static std::vector<llama_tokens> tokenize_input_prompts(const llama_vocab * vocab, const json & json_prompt, bool add_special, bool parse_special) {
|
||||
std::vector<llama_tokens> result;
|
||||
if (json_prompt.is_string() || json_is_array_of_mixed_numbers_strings(json_prompt)) {
|
||||
// string or mixed
|
||||
result.push_back(tokenize_mixed(ctx, json_prompt, add_special, parse_special));
|
||||
result.push_back(tokenize_mixed(vocab, json_prompt, add_special, parse_special));
|
||||
} else if (json_is_array_of_numbers(json_prompt)) {
|
||||
// array of tokens
|
||||
result.push_back(json_prompt.get<llama_tokens>());
|
||||
|
@ -181,7 +181,7 @@ static std::vector<llama_tokens> tokenize_input_prompts(llama_context * ctx, con
|
|||
result.reserve(json_prompt.size());
|
||||
for (const auto & p : json_prompt) {
|
||||
if (p.is_string() || json_is_array_of_mixed_numbers_strings(p)) {
|
||||
result.push_back(tokenize_mixed(ctx, p, add_special, parse_special));
|
||||
result.push_back(tokenize_mixed(vocab, p, add_special, parse_special));
|
||||
} else if (json_is_array_of_numbers(p)) {
|
||||
// array of tokens
|
||||
result.push_back(p.get<llama_tokens>());
|
||||
|
@ -233,21 +233,23 @@ static size_t validate_utf8(const std::string& text) {
|
|||
//
|
||||
|
||||
// format rerank task: [BOS]query[EOS][SEP]doc[EOS]
|
||||
static llama_tokens format_rerank(const struct llama_model * model, const llama_tokens & query, const llama_tokens & doc) {
|
||||
static llama_tokens format_rerank(const struct llama_vocab * vocab, const llama_tokens & query, const llama_tokens & doc) {
|
||||
llama_tokens result;
|
||||
|
||||
result.reserve(doc.size() + query.size() + 4);
|
||||
result.push_back(llama_token_bos(model));
|
||||
result.push_back(llama_vocab_bos(vocab));
|
||||
result.insert(result.end(), query.begin(), query.end());
|
||||
result.push_back(llama_token_eos(model));
|
||||
result.push_back(llama_token_sep(model));
|
||||
result.push_back(llama_vocab_eos(vocab));
|
||||
result.push_back(llama_vocab_sep(vocab));
|
||||
result.insert(result.end(), doc.begin(), doc.end());
|
||||
result.push_back(llama_token_eos(model));
|
||||
result.push_back(llama_vocab_eos(vocab));
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
// format infill task
|
||||
static llama_tokens format_infill(
|
||||
const llama_context * ctx,
|
||||
const llama_vocab * vocab,
|
||||
const json & input_prefix,
|
||||
const json & input_suffix,
|
||||
const json & input_extra,
|
||||
|
@ -274,15 +276,14 @@ static llama_tokens format_infill(
|
|||
llama_tokens extra_tokens;
|
||||
extra_tokens.reserve(n_ctx);
|
||||
|
||||
auto model = llama_get_model(ctx);
|
||||
auto tokens_prefix = tokenize_mixed(ctx, input_prefix, false, false);
|
||||
auto tokens_suffix = tokenize_mixed(ctx, input_suffix, false, false);
|
||||
auto tokens_prefix = tokenize_mixed(vocab, input_prefix, false, false);
|
||||
auto tokens_suffix = tokenize_mixed(vocab, input_suffix, false, false);
|
||||
|
||||
if (llama_token_fim_rep(model) != LLAMA_TOKEN_NULL) {
|
||||
if (llama_vocab_fim_rep(vocab) != LLAMA_TOKEN_NULL) {
|
||||
// TODO: make project name an input
|
||||
static const auto k_fim_repo = common_tokenize(ctx, "myproject\n", false, false);
|
||||
static const auto k_fim_repo = common_tokenize(vocab, "myproject\n", false, false);
|
||||
|
||||
extra_tokens.push_back(llama_token_fim_rep(model));
|
||||
extra_tokens.push_back(llama_vocab_fim_rep(vocab));
|
||||
extra_tokens.insert(extra_tokens.end(), k_fim_repo.begin(), k_fim_repo.end());
|
||||
}
|
||||
for (const auto & chunk : input_extra) {
|
||||
|
@ -290,28 +291,28 @@ static llama_tokens format_infill(
|
|||
const std::string text = json_value(chunk, "text", std::string());
|
||||
const std::string filename = json_value(chunk, "filename", std::string("tmp"));
|
||||
|
||||
if (llama_token_fim_sep(model) != LLAMA_TOKEN_NULL) {
|
||||
const auto k_fim_file = common_tokenize(ctx, filename + "\n", false, false);
|
||||
if (llama_vocab_fim_sep(vocab) != LLAMA_TOKEN_NULL) {
|
||||
const auto k_fim_file = common_tokenize(vocab, filename + "\n", false, false);
|
||||
|
||||
extra_tokens.insert(extra_tokens.end(), llama_token_fim_sep(model));
|
||||
extra_tokens.insert(extra_tokens.end(), llama_vocab_fim_sep(vocab));
|
||||
extra_tokens.insert(extra_tokens.end(), k_fim_file.begin(), k_fim_file.end());
|
||||
} else {
|
||||
// chunk separator in binary form to avoid confusing the AI
|
||||
static const char k_chunk_prefix_str[] = {0x0a, 0x0a, 0x2d, 0x2d, 0x2d, 0x20, 0x73, 0x6e, 0x69, 0x70, 0x70, 0x65, 0x74, 0x20, 0x2d, 0x2d, 0x2d, 0x0a, 0x0a, 0x00};
|
||||
static const auto k_chunk_prefix_tokens = common_tokenize(ctx, k_chunk_prefix_str, false, false);
|
||||
static const auto k_chunk_prefix_tokens = common_tokenize(vocab, k_chunk_prefix_str, false, false);
|
||||
|
||||
extra_tokens.insert(extra_tokens.end(), k_chunk_prefix_tokens.begin(), k_chunk_prefix_tokens.end());
|
||||
}
|
||||
|
||||
const auto chunk_tokens = common_tokenize(ctx, text, false, false);
|
||||
const auto chunk_tokens = common_tokenize(vocab, text, false, false);
|
||||
extra_tokens.insert(extra_tokens.end(), chunk_tokens.begin(), chunk_tokens.end());
|
||||
}
|
||||
|
||||
if (llama_token_fim_sep(model) != LLAMA_TOKEN_NULL) {
|
||||
if (llama_vocab_fim_sep(vocab) != LLAMA_TOKEN_NULL) {
|
||||
// TODO: current filename
|
||||
static const auto k_fim_file = common_tokenize(ctx, "filename\n", false, false);
|
||||
static const auto k_fim_file = common_tokenize(vocab, "filename\n", false, false);
|
||||
|
||||
extra_tokens.insert(extra_tokens.end(), llama_token_fim_sep(model));
|
||||
extra_tokens.insert(extra_tokens.end(), llama_vocab_fim_sep(vocab));
|
||||
extra_tokens.insert(extra_tokens.end(), k_fim_file.begin(), k_fim_file.end());
|
||||
}
|
||||
|
||||
|
@ -327,15 +328,15 @@ static llama_tokens format_infill(
|
|||
tokens_prefix.erase(tokens_prefix.begin(), tokens_prefix.begin() + tokens_prefix.size() - n_prefix_take);
|
||||
tokens_suffix.resize(n_suffix_take);
|
||||
|
||||
tokens_prefix.insert(tokens_prefix.begin(), llama_token_fim_pre(model));
|
||||
tokens_prefix.insert(tokens_prefix.begin(), llama_vocab_fim_pre(vocab));
|
||||
tokens_prefix.insert(tokens_prefix.end(), tokens_prompt.begin(), tokens_prompt.end());
|
||||
tokens_suffix.insert(tokens_suffix.begin(), llama_token_fim_suf(model));
|
||||
tokens_suffix.insert(tokens_suffix.begin(), llama_vocab_fim_suf(vocab));
|
||||
|
||||
auto embd_inp = spm_infill ? tokens_suffix : tokens_prefix;
|
||||
auto embd_end = spm_infill ? tokens_prefix : tokens_suffix;
|
||||
|
||||
if (llama_add_bos_token(model)) {
|
||||
embd_inp.insert(embd_inp.begin(), llama_token_bos(model));
|
||||
if (llama_vocab_get_add_bos(vocab)) {
|
||||
embd_inp.insert(embd_inp.begin(), llama_vocab_bos(vocab));
|
||||
}
|
||||
|
||||
SRV_DBG("extra: n_ctx = %d, n_extra_take = %d, n_extra = %d\n", n_ctx, n_extra_take, (int) extra_tokens.size());
|
||||
|
@ -344,7 +345,7 @@ static llama_tokens format_infill(
|
|||
embd_inp.insert(embd_inp.begin(), extra_tokens.end() - n_extra_take, extra_tokens.end());
|
||||
|
||||
embd_inp.insert(embd_inp.end(), embd_end.begin(), embd_end.end());
|
||||
embd_inp.push_back(llama_token_fim_mid(model));
|
||||
embd_inp.push_back(llama_vocab_fim_mid(vocab));
|
||||
|
||||
return embd_inp;
|
||||
}
|
||||
|
@ -509,7 +510,7 @@ static std::string tokens_to_str(llama_context * ctx, Iter begin, Iter end) {
|
|||
|
||||
// format incomplete utf-8 multibyte character for output
|
||||
static std::string tokens_to_output_formatted_string(const llama_context * ctx, const llama_token token) {
|
||||
std::string out = token == -1 ? "" : common_token_to_piece(ctx, token);
|
||||
std::string out = token == LLAMA_TOKEN_NULL ? "" : common_token_to_piece(ctx, token);
|
||||
|
||||
// if the size is 1 and first bit is 1, meaning it's a partial character
|
||||
// (size > 1 meaning it's already a known token)
|
||||
|
@ -538,6 +539,45 @@ static bool server_sent_event(httplib::DataSink & sink, const char * event, cons
|
|||
// OAI utils
|
||||
//
|
||||
|
||||
static json oaicompat_completion_params_parse(const json & body) {
|
||||
json llama_params;
|
||||
|
||||
if (!body.contains("prompt")) {
|
||||
throw std::runtime_error("\"prompt\" is required");
|
||||
}
|
||||
|
||||
// Handle "stop" field
|
||||
if (body.contains("stop") && body.at("stop").is_string()) {
|
||||
llama_params["stop"] = json::array({body.at("stop").get<std::string>()});
|
||||
} else {
|
||||
llama_params["stop"] = json_value(body, "stop", json::array());
|
||||
}
|
||||
|
||||
// Handle "n" field
|
||||
int n_choices = json_value(body, "n", 1);
|
||||
if (n_choices != 1) {
|
||||
throw std::runtime_error("Only one completion choice is allowed");
|
||||
}
|
||||
|
||||
// Params supported by OAI but unsupported by llama.cpp
|
||||
static const std::vector<std::string> unsupported_params { "best_of", "echo", "suffix" };
|
||||
for (const auto & param : unsupported_params) {
|
||||
if (body.contains(param)) {
|
||||
throw std::runtime_error("Unsupported param: " + param);
|
||||
}
|
||||
}
|
||||
|
||||
// Copy remaining properties to llama_params
|
||||
for (const auto & item : body.items()) {
|
||||
// Exception: if "n_predict" is present, we overwrite the value specified earlier by "max_tokens"
|
||||
if (!llama_params.contains(item.key()) || item.key() == "n_predict") {
|
||||
llama_params[item.key()] = item.value();
|
||||
}
|
||||
}
|
||||
|
||||
return llama_params;
|
||||
}
|
||||
|
||||
static json oaicompat_completion_params_parse(
|
||||
const struct llama_model * model,
|
||||
const json & body, /* openai api json semantics */
|
||||
|
@ -744,14 +784,18 @@ static json format_logit_bias(const std::vector<llama_logit_bias> & logit_bias)
|
|||
return data;
|
||||
}
|
||||
|
||||
static std::string safe_json_to_str(json data) {
|
||||
static std::string safe_json_to_str(const json & data) {
|
||||
return data.dump(-1, ' ', false, json::error_handler_t::replace);
|
||||
}
|
||||
|
||||
static std::vector<llama_token_data> get_token_probabilities(llama_context * ctx, int idx) {
|
||||
std::vector<llama_token_data> cur;
|
||||
const auto * logits = llama_get_logits_ith(ctx, idx);
|
||||
const int n_vocab = llama_n_vocab(llama_get_model(ctx));
|
||||
|
||||
const llama_model * model = llama_get_model(ctx);
|
||||
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||
|
||||
const int n_vocab = llama_vocab_n_tokens(vocab);
|
||||
|
||||
cur.resize(n_vocab);
|
||||
for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
|
||||
|
@ -777,3 +821,44 @@ static std::vector<llama_token_data> get_token_probabilities(llama_context * ctx
|
|||
|
||||
return cur;
|
||||
}
|
||||
|
||||
static bool are_lora_equal(
|
||||
const std::vector<common_adapter_lora_info> & l1,
|
||||
const std::vector<common_adapter_lora_info> & l2) {
|
||||
if (l1.size() != l2.size()) {
|
||||
return false;
|
||||
}
|
||||
for (size_t i = 0; i < l1.size(); ++i) {
|
||||
// we don't check lora.path to reduce the time complexity
|
||||
if (l1[i].scale != l2[i].scale || l1[i].ptr != l2[i].ptr) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
// parse lora config from JSON request, returned a copy of lora_base with updated scale
|
||||
static std::vector<common_adapter_lora_info> parse_lora_request(
|
||||
const std::vector<common_adapter_lora_info> & lora_base,
|
||||
const json & data) {
|
||||
std::vector<common_adapter_lora_info> lora(lora_base);
|
||||
int max_idx = lora.size();
|
||||
|
||||
// clear existing value
|
||||
for (auto & entry : lora) {
|
||||
entry.scale = 0.0f;
|
||||
}
|
||||
|
||||
// set value
|
||||
for (const auto & entry : data) {
|
||||
int id = json_value(entry, "id", -1);
|
||||
float scale = json_value(entry, "scale", 0.0f);
|
||||
if (0 <= id && id < max_idx) {
|
||||
lora[id].scale = scale;
|
||||
} else {
|
||||
throw std::runtime_error("invalid adapter id");
|
||||
}
|
||||
}
|
||||
|
||||
return lora;
|
||||
}
|
||||
|
|
|
@ -37,7 +37,7 @@
|
|||
<div v-for="conv in conversations" :class="{
|
||||
'btn btn-ghost justify-start font-normal': true,
|
||||
'btn-active': conv.id === viewingConvId,
|
||||
}" @click="setViewingConv(conv.id)">
|
||||
}" @click="setViewingConv(conv.id)" dir="auto">
|
||||
<span class="truncate">{{ conv.messages[0].content }}</span>
|
||||
</div>
|
||||
<div class="text-center text-xs opacity-40 mt-auto mx-4">
|
||||
|
@ -62,53 +62,57 @@
|
|||
<!-- action buttons (top right) -->
|
||||
<div class="flex items-center">
|
||||
<div v-if="messages.length > 0" class="dropdown dropdown-end">
|
||||
<!-- "more" button -->
|
||||
<!-- "..." button -->
|
||||
<button tabindex="0" role="button" class="btn m-1" :disabled="isGenerating">
|
||||
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-three-dots-vertical" viewBox="0 0 16 16">
|
||||
<path d="M9.5 13a1.5 1.5 0 1 1-3 0 1.5 1.5 0 0 1 3 0m0-5a1.5 1.5 0 1 1-3 0 1.5 1.5 0 0 1 3 0m0-5a1.5 1.5 0 1 1-3 0 1.5 1.5 0 0 1 3 0"/>
|
||||
</svg>
|
||||
</button>
|
||||
<!-- "more" dropdown menu -->
|
||||
<!-- "delete" dropdown menu -->
|
||||
<ul tabindex="0" class="dropdown-content menu bg-base-100 rounded-box z-[1] w-52 p-2 shadow">
|
||||
<li @click="downloadConv(viewingConvId)"><a>Download</a></li>
|
||||
<li class="text-error" @click="deleteConv(viewingConvId)"><a>Delete</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
<button class="btn" @click="showConfigDialog = true" :disabled="isGenerating">
|
||||
<!-- settings button -->
|
||||
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-gear" viewBox="0 0 16 16">
|
||||
<path d="M8 4.754a3.246 3.246 0 1 0 0 6.492 3.246 3.246 0 0 0 0-6.492M5.754 8a2.246 2.246 0 1 1 4.492 0 2.246 2.246 0 0 1-4.492 0"/>
|
||||
<path d="M9.796 1.343c-.527-1.79-3.065-1.79-3.592 0l-.094.319a.873.873 0 0 1-1.255.52l-.292-.16c-1.64-.892-3.433.902-2.54 2.541l.159.292a.873.873 0 0 1-.52 1.255l-.319.094c-1.79.527-1.79 3.065 0 3.592l.319.094a.873.873 0 0 1 .52 1.255l-.16.292c-.892 1.64.901 3.434 2.541 2.54l.292-.159a.873.873 0 0 1 1.255.52l.094.319c.527 1.79 3.065 1.79 3.592 0l.094-.319a.873.873 0 0 1 1.255-.52l.292.16c1.64.893 3.434-.902 2.54-2.541l-.159-.292a.873.873 0 0 1 .52-1.255l.319-.094c1.79-.527 1.79-3.065 0-3.592l-.319-.094a.873.873 0 0 1-.52-1.255l.16-.292c.893-1.64-.902-3.433-2.541-2.54l-.292.159a.873.873 0 0 1-1.255-.52zm-2.633.283c.246-.835 1.428-.835 1.674 0l.094.319a1.873 1.873 0 0 0 2.693 1.115l.291-.16c.764-.415 1.6.42 1.184 1.185l-.159.292a1.873 1.873 0 0 0 1.116 2.692l.318.094c.835.246.835 1.428 0 1.674l-.319.094a1.873 1.873 0 0 0-1.115 2.693l.16.291c.415.764-.42 1.6-1.185 1.184l-.291-.159a1.873 1.873 0 0 0-2.693 1.116l-.094.318c-.246.835-1.428.835-1.674 0l-.094-.319a1.873 1.873 0 0 0-2.692-1.115l-.292.16c-.764.415-1.6-.42-1.184-1.185l.159-.291A1.873 1.873 0 0 0 1.945 8.93l-.319-.094c-.835-.246-.835-1.428 0-1.674l.319-.094A1.873 1.873 0 0 0 3.06 4.377l-.16-.292c-.415-.764.42-1.6 1.185-1.184l.292.159a1.873 1.873 0 0 0 2.692-1.115z"/>
|
||||
</svg>
|
||||
</button>
|
||||
<div class="tooltip tooltip-bottom" data-tip="Settings">
|
||||
<button class="btn" @click="showConfigDialog = true" :disabled="isGenerating">
|
||||
<!-- settings button -->
|
||||
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-gear" viewBox="0 0 16 16">
|
||||
<path d="M8 4.754a3.246 3.246 0 1 0 0 6.492 3.246 3.246 0 0 0 0-6.492M5.754 8a2.246 2.246 0 1 1 4.492 0 2.246 2.246 0 0 1-4.492 0"/>
|
||||
<path d="M9.796 1.343c-.527-1.79-3.065-1.79-3.592 0l-.094.319a.873.873 0 0 1-1.255.52l-.292-.16c-1.64-.892-3.433.902-2.54 2.541l.159.292a.873.873 0 0 1-.52 1.255l-.319.094c-1.79.527-1.79 3.065 0 3.592l.319.094a.873.873 0 0 1 .52 1.255l-.16.292c-.892 1.64.901 3.434 2.541 2.54l.292-.159a.873.873 0 0 1 1.255.52l.094.319c.527 1.79 3.065 1.79 3.592 0l.094-.319a.873.873 0 0 1 1.255-.52l.292.16c1.64.893 3.434-.902 2.54-2.541l-.159-.292a.873.873 0 0 1 .52-1.255l.319-.094c1.79-.527 1.79-3.065 0-3.592l-.319-.094a.873.873 0 0 1-.52-1.255l.16-.292c.893-1.64-.902-3.433-2.541-2.54l-.292.159a.873.873 0 0 1-1.255-.52zm-2.633.283c.246-.835 1.428-.835 1.674 0l.094.319a1.873 1.873 0 0 0 2.693 1.115l.291-.16c.764-.415 1.6.42 1.184 1.185l-.159.292a1.873 1.873 0 0 0 1.116 2.692l.318.094c.835.246.835 1.428 0 1.674l-.319.094a1.873 1.873 0 0 0-1.115 2.693l.16.291c.415.764-.42 1.6-1.185 1.184l-.291-.159a1.873 1.873 0 0 0-2.693 1.116l-.094.318c-.246.835-1.428.835-1.674 0l-.094-.319a1.873 1.873 0 0 0-2.692-1.115l-.292.16c-.764.415-1.6-.42-1.184-1.185l.159-.291A1.873 1.873 0 0 0 1.945 8.93l-.319-.094c-.835-.246-.835-1.428 0-1.674l.319-.094A1.873 1.873 0 0 0 3.06 4.377l-.16-.292c-.415-.764.42-1.6 1.185-1.184l.292.159a1.873 1.873 0 0 0 2.692-1.115z"/>
|
||||
</svg>
|
||||
</button>
|
||||
</div>
|
||||
|
||||
<!-- theme controller is copied from https://daisyui.com/components/theme-controller/ -->
|
||||
<div class="dropdown dropdown-end dropdown-bottom">
|
||||
<div tabindex="0" role="button" class="btn m-1">
|
||||
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-palette2" viewBox="0 0 16 16">
|
||||
<path d="M0 .5A.5.5 0 0 1 .5 0h5a.5.5 0 0 1 .5.5v5.277l4.147-4.131a.5.5 0 0 1 .707 0l3.535 3.536a.5.5 0 0 1 0 .708L10.261 10H15.5a.5.5 0 0 1 .5.5v5a.5.5 0 0 1-.5.5H3a3 3 0 0 1-2.121-.879A3 3 0 0 1 0 13.044m6-.21 7.328-7.3-2.829-2.828L6 7.188zM4.5 13a1.5 1.5 0 1 0-3 0 1.5 1.5 0 0 0 3 0M15 15v-4H9.258l-4.015 4zM0 .5v12.495zm0 12.495V13z"/>
|
||||
</svg>
|
||||
<div class="tooltip tooltip-bottom" data-tip="Themes">
|
||||
<div class="dropdown dropdown-end dropdown-bottom">
|
||||
<div tabindex="0" role="button" class="btn m-1">
|
||||
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="currentColor" class="bi bi-palette2" viewBox="0 0 16 16">
|
||||
<path d="M0 .5A.5.5 0 0 1 .5 0h5a.5.5 0 0 1 .5.5v5.277l4.147-4.131a.5.5 0 0 1 .707 0l3.535 3.536a.5.5 0 0 1 0 .708L10.261 10H15.5a.5.5 0 0 1 .5.5v5a.5.5 0 0 1-.5.5H3a3 3 0 0 1-2.121-.879A3 3 0 0 1 0 13.044m6-.21 7.328-7.3-2.829-2.828L6 7.188zM4.5 13a1.5 1.5 0 1 0-3 0 1.5 1.5 0 0 0 3 0M15 15v-4H9.258l-4.015 4zM0 .5v12.495zm0 12.495V13z"/>
|
||||
</svg>
|
||||
</div>
|
||||
<ul tabindex="0" class="dropdown-content bg-base-300 rounded-box z-[1] w-52 p-2 shadow-2xl h-80 overflow-y-auto">
|
||||
<li>
|
||||
<button
|
||||
class="btn btn-sm btn-block btn-ghost justify-start"
|
||||
:class="{ 'btn-active': selectedTheme === 'auto' }"
|
||||
@click="setSelectedTheme('auto')">
|
||||
auto
|
||||
</button>
|
||||
</li>
|
||||
<li v-for="theme in themes">
|
||||
<input
|
||||
type="radio"
|
||||
name="theme-dropdown"
|
||||
class="theme-controller btn btn-sm btn-block btn-ghost justify-start"
|
||||
:aria-label="theme"
|
||||
:value="theme"
|
||||
:checked="selectedTheme === theme"
|
||||
@click="setSelectedTheme(theme)" />
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
<ul tabindex="0" class="dropdown-content bg-base-300 rounded-box z-[1] w-52 p-2 shadow-2xl h-80 overflow-y-auto">
|
||||
<li>
|
||||
<button
|
||||
class="btn btn-sm btn-block btn-ghost justify-start"
|
||||
:class="{ 'btn-active': selectedTheme === 'auto' }"
|
||||
@click="setSelectedTheme('auto')">
|
||||
auto
|
||||
</button>
|
||||
</li>
|
||||
<li v-for="theme in themes">
|
||||
<input
|
||||
type="radio"
|
||||
name="theme-dropdown"
|
||||
class="theme-controller btn btn-sm btn-block btn-ghost justify-start"
|
||||
:aria-label="theme"
|
||||
:value="theme"
|
||||
:checked="selectedTheme === theme"
|
||||
@click="setSelectedTheme(theme)" />
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
@ -152,6 +156,7 @@
|
|||
@keydown.enter.shift.exact.prevent="inputMsg += '\n'"
|
||||
:disabled="isGenerating"
|
||||
id="msg-input"
|
||||
dir="auto"
|
||||
></textarea>
|
||||
<button v-if="!isGenerating" class="btn btn-primary ml-2" @click="sendMessage" :disabled="inputMsg.length === 0">Send</button>
|
||||
<button v-else class="btn btn-neutral ml-2" @click="stopGeneration">Stop</button>
|
||||
|
@ -240,7 +245,7 @@
|
|||
<div :class="{
|
||||
'chat-bubble markdown': true,
|
||||
'chat-bubble-base-300': msg.role !== 'user',
|
||||
}">
|
||||
}" dir="auto">
|
||||
<!-- textarea for editing message -->
|
||||
<template v-if="editingContent !== null">
|
||||
<textarea
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue