server: tests: download model from HF, add batch size

This commit is contained in:
Pierrick HYMBERT 2024-03-02 13:01:57 +01:00
parent 1780d9601d
commit 319ded7dde
9 changed files with 201 additions and 76 deletions

View file

@ -70,12 +70,6 @@ jobs:
run: | run: |
pip install -r examples/server/tests/requirements.txt pip install -r examples/server/tests/requirements.txt
- name: Download models
id: download_models
run: |
cd examples/server/tests
../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf
- name: Tests - name: Tests
id: server_integration_test id: server_integration_test
run: | run: |

View file

@ -1,22 +1,30 @@
# Server tests # Server tests
Python based server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_development) and [behave](https://behave.readthedocs.io/en/latest/): Python based server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_development)
* [issues.feature](./features/issues.feature) Pending issues scenario and [behave](https://behave.readthedocs.io/en/latest/):
* [parallel.feature](./features/parallel.feature) Scenario involving multi slots and concurrent requests
* [security.feature](./features/security.feature) Security, CORS and API Key * [issues.feature](./features/issues.feature) Pending issues scenario
* [server.feature](./features/server.feature) Server base scenario: completion, embedding, tokenization, etc... * [parallel.feature](./features/parallel.feature) Scenario involving multi slots and concurrent requests
* [security.feature](./features/security.feature) Security, CORS and API Key
* [server.feature](./features/server.feature) Server base scenario: completion, embedding, tokenization, etc...
Tests target GitHub workflows job runners with 4 vCPU. Tests target GitHub workflows job runners with 4 vCPU.
Requests are using [aiohttp](https://docs.aiohttp.org/en/stable/client_reference.html), [asyncio](https://docs.python.org/fr/3/library/asyncio.html) based http client. Requests are
using [aiohttp](https://docs.aiohttp.org/en/stable/client_reference.html), [asyncio](https://docs.python.org/fr/3/library/asyncio.html)
based http client.
Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail. To mitigate it, you can increase values in `n_predict`, `kv_size`. Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail.
To mitigate it, you can increase values in `n_predict`, `kv_size`.
### Install dependencies ### Install dependencies
`pip install -r requirements.txt` `pip install -r requirements.txt`
### Run tests ### Run tests
1. Build the server 1. Build the server
```shell ```shell
cd ../../.. cd ../../..
mkdir build mkdir build
@ -24,24 +32,36 @@ cd build
cmake ../ cmake ../
cmake --build . --target server cmake --build . --target server
``` ```
2. download required models:
1. `../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf` 2. Start the test: `./tests.sh`
3. Start the test: `./tests.sh`
It's possible to override some scenario steps values with environment variables: It's possible to override some scenario steps values with environment variables:
- `PORT` -> `context.server_port` to set the listening port of the server during scenario, default: `8080`
- `LLAMA_SERVER_BIN_PATH` -> to change the server binary path, default: `../../../build/bin/server` | variable | description |
- `DEBUG` -> "ON" to enable steps and server verbose mode `--verbose` |--------------------------|------------------------------------------------------------------------------------------------|
- `SERVER_LOG_FORMAT_JSON` -> if set switch server logs to json format | `PORT` | `context.server_port` to set the listening port of the server during scenario, default: `8080` |
| `LLAMA_SERVER_BIN_PATH` | to change the server binary path, default: `../../../build/bin/server` |
| `DEBUG` | "ON" to enable steps and server verbose mode `--verbose` |
| `SERVER_LOG_FORMAT_JSON` | if set switch server logs to json format |
| `N_GPU_LAYERS` | number of model layers to offload to VRAM `-ngl --n-gpu-layers` |
### Run @bug, @wip or @wrong_usage annotated scenario ### Run @bug, @wip or @wrong_usage annotated scenario
Feature or Scenario must be annotated with `@llama.cpp` to be included in the default scope. Feature or Scenario must be annotated with `@llama.cpp` to be included in the default scope.
- `@bug` annotation aims to link a scenario with a GitHub issue. - `@bug` annotation aims to link a scenario with a GitHub issue.
- `@wrong_usage` are meant to show user issue that are actually an expected behavior - `@wrong_usage` are meant to show user issue that are actually an expected behavior
- `@wip` to focus on a scenario working in progress - `@wip` to focus on a scenario working in progress
- `@slow` heavy test, disabled by default
To run a scenario annotated with `@bug`, start: To run a scenario annotated with `@bug`, start:
`DEBUG=ON ./tests.sh --no-skipped --tags bug`
```shell
DEBUG=ON ./tests.sh --no-skipped --tags bug
```
After changing logic in `steps.py`, ensure that `@bug` and `@wrong_usage` scenario are updated. After changing logic in `steps.py`, ensure that `@bug` and `@wrong_usage` scenario are updated.
```shell
./tests.sh --no-skipped --tags bug,wrong_usage || echo "should failed but compile"
```

View file

@ -1,11 +1,12 @@
@llama.cpp @llama.cpp
@parallel
Feature: Parallel Feature: Parallel
Background: Server startup Background: Server startup
Given a server listening on localhost:8080 Given a server listening on localhost:8080
And a model file stories260K.gguf And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
And a model alias tinyllama-2
And 42 as server seed And 42 as server seed
And 512 as batch size
And 64 KV cache size And 64 KV cache size
And 2 slots And 2 slots
And embeddings extraction And embeddings extraction

View file

@ -1,9 +1,10 @@
@llama.cpp @llama.cpp
@security
Feature: Security Feature: Security
Background: Server startup with an api key defined Background: Server startup with an api key defined
Given a server listening on localhost:8080 Given a server listening on localhost:8080
And a model file stories260K.gguf And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
And a server api key llama.cpp And a server api key llama.cpp
Then the server is starting Then the server is starting
Then the server is healthy Then the server is healthy

View file

@ -1,15 +1,17 @@
@llama.cpp @llama.cpp
@server
Feature: llama.cpp server Feature: llama.cpp server
Background: Server startup Background: Server startup
Given a server listening on localhost:8080 Given a server listening on localhost:8080
And a model file stories260K.gguf And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
And a model alias tinyllama-2 And a model alias tinyllama-2
And 42 as server seed And 42 as server seed
# KV Cache corresponds to the total amount of tokens # KV Cache corresponds to the total amount of tokens
# that can be stored across all independent sequences: #4130 # that can be stored across all independent sequences: #4130
# see --ctx-size and #5568 # see --ctx-size and #5568
And 32 KV cache size And 32 KV cache size
And 512 as batch size
And 1 slots And 1 slots
And embeddings extraction And embeddings extraction
And 32 server max tokens to predict And 32 server max tokens to predict
@ -85,4 +87,5 @@ Feature: llama.cpp server
Scenario: Models available Scenario: Models available
Given available models Given available models
Then 1 models are supported Then 1 models are supported
Then model 0 is tinyllama-2 Then model 0 is identified by tinyllama-2
Then model 0 is trained on 128 tokens context

View file

@ -13,6 +13,7 @@ import aiohttp
import openai import openai
from behave import step from behave import step
from behave.api.async_step import async_run_until_complete from behave.api.async_step import async_run_until_complete
from huggingface_hub import hf_hub_download
from prometheus_client import parser from prometheus_client import parser
@ -26,17 +27,21 @@ def step_server_config(context, server_fqdn, server_port):
context.base_url = f'http://{context.server_fqdn}:{context.server_port}' context.base_url = f'http://{context.server_fqdn}:{context.server_port}'
context.debug = 'DEBUG' in os.environ and os.environ['DEBUG'] == 'ON'
context.model_alias = None context.model_alias = None
context.n_batch = None
context.n_ctx = None context.n_ctx = None
context.n_gpu_layer = None
context.n_predict = None context.n_predict = None
context.n_server_predict = None context.n_server_predict = None
context.n_slots = None context.n_slots = None
context.prompt_prefix = None
context.prompt_suffix = None
context.server_api_key = None context.server_api_key = None
context.server_continuous_batching = False context.server_continuous_batching = False
context.server_embeddings = False context.server_embeddings = False
context.server_metrics = False context.server_metrics = False
context.server_process = None context.server_process = None
context.seed = None
context.server_seed = None context.server_seed = None
context.user_api_key = None context.user_api_key = None
@ -45,9 +50,11 @@ def step_server_config(context, server_fqdn, server_port):
context.prompts = [] context.prompts = []
@step(u'a model file {model_file}') @step(u'a model file {hf_file} from HF repo {hf_repo}')
def step_model_file(context, model_file): def step_download_hf_model(context, hf_file, hf_repo):
context.model_file = model_file context.model_file = hf_hub_download(repo_id=hf_repo, filename=hf_file)
if context.debug:
print(f"model file: {context.model_file}\n")
@step(u'a model alias {model_alias}') @step(u'a model alias {model_alias}')
@ -55,24 +62,40 @@ def step_model_alias(context, model_alias):
context.model_alias = model_alias context.model_alias = model_alias
@step(u'{seed} as server seed') @step(u'{seed:d} as server seed')
def step_seed(context, seed): def step_seed(context, seed):
context.server_seed = int(seed) context.server_seed = seed
@step(u'{n_ctx} KV cache size') @step(u'{ngl:d} GPU offloaded layers')
def step_n_gpu_layer(context, ngl):
if 'N_GPU_LAYERS' in os.environ:
new_ngl = int(os.environ['N_GPU_LAYERS'])
if context.debug:
print(f"-ngl upgraded from {ngl} to {new_ngl}")
ngl = new_ngl
context.n_gpu_layer = ngl
@step(u'{n_ctx:d} KV cache size')
def step_n_ctx(context, n_ctx): def step_n_ctx(context, n_ctx):
context.n_ctx = int(n_ctx) context.n_ctx = n_ctx
@step(u'{n_slots} slots') @step(u'a KV cache size based on the model trained context {n_ctx_train:d}'
u' extended by {n_grp:d} with additional {n_keep:d} tokens')
def step_kv_cache_size_extended(context, n_ctx_train, n_grp, n_keep):
context.n_ctx = n_ctx_train * n_grp + n_keep
@step(u'{n_slots:d} slots')
def step_n_slots(context, n_slots): def step_n_slots(context, n_slots):
context.n_slots = int(n_slots) context.n_slots = n_slots
@step(u'{n_predict} server max tokens to predict') @step(u'{n_predict:d} server max tokens to predict')
def step_server_n_predict(context, n_predict): def step_server_n_predict(context, n_predict):
context.n_server_predict = int(n_predict) context.n_server_predict = n_predict
@step(u'continuous batching') @step(u'continuous batching')
@ -116,11 +139,12 @@ async def step_wait_for_the_server_to_be_started(context, expecting_status):
case 'ready' | 'idle': case 'ready' | 'idle':
await wait_for_health_status(context, context.base_url, 200, 'ok', await wait_for_health_status(context, context.base_url, 200, 'ok',
timeout=10,
params={'fail_on_no_slot': 0, 'include_slots': 0}, params={'fail_on_no_slot': 0, 'include_slots': 0},
slots_idle=context.n_slots, slots_idle=context.n_slots,
slots_processing=0, slots_processing=0,
expected_slots=[{'id': slot_id, 'state': 0} expected_slots=[{'id': slot_id, 'state': 0}
for slot_id in range(context.n_slots)]) for slot_id in range(context.n_slots if context.n_slots else 1)])
case 'busy': case 'busy':
await wait_for_health_status(context, context.base_url, 503, await wait_for_health_status(context, context.base_url, 503,
'no slot available', 'no slot available',
@ -128,7 +152,7 @@ async def step_wait_for_the_server_to_be_started(context, expecting_status):
slots_idle=0, slots_idle=0,
slots_processing=context.n_slots, slots_processing=context.n_slots,
expected_slots=[{'id': slot_id, 'state': 1} expected_slots=[{'id': slot_id, 'state': 1}
for slot_id in range(context.n_slots)]) for slot_id in range(context.n_slots if context.n_slots else 1)])
case _: case _:
assert False, "unknown status" assert False, "unknown status"
@ -157,12 +181,12 @@ async def step_request_completion(context, api_error):
context.base_url, context.base_url,
debug=context.debug, debug=context.debug,
n_predict=context.n_predict, n_predict=context.n_predict,
server_seed=context.server_seed, seed=await completions_seed(context),
expect_api_error=expect_api_error, expect_api_error=expect_api_error,
user_api_key=context.user_api_key) user_api_key=context.user_api_key)
context.tasks_result.append(completion) context.tasks_result.append(completion)
if context.debug: if context.debug:
print(f"Completion response: {completion}") print(f"Completion response: {completion}\n")
if expect_api_error: if expect_api_error:
assert completion == 401, f"completion must be an 401 status code: {completion}" assert completion == 401, f"completion must be an 401 status code: {completion}"
@ -192,9 +216,9 @@ def step_model(context, model):
context.model = model context.model = model
@step(u'{max_tokens} max tokens to predict') @step(u'{max_tokens:d} max tokens to predict')
def step_max_tokens(context, max_tokens): def step_max_tokens(context, max_tokens):
context.n_predict = int(max_tokens) context.n_predict = max_tokens
@step(u'streaming is {enable_streaming}') @step(u'streaming is {enable_streaming}')
@ -222,11 +246,68 @@ def step_server_api_key(context, server_api_key):
context.server_api_key = server_api_key context.server_api_key = server_api_key
@step(u'{n_junk:d} as number of junk')
def step_n_junk(context, n_junk):
context.n_junk = n_junk
@step(u'{n_batch:d} as batch size')
def step_n_batch(context, n_batch):
context.n_batch = n_batch
@step(u'a self-extend context with a factor of {n_grp:d}')
def step_n_grp(context, n_grp):
context.n_grp = n_grp
@step(u'{seed:d} as seed')
def step_seed(context, seed):
context.seed = seed
@step(u'a prefix prompt')
def step_prompt_prefix(context):
context.prompt_prefix = context.text
@step(u'a junk suffix prompt')
def step_prompt_junk_suffix(context):
context.prompt_junk_suffix = context.text
@step(u'a suffix prompt')
def step_prompt_suffix(context):
context.prompt_suffix = context.text
@step(u'a passkey prompt template')
def step_prompt_passkey_template(context):
context.prompt_passkey_template = context.text
@step(u'a "{passkey}" passkey challenge prompt with the passkey inserted every {i_pos:d} junk')
def step_prompt_passkey(context, passkey, i_pos):
prompt = ""
for i in range(context.n_junk):
if i % context.n_junk == i_pos:
prompt += context.prompt_passkey_template
prompt += context.prompt_junk_suffix
if context.debug:
print(f"ERRRRR Passkey challenge:\n```\n{prompt}\n```\n")
context.prompts.append(prompt)
@step(u'The passkey is found')
def step_passkey_found(context):
raise NotImplementedError(u'STEP: Then The passkey is found')
@step(u'an OAI compatible chat completions request with {api_error} api error') @step(u'an OAI compatible chat completions request with {api_error} api error')
@async_run_until_complete @async_run_until_complete
async def step_oai_chat_completions(context, api_error): async def step_oai_chat_completions(context, api_error):
if context.debug: if context.debug:
print(f"Submitting OAI compatible completions request...") print(f"Submitting OAI compatible completions request...\n")
expect_api_error = api_error == 'raised' expect_api_error = api_error == 'raised'
completion = await oai_chat_completions(context.prompts.pop(), completion = await oai_chat_completions(context.prompts.pop(),
context.system_prompt, context.system_prompt,
@ -241,8 +322,7 @@ async def step_oai_chat_completions(context, api_error):
enable_streaming=context.enable_streaming enable_streaming=context.enable_streaming
if hasattr(context, 'enable_streaming') else None, if hasattr(context, 'enable_streaming') else None,
server_seed=context.server_seed seed=await completions_seed(context),
if hasattr(context, 'server_seed') else None,
user_api_key=context.user_api_key user_api_key=context.user_api_key
if hasattr(context, 'user_api_key') else None, if hasattr(context, 'user_api_key') else None,
@ -276,8 +356,10 @@ async def step_concurrent_completion_requests(context):
# prompt is inserted automatically # prompt is inserted automatically
context.base_url, context.base_url,
debug=context.debug, debug=context.debug,
prompt_prefix=context.prompt_prefix,
prompt_suffix=context.prompt_suffix,
n_predict=context.n_predict if hasattr(context, 'n_predict') else None, n_predict=context.n_predict if hasattr(context, 'n_predict') else None,
server_seed=context.server_seed if hasattr(context, 'server_seed') else None, seed=await completions_seed(context),
user_api_key=context.user_api_key if hasattr(context, user_api_key=context.user_api_key if hasattr(context,
'user_api_key') else None) 'user_api_key') else None)
@ -297,8 +379,7 @@ async def step_oai_chat_completions(context):
if hasattr(context, 'n_predict') else None, if hasattr(context, 'n_predict') else None,
enable_streaming=context.enable_streaming enable_streaming=context.enable_streaming
if hasattr(context, 'enable_streaming') else None, if hasattr(context, 'enable_streaming') else None,
server_seed=context.server_seed seed=await completions_seed(context),
if hasattr(context, 'server_seed') else None,
user_api_key=context.user_api_key user_api_key=context.user_api_key
if hasattr(context, 'user_api_key') else None) if hasattr(context, 'user_api_key') else None)
@ -318,7 +399,9 @@ async def step_oai_chat_completions(context):
if hasattr(context, 'n_predict') else None, if hasattr(context, 'n_predict') else None,
enable_streaming=context.enable_streaming enable_streaming=context.enable_streaming
if hasattr(context, 'enable_streaming') else None, if hasattr(context, 'enable_streaming') else None,
server_seed=context.server_seed seed=context.seed
if hasattr(context, 'seed') else
context.server_seed
if hasattr(context, 'server_seed') else None, if hasattr(context, 'server_seed') else None,
user_api_key=context.user_api_key user_api_key=context.user_api_key
if hasattr(context, 'user_api_key') else None) if hasattr(context, 'user_api_key') else None)
@ -330,11 +413,10 @@ async def step_all_prompts_are_predicted(context):
await all_prompts_are_predicted(context) await all_prompts_are_predicted(context)
@step(u'all prompts are predicted with {n_predict} tokens') @step(u'all prompts are predicted with {n_expected_predicted:d} tokens')
@async_run_until_complete @async_run_until_complete
async def step_all_prompts_are_predicted_with_n_tokens(context, n_predict): async def step_all_prompts_are_predicted_with_n_tokens(context, n_expected_predicted):
expected_predicted_n = int(n_predict) await all_prompts_are_predicted(context, n_expected_predicted)
await all_prompts_are_predicted(context, expected_predicted_n)
async def all_prompts_are_predicted(context, expected_predicted_n=None): async def all_prompts_are_predicted(context, expected_predicted_n=None):
@ -480,17 +562,27 @@ def step_available_models(context):
context.models = openai.Model.list().data context.models = openai.Model.list().data
@step(u'{n_model} models are supported') @step(u'{n_model:d} models are supported')
def step_supported_models(context, n_model): def step_supported_models(context, n_model):
if context.debug: if context.debug:
print("server models available:", context.models) print("server models available:", context.models)
assert len(context.models) == int(n_model) assert len(context.models) == n_model
@step(u'model {i_model} is {model_alias}') @step(u'model {i_model:d} is {param} {preposition} {param_value}')
def step_supported_models(context, i_model, model_alias): def step_supported_models(context, i_model, param, preposition, param_value):
model = context.models[int(i_model)] assert i_model < len(context.models)
assert model.id == model_alias, f"model id {model.id} == {model_alias}" model = context.models[i_model]
param_value = param_value.split(' ', 1)[0]
match param:
case 'identified':
value = model.id
case 'trained':
value = str(model.meta.n_ctx_train)
case _:
assert False, "param {param} not supported"
assert param_value == value, f"model param {param} {value} != {param_value}"
async def concurrent_requests(context, f_completion, *args, **kwargs): async def concurrent_requests(context, f_completion, *args, **kwargs):
@ -507,8 +599,10 @@ async def concurrent_requests(context, f_completion, *args, **kwargs):
async def request_completion(prompt, async def request_completion(prompt,
base_url, base_url,
debug=False, debug=False,
prompt_prefix=None,
prompt_suffix=None,
n_predict=None, n_predict=None,
server_seed=None, seed=None,
expect_api_error=None, expect_api_error=None,
user_api_key=None): user_api_key=None):
if debug: if debug:
@ -525,9 +619,11 @@ async def request_completion(prompt,
async with aiohttp.ClientSession() as session: async with aiohttp.ClientSession() as session:
async with session.post(f'{base_url}/completion', async with session.post(f'{base_url}/completion',
json={ json={
"input_prefix": prompt_prefix,
"prompt": prompt, "prompt": prompt,
"n_predict": int(n_predict) if n_predict is not None else -1, "input_suffix": prompt_suffix,
"seed": server_seed if server_seed is not None else 42 "n_predict": n_predict if n_predict is not None else -1,
"seed": seed if seed is not None else 42
}, },
headers=headers) as response: headers=headers) as response:
if expect_api_error is None or not expect_api_error: if expect_api_error is None or not expect_api_error:
@ -547,14 +643,14 @@ async def oai_chat_completions(user_prompt,
model=None, model=None,
n_predict=None, n_predict=None,
enable_streaming=None, enable_streaming=None,
server_seed=None, seed=None,
user_api_key=None, user_api_key=None,
expect_api_error=None): expect_api_error=None):
if debug: if debug:
print(f"Sending OAI Chat completions request: {user_prompt}") print(f"Sending OAI Chat completions request: {user_prompt}")
# openai client always expects an api key # openai client always expects an api key
user_api_key = user_api_key if user_api_key is not None else 'nope' user_api_key = user_api_key if user_api_key is not None else 'nope'
seed = server_seed if server_seed is not None else 42 seed = seed if seed is not None else 42
enable_streaming = enable_streaming if enable_streaming is not None else False enable_streaming = enable_streaming if enable_streaming is not None else False
payload = { payload = {
"messages": [ "messages": [
@ -726,7 +822,7 @@ def assert_n_tokens_predicted(completion_response, expected_predicted_n=None, re
async def gather_tasks_results(context): async def gather_tasks_results(context):
n_tasks = len(context.concurrent_tasks) n_tasks = len(context.concurrent_tasks)
if context.debug: if context.debug:
print(f"Waiting for all {n_tasks} tasks results...") print(f"Waiting for all {n_tasks} tasks results...\n")
for task_no in range(n_tasks): for task_no in range(n_tasks):
context.tasks_result.append(await context.concurrent_tasks.pop()) context.tasks_result.append(await context.concurrent_tasks.pop())
n_completions = len(context.tasks_result) n_completions = len(context.tasks_result)
@ -737,15 +833,14 @@ async def wait_for_health_status(context,
base_url, base_url,
expected_http_status_code, expected_http_status_code,
expected_health_status, expected_health_status,
timeout = 3,
params=None, params=None,
slots_idle=None, slots_idle=None,
slots_processing=None, slots_processing=None,
expected_slots=None): expected_slots=None):
if context.debug: if context.debug:
print(f"Starting checking for health for expected_health_status={expected_health_status}") print(f"Starting checking for health for expected_health_status={expected_health_status}\n")
timeout = 3 # seconds timeout = 3
if expected_health_status == 'ok':
timeout = 10 # CI slow inference
interval = 0.5 interval = 0.5
counter = 0 counter = 0
async with aiohttp.ClientSession() as session: async with aiohttp.ClientSession() as session:
@ -755,7 +850,7 @@ async def wait_for_health_status(context,
health = await health_response.json() health = await health_response.json()
if context.debug: if context.debug:
print(f"HEALTH - response for expected health status='{expected_health_status}' on " print(f"HEALTH - response for expected health status='{expected_health_status}' on "
f"'{base_url}/health'?{params} is {health}") f"'{base_url}/health'?{params} is {health}\n")
if (status_code == expected_http_status_code if (status_code == expected_http_status_code
and health['status'] == expected_health_status and health['status'] == expected_health_status
and (slots_idle is None or health['slots_idle'] == slots_idle) and (slots_idle is None or health['slots_idle'] == slots_idle)
@ -778,7 +873,7 @@ async def wait_for_health_status(context,
if expected_http_status_code == 503: if expected_http_status_code == 503:
if len(context.tasks_result) == 0: if len(context.tasks_result) == 0:
print("\x1b[5;37;43mWARNING: forcing concurrent tasks," print("\x1b[5;37;43mWARNING: forcing concurrent tasks,"
" busy health check missed, probably too fast inference\x1b[0m") " busy health check missed, probably too fast inference\x1b[0m\n")
n_completions = await gather_tasks_results(context) n_completions = await gather_tasks_results(context)
if n_completions > 0: if n_completions > 0:
return return
@ -812,6 +907,11 @@ def assert_slots_status(slots, expected_slots):
f" = {expected[key]} != {slot[key]}") f" = {expected[key]} != {slot[key]}")
async def completions_seed(context):
return context.seed if hasattr(context, 'seed') and context.seed is not None \
else context.server_seed if hasattr(context, 'server_seed') else None
def start_server_background(context): def start_server_background(context):
context.server_path = '../../../build/bin/server' context.server_path = '../../../build/bin/server'
if 'LLAMA_SERVER_BIN_PATH' in os.environ: if 'LLAMA_SERVER_BIN_PATH' in os.environ:
@ -821,6 +921,10 @@ def start_server_background(context):
'--port', context.server_port, '--port', context.server_port,
'--model', context.model_file '--model', context.model_file
] ]
if context.n_batch:
server_args.extend(['--batch-size', context.n_batch])
if context.n_gpu_layer:
server_args.extend(['--n-gpu-layers', context.n_gpu_layer])
if context.server_continuous_batching: if context.server_continuous_batching:
server_args.append('--cont-batching') server_args.append('--cont-batching')
if context.server_embeddings: if context.server_embeddings:
@ -841,7 +945,7 @@ def start_server_background(context):
server_args.append('--verbose') server_args.append('--verbose')
if 'SERVER_LOG_FORMAT_JSON' not in os.environ: if 'SERVER_LOG_FORMAT_JSON' not in os.environ:
server_args.extend(['--log-format', "text"]) server_args.extend(['--log-format', "text"])
print(f"starting server with: {context.server_path}", *server_args) print(f"starting server with: {context.server_path} {server_args}\n")
context.server_process = subprocess.Popen( context.server_process = subprocess.Popen(
[str(arg) for arg in [context.server_path, *server_args]], [str(arg) for arg in [context.server_path, *server_args]],
close_fds=True) close_fds=True)

View file

@ -7,7 +7,7 @@ Feature: Wrong usage of llama.cpp server
# or pass n_predict/max_tokens in the request. # or pass n_predict/max_tokens in the request.
Scenario: Infinite loop Scenario: Infinite loop
Given a server listening on localhost:8080 Given a server listening on localhost:8080
And a model file stories260K.gguf And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
# Uncomment below to fix the issue # Uncomment below to fix the issue
#And 64 server max tokens to predict #And 64 server max tokens to predict
Then the server is starting Then the server is starting
@ -18,4 +18,5 @@ Feature: Wrong usage of llama.cpp server
# Uncomment below to fix the issue # Uncomment below to fix the issue
#And 128 max tokens to predict #And 128 max tokens to predict
Given concurrent completion requests Given concurrent completion requests
Then the server is idle
Then all prompts are predicted Then all prompts are predicted

View file

@ -1,4 +1,5 @@
aiohttp~=3.9.3 aiohttp~=3.9.3
behave~=1.2.6 behave~=1.2.6
huggingface_hub~=0.20.3
openai~=0.25.0 openai~=0.25.0
prometheus-client~=0.20.0 prometheus-client~=0.20.0

View file

@ -5,7 +5,7 @@ set -eu
if [ $# -lt 1 ] if [ $# -lt 1 ]
then then
# Start @llama.cpp scenario # Start @llama.cpp scenario
behave --summary --stop --no-capture --exclude 'issues|wrong_usages' --tags llama.cpp behave --summary --stop --no-capture --exclude 'issues|wrong_usages|slow' --tags llama.cpp
else else
behave "$@" behave "$@"
fi fi