server: tests: download model from HF, add batch size
This commit is contained in:
parent
1780d9601d
commit
319ded7dde
9 changed files with 201 additions and 76 deletions
6
.github/workflows/server.yml
vendored
6
.github/workflows/server.yml
vendored
|
@ -70,12 +70,6 @@ jobs:
|
||||||
run: |
|
run: |
|
||||||
pip install -r examples/server/tests/requirements.txt
|
pip install -r examples/server/tests/requirements.txt
|
||||||
|
|
||||||
- name: Download models
|
|
||||||
id: download_models
|
|
||||||
run: |
|
|
||||||
cd examples/server/tests
|
|
||||||
../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf
|
|
||||||
|
|
||||||
- name: Tests
|
- name: Tests
|
||||||
id: server_integration_test
|
id: server_integration_test
|
||||||
run: |
|
run: |
|
||||||
|
|
|
@ -1,22 +1,30 @@
|
||||||
# Server tests
|
# Server tests
|
||||||
|
|
||||||
Python based server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_development) and [behave](https://behave.readthedocs.io/en/latest/):
|
Python based server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_development)
|
||||||
* [issues.feature](./features/issues.feature) Pending issues scenario
|
and [behave](https://behave.readthedocs.io/en/latest/):
|
||||||
* [parallel.feature](./features/parallel.feature) Scenario involving multi slots and concurrent requests
|
|
||||||
* [security.feature](./features/security.feature) Security, CORS and API Key
|
* [issues.feature](./features/issues.feature) Pending issues scenario
|
||||||
* [server.feature](./features/server.feature) Server base scenario: completion, embedding, tokenization, etc...
|
* [parallel.feature](./features/parallel.feature) Scenario involving multi slots and concurrent requests
|
||||||
|
* [security.feature](./features/security.feature) Security, CORS and API Key
|
||||||
|
* [server.feature](./features/server.feature) Server base scenario: completion, embedding, tokenization, etc...
|
||||||
|
|
||||||
Tests target GitHub workflows job runners with 4 vCPU.
|
Tests target GitHub workflows job runners with 4 vCPU.
|
||||||
|
|
||||||
Requests are using [aiohttp](https://docs.aiohttp.org/en/stable/client_reference.html), [asyncio](https://docs.python.org/fr/3/library/asyncio.html) based http client.
|
Requests are
|
||||||
|
using [aiohttp](https://docs.aiohttp.org/en/stable/client_reference.html), [asyncio](https://docs.python.org/fr/3/library/asyncio.html)
|
||||||
|
based http client.
|
||||||
|
|
||||||
Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail. To mitigate it, you can increase values in `n_predict`, `kv_size`.
|
Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail.
|
||||||
|
To mitigate it, you can increase values in `n_predict`, `kv_size`.
|
||||||
|
|
||||||
### Install dependencies
|
### Install dependencies
|
||||||
|
|
||||||
`pip install -r requirements.txt`
|
`pip install -r requirements.txt`
|
||||||
|
|
||||||
### Run tests
|
### Run tests
|
||||||
|
|
||||||
1. Build the server
|
1. Build the server
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
cd ../../..
|
cd ../../..
|
||||||
mkdir build
|
mkdir build
|
||||||
|
@ -24,24 +32,36 @@ cd build
|
||||||
cmake ../
|
cmake ../
|
||||||
cmake --build . --target server
|
cmake --build . --target server
|
||||||
```
|
```
|
||||||
2. download required models:
|
|
||||||
1. `../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf`
|
2. Start the test: `./tests.sh`
|
||||||
3. Start the test: `./tests.sh`
|
|
||||||
|
|
||||||
It's possible to override some scenario steps values with environment variables:
|
It's possible to override some scenario steps values with environment variables:
|
||||||
- `PORT` -> `context.server_port` to set the listening port of the server during scenario, default: `8080`
|
|
||||||
- `LLAMA_SERVER_BIN_PATH` -> to change the server binary path, default: `../../../build/bin/server`
|
| variable | description |
|
||||||
- `DEBUG` -> "ON" to enable steps and server verbose mode `--verbose`
|
|--------------------------|------------------------------------------------------------------------------------------------|
|
||||||
- `SERVER_LOG_FORMAT_JSON` -> if set switch server logs to json format
|
| `PORT` | `context.server_port` to set the listening port of the server during scenario, default: `8080` |
|
||||||
|
| `LLAMA_SERVER_BIN_PATH` | to change the server binary path, default: `../../../build/bin/server` |
|
||||||
|
| `DEBUG` | "ON" to enable steps and server verbose mode `--verbose` |
|
||||||
|
| `SERVER_LOG_FORMAT_JSON` | if set switch server logs to json format |
|
||||||
|
| `N_GPU_LAYERS` | number of model layers to offload to VRAM `-ngl --n-gpu-layers` |
|
||||||
|
|
||||||
### Run @bug, @wip or @wrong_usage annotated scenario
|
### Run @bug, @wip or @wrong_usage annotated scenario
|
||||||
|
|
||||||
Feature or Scenario must be annotated with `@llama.cpp` to be included in the default scope.
|
Feature or Scenario must be annotated with `@llama.cpp` to be included in the default scope.
|
||||||
|
|
||||||
- `@bug` annotation aims to link a scenario with a GitHub issue.
|
- `@bug` annotation aims to link a scenario with a GitHub issue.
|
||||||
- `@wrong_usage` are meant to show user issue that are actually an expected behavior
|
- `@wrong_usage` are meant to show user issue that are actually an expected behavior
|
||||||
- `@wip` to focus on a scenario working in progress
|
- `@wip` to focus on a scenario working in progress
|
||||||
|
- `@slow` heavy test, disabled by default
|
||||||
|
|
||||||
To run a scenario annotated with `@bug`, start:
|
To run a scenario annotated with `@bug`, start:
|
||||||
`DEBUG=ON ./tests.sh --no-skipped --tags bug`
|
|
||||||
|
```shell
|
||||||
|
DEBUG=ON ./tests.sh --no-skipped --tags bug
|
||||||
|
```
|
||||||
|
|
||||||
After changing logic in `steps.py`, ensure that `@bug` and `@wrong_usage` scenario are updated.
|
After changing logic in `steps.py`, ensure that `@bug` and `@wrong_usage` scenario are updated.
|
||||||
|
|
||||||
|
```shell
|
||||||
|
./tests.sh --no-skipped --tags bug,wrong_usage || echo "should failed but compile"
|
||||||
|
```
|
||||||
|
|
|
@ -1,11 +1,12 @@
|
||||||
@llama.cpp
|
@llama.cpp
|
||||||
|
@parallel
|
||||||
Feature: Parallel
|
Feature: Parallel
|
||||||
|
|
||||||
Background: Server startup
|
Background: Server startup
|
||||||
Given a server listening on localhost:8080
|
Given a server listening on localhost:8080
|
||||||
And a model file stories260K.gguf
|
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
|
||||||
And a model alias tinyllama-2
|
|
||||||
And 42 as server seed
|
And 42 as server seed
|
||||||
|
And 512 as batch size
|
||||||
And 64 KV cache size
|
And 64 KV cache size
|
||||||
And 2 slots
|
And 2 slots
|
||||||
And embeddings extraction
|
And embeddings extraction
|
||||||
|
|
|
@ -1,9 +1,10 @@
|
||||||
@llama.cpp
|
@llama.cpp
|
||||||
|
@security
|
||||||
Feature: Security
|
Feature: Security
|
||||||
|
|
||||||
Background: Server startup with an api key defined
|
Background: Server startup with an api key defined
|
||||||
Given a server listening on localhost:8080
|
Given a server listening on localhost:8080
|
||||||
And a model file stories260K.gguf
|
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
|
||||||
And a server api key llama.cpp
|
And a server api key llama.cpp
|
||||||
Then the server is starting
|
Then the server is starting
|
||||||
Then the server is healthy
|
Then the server is healthy
|
||||||
|
|
|
@ -1,15 +1,17 @@
|
||||||
@llama.cpp
|
@llama.cpp
|
||||||
|
@server
|
||||||
Feature: llama.cpp server
|
Feature: llama.cpp server
|
||||||
|
|
||||||
Background: Server startup
|
Background: Server startup
|
||||||
Given a server listening on localhost:8080
|
Given a server listening on localhost:8080
|
||||||
And a model file stories260K.gguf
|
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
|
||||||
And a model alias tinyllama-2
|
And a model alias tinyllama-2
|
||||||
And 42 as server seed
|
And 42 as server seed
|
||||||
# KV Cache corresponds to the total amount of tokens
|
# KV Cache corresponds to the total amount of tokens
|
||||||
# that can be stored across all independent sequences: #4130
|
# that can be stored across all independent sequences: #4130
|
||||||
# see --ctx-size and #5568
|
# see --ctx-size and #5568
|
||||||
And 32 KV cache size
|
And 32 KV cache size
|
||||||
|
And 512 as batch size
|
||||||
And 1 slots
|
And 1 slots
|
||||||
And embeddings extraction
|
And embeddings extraction
|
||||||
And 32 server max tokens to predict
|
And 32 server max tokens to predict
|
||||||
|
@ -85,4 +87,5 @@ Feature: llama.cpp server
|
||||||
Scenario: Models available
|
Scenario: Models available
|
||||||
Given available models
|
Given available models
|
||||||
Then 1 models are supported
|
Then 1 models are supported
|
||||||
Then model 0 is tinyllama-2
|
Then model 0 is identified by tinyllama-2
|
||||||
|
Then model 0 is trained on 128 tokens context
|
|
@ -13,6 +13,7 @@ import aiohttp
|
||||||
import openai
|
import openai
|
||||||
from behave import step
|
from behave import step
|
||||||
from behave.api.async_step import async_run_until_complete
|
from behave.api.async_step import async_run_until_complete
|
||||||
|
from huggingface_hub import hf_hub_download
|
||||||
from prometheus_client import parser
|
from prometheus_client import parser
|
||||||
|
|
||||||
|
|
||||||
|
@ -26,17 +27,21 @@ def step_server_config(context, server_fqdn, server_port):
|
||||||
|
|
||||||
context.base_url = f'http://{context.server_fqdn}:{context.server_port}'
|
context.base_url = f'http://{context.server_fqdn}:{context.server_port}'
|
||||||
|
|
||||||
context.debug = 'DEBUG' in os.environ and os.environ['DEBUG'] == 'ON'
|
|
||||||
context.model_alias = None
|
context.model_alias = None
|
||||||
|
context.n_batch = None
|
||||||
context.n_ctx = None
|
context.n_ctx = None
|
||||||
|
context.n_gpu_layer = None
|
||||||
context.n_predict = None
|
context.n_predict = None
|
||||||
context.n_server_predict = None
|
context.n_server_predict = None
|
||||||
context.n_slots = None
|
context.n_slots = None
|
||||||
|
context.prompt_prefix = None
|
||||||
|
context.prompt_suffix = None
|
||||||
context.server_api_key = None
|
context.server_api_key = None
|
||||||
context.server_continuous_batching = False
|
context.server_continuous_batching = False
|
||||||
context.server_embeddings = False
|
context.server_embeddings = False
|
||||||
context.server_metrics = False
|
context.server_metrics = False
|
||||||
context.server_process = None
|
context.server_process = None
|
||||||
|
context.seed = None
|
||||||
context.server_seed = None
|
context.server_seed = None
|
||||||
context.user_api_key = None
|
context.user_api_key = None
|
||||||
|
|
||||||
|
@ -45,9 +50,11 @@ def step_server_config(context, server_fqdn, server_port):
|
||||||
context.prompts = []
|
context.prompts = []
|
||||||
|
|
||||||
|
|
||||||
@step(u'a model file {model_file}')
|
@step(u'a model file {hf_file} from HF repo {hf_repo}')
|
||||||
def step_model_file(context, model_file):
|
def step_download_hf_model(context, hf_file, hf_repo):
|
||||||
context.model_file = model_file
|
context.model_file = hf_hub_download(repo_id=hf_repo, filename=hf_file)
|
||||||
|
if context.debug:
|
||||||
|
print(f"model file: {context.model_file}\n")
|
||||||
|
|
||||||
|
|
||||||
@step(u'a model alias {model_alias}')
|
@step(u'a model alias {model_alias}')
|
||||||
|
@ -55,24 +62,40 @@ def step_model_alias(context, model_alias):
|
||||||
context.model_alias = model_alias
|
context.model_alias = model_alias
|
||||||
|
|
||||||
|
|
||||||
@step(u'{seed} as server seed')
|
@step(u'{seed:d} as server seed')
|
||||||
def step_seed(context, seed):
|
def step_seed(context, seed):
|
||||||
context.server_seed = int(seed)
|
context.server_seed = seed
|
||||||
|
|
||||||
|
|
||||||
@step(u'{n_ctx} KV cache size')
|
@step(u'{ngl:d} GPU offloaded layers')
|
||||||
|
def step_n_gpu_layer(context, ngl):
|
||||||
|
if 'N_GPU_LAYERS' in os.environ:
|
||||||
|
new_ngl = int(os.environ['N_GPU_LAYERS'])
|
||||||
|
if context.debug:
|
||||||
|
print(f"-ngl upgraded from {ngl} to {new_ngl}")
|
||||||
|
ngl = new_ngl
|
||||||
|
context.n_gpu_layer = ngl
|
||||||
|
|
||||||
|
|
||||||
|
@step(u'{n_ctx:d} KV cache size')
|
||||||
def step_n_ctx(context, n_ctx):
|
def step_n_ctx(context, n_ctx):
|
||||||
context.n_ctx = int(n_ctx)
|
context.n_ctx = n_ctx
|
||||||
|
|
||||||
|
|
||||||
@step(u'{n_slots} slots')
|
@step(u'a KV cache size based on the model trained context {n_ctx_train:d}'
|
||||||
|
u' extended by {n_grp:d} with additional {n_keep:d} tokens')
|
||||||
|
def step_kv_cache_size_extended(context, n_ctx_train, n_grp, n_keep):
|
||||||
|
context.n_ctx = n_ctx_train * n_grp + n_keep
|
||||||
|
|
||||||
|
|
||||||
|
@step(u'{n_slots:d} slots')
|
||||||
def step_n_slots(context, n_slots):
|
def step_n_slots(context, n_slots):
|
||||||
context.n_slots = int(n_slots)
|
context.n_slots = n_slots
|
||||||
|
|
||||||
|
|
||||||
@step(u'{n_predict} server max tokens to predict')
|
@step(u'{n_predict:d} server max tokens to predict')
|
||||||
def step_server_n_predict(context, n_predict):
|
def step_server_n_predict(context, n_predict):
|
||||||
context.n_server_predict = int(n_predict)
|
context.n_server_predict = n_predict
|
||||||
|
|
||||||
|
|
||||||
@step(u'continuous batching')
|
@step(u'continuous batching')
|
||||||
|
@ -116,11 +139,12 @@ async def step_wait_for_the_server_to_be_started(context, expecting_status):
|
||||||
|
|
||||||
case 'ready' | 'idle':
|
case 'ready' | 'idle':
|
||||||
await wait_for_health_status(context, context.base_url, 200, 'ok',
|
await wait_for_health_status(context, context.base_url, 200, 'ok',
|
||||||
|
timeout=10,
|
||||||
params={'fail_on_no_slot': 0, 'include_slots': 0},
|
params={'fail_on_no_slot': 0, 'include_slots': 0},
|
||||||
slots_idle=context.n_slots,
|
slots_idle=context.n_slots,
|
||||||
slots_processing=0,
|
slots_processing=0,
|
||||||
expected_slots=[{'id': slot_id, 'state': 0}
|
expected_slots=[{'id': slot_id, 'state': 0}
|
||||||
for slot_id in range(context.n_slots)])
|
for slot_id in range(context.n_slots if context.n_slots else 1)])
|
||||||
case 'busy':
|
case 'busy':
|
||||||
await wait_for_health_status(context, context.base_url, 503,
|
await wait_for_health_status(context, context.base_url, 503,
|
||||||
'no slot available',
|
'no slot available',
|
||||||
|
@ -128,7 +152,7 @@ async def step_wait_for_the_server_to_be_started(context, expecting_status):
|
||||||
slots_idle=0,
|
slots_idle=0,
|
||||||
slots_processing=context.n_slots,
|
slots_processing=context.n_slots,
|
||||||
expected_slots=[{'id': slot_id, 'state': 1}
|
expected_slots=[{'id': slot_id, 'state': 1}
|
||||||
for slot_id in range(context.n_slots)])
|
for slot_id in range(context.n_slots if context.n_slots else 1)])
|
||||||
case _:
|
case _:
|
||||||
assert False, "unknown status"
|
assert False, "unknown status"
|
||||||
|
|
||||||
|
@ -157,12 +181,12 @@ async def step_request_completion(context, api_error):
|
||||||
context.base_url,
|
context.base_url,
|
||||||
debug=context.debug,
|
debug=context.debug,
|
||||||
n_predict=context.n_predict,
|
n_predict=context.n_predict,
|
||||||
server_seed=context.server_seed,
|
seed=await completions_seed(context),
|
||||||
expect_api_error=expect_api_error,
|
expect_api_error=expect_api_error,
|
||||||
user_api_key=context.user_api_key)
|
user_api_key=context.user_api_key)
|
||||||
context.tasks_result.append(completion)
|
context.tasks_result.append(completion)
|
||||||
if context.debug:
|
if context.debug:
|
||||||
print(f"Completion response: {completion}")
|
print(f"Completion response: {completion}\n")
|
||||||
if expect_api_error:
|
if expect_api_error:
|
||||||
assert completion == 401, f"completion must be an 401 status code: {completion}"
|
assert completion == 401, f"completion must be an 401 status code: {completion}"
|
||||||
|
|
||||||
|
@ -192,9 +216,9 @@ def step_model(context, model):
|
||||||
context.model = model
|
context.model = model
|
||||||
|
|
||||||
|
|
||||||
@step(u'{max_tokens} max tokens to predict')
|
@step(u'{max_tokens:d} max tokens to predict')
|
||||||
def step_max_tokens(context, max_tokens):
|
def step_max_tokens(context, max_tokens):
|
||||||
context.n_predict = int(max_tokens)
|
context.n_predict = max_tokens
|
||||||
|
|
||||||
|
|
||||||
@step(u'streaming is {enable_streaming}')
|
@step(u'streaming is {enable_streaming}')
|
||||||
|
@ -222,11 +246,68 @@ def step_server_api_key(context, server_api_key):
|
||||||
context.server_api_key = server_api_key
|
context.server_api_key = server_api_key
|
||||||
|
|
||||||
|
|
||||||
|
@step(u'{n_junk:d} as number of junk')
|
||||||
|
def step_n_junk(context, n_junk):
|
||||||
|
context.n_junk = n_junk
|
||||||
|
|
||||||
|
|
||||||
|
@step(u'{n_batch:d} as batch size')
|
||||||
|
def step_n_batch(context, n_batch):
|
||||||
|
context.n_batch = n_batch
|
||||||
|
|
||||||
|
|
||||||
|
@step(u'a self-extend context with a factor of {n_grp:d}')
|
||||||
|
def step_n_grp(context, n_grp):
|
||||||
|
context.n_grp = n_grp
|
||||||
|
|
||||||
|
|
||||||
|
@step(u'{seed:d} as seed')
|
||||||
|
def step_seed(context, seed):
|
||||||
|
context.seed = seed
|
||||||
|
|
||||||
|
|
||||||
|
@step(u'a prefix prompt')
|
||||||
|
def step_prompt_prefix(context):
|
||||||
|
context.prompt_prefix = context.text
|
||||||
|
|
||||||
|
|
||||||
|
@step(u'a junk suffix prompt')
|
||||||
|
def step_prompt_junk_suffix(context):
|
||||||
|
context.prompt_junk_suffix = context.text
|
||||||
|
|
||||||
|
|
||||||
|
@step(u'a suffix prompt')
|
||||||
|
def step_prompt_suffix(context):
|
||||||
|
context.prompt_suffix = context.text
|
||||||
|
|
||||||
|
|
||||||
|
@step(u'a passkey prompt template')
|
||||||
|
def step_prompt_passkey_template(context):
|
||||||
|
context.prompt_passkey_template = context.text
|
||||||
|
|
||||||
|
|
||||||
|
@step(u'a "{passkey}" passkey challenge prompt with the passkey inserted every {i_pos:d} junk')
|
||||||
|
def step_prompt_passkey(context, passkey, i_pos):
|
||||||
|
prompt = ""
|
||||||
|
for i in range(context.n_junk):
|
||||||
|
if i % context.n_junk == i_pos:
|
||||||
|
prompt += context.prompt_passkey_template
|
||||||
|
prompt += context.prompt_junk_suffix
|
||||||
|
if context.debug:
|
||||||
|
print(f"ERRRRR Passkey challenge:\n```\n{prompt}\n```\n")
|
||||||
|
context.prompts.append(prompt)
|
||||||
|
|
||||||
|
|
||||||
|
@step(u'The passkey is found')
|
||||||
|
def step_passkey_found(context):
|
||||||
|
raise NotImplementedError(u'STEP: Then The passkey is found')
|
||||||
|
|
||||||
|
|
||||||
@step(u'an OAI compatible chat completions request with {api_error} api error')
|
@step(u'an OAI compatible chat completions request with {api_error} api error')
|
||||||
@async_run_until_complete
|
@async_run_until_complete
|
||||||
async def step_oai_chat_completions(context, api_error):
|
async def step_oai_chat_completions(context, api_error):
|
||||||
if context.debug:
|
if context.debug:
|
||||||
print(f"Submitting OAI compatible completions request...")
|
print(f"Submitting OAI compatible completions request...\n")
|
||||||
expect_api_error = api_error == 'raised'
|
expect_api_error = api_error == 'raised'
|
||||||
completion = await oai_chat_completions(context.prompts.pop(),
|
completion = await oai_chat_completions(context.prompts.pop(),
|
||||||
context.system_prompt,
|
context.system_prompt,
|
||||||
|
@ -241,8 +322,7 @@ async def step_oai_chat_completions(context, api_error):
|
||||||
enable_streaming=context.enable_streaming
|
enable_streaming=context.enable_streaming
|
||||||
if hasattr(context, 'enable_streaming') else None,
|
if hasattr(context, 'enable_streaming') else None,
|
||||||
|
|
||||||
server_seed=context.server_seed
|
seed=await completions_seed(context),
|
||||||
if hasattr(context, 'server_seed') else None,
|
|
||||||
|
|
||||||
user_api_key=context.user_api_key
|
user_api_key=context.user_api_key
|
||||||
if hasattr(context, 'user_api_key') else None,
|
if hasattr(context, 'user_api_key') else None,
|
||||||
|
@ -276,8 +356,10 @@ async def step_concurrent_completion_requests(context):
|
||||||
# prompt is inserted automatically
|
# prompt is inserted automatically
|
||||||
context.base_url,
|
context.base_url,
|
||||||
debug=context.debug,
|
debug=context.debug,
|
||||||
|
prompt_prefix=context.prompt_prefix,
|
||||||
|
prompt_suffix=context.prompt_suffix,
|
||||||
n_predict=context.n_predict if hasattr(context, 'n_predict') else None,
|
n_predict=context.n_predict if hasattr(context, 'n_predict') else None,
|
||||||
server_seed=context.server_seed if hasattr(context, 'server_seed') else None,
|
seed=await completions_seed(context),
|
||||||
user_api_key=context.user_api_key if hasattr(context,
|
user_api_key=context.user_api_key if hasattr(context,
|
||||||
'user_api_key') else None)
|
'user_api_key') else None)
|
||||||
|
|
||||||
|
@ -297,8 +379,7 @@ async def step_oai_chat_completions(context):
|
||||||
if hasattr(context, 'n_predict') else None,
|
if hasattr(context, 'n_predict') else None,
|
||||||
enable_streaming=context.enable_streaming
|
enable_streaming=context.enable_streaming
|
||||||
if hasattr(context, 'enable_streaming') else None,
|
if hasattr(context, 'enable_streaming') else None,
|
||||||
server_seed=context.server_seed
|
seed=await completions_seed(context),
|
||||||
if hasattr(context, 'server_seed') else None,
|
|
||||||
user_api_key=context.user_api_key
|
user_api_key=context.user_api_key
|
||||||
if hasattr(context, 'user_api_key') else None)
|
if hasattr(context, 'user_api_key') else None)
|
||||||
|
|
||||||
|
@ -318,7 +399,9 @@ async def step_oai_chat_completions(context):
|
||||||
if hasattr(context, 'n_predict') else None,
|
if hasattr(context, 'n_predict') else None,
|
||||||
enable_streaming=context.enable_streaming
|
enable_streaming=context.enable_streaming
|
||||||
if hasattr(context, 'enable_streaming') else None,
|
if hasattr(context, 'enable_streaming') else None,
|
||||||
server_seed=context.server_seed
|
seed=context.seed
|
||||||
|
if hasattr(context, 'seed') else
|
||||||
|
context.server_seed
|
||||||
if hasattr(context, 'server_seed') else None,
|
if hasattr(context, 'server_seed') else None,
|
||||||
user_api_key=context.user_api_key
|
user_api_key=context.user_api_key
|
||||||
if hasattr(context, 'user_api_key') else None)
|
if hasattr(context, 'user_api_key') else None)
|
||||||
|
@ -330,11 +413,10 @@ async def step_all_prompts_are_predicted(context):
|
||||||
await all_prompts_are_predicted(context)
|
await all_prompts_are_predicted(context)
|
||||||
|
|
||||||
|
|
||||||
@step(u'all prompts are predicted with {n_predict} tokens')
|
@step(u'all prompts are predicted with {n_expected_predicted:d} tokens')
|
||||||
@async_run_until_complete
|
@async_run_until_complete
|
||||||
async def step_all_prompts_are_predicted_with_n_tokens(context, n_predict):
|
async def step_all_prompts_are_predicted_with_n_tokens(context, n_expected_predicted):
|
||||||
expected_predicted_n = int(n_predict)
|
await all_prompts_are_predicted(context, n_expected_predicted)
|
||||||
await all_prompts_are_predicted(context, expected_predicted_n)
|
|
||||||
|
|
||||||
|
|
||||||
async def all_prompts_are_predicted(context, expected_predicted_n=None):
|
async def all_prompts_are_predicted(context, expected_predicted_n=None):
|
||||||
|
@ -480,17 +562,27 @@ def step_available_models(context):
|
||||||
context.models = openai.Model.list().data
|
context.models = openai.Model.list().data
|
||||||
|
|
||||||
|
|
||||||
@step(u'{n_model} models are supported')
|
@step(u'{n_model:d} models are supported')
|
||||||
def step_supported_models(context, n_model):
|
def step_supported_models(context, n_model):
|
||||||
if context.debug:
|
if context.debug:
|
||||||
print("server models available:", context.models)
|
print("server models available:", context.models)
|
||||||
assert len(context.models) == int(n_model)
|
assert len(context.models) == n_model
|
||||||
|
|
||||||
|
|
||||||
@step(u'model {i_model} is {model_alias}')
|
@step(u'model {i_model:d} is {param} {preposition} {param_value}')
|
||||||
def step_supported_models(context, i_model, model_alias):
|
def step_supported_models(context, i_model, param, preposition, param_value):
|
||||||
model = context.models[int(i_model)]
|
assert i_model < len(context.models)
|
||||||
assert model.id == model_alias, f"model id {model.id} == {model_alias}"
|
model = context.models[i_model]
|
||||||
|
|
||||||
|
param_value = param_value.split(' ', 1)[0]
|
||||||
|
match param:
|
||||||
|
case 'identified':
|
||||||
|
value = model.id
|
||||||
|
case 'trained':
|
||||||
|
value = str(model.meta.n_ctx_train)
|
||||||
|
case _:
|
||||||
|
assert False, "param {param} not supported"
|
||||||
|
assert param_value == value, f"model param {param} {value} != {param_value}"
|
||||||
|
|
||||||
|
|
||||||
async def concurrent_requests(context, f_completion, *args, **kwargs):
|
async def concurrent_requests(context, f_completion, *args, **kwargs):
|
||||||
|
@ -507,8 +599,10 @@ async def concurrent_requests(context, f_completion, *args, **kwargs):
|
||||||
async def request_completion(prompt,
|
async def request_completion(prompt,
|
||||||
base_url,
|
base_url,
|
||||||
debug=False,
|
debug=False,
|
||||||
|
prompt_prefix=None,
|
||||||
|
prompt_suffix=None,
|
||||||
n_predict=None,
|
n_predict=None,
|
||||||
server_seed=None,
|
seed=None,
|
||||||
expect_api_error=None,
|
expect_api_error=None,
|
||||||
user_api_key=None):
|
user_api_key=None):
|
||||||
if debug:
|
if debug:
|
||||||
|
@ -525,9 +619,11 @@ async def request_completion(prompt,
|
||||||
async with aiohttp.ClientSession() as session:
|
async with aiohttp.ClientSession() as session:
|
||||||
async with session.post(f'{base_url}/completion',
|
async with session.post(f'{base_url}/completion',
|
||||||
json={
|
json={
|
||||||
|
"input_prefix": prompt_prefix,
|
||||||
"prompt": prompt,
|
"prompt": prompt,
|
||||||
"n_predict": int(n_predict) if n_predict is not None else -1,
|
"input_suffix": prompt_suffix,
|
||||||
"seed": server_seed if server_seed is not None else 42
|
"n_predict": n_predict if n_predict is not None else -1,
|
||||||
|
"seed": seed if seed is not None else 42
|
||||||
},
|
},
|
||||||
headers=headers) as response:
|
headers=headers) as response:
|
||||||
if expect_api_error is None or not expect_api_error:
|
if expect_api_error is None or not expect_api_error:
|
||||||
|
@ -547,14 +643,14 @@ async def oai_chat_completions(user_prompt,
|
||||||
model=None,
|
model=None,
|
||||||
n_predict=None,
|
n_predict=None,
|
||||||
enable_streaming=None,
|
enable_streaming=None,
|
||||||
server_seed=None,
|
seed=None,
|
||||||
user_api_key=None,
|
user_api_key=None,
|
||||||
expect_api_error=None):
|
expect_api_error=None):
|
||||||
if debug:
|
if debug:
|
||||||
print(f"Sending OAI Chat completions request: {user_prompt}")
|
print(f"Sending OAI Chat completions request: {user_prompt}")
|
||||||
# openai client always expects an api key
|
# openai client always expects an api key
|
||||||
user_api_key = user_api_key if user_api_key is not None else 'nope'
|
user_api_key = user_api_key if user_api_key is not None else 'nope'
|
||||||
seed = server_seed if server_seed is not None else 42
|
seed = seed if seed is not None else 42
|
||||||
enable_streaming = enable_streaming if enable_streaming is not None else False
|
enable_streaming = enable_streaming if enable_streaming is not None else False
|
||||||
payload = {
|
payload = {
|
||||||
"messages": [
|
"messages": [
|
||||||
|
@ -726,7 +822,7 @@ def assert_n_tokens_predicted(completion_response, expected_predicted_n=None, re
|
||||||
async def gather_tasks_results(context):
|
async def gather_tasks_results(context):
|
||||||
n_tasks = len(context.concurrent_tasks)
|
n_tasks = len(context.concurrent_tasks)
|
||||||
if context.debug:
|
if context.debug:
|
||||||
print(f"Waiting for all {n_tasks} tasks results...")
|
print(f"Waiting for all {n_tasks} tasks results...\n")
|
||||||
for task_no in range(n_tasks):
|
for task_no in range(n_tasks):
|
||||||
context.tasks_result.append(await context.concurrent_tasks.pop())
|
context.tasks_result.append(await context.concurrent_tasks.pop())
|
||||||
n_completions = len(context.tasks_result)
|
n_completions = len(context.tasks_result)
|
||||||
|
@ -737,15 +833,14 @@ async def wait_for_health_status(context,
|
||||||
base_url,
|
base_url,
|
||||||
expected_http_status_code,
|
expected_http_status_code,
|
||||||
expected_health_status,
|
expected_health_status,
|
||||||
|
timeout = 3,
|
||||||
params=None,
|
params=None,
|
||||||
slots_idle=None,
|
slots_idle=None,
|
||||||
slots_processing=None,
|
slots_processing=None,
|
||||||
expected_slots=None):
|
expected_slots=None):
|
||||||
if context.debug:
|
if context.debug:
|
||||||
print(f"Starting checking for health for expected_health_status={expected_health_status}")
|
print(f"Starting checking for health for expected_health_status={expected_health_status}\n")
|
||||||
timeout = 3 # seconds
|
timeout = 3
|
||||||
if expected_health_status == 'ok':
|
|
||||||
timeout = 10 # CI slow inference
|
|
||||||
interval = 0.5
|
interval = 0.5
|
||||||
counter = 0
|
counter = 0
|
||||||
async with aiohttp.ClientSession() as session:
|
async with aiohttp.ClientSession() as session:
|
||||||
|
@ -755,7 +850,7 @@ async def wait_for_health_status(context,
|
||||||
health = await health_response.json()
|
health = await health_response.json()
|
||||||
if context.debug:
|
if context.debug:
|
||||||
print(f"HEALTH - response for expected health status='{expected_health_status}' on "
|
print(f"HEALTH - response for expected health status='{expected_health_status}' on "
|
||||||
f"'{base_url}/health'?{params} is {health}")
|
f"'{base_url}/health'?{params} is {health}\n")
|
||||||
if (status_code == expected_http_status_code
|
if (status_code == expected_http_status_code
|
||||||
and health['status'] == expected_health_status
|
and health['status'] == expected_health_status
|
||||||
and (slots_idle is None or health['slots_idle'] == slots_idle)
|
and (slots_idle is None or health['slots_idle'] == slots_idle)
|
||||||
|
@ -778,7 +873,7 @@ async def wait_for_health_status(context,
|
||||||
if expected_http_status_code == 503:
|
if expected_http_status_code == 503:
|
||||||
if len(context.tasks_result) == 0:
|
if len(context.tasks_result) == 0:
|
||||||
print("\x1b[5;37;43mWARNING: forcing concurrent tasks,"
|
print("\x1b[5;37;43mWARNING: forcing concurrent tasks,"
|
||||||
" busy health check missed, probably too fast inference\x1b[0m")
|
" busy health check missed, probably too fast inference\x1b[0m\n")
|
||||||
n_completions = await gather_tasks_results(context)
|
n_completions = await gather_tasks_results(context)
|
||||||
if n_completions > 0:
|
if n_completions > 0:
|
||||||
return
|
return
|
||||||
|
@ -812,6 +907,11 @@ def assert_slots_status(slots, expected_slots):
|
||||||
f" = {expected[key]} != {slot[key]}")
|
f" = {expected[key]} != {slot[key]}")
|
||||||
|
|
||||||
|
|
||||||
|
async def completions_seed(context):
|
||||||
|
return context.seed if hasattr(context, 'seed') and context.seed is not None \
|
||||||
|
else context.server_seed if hasattr(context, 'server_seed') else None
|
||||||
|
|
||||||
|
|
||||||
def start_server_background(context):
|
def start_server_background(context):
|
||||||
context.server_path = '../../../build/bin/server'
|
context.server_path = '../../../build/bin/server'
|
||||||
if 'LLAMA_SERVER_BIN_PATH' in os.environ:
|
if 'LLAMA_SERVER_BIN_PATH' in os.environ:
|
||||||
|
@ -821,6 +921,10 @@ def start_server_background(context):
|
||||||
'--port', context.server_port,
|
'--port', context.server_port,
|
||||||
'--model', context.model_file
|
'--model', context.model_file
|
||||||
]
|
]
|
||||||
|
if context.n_batch:
|
||||||
|
server_args.extend(['--batch-size', context.n_batch])
|
||||||
|
if context.n_gpu_layer:
|
||||||
|
server_args.extend(['--n-gpu-layers', context.n_gpu_layer])
|
||||||
if context.server_continuous_batching:
|
if context.server_continuous_batching:
|
||||||
server_args.append('--cont-batching')
|
server_args.append('--cont-batching')
|
||||||
if context.server_embeddings:
|
if context.server_embeddings:
|
||||||
|
@ -841,7 +945,7 @@ def start_server_background(context):
|
||||||
server_args.append('--verbose')
|
server_args.append('--verbose')
|
||||||
if 'SERVER_LOG_FORMAT_JSON' not in os.environ:
|
if 'SERVER_LOG_FORMAT_JSON' not in os.environ:
|
||||||
server_args.extend(['--log-format', "text"])
|
server_args.extend(['--log-format', "text"])
|
||||||
print(f"starting server with: {context.server_path}", *server_args)
|
print(f"starting server with: {context.server_path} {server_args}\n")
|
||||||
context.server_process = subprocess.Popen(
|
context.server_process = subprocess.Popen(
|
||||||
[str(arg) for arg in [context.server_path, *server_args]],
|
[str(arg) for arg in [context.server_path, *server_args]],
|
||||||
close_fds=True)
|
close_fds=True)
|
||||||
|
|
|
@ -7,7 +7,7 @@ Feature: Wrong usage of llama.cpp server
|
||||||
# or pass n_predict/max_tokens in the request.
|
# or pass n_predict/max_tokens in the request.
|
||||||
Scenario: Infinite loop
|
Scenario: Infinite loop
|
||||||
Given a server listening on localhost:8080
|
Given a server listening on localhost:8080
|
||||||
And a model file stories260K.gguf
|
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
|
||||||
# Uncomment below to fix the issue
|
# Uncomment below to fix the issue
|
||||||
#And 64 server max tokens to predict
|
#And 64 server max tokens to predict
|
||||||
Then the server is starting
|
Then the server is starting
|
||||||
|
@ -18,4 +18,5 @@ Feature: Wrong usage of llama.cpp server
|
||||||
# Uncomment below to fix the issue
|
# Uncomment below to fix the issue
|
||||||
#And 128 max tokens to predict
|
#And 128 max tokens to predict
|
||||||
Given concurrent completion requests
|
Given concurrent completion requests
|
||||||
|
Then the server is idle
|
||||||
Then all prompts are predicted
|
Then all prompts are predicted
|
||||||
|
|
|
@ -1,4 +1,5 @@
|
||||||
aiohttp~=3.9.3
|
aiohttp~=3.9.3
|
||||||
behave~=1.2.6
|
behave~=1.2.6
|
||||||
|
huggingface_hub~=0.20.3
|
||||||
openai~=0.25.0
|
openai~=0.25.0
|
||||||
prometheus-client~=0.20.0
|
prometheus-client~=0.20.0
|
||||||
|
|
|
@ -5,7 +5,7 @@ set -eu
|
||||||
if [ $# -lt 1 ]
|
if [ $# -lt 1 ]
|
||||||
then
|
then
|
||||||
# Start @llama.cpp scenario
|
# Start @llama.cpp scenario
|
||||||
behave --summary --stop --no-capture --exclude 'issues|wrong_usages' --tags llama.cpp
|
behave --summary --stop --no-capture --exclude 'issues|wrong_usages|slow' --tags llama.cpp
|
||||||
else
|
else
|
||||||
behave "$@"
|
behave "$@"
|
||||||
fi
|
fi
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue