server : refactor multitask handling (#9274)
* server : remove multitask from server_task * refactor completions handler * fix embeddings * use res_ok everywhere * small change for handle_slots_action * use unordered_set everywhere * (try) fix test * no more "mutable" lambda * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * use deque --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit is contained in:
parent
b60074f1c2
commit
6e7d133a5f
5 changed files with 365 additions and 462 deletions
|
@ -52,8 +52,8 @@ Feature: Parallel
|
|||
Then all prompts are predicted with <n_predict> tokens
|
||||
Examples:
|
||||
| streaming | n_predict |
|
||||
| disabled | 128 |
|
||||
| enabled | 64 |
|
||||
| disabled | 200 |
|
||||
| enabled | 200 |
|
||||
|
||||
Scenario Outline: Multi users OAI completions compatibility no v1
|
||||
Given a system prompt You are a writer.
|
||||
|
|
|
@ -818,7 +818,7 @@ async def concurrent_requests(context, f_completion, *args, **kwargs):
|
|||
for prompt_no in range(context.n_prompts):
|
||||
shifted_args = [context.prompts.pop(), seeds[prompt_no], *args]
|
||||
context.concurrent_tasks.append(asyncio.create_task(f_completion(*shifted_args, **kwargs)))
|
||||
await asyncio.sleep(0.1)
|
||||
await asyncio.sleep(0.01)
|
||||
|
||||
|
||||
@step('the slot {slot_id:d} is saved with filename "{filename}"')
|
||||
|
|
|
@ -8,9 +8,12 @@ Feature: Wrong usage of llama.cpp server
|
|||
Scenario: Infinite loop
|
||||
Given a server listening on localhost:8080
|
||||
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
|
||||
And 42 as server seed
|
||||
And 2048 KV cache size
|
||||
# Uncomment below to fix the issue
|
||||
#And 64 server max tokens to predict
|
||||
Then the server is starting
|
||||
Then the server is healthy
|
||||
Given a prompt:
|
||||
"""
|
||||
Go to: infinite loop
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue