server : refactor multitask handling (#9274)

* server : remove multitask from server_task * refactor completions handler * fix embeddings * use res_ok everywhere * small change for handle_slots_action * use unordered_set everywhere * (try) fix test * no more "mutable" lambda * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * use deque --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-02 17:11:51 +02:00 · 2024-09-02 17:11:51 +02:00 · 6e7d133a5f
commit 6e7d133a5f
parent b60074f1c2
5 changed files with 365 additions and 462 deletions
--- a/examples/server/tests/features/parallel.feature
+++ b/examples/server/tests/features/parallel.feature
@ -52,8 +52,8 @@ Feature: Parallel
    Then all prompts are predicted with <n_predict> tokens
    Examples:
      | streaming | n_predict |
-      | disabled  | 128       |
-      | enabled   | 64        |
+      | disabled  | 200       |
+      | enabled   | 200       |

  Scenario Outline: Multi users OAI completions compatibility no v1
    Given a system prompt You are a writer.
--- a/examples/server/tests/features/steps/steps.py
+++ b/examples/server/tests/features/steps/steps.py
@ -818,7 +818,7 @@ async def concurrent_requests(context, f_completion, *args, **kwargs):
    for prompt_no in range(context.n_prompts):
        shifted_args = [context.prompts.pop(), seeds[prompt_no], *args]
        context.concurrent_tasks.append(asyncio.create_task(f_completion(*shifted_args, **kwargs)))
-    await asyncio.sleep(0.1)
+    await asyncio.sleep(0.01)


@step('the slot {slot_id:d} is saved with filename "{filename}"')
--- a/examples/server/tests/features/wrong_usages.feature
+++ b/examples/server/tests/features/wrong_usages.feature
@ -8,9 +8,12 @@ Feature: Wrong usage of llama.cpp server
  Scenario: Infinite loop
    Given a server listening on localhost:8080
    And   a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
+    And   42 as server seed
+    And   2048 KV cache size
    # Uncomment below to fix the issue
    #And   64 server max tokens to predict
    Then  the server is starting
+    Then  the server is healthy
    Given a prompt:
      """
      Go to: infinite loop