server: init functional tests (#5566)

* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" #3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault #5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-24 12:28:55 +01:00 · 2024-02-24 12:28:55 +01:00 · 525213d2f5
commit 525213d2f5
parent fd43d66f46
14 changed files with 1243 additions and 18 deletions
--- a/examples/server/tests/features/parallel.feature
+++ b/examples/server/tests/features/parallel.feature
@ -0,0 +1,77 @@
+@llama.cpp
+Feature: Parallel
+
+  Background: Server startup
+    Given a server listening on localhost:8080
+    And   a model file stories260K.gguf
+    And   a model alias tinyllama-2
+    And   42 as server seed
+    And   64 KV cache size
+    And   2 slots
+    And   continuous batching
+    Then  the server is starting
+    Then  the server is healthy
+
+  Scenario Outline: Multi users completion
+    Given a prompt:
+      """
+      Write a very long story about AI.
+      """
+    And a prompt:
+      """
+      Write another very long music lyrics.
+      """
+    And <n_predict> max tokens to predict
+    Given concurrent completion requests
+    Then the server is busy
+    Then the server is idle
+    And  all slots are idle
+    Then all prompts are predicted with <n_predict> tokens
+    Examples:
+      | n_predict |
+      | 128       |
+
+  Scenario Outline: Multi users OAI completions compatibility
+    Given a system prompt You are a writer.
+    And   a model tinyllama-2
+    Given a prompt:
+      """
+      Write a very long book.
+      """
+    And a prompt:
+      """
+      Write another a poem.
+      """
+    And <n_predict> max tokens to predict
+    And streaming is <streaming>
+    Given concurrent OAI completions requests
+    Then the server is busy
+    Then the server is idle
+    Then all prompts are predicted with <n_predict> tokens
+    Examples:
+      | streaming | n_predict |
+      | disabled  | 128       |
+      | enabled   | 64        |
+
+  Scenario:  Multi users with total number of tokens to predict exceeds the KV Cache size #3969
+    Given a prompt:
+      """
+      Write a very long story about AI.
+      """
+    And a prompt:
+      """
+      Write another very long music lyrics.
+      """
+    And a prompt:
+      """
+      Write a very long poem.
+      """
+    And a prompt:
+      """
+      Write a very long joke.
+      """
+    And 128 max tokens to predict
+    Given concurrent completion requests
+    Then the server is busy
+    Then the server is idle
+    Then all prompts are predicted