server: tests: reducing n_ctx and n_predict for // prompts as it is too slow in the CI.

This commit is contained in:
Pierrick HYMBERT 2024-02-23 02:38:37 +01:00
parent 5110de08e3
commit bedf37c9d1

View file

@ -6,7 +6,7 @@ Feature: Parallel
And a model file stories260K.gguf And a model file stories260K.gguf
And a model alias tinyllama-2 And a model alias tinyllama-2
And 42 as server seed And 42 as server seed
And 32 KV cache size And 64 KV cache size
And 2 slots And 2 slots
And continuous batching And continuous batching
Then the server is starting Then the server is starting
@ -29,7 +29,7 @@ Feature: Parallel
Then all prompts are predicted with <n_predict> tokens Then all prompts are predicted with <n_predict> tokens
Examples: Examples:
| n_predict | | n_predict |
| 512 | | 128 |
Scenario Outline: Multi users OAI completions compatibility Scenario Outline: Multi users OAI completions compatibility
Given a system prompt You are a writer. Given a system prompt You are a writer.
@ -50,15 +50,15 @@ Feature: Parallel
Then all prompts are predicted with <n_predict> tokens Then all prompts are predicted with <n_predict> tokens
Examples: Examples:
| streaming | n_predict | | streaming | n_predict |
| disabled | 512 | | disabled | 64 |
#| enabled | 512 | FIXME: phymbert: need to investigate why in aiohttp with streaming only one token is generated #| enabled | 64 | FIXME: phymbert: need to investigate why in aiohttp with streaming only one token is generated
Scenario: Multi users with total number of tokens to predict exceeds the KV Cache size #3969 Scenario: Multi users with total number of tokens to predict exceeds the KV Cache size #3969
Given a server listening on localhost:8080 Given a server listening on localhost:8080
And a model file stories260K.gguf And a model file stories260K.gguf
And 42 as server seed And 42 as server seed
And 2 slots And 2 slots
And 1024 KV cache size And 64 KV cache size
Then the server is starting Then the server is starting
Then the server is healthy Then the server is healthy
Given a prompt: Given a prompt:
@ -77,7 +77,7 @@ Feature: Parallel
""" """
Write a very long joke. Write a very long joke.
""" """
And 2048 max tokens to predict And 128 max tokens to predict
Given concurrent completion requests Given concurrent completion requests
Then the server is busy Then the server is busy
Then the server is idle Then the server is idle