remove feature files
This commit is contained in:
parent
c432a82295
commit
58cbcd2371
15 changed files with 0 additions and 2465 deletions
|
@ -1,66 +0,0 @@
|
||||||
@llama.cpp
|
|
||||||
@ctx_shift
|
|
||||||
Feature: llama.cpp server
|
|
||||||
|
|
||||||
Background: Server startup
|
|
||||||
Given a server listening on localhost:8080
|
|
||||||
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
|
|
||||||
And a model file test-model.gguf
|
|
||||||
And a model alias tinyllama-2
|
|
||||||
And BOS token is 1
|
|
||||||
And 42 as server seed
|
|
||||||
And 256 KV cache size
|
|
||||||
And 32 as batch size
|
|
||||||
And 2 slots
|
|
||||||
|
|
||||||
# the prompt is 301 tokens
|
|
||||||
# the slot context is 256/2 = 128 tokens
|
|
||||||
# the prompt is truncated to keep the last 109 tokens
|
|
||||||
# 64 tokens are generated thanks to shifting the context when it gets full
|
|
||||||
Scenario: Inference with context shift
|
|
||||||
And 64 server max tokens to predict
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
Given a prompt:
|
|
||||||
"""
|
|
||||||
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
|
|
||||||
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
|
|
||||||
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
|
|
||||||
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
|
|
||||||
"""
|
|
||||||
And a completion request with no api error
|
|
||||||
Then 64 tokens are predicted matching fun|Annaks|popcorns|pictry|bowl
|
|
||||||
And the completion is truncated
|
|
||||||
And 109 prompt tokens are processed
|
|
||||||
|
|
||||||
Scenario Outline: Inference without context shift
|
|
||||||
And <n_predict> server max tokens to predict
|
|
||||||
And disable context shifting
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
Given a prompt:
|
|
||||||
"""
|
|
||||||
Hi how are you
|
|
||||||
"""
|
|
||||||
And a completion request with no api error
|
|
||||||
Then <n_token_output> tokens are predicted matching twind|Anna
|
|
||||||
And the completion is <truncated> truncated
|
|
||||||
And 8 prompt tokens are processed
|
|
||||||
Examples:
|
|
||||||
| n_predict | n_token_output | truncated |
|
|
||||||
| 64 | 64 | not |
|
|
||||||
| -1 | 120 | |
|
|
||||||
|
|
||||||
Scenario: Inference without context shift (expected error: prompt too long)
|
|
||||||
And disable context shifting
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
Given a prompt:
|
|
||||||
"""
|
|
||||||
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
|
|
||||||
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
|
|
||||||
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
|
|
||||||
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
|
|
||||||
"""
|
|
||||||
And a completion request with 400 api error
|
|
||||||
|
|
|
@ -1,113 +0,0 @@
|
||||||
@llama.cpp
|
|
||||||
@embeddings
|
|
||||||
Feature: llama.cpp server
|
|
||||||
|
|
||||||
Background: Server startup
|
|
||||||
Given a server listening on localhost:8080
|
|
||||||
And a model url https://huggingface.co/ggml-org/models/resolve/main/bert-bge-small/ggml-model-f16.gguf
|
|
||||||
And a model file bert-bge-small.gguf
|
|
||||||
And a model alias bert-bge-small
|
|
||||||
And 42 as server seed
|
|
||||||
And 2 slots
|
|
||||||
# the bert-bge-small model has context size of 512
|
|
||||||
# since the generated prompts are as big as the batch size, we need to set the batch size to <= 512
|
|
||||||
# ref: https://huggingface.co/BAAI/bge-small-en-v1.5/blob/5c38ec7c405ec4b44b94cc5a9bb96e735b38267a/config.json#L20
|
|
||||||
And 128 as batch size
|
|
||||||
And 128 as ubatch size
|
|
||||||
And 512 KV cache size
|
|
||||||
And enable embeddings endpoint
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
|
|
||||||
Scenario: Embedding
|
|
||||||
When embeddings are computed for:
|
|
||||||
"""
|
|
||||||
What is the capital of Bulgaria ?
|
|
||||||
"""
|
|
||||||
Then embeddings are generated
|
|
||||||
|
|
||||||
Scenario: Embedding (error: prompt too long)
|
|
||||||
When embeddings are computed for:
|
|
||||||
"""
|
|
||||||
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
|
|
||||||
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
|
|
||||||
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
|
|
||||||
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
|
|
||||||
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
|
|
||||||
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
|
|
||||||
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
|
|
||||||
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
|
|
||||||
"""
|
|
||||||
And embeddings request with 500 api error
|
|
||||||
|
|
||||||
Scenario: OAI Embeddings compatibility
|
|
||||||
Given a model bert-bge-small
|
|
||||||
When an OAI compatible embeddings computation request for:
|
|
||||||
"""
|
|
||||||
What is the capital of Spain ?
|
|
||||||
"""
|
|
||||||
Then embeddings are generated
|
|
||||||
|
|
||||||
Scenario: OAI Embeddings compatibility with multiple inputs
|
|
||||||
Given a model bert-bge-small
|
|
||||||
Given a prompt:
|
|
||||||
"""
|
|
||||||
In which country Paris is located ?
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
Is Madrid the capital of Spain ?
|
|
||||||
"""
|
|
||||||
When an OAI compatible embeddings computation request for multiple inputs
|
|
||||||
Then embeddings are generated
|
|
||||||
|
|
||||||
Scenario: Multi users embeddings
|
|
||||||
Given a prompt:
|
|
||||||
"""
|
|
||||||
Write a very long story about AI.
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
Write another very long music lyrics.
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
Write a very long poem.
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
Write a very long joke.
|
|
||||||
"""
|
|
||||||
Given concurrent embedding requests
|
|
||||||
Then the server is busy
|
|
||||||
Then the server is idle
|
|
||||||
Then all embeddings are generated
|
|
||||||
|
|
||||||
Scenario: Multi users OAI compatibility embeddings
|
|
||||||
Given a prompt:
|
|
||||||
"""
|
|
||||||
In which country Paris is located ?
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
Is Madrid the capital of Spain ?
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
What is the biggest US city ?
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
What is the capital of Bulgaria ?
|
|
||||||
"""
|
|
||||||
And a model bert-bge-small
|
|
||||||
Given concurrent OAI embedding requests
|
|
||||||
Then the server is busy
|
|
||||||
Then the server is idle
|
|
||||||
Then all embeddings are generated
|
|
||||||
|
|
||||||
Scenario: All embeddings should be the same
|
|
||||||
Given 10 fixed prompts
|
|
||||||
And a model bert-bge-small
|
|
||||||
Given concurrent OAI embedding requests
|
|
||||||
Then all embeddings are the same
|
|
|
@ -1,71 +0,0 @@
|
||||||
import os
|
|
||||||
import signal
|
|
||||||
import socket
|
|
||||||
import sys
|
|
||||||
import time
|
|
||||||
import traceback
|
|
||||||
from contextlib import closing
|
|
||||||
from subprocess import TimeoutExpired
|
|
||||||
|
|
||||||
|
|
||||||
def before_scenario(context, scenario):
|
|
||||||
context.debug = 'DEBUG' in os.environ and os.environ['DEBUG'] == 'ON'
|
|
||||||
if context.debug:
|
|
||||||
print("DEBUG=ON")
|
|
||||||
print(f"\x1b[33;42mStarting new scenario: {scenario.name}!\x1b[0m")
|
|
||||||
port = 8080
|
|
||||||
if 'PORT' in os.environ:
|
|
||||||
port = int(os.environ['PORT'])
|
|
||||||
if is_server_listening("localhost", port):
|
|
||||||
assert False, "Server already started"
|
|
||||||
|
|
||||||
|
|
||||||
def after_scenario(context, scenario):
|
|
||||||
try:
|
|
||||||
if 'server_process' not in context or context.server_process is None:
|
|
||||||
return
|
|
||||||
if scenario.status == "failed":
|
|
||||||
if 'GITHUB_ACTIONS' in os.environ:
|
|
||||||
print(f"\x1b[33;101mSCENARIO FAILED: {scenario.name} server logs:\x1b[0m\n")
|
|
||||||
if os.path.isfile('llama.log'):
|
|
||||||
with closing(open('llama.log', 'r')) as f:
|
|
||||||
for line in f:
|
|
||||||
print(line)
|
|
||||||
if not is_server_listening(context.server_fqdn, context.server_port):
|
|
||||||
print("\x1b[33;101mERROR: Server stopped listening\x1b[0m")
|
|
||||||
|
|
||||||
if context.server_process.poll() is not None:
|
|
||||||
assert False, f"Server not running pid={context.server_process.pid} ..."
|
|
||||||
|
|
||||||
server_graceful_shutdown(context) # SIGINT
|
|
||||||
|
|
||||||
try:
|
|
||||||
context.server_process.wait(0.5)
|
|
||||||
except TimeoutExpired:
|
|
||||||
print(f"server still alive after 500ms, force-killing pid={context.server_process.pid} ...")
|
|
||||||
context.server_process.kill() # SIGKILL
|
|
||||||
context.server_process.wait()
|
|
||||||
|
|
||||||
while is_server_listening(context.server_fqdn, context.server_port):
|
|
||||||
time.sleep(0.1)
|
|
||||||
except Exception:
|
|
||||||
print("ignoring error in after_scenario:")
|
|
||||||
traceback.print_exc(file=sys.stdout)
|
|
||||||
|
|
||||||
|
|
||||||
def server_graceful_shutdown(context):
|
|
||||||
print(f"shutting down server pid={context.server_process.pid} ...")
|
|
||||||
if os.name == 'nt':
|
|
||||||
interrupt = signal.CTRL_C_EVENT
|
|
||||||
else:
|
|
||||||
interrupt = signal.SIGINT
|
|
||||||
context.server_process.send_signal(interrupt)
|
|
||||||
|
|
||||||
|
|
||||||
def is_server_listening(server_fqdn, server_port):
|
|
||||||
with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as sock:
|
|
||||||
result = sock.connect_ex((server_fqdn, server_port))
|
|
||||||
_is_server_listening = result == 0
|
|
||||||
if _is_server_listening:
|
|
||||||
print(f"server is listening on {server_fqdn}:{server_port}...")
|
|
||||||
return _is_server_listening
|
|
|
@ -1,36 +0,0 @@
|
||||||
@llama.cpp
|
|
||||||
@infill
|
|
||||||
Feature: llama.cpp server
|
|
||||||
|
|
||||||
# The current model is made by adding FIM tokens to the existing stories260K
|
|
||||||
# We may want to use a better model in the future, maybe something like SmolLM 360M
|
|
||||||
|
|
||||||
Background: Server startup
|
|
||||||
Given a server listening on localhost:8080
|
|
||||||
And a model file tinyllamas/stories260K-infill.gguf from HF repo ggml-org/models
|
|
||||||
And a model file test-model-infill.gguf
|
|
||||||
And a model alias tinyllama-infill
|
|
||||||
And 42 as server seed
|
|
||||||
And 1024 as batch size
|
|
||||||
And 1024 as ubatch size
|
|
||||||
And 2048 KV cache size
|
|
||||||
And 64 max tokens to predict
|
|
||||||
And 0.0 temperature
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
|
|
||||||
Scenario: Infill without input_extra
|
|
||||||
Given a prompt "Complete this"
|
|
||||||
And an infill input extra none none
|
|
||||||
And an infill input prefix "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = llama_"
|
|
||||||
And an infill input suffix "}\n"
|
|
||||||
And an infill request with no api error
|
|
||||||
Then 64 tokens are predicted matching One|day|she|saw|big|scary|bird
|
|
||||||
|
|
||||||
Scenario: Infill with input_extra
|
|
||||||
Given a prompt "Complete this"
|
|
||||||
And an infill input extra "llama.h" "LLAMA_API int32_t llama_n_threads();\n"
|
|
||||||
And an infill input prefix "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = llama_"
|
|
||||||
And an infill input suffix "}\n"
|
|
||||||
And an infill request with no api error
|
|
||||||
Then 64 tokens are predicted matching cuts|Jimmy|mom|came|into|the|room"
|
|
|
@ -1,5 +0,0 @@
|
||||||
# List of ongoing issues
|
|
||||||
# run with: DEBUG=ON ./tests.sh --no-skipped --tags bug
|
|
||||||
@bug
|
|
||||||
Feature: Issues
|
|
||||||
# No confirmed issue at the moment
|
|
|
@ -1,36 +0,0 @@
|
||||||
@llama.cpp
|
|
||||||
@lora
|
|
||||||
Feature: llama.cpp server
|
|
||||||
|
|
||||||
Background: Server startup
|
|
||||||
Given a server listening on localhost:8080
|
|
||||||
And a model url https://huggingface.co/ggml-org/stories15M_MOE/resolve/main/stories15M_MOE-F16.gguf
|
|
||||||
And a model file stories15M_MOE-F16.gguf
|
|
||||||
And a model alias stories15M_MOE
|
|
||||||
And a lora adapter file from https://huggingface.co/ggml-org/stories15M_MOE/resolve/main/moe_shakespeare15M.gguf
|
|
||||||
And 42 as server seed
|
|
||||||
And 1024 as batch size
|
|
||||||
And 1024 as ubatch size
|
|
||||||
And 2048 KV cache size
|
|
||||||
And 64 max tokens to predict
|
|
||||||
And 0.0 temperature
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
|
|
||||||
Scenario: Completion LoRA disabled
|
|
||||||
Given switch off lora adapter 0
|
|
||||||
Given a prompt:
|
|
||||||
"""
|
|
||||||
Look in thy glass
|
|
||||||
"""
|
|
||||||
And a completion request with no api error
|
|
||||||
Then 64 tokens are predicted matching little|girl|three|years|old
|
|
||||||
|
|
||||||
Scenario: Completion LoRA enabled
|
|
||||||
Given switch on lora adapter 0
|
|
||||||
Given a prompt:
|
|
||||||
"""
|
|
||||||
Look in thy glass
|
|
||||||
"""
|
|
||||||
And a completion request with no api error
|
|
||||||
Then 64 tokens are predicted matching eye|love|glass|sun
|
|
|
@ -1,131 +0,0 @@
|
||||||
@llama.cpp
|
|
||||||
@parallel
|
|
||||||
Feature: Parallel
|
|
||||||
|
|
||||||
Background: Server startup
|
|
||||||
Given a server listening on localhost:8080
|
|
||||||
And a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models
|
|
||||||
And a model file test-model-00001-of-00003.gguf
|
|
||||||
And 42 as server seed
|
|
||||||
And 128 as batch size
|
|
||||||
And 256 KV cache size
|
|
||||||
And 2 slots
|
|
||||||
And continuous batching
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
|
|
||||||
Scenario Outline: Multi users completion
|
|
||||||
Given a prompt:
|
|
||||||
"""
|
|
||||||
Write a very long story about AI.
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
Write another very long music lyrics.
|
|
||||||
"""
|
|
||||||
And <n_predict> max tokens to predict
|
|
||||||
Given concurrent completion requests
|
|
||||||
Then the server is busy
|
|
||||||
Then the server is idle
|
|
||||||
And all slots are idle
|
|
||||||
Then all prompts are predicted with <n_predict> tokens
|
|
||||||
Examples:
|
|
||||||
| n_predict |
|
|
||||||
| 128 |
|
|
||||||
|
|
||||||
Scenario Outline: Multi users OAI completions compatibility
|
|
||||||
Given a system prompt You are a writer.
|
|
||||||
And a model tinyllama-2
|
|
||||||
Given a prompt:
|
|
||||||
"""
|
|
||||||
Write a very long book.
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
Write another a poem.
|
|
||||||
"""
|
|
||||||
And <n_predict> max tokens to predict
|
|
||||||
And streaming is <streaming>
|
|
||||||
Given concurrent OAI completions requests
|
|
||||||
Then the server is busy
|
|
||||||
Then the server is idle
|
|
||||||
Then all prompts are predicted with <n_predict> tokens
|
|
||||||
Examples:
|
|
||||||
| streaming | n_predict |
|
|
||||||
| disabled | 128 |
|
|
||||||
| enabled | 64 |
|
|
||||||
|
|
||||||
Scenario Outline: Multi users OAI completions compatibility no v1
|
|
||||||
Given a system prompt You are a writer.
|
|
||||||
And a model tinyllama-2
|
|
||||||
Given a prompt:
|
|
||||||
"""
|
|
||||||
Write a very long book.
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
Write another a poem.
|
|
||||||
"""
|
|
||||||
And <n_predict> max tokens to predict
|
|
||||||
And streaming is <streaming>
|
|
||||||
Given concurrent OAI completions requests no v1
|
|
||||||
Then the server is busy
|
|
||||||
Then the server is idle
|
|
||||||
Then all prompts are predicted with <n_predict> tokens
|
|
||||||
Examples:
|
|
||||||
| streaming | n_predict |
|
|
||||||
| disabled | 128 |
|
|
||||||
| enabled | 64 |
|
|
||||||
|
|
||||||
Scenario Outline: Multi users with number of prompts exceeding number of slots
|
|
||||||
Given a system prompt You are a writer.
|
|
||||||
And a model tinyllama-2
|
|
||||||
Given a prompt:
|
|
||||||
"""
|
|
||||||
Write a very long book.
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
Write another a poem.
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
What is LLM?
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
The sky is blue and I love it.
|
|
||||||
"""
|
|
||||||
And <n_predict> max tokens to predict
|
|
||||||
And streaming is <streaming>
|
|
||||||
Given concurrent OAI completions requests
|
|
||||||
Then the server is busy
|
|
||||||
Then the server is idle
|
|
||||||
Then all prompts are predicted with <n_predict> tokens
|
|
||||||
Examples:
|
|
||||||
| streaming | n_predict |
|
|
||||||
| disabled | 128 |
|
|
||||||
| enabled | 64 |
|
|
||||||
|
|
||||||
Scenario: Multi users with total number of tokens to predict exceeds the KV Cache size #3969
|
|
||||||
Given a prompt:
|
|
||||||
"""
|
|
||||||
Write a very long story about AI.
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
Write another very long music lyrics.
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
Write a very long poem.
|
|
||||||
"""
|
|
||||||
And a prompt:
|
|
||||||
"""
|
|
||||||
Write a very long joke.
|
|
||||||
"""
|
|
||||||
And 128 max tokens to predict
|
|
||||||
Given concurrent completion requests
|
|
||||||
Then the server is busy
|
|
||||||
Then the server is idle
|
|
||||||
Then all prompts are predicted
|
|
|
@ -1,56 +0,0 @@
|
||||||
# run with: ./tests.sh --no-skipped --tags passkey
|
|
||||||
@passkey
|
|
||||||
@slow
|
|
||||||
Feature: Passkey / Self-extend with context shift
|
|
||||||
|
|
||||||
Background: Server startup
|
|
||||||
Given a server listening on localhost:8080
|
|
||||||
|
|
||||||
# Generates a long text of junk and inserts a secret passkey number inside it.
|
|
||||||
# Then we query the LLM for the secret passkey.
|
|
||||||
# see #3856 and #4810
|
|
||||||
Scenario Outline: Passkey
|
|
||||||
Given a model file <hf_file> from HF repo <hf_repo>
|
|
||||||
And <n_batch> as batch size
|
|
||||||
And <n_junk> as number of junk
|
|
||||||
And <n_predicted> server max tokens to predict
|
|
||||||
And 42 as seed
|
|
||||||
And 0.0 temperature
|
|
||||||
And <n_ctx> KV cache size
|
|
||||||
And 1 slots
|
|
||||||
And <n_ga> group attention factor to extend context size through self-extend
|
|
||||||
And <n_ga_w> group attention width to extend context size through self-extend
|
|
||||||
# Can be override with N_GPU_LAYERS
|
|
||||||
And <ngl> GPU offloaded layers
|
|
||||||
Then the server is starting
|
|
||||||
# Higher timeout because the model may need to be downloaded from the internet
|
|
||||||
Then the server is healthy with timeout 120 seconds
|
|
||||||
Given available models
|
|
||||||
Then model 0 is trained on <n_ctx_train> tokens context
|
|
||||||
Given a prefix prompt:
|
|
||||||
"""
|
|
||||||
here is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.
|
|
||||||
"""
|
|
||||||
And a passkey prompt template:
|
|
||||||
"""
|
|
||||||
The pass key is <passkey> Remember it. <passkey> is the pass key.
|
|
||||||
"""
|
|
||||||
And a junk suffix prompt:
|
|
||||||
"""
|
|
||||||
The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again.
|
|
||||||
"""
|
|
||||||
And a suffix prompt:
|
|
||||||
"""
|
|
||||||
What is the pass key? The pass key is
|
|
||||||
"""
|
|
||||||
Given a "<passkey>" passkey challenge prompt with the passkey inserted every <i_pos> junk
|
|
||||||
And a completion request with no api error
|
|
||||||
Then <n_predicted> tokens are predicted matching <re_content>
|
|
||||||
|
|
||||||
Examples:
|
|
||||||
| hf_repo | hf_file | n_ctx_train | ngl | n_ctx | n_batch | n_ga | n_ga_w | n_junk | i_pos | passkey | n_predicted | re_content |
|
|
||||||
| TheBloke/phi-2-GGUF | phi-2.Q4_K_M.gguf | 2048 | 5 | 8192 | 512 | 4 | 512 | 250 | 50 | 42 | 1 | 42 |
|
|
||||||
| TheBloke/phi-2-GGUF | phi-2.Q4_K_M.gguf | 2048 | 5 | 8192 | 512 | 2 | 512 | 250 | 50 | 42 | 1 | \b((?!42)\w)+\b |
|
|
||||||
#| TheBloke/Llama-2-7B-GGUF | llama-2-7b.Q2_K.gguf | 4096 | 3 | 16384 | 512 | 4 | 512 | 500 | 300 | 1234 | 5 | 1234 |
|
|
||||||
#| TheBloke/Mixtral-8x7B-v0.1-GGUF | mixtral-8x7b-v0.1.Q2_K.gguf | 32768 | 2 | 16384 | 512 | 4 | 512 | 500 | 100 | 0987 | 5 | 0
|
|
||||||
# 987 |
|
|
|
@ -1,42 +0,0 @@
|
||||||
@llama.cpp
|
|
||||||
@rerank
|
|
||||||
Feature: llama.cpp server
|
|
||||||
|
|
||||||
Background: Server startup
|
|
||||||
Given a server listening on localhost:8080
|
|
||||||
And a model url https://huggingface.co/ggml-org/models/resolve/main/jina-reranker-v1-tiny-en/ggml-model-f16.gguf
|
|
||||||
And a model file jina-reranker-v1-tiny-en.gguf
|
|
||||||
And a model alias jina-reranker-v1-tiny-en
|
|
||||||
And 42 as server seed
|
|
||||||
And 2 slots
|
|
||||||
And 512 as batch size
|
|
||||||
And 512 as ubatch size
|
|
||||||
And 512 KV cache size
|
|
||||||
And enable reranking endpoint
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
|
|
||||||
Scenario: Rerank
|
|
||||||
Given a rerank query:
|
|
||||||
"""
|
|
||||||
Machine learning is
|
|
||||||
"""
|
|
||||||
And a rerank document:
|
|
||||||
"""
|
|
||||||
A machine is a physical system that uses power to apply forces and control movement to perform an action. The term is commonly applied to artificial devices, such as those employing engines or motors, but also to natural biological macromolecules, such as molecular machines.
|
|
||||||
"""
|
|
||||||
And a rerank document:
|
|
||||||
"""
|
|
||||||
Learning is the process of acquiring new understanding, knowledge, behaviors, skills, values, attitudes, and preferences. The ability to learn is possessed by humans, non-human animals, and some machines; there is also evidence for some kind of learning in certain plants.
|
|
||||||
"""
|
|
||||||
And a rerank document:
|
|
||||||
"""
|
|
||||||
Machine learning is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions.
|
|
||||||
"""
|
|
||||||
And a rerank document:
|
|
||||||
"""
|
|
||||||
Paris, capitale de la France, est une grande ville européenne et un centre mondial de l'art, de la mode, de la gastronomie et de la culture. Son paysage urbain du XIXe siècle est traversé par de larges boulevards et la Seine.
|
|
||||||
"""
|
|
||||||
When reranking request
|
|
||||||
Then reranking results are returned
|
|
||||||
Then reranking highest score is index 2 and lowest score is index 3
|
|
|
@ -1,118 +0,0 @@
|
||||||
@llama.cpp
|
|
||||||
@results
|
|
||||||
Feature: Results
|
|
||||||
|
|
||||||
Background: Server startup
|
|
||||||
Given a server listening on localhost:8080
|
|
||||||
And a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models
|
|
||||||
And a model file test-model-00001-of-00003.gguf
|
|
||||||
And 128 as batch size
|
|
||||||
And 1024 KV cache size
|
|
||||||
And 128 max tokens to predict
|
|
||||||
And continuous batching
|
|
||||||
|
|
||||||
Scenario Outline: consistent results with same seed
|
|
||||||
Given <n_slots> slots
|
|
||||||
And 1.0 temperature
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
|
|
||||||
Given 4 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
|
|
||||||
|
|
||||||
Given concurrent completion requests
|
|
||||||
Then the server is busy
|
|
||||||
Then the server is idle
|
|
||||||
And all slots are idle
|
|
||||||
Then all predictions are equal
|
|
||||||
Examples:
|
|
||||||
| n_slots |
|
|
||||||
| 1 |
|
|
||||||
# FIXME: unified KV cache nondeterminism
|
|
||||||
# | 2 |
|
|
||||||
|
|
||||||
Scenario Outline: different results with different seed
|
|
||||||
Given <n_slots> slots
|
|
||||||
And 1.0 temperature
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
|
|
||||||
Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
|
|
||||||
Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 43
|
|
||||||
Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 44
|
|
||||||
Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 45
|
|
||||||
|
|
||||||
Given concurrent completion requests
|
|
||||||
Then the server is busy
|
|
||||||
Then the server is idle
|
|
||||||
And all slots are idle
|
|
||||||
Then all predictions are different
|
|
||||||
Examples:
|
|
||||||
| n_slots |
|
|
||||||
| 1 |
|
|
||||||
| 2 |
|
|
||||||
|
|
||||||
Scenario Outline: consistent results with same seed and varying batch size
|
|
||||||
Given 4 slots
|
|
||||||
And <temp> temperature
|
|
||||||
# And 0 as draft
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
|
|
||||||
Given 1 prompts "Write a very long story about AI." with seed 42
|
|
||||||
And concurrent completion requests
|
|
||||||
# Then the server is busy # Not all slots will be utilized.
|
|
||||||
Then the server is idle
|
|
||||||
And all slots are idle
|
|
||||||
|
|
||||||
Given <n_parallel> prompts "Write a very long story about AI." with seed 42
|
|
||||||
And concurrent completion requests
|
|
||||||
# Then the server is busy # Not all slots will be utilized.
|
|
||||||
Then the server is idle
|
|
||||||
And all slots are idle
|
|
||||||
|
|
||||||
Then all predictions are equal
|
|
||||||
Examples:
|
|
||||||
| n_parallel | temp |
|
|
||||||
| 1 | 0.0 |
|
|
||||||
| 1 | 1.0 |
|
|
||||||
# FIXME: unified KV cache nondeterminism
|
|
||||||
# See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
|
|
||||||
# and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
|
|
||||||
# and https://github.com/ggerganov/llama.cpp/pull/7347 .
|
|
||||||
# | 2 | 0.0 |
|
|
||||||
# | 4 | 0.0 |
|
|
||||||
# | 2 | 1.0 |
|
|
||||||
# | 4 | 1.0 |
|
|
||||||
|
|
||||||
Scenario Outline: consistent token probs with same seed and prompt
|
|
||||||
Given <n_slots> slots
|
|
||||||
And <n_kv> KV cache size
|
|
||||||
And 1.0 temperature
|
|
||||||
And <n_predict> max tokens to predict
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
|
|
||||||
Given 1 prompts "The meaning of life is" with seed 42
|
|
||||||
And concurrent completion requests
|
|
||||||
# Then the server is busy # Not all slots will be utilized.
|
|
||||||
Then the server is idle
|
|
||||||
And all slots are idle
|
|
||||||
|
|
||||||
Given <n_parallel> prompts "The meaning of life is" with seed 42
|
|
||||||
And concurrent completion requests
|
|
||||||
# Then the server is busy # Not all slots will be utilized.
|
|
||||||
Then the server is idle
|
|
||||||
And all slots are idle
|
|
||||||
|
|
||||||
Then all token probabilities are equal
|
|
||||||
Examples:
|
|
||||||
| n_slots | n_kv | n_predict | n_parallel |
|
|
||||||
| 4 | 1024 | 1 | 1 |
|
|
||||||
# FIXME: unified KV cache nondeterminism
|
|
||||||
# See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
|
|
||||||
# and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
|
|
||||||
# and https://github.com/ggerganov/llama.cpp/pull/7347 .
|
|
||||||
# | 4 | 1024 | 1 | 4 |
|
|
||||||
# | 4 | 1024 | 100 | 1 |
|
|
||||||
# This test still fails even the above patches; the first token probabilities are already different.
|
|
||||||
# | 4 | 1024 | 100 | 4 |
|
|
|
@ -1,68 +0,0 @@
|
||||||
@llama.cpp
|
|
||||||
@security
|
|
||||||
Feature: Security
|
|
||||||
|
|
||||||
Background: Server startup with an api key defined
|
|
||||||
Given a server listening on localhost:8080
|
|
||||||
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
|
|
||||||
And a server api key THIS_IS_THE_KEY
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
|
|
||||||
Scenario Outline: Completion with some user api key
|
|
||||||
Given a prompt test
|
|
||||||
And a user api key <api_key>
|
|
||||||
And 4 max tokens to predict
|
|
||||||
And a completion request with <api_error> api error
|
|
||||||
|
|
||||||
Examples: Prompts
|
|
||||||
| api_key | api_error |
|
|
||||||
| THIS_IS_THE_KEY | no |
|
|
||||||
| THIS_IS_THE_KEY | no |
|
|
||||||
| hackeme | raised |
|
|
||||||
| | raised |
|
|
||||||
|
|
||||||
Scenario Outline: OAI Compatibility
|
|
||||||
Given a system prompt test
|
|
||||||
And a user prompt test
|
|
||||||
And a model test
|
|
||||||
And 2 max tokens to predict
|
|
||||||
And streaming is disabled
|
|
||||||
And a user api key <api_key>
|
|
||||||
Given an OAI compatible chat completions request with <api_error> api error
|
|
||||||
|
|
||||||
Examples: Prompts
|
|
||||||
| api_key | api_error |
|
|
||||||
| THIS_IS_THE_KEY | no |
|
|
||||||
| THIS_IS_THE_KEY | no |
|
|
||||||
| hackme | raised |
|
|
||||||
|
|
||||||
Scenario Outline: OAI Compatibility (invalid response formats)
|
|
||||||
Given a system prompt test
|
|
||||||
And a user prompt test
|
|
||||||
And a response format <response_format>
|
|
||||||
And a model test
|
|
||||||
And 2 max tokens to predict
|
|
||||||
And streaming is disabled
|
|
||||||
Given an OAI compatible chat completions request with raised api error
|
|
||||||
|
|
||||||
Examples: Prompts
|
|
||||||
| response_format |
|
|
||||||
| {"type": "sound"} |
|
|
||||||
| {"type": "json_object", "schema": 123} |
|
|
||||||
| {"type": "json_object", "schema": {"type": 123}} |
|
|
||||||
| {"type": "json_object", "schema": {"type": "hiccup"}} |
|
|
||||||
|
|
||||||
|
|
||||||
Scenario Outline: CORS Options
|
|
||||||
Given a user api key THIS_IS_THE_KEY
|
|
||||||
When an OPTIONS request is sent from <origin>
|
|
||||||
Then CORS header <cors_header> is set to <cors_header_value>
|
|
||||||
|
|
||||||
Examples: Headers
|
|
||||||
| origin | cors_header | cors_header_value |
|
|
||||||
| localhost | Access-Control-Allow-Origin | localhost |
|
|
||||||
| web.mydomain.fr | Access-Control-Allow-Origin | web.mydomain.fr |
|
|
||||||
| origin | Access-Control-Allow-Credentials | true |
|
|
||||||
| web.mydomain.fr | Access-Control-Allow-Methods | GET, POST |
|
|
||||||
| web.mydomain.fr | Access-Control-Allow-Headers | * |
|
|
|
@ -1,120 +0,0 @@
|
||||||
@llama.cpp
|
|
||||||
@server
|
|
||||||
Feature: llama.cpp server
|
|
||||||
|
|
||||||
Background: Server startup
|
|
||||||
Given a server listening on localhost:8080
|
|
||||||
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
|
|
||||||
And a model file test-model.gguf
|
|
||||||
And a model alias tinyllama-2
|
|
||||||
And BOS token is 1
|
|
||||||
And 42 as server seed
|
|
||||||
# KV Cache corresponds to the total amount of tokens
|
|
||||||
# that can be stored across all independent sequences: #4130
|
|
||||||
# see --ctx-size and #5568
|
|
||||||
And 256 KV cache size
|
|
||||||
And 32 as batch size
|
|
||||||
And 2 slots
|
|
||||||
And 64 server max tokens to predict
|
|
||||||
And prometheus compatible metrics exposed
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
|
|
||||||
Scenario: Health
|
|
||||||
Then the server is ready
|
|
||||||
And all slots are idle
|
|
||||||
|
|
||||||
|
|
||||||
Scenario Outline: Completion
|
|
||||||
Given a prompt <prompt>
|
|
||||||
And <n_predict> max tokens to predict
|
|
||||||
And a completion request with no api error
|
|
||||||
Then <n_predicted> tokens are predicted matching <re_content>
|
|
||||||
And the completion is <truncated> truncated
|
|
||||||
And <n_prompt> prompt tokens are processed
|
|
||||||
And prometheus metrics are exposed
|
|
||||||
And metric llamacpp:tokens_predicted is <n_predicted>
|
|
||||||
|
|
||||||
Examples: Prompts
|
|
||||||
| prompt | n_predict | re_content | n_prompt | n_predicted | truncated |
|
|
||||||
| I believe the meaning of life is | 8 | (read\|going)+ | 18 | 8 | not |
|
|
||||||
| Write a joke about AI from a very long prompt which will not be truncated | 256 | (princesses\|everyone\|kids\|Anna\|forest)+ | 46 | 64 | not |
|
|
||||||
|
|
||||||
Scenario: Completion prompt truncated
|
|
||||||
Given a prompt:
|
|
||||||
"""
|
|
||||||
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
|
|
||||||
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
|
|
||||||
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
|
|
||||||
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
|
|
||||||
"""
|
|
||||||
And a completion request with no api error
|
|
||||||
Then 64 tokens are predicted matching fun|Annaks|popcorns|pictry|bowl
|
|
||||||
And the completion is truncated
|
|
||||||
And 109 prompt tokens are processed
|
|
||||||
|
|
||||||
|
|
||||||
Scenario Outline: OAI Compatibility
|
|
||||||
Given a model <model>
|
|
||||||
And a system prompt <system_prompt>
|
|
||||||
And a user prompt <user_prompt>
|
|
||||||
And <max_tokens> max tokens to predict
|
|
||||||
And streaming is <enable_streaming>
|
|
||||||
Given an OAI compatible chat completions request with no api error
|
|
||||||
Then <n_predicted> tokens are predicted matching <re_content>
|
|
||||||
And <n_prompt> prompt tokens are processed
|
|
||||||
And the completion is <truncated> truncated
|
|
||||||
|
|
||||||
Examples: Prompts
|
|
||||||
| model | system_prompt | user_prompt | max_tokens | re_content | n_prompt | n_predicted | enable_streaming | truncated |
|
|
||||||
| llama-2 | Book | What is the best book | 8 | (Here\|what)+ | 77 | 8 | disabled | not |
|
|
||||||
| codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 128 | (thanks\|happy\|bird\|Annabyear)+ | -1 | 64 | enabled | |
|
|
||||||
|
|
||||||
|
|
||||||
Scenario Outline: OAI Compatibility w/ response format
|
|
||||||
Given a model test
|
|
||||||
And a system prompt test
|
|
||||||
And a user prompt test
|
|
||||||
And a response format <response_format>
|
|
||||||
And 10 max tokens to predict
|
|
||||||
Given an OAI compatible chat completions request with no api error
|
|
||||||
Then <n_predicted> tokens are predicted matching <re_content>
|
|
||||||
|
|
||||||
Examples: Prompts
|
|
||||||
| response_format | n_predicted | re_content |
|
|
||||||
| {"type": "json_object", "schema": {"const": "42"}} | 6 | "42" |
|
|
||||||
| {"type": "json_object", "schema": {"items": [{"type": "integer"}]}} | 10 | \[ -300 \] |
|
|
||||||
| {"type": "json_object"} | 10 | \{ " Jacky. |
|
|
||||||
|
|
||||||
|
|
||||||
Scenario: Tokenize / Detokenize
|
|
||||||
When tokenizing:
|
|
||||||
"""
|
|
||||||
What is the capital of France ?
|
|
||||||
"""
|
|
||||||
Then tokens can be detokenized
|
|
||||||
And tokens do not begin with BOS
|
|
||||||
|
|
||||||
Scenario: Tokenize w/ BOS
|
|
||||||
Given adding special tokens
|
|
||||||
When tokenizing:
|
|
||||||
"""
|
|
||||||
What is the capital of Germany?
|
|
||||||
"""
|
|
||||||
Then tokens begin with BOS
|
|
||||||
Given first token is removed
|
|
||||||
Then tokens can be detokenized
|
|
||||||
|
|
||||||
Scenario: Tokenize with pieces
|
|
||||||
When tokenizing with pieces:
|
|
||||||
"""
|
|
||||||
What is the capital of Germany?
|
|
||||||
媽
|
|
||||||
"""
|
|
||||||
Then tokens are given with pieces
|
|
||||||
|
|
||||||
Scenario: Models available
|
|
||||||
Given available models
|
|
||||||
Then 1 models are supported
|
|
||||||
Then model 0 is identified by tinyllama-2
|
|
||||||
Then model 0 is trained on 128 tokens context
|
|
|
@ -1,58 +0,0 @@
|
||||||
@llama.cpp
|
|
||||||
@slotsave
|
|
||||||
Feature: llama.cpp server slot management
|
|
||||||
|
|
||||||
Background: Server startup
|
|
||||||
Given a server listening on localhost:8080
|
|
||||||
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
|
|
||||||
And prompt caching is enabled
|
|
||||||
And 2 slots
|
|
||||||
And . as slot save path
|
|
||||||
And 2048 KV cache size
|
|
||||||
And 42 as server seed
|
|
||||||
And 24 max tokens to predict
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
|
|
||||||
Scenario: Save and Restore Slot
|
|
||||||
# First prompt in slot 1 should be fully processed
|
|
||||||
Given a user prompt "What is the capital of France?"
|
|
||||||
And using slot id 1
|
|
||||||
And a completion request with no api error
|
|
||||||
Then 24 tokens are predicted matching (Lily|cake)
|
|
||||||
And 22 prompt tokens are processed
|
|
||||||
When the slot 1 is saved with filename "slot1.bin"
|
|
||||||
Then the server responds with status code 200
|
|
||||||
# Since we have cache, this should only process the last tokens
|
|
||||||
Given a user prompt "What is the capital of Germany?"
|
|
||||||
And a completion request with no api error
|
|
||||||
Then 24 tokens are predicted matching (Thank|special)
|
|
||||||
And 7 prompt tokens are processed
|
|
||||||
# Loading the original cache into slot 0,
|
|
||||||
# we should only be processing 1 prompt token and get the same output
|
|
||||||
When the slot 0 is restored with filename "slot1.bin"
|
|
||||||
Then the server responds with status code 200
|
|
||||||
Given a user prompt "What is the capital of France?"
|
|
||||||
And using slot id 0
|
|
||||||
And a completion request with no api error
|
|
||||||
Then 24 tokens are predicted matching (Lily|cake)
|
|
||||||
And 1 prompt tokens are processed
|
|
||||||
# For verification that slot 1 was not corrupted during slot 0 load, same thing
|
|
||||||
Given a user prompt "What is the capital of Germany?"
|
|
||||||
And using slot id 1
|
|
||||||
And a completion request with no api error
|
|
||||||
Then 24 tokens are predicted matching (Thank|special)
|
|
||||||
And 1 prompt tokens are processed
|
|
||||||
|
|
||||||
Scenario: Erase Slot
|
|
||||||
Given a user prompt "What is the capital of France?"
|
|
||||||
And using slot id 1
|
|
||||||
And a completion request with no api error
|
|
||||||
Then 24 tokens are predicted matching (Lily|cake)
|
|
||||||
And 22 prompt tokens are processed
|
|
||||||
When the slot 1 is erased
|
|
||||||
Then the server responds with status code 200
|
|
||||||
Given a user prompt "What is the capital of France?"
|
|
||||||
And a completion request with no api error
|
|
||||||
Then 24 tokens are predicted matching (Lily|cake)
|
|
||||||
And 22 prompt tokens are processed
|
|
File diff suppressed because it is too large
Load diff
|
@ -1,25 +0,0 @@
|
||||||
# run with: ./tests.sh --no-skipped --tags wrong_usage
|
|
||||||
@wrong_usage
|
|
||||||
Feature: Wrong usage of llama.cpp server
|
|
||||||
|
|
||||||
#3969 The user must always set --n-predict option
|
|
||||||
# to cap the number of tokens any completion request can generate
|
|
||||||
# or pass n_predict/max_tokens in the request.
|
|
||||||
Scenario: Infinite loop
|
|
||||||
Given a server listening on localhost:8080
|
|
||||||
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
|
|
||||||
And 42 as server seed
|
|
||||||
And 2048 KV cache size
|
|
||||||
# Uncomment below to fix the issue
|
|
||||||
#And 64 server max tokens to predict
|
|
||||||
Then the server is starting
|
|
||||||
Then the server is healthy
|
|
||||||
Given a prompt:
|
|
||||||
"""
|
|
||||||
Go to: infinite loop
|
|
||||||
"""
|
|
||||||
# Uncomment below to fix the issue
|
|
||||||
#And 128 max tokens to predict
|
|
||||||
Given concurrent completion requests
|
|
||||||
Then the server is idle
|
|
||||||
Then all prompts are predicted
|
|
Loading…
Add table
Add a link
Reference in a new issue