ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495)

* ci: bench: support sse and fix prompt processing time
server: add tokens usage in stream mode

* ci: bench: README.md EOL

* ci: bench: remove total pp and tg as it is not accurate

* ci: bench: fix case when there is no token generated

* ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics

* ci: bench: fix finish reason rate
This commit is contained in:
Pierrick Hymbert 2024-04-06 05:40:47 +02:00 committed by GitHub
parent a8bd14d557
commit 75cd4c7729
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
5 changed files with 112 additions and 38 deletions

View file

@ -2,13 +2,15 @@
Benchmark is using [k6](https://k6.io/).
##### Install k6
##### Install k6 and sse extension
Follow instruction from: https://k6.io/docs/get-started/installation/
SSE is not supported by default in k6, you have to build k6 with the [xk6-sse](https://github.com/phymbert/xk6-sse) extension.
Example for ubuntu:
Example:
```shell
snap install k6
go install go.k6.io/xk6/cmd/xk6@latest
xk6 build master \
--with github.com/phymbert/xk6-sse
```
#### Download a dataset
@ -46,7 +48,7 @@ server --host localhost --port 8080 \
For 500 chat completions request with 8 concurrent users during maximum 10 minutes, run:
```shell
k6 run script.js --duration 10m --iterations 500 --vus 8
./k6 run script.js --duration 10m --iterations 500 --vus 8
```
The benchmark values can be overridden with:
@ -86,3 +88,33 @@ K6 metrics might be compared against [server metrics](../README.md), with:
```shell
curl http://localhost:8080/metrics
```
### Using the CI python script
The `bench.py` script does several steps:
- start the server
- define good variable for k6
- run k6 script
- extract metrics from prometheus
It aims to be used in the CI, but you can run it manually:
```shell
LLAMA_SERVER_BIN_PATH=../../../cmake-build-release/bin/server python bench.py \
--runner-label local \
--name local \
--branch `git rev-parse --abbrev-ref HEAD` \
--commit `git rev-parse HEAD` \
--scenario script.js \
--duration 5m \
--hf-repo ggml-org/models \
--hf-file phi-2/ggml-model-q4_0.gguf \
--model-path-prefix models \
--parallel 4 \
-ngl 33 \
--batch-size 2048 \
--ubatch-size 256 \
--ctx-size 4096 \
--n-prompts 200 \
--max-prompt-tokens 256 \
--max-tokens 256
```