ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495)

* ci: bench: support sse and fix prompt processing time
server: add tokens usage in stream mode

* ci: bench: README.md EOL

* ci: bench: remove total pp and tg as it is not accurate

* ci: bench: fix case when there is no token generated

* ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics

* ci: bench: fix finish reason rate
This commit is contained in:
Pierrick Hymbert 2024-04-06 05:40:47 +02:00 committed by GitHub
parent a8bd14d557
commit 75cd4c7729
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
5 changed files with 112 additions and 38 deletions

View file

@ -567,6 +567,15 @@ static std::vector<json> format_partial_response_oaicompat(json result, const st
{"model", modelname},
{"object", "chat.completion.chunk"}
};
if (!finish_reason.empty()) {
int num_tokens_predicted = json_value(result, "tokens_predicted", 0);
int num_prompt_tokens = json_value(result, "tokens_evaluated", 0);
ret.push_back({"usage", json {
{"completion_tokens", num_tokens_predicted},
{"prompt_tokens", num_prompt_tokens},
{"total_tokens", num_tokens_predicted + num_prompt_tokens}
}});
}
return std::vector<json>({ret});
}