top-k sort speedup
The problem raised here: https://github.com/ggerganov/llama.cpp/discussions/5073 This patch solves any top-k that's smaller than the entire vocab, so for example a top-k of 10000 runs 29% faster on my I7 CPU. 0.93ms /token goes down to 0.72ms At 20000 tokens the speedup is only 9% At >= vocab the full sort is used. The new code should be equivalent to the normal sort. To really solve "large top-k" I see two ways forward: 1) possibly a lower precision sorting 2) a pre-selector that reduces top-k dynamically down to potential k candidates and then does the partial sort. After all, in almost all runs a top-k of 20000 logits is likely ignoring the lower 19550 due to temperature/p settings.
This commit is contained in:
parent
6f9939d119
commit
84aa8899fb
1 changed files with 2 additions and 1 deletions
|
@ -8004,7 +8004,8 @@ void llama_sample_top_k(struct llama_context * ctx, llama_token_data_array * can
|
|||
if (k == (int) candidates->size) {
|
||||
std::sort(candidates->data, candidates->data + candidates->size, comp);
|
||||
} else {
|
||||
std::partial_sort(candidates->data, candidates->data + k, candidates->data + candidates->size, comp);
|
||||
std::nth_element(candidates->data, candidates->data + k, candidates->data + candidates->size, comp); // separate stack to top-k
|
||||
std::sort(candidates->data, candidates->data + k, comp); // Sort the top-k stack
|
||||
}
|
||||
candidates->sorted = true;
|
||||
}
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue