From 84aa8899fba7bf08682b20ca3c68733579d3e3de Mon Sep 17 00:00:00 2001
From: John <78893154+cmp-nct@users.noreply.github.com>
Date: Mon, 22 Jan 2024 20:46:13 +0100
Subject: [PATCH] top-k sort speedup

The problem raised here: https://github.com/ggerganov/llama.cpp/discussions/5073

This patch solves any top-k that's smaller than the entire vocab, so for example a top-k of 10000 runs 29% faster on my I7 CPU.
0.93ms /token goes down to 0.72ms

At 20000 tokens the speedup is only 9%
At >= vocab the full sort is used.

The new code should be equivalent to the normal sort.

To really solve "large top-k" I see two ways forward:
1) possibly a lower precision sorting
2) a pre-selector that reduces top-k dynamically down to potential k candidates and then does the partial sort.
After all, in almost all runs a top-k of 20000 logits is likely ignoring the lower 19550 due to temperature/p settings.
---
 llama.cpp | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/llama.cpp b/llama.cpp
index 8c906a22f..02028b900 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -8004,7 +8004,8 @@ void llama_sample_top_k(struct llama_context * ctx, llama_token_data_array * can
         if (k == (int) candidates->size) {
             std::sort(candidates->data, candidates->data + candidates->size, comp);
         } else {
-            std::partial_sort(candidates->data, candidates->data + k, candidates->data + candidates->size, comp);
+            std::nth_element(candidates->data, candidates->data + k, candidates->data + candidates->size, comp); // separate stack to top-k
+            std::sort(candidates->data, candidates->data + k, comp); // Sort the top-k stack
         }
         candidates->sorted = true;
     }