llama : support batched embeddings (#5466)

* batched embedding: pool outputs by sequence id. updated embedding example * bring back non-causal attention * embd : minor improvements * llama : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-13 06:06:58 -06:00 · 2024-02-13 06:06:58 -06:00 · 03bf161eb6
commit 03bf161eb6
parent ad014bba97
6 changed files with 163 additions and 54 deletions
--- a/gguf-py/gguf/constants.py
+++ b/gguf-py/gguf/constants.py
@ -40,6 +40,7 @@ class Keys:
        TENSOR_DATA_LAYOUT    = "{arch}.tensor_data_layout"
        EXPERT_COUNT          = "{arch}.expert_count"
        EXPERT_USED_COUNT     = "{arch}.expert_used_count"
+        POOLING_LAYER         = "{arch}.pooling_layer"

    class Attention:
        HEAD_COUNT        = "{arch}.attention.head_count"