command-r : add BPE pre-tokenization (#7063)

* Add BPE pre-tokenization for Command-R/R+.

* Bump transformers convert requirement.

* command-r : add individual digits regex

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit is contained in:
DAN™ 2024-05-05 01:19:30 -04:00 committed by GitHub
parent 6fbd432211
commit 889bdd7686
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
9 changed files with 168 additions and 1 deletions

View file

@ -80,6 +80,7 @@ extern "C" {
LLAMA_VOCAB_PRE_TYPE_STARCODER = 6,
LLAMA_VOCAB_PRE_TYPE_GPT2 = 7,
LLAMA_VOCAB_PRE_TYPE_REFACT = 8,
LLAMA_VOCAB_PRE_TYPE_COMMAND_R = 9,
};
// note: these values should be synchronized with ggml_rope