tokenizer : BPE fixes (#7530)

* Random test: add_bos_token, add_eos_token
* Random test: add BPE models for testing
* Custom regex split fails with codepoint 0
* Fix falcon punctuation regex
* Refactor llm_tokenizer_bpe: move code to constructor
* Move 'add_special_bos/eos' logic to llm_tokenizer_bpe
* Move tokenizer flags to vocab structure.
* Default values for special_add_bos/eos
* Build vocab.special_tokens_cache using vocab token types
* Generalize 'jina-v2' per token attributes
* Fix unicode whitespaces (deepseek-coder, deepseek-llm)
* Skip missing byte tokens (falcon)
* Better unicode data generation
* Replace char32_t with uint32_t

This commit is contained in:

jaime-m-p

2024-06-18 18:40:52 +02:00

• committed by

GitHub

parent 91c188d6c2

commit 37bef89433

No known key found for this signature in database

GPG key ID: B5690EEEBB952194

5 changed files with 1283 additions and 1053 deletions

1652

unicode-data.cpp

View file

File diff suppressed because it is too large Load diff

Rows
Columns

tokenizer : BPE fixes (#7530)

1652 unicode-data.cpp View file

1652

unicode-data.cpp

View file