Commit graph

16 commits

Author SHA1 Message Date
Georgi Gerganov
d999cf65c5
unicode : remove redundant headers 2024-04-26 13:29:48 +03:00
Kazim Abrar Mahi
feeaf4f39c
Added needed functionality, testing remains 2024-04-26 11:43:29 +03:00
Kazim Abrar Mahi
7e308ed212
Adding unicode regex function 2024-04-26 11:43:29 +03:00
Kazim Abrar Mahi
4056dc5b1e
added and refactored unicode_regex_split and related functions 2024-04-26 11:43:28 +03:00
Kazim Abrar Mahi
1c924e4b35
Resolved issues 2024-04-26 11:43:28 +03:00
Kazim Abrar Mahi
54f93eb50b
Moved header files 2024-04-26 11:43:28 +03:00
Kazim Abrar Mahi
d2cfc2225f
Moved regex patterns to unicode.cpp and updated unicode.h 2024-04-26 11:43:28 +03:00
Jaggzh
6fbab2dbc8
merged the changes from deepseeker models to main branch 2024-04-26 11:43:08 +03:00
Jared Van Bortel
32c8486e1f
wpm : portable unicode tolower (#6305)
Also use C locale for ispunct/isspace, and split unicode-data.cpp from unicode.cpp.
2024-03-26 17:46:21 -04:00
Georgi Gerganov
83796e62bc
llama : refactor unicode stuff (#5992)
* llama : refactor unicode stuff

ggml-ci

* unicode : names

* make : fix c++ compiler

* unicode : names

* unicode : straighten tables

* zig : fix build

* unicode : put nfd normalization behind API

ggml-ci

* swift : fix build

* unicode : add BOM

* unicode : add <cstdint>

ggml-ci

* unicode : pass as cpts as const ref
2024-03-11 17:47:47 +02:00
Douglas Hanley
9600d59e01
unicode : switch to multimap based nfd_map (#5799)
* switch to multimap based nfd_map due to compile time issues

* simplify multimap keys

* dont construct new locale every time
2024-03-01 11:15:36 +02:00
Douglas Hanley
177628bfd8
llama : improve BERT tokenization (#5740)
* implement nfd for stripping accents in wpm tokenizer

* sort nfd map; reuse iterator

* use builtin tolower

* add locale include

* Simplify to_lower cases

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-02-28 10:51:11 +02:00
Georgi Gerganov
67fd33132f
unicode : reuse iterator (#5726) 2024-02-26 14:02:12 +02:00
Georgi Gerganov
cf45252a7c
tests : multi-thread the tokenizer tests (#5474)
* tests : multi-thread the tokenizer tests

ggml-ci

* unicode : fix data race for unidentified codepoints

ggml-ci

* unicode : minor style fixes

ggml-ci
2024-02-13 15:14:22 +02:00
bobqianic
6c5629d4d2
add #include <string> to unicode.h (#5051)
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-01-21 10:17:35 -05:00
goerch
ff5a3f0c09
Work on the BPE tokenizer (#3252)
* Work on the BPE tokenizer

Tokenizer tests work for Falcon-7B

* Try to fix build problem

* Fix debug assertion failure

* Fix MSVC Unicode BOM problem

* Cleanup and an improvement

* Fix compiler warning

* Cleanup

* Test doesn't work over the full range of Unicodes

* Update .gitignore and Makefile

* Another Makefile rule

* Testing Aquila

* Moving byte decoding back to `token_to_piece` ...

... because everyone is using it.

* Guarding some unusable code pathes

* Streamlining code and adding some more assertions

Important change: I'm classifying added tokens as control tokens now for BPE.

* Adding a comment

* Adding another assertion

* Fixed vocabulary guarding assertions

* Fix PR for recent change

* Fix PR for recent change

* Fix for compiler warning

* Fix PR for recent change

* Fix PR for recent change

* Fix PR for recent change

* Fix for compiler warning

* Fixes for more compiler warnings

* Remove unused code

* Fix initialization of static maps

* Add scores and token types back, adapt gptneox

* Update llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update unicode.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update unicode.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Ported Starcoder and added some assertions

* Fix coding style

* Apply @jploski 's fix for missing tokens

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-03 09:16:26 +02:00