llama : more tokenizer fixes (#2810)

* tests : write a Python tokenizer test (wip) * llama : prefix input text for tokenization with whitespace * llama : distinguish pieces from decoded text + fix detokenization * common : add comments * examples : no longer manually add leading space when tokenizing * tests : use Python to generate tokenizer tests for C++ * tests : add option to tokenize text files ggml-ci * tests : add test-tokenizer-1.py * llama.cpp : fix LF token * hellaswag : move the concat space for clarity * tests : add falcon tests (py + cpp, currently do not pass Unicode) ggml-ci * common : temporary separate llama_detokenize calls for SPM and BPE --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
2023-08-27 14:19:19 +03:00 · 2023-08-27 14:19:19 +03:00 · edd4c14817
commit edd4c14817
parent 1591e2e590
20 changed files with 671 additions and 224 deletions
--- a/examples/simple/simple.cpp
+++ b/examples/simple/simple.cpp
@ -63,7 +63,7 @@ int main(int argc, char ** argv) {
    fprintf(stderr, "\n\n");

    for (auto id : tokens_list) {
-        fprintf(stderr, "%s", llama_token_to_str(ctx, id).c_str());
+        fprintf(stderr, "%s", llama_token_to_piece(ctx, id).c_str());
    }

    fflush(stderr);
@ -112,7 +112,7 @@ int main(int argc, char ** argv) {
        }

        // print the new token :
-        printf("%s", llama_token_to_str(ctx, new_token_id).c_str());
+        printf("%s", llama_token_to_piece(ctx, new_token_id).c_str());
        fflush(stdout);

        // push this new token for next evaluation