* metal : fix memory leak
* metal : fix encoders memory leak
* metal : clean up more memory resources
* metal : fix more leaks
* metal : reuse dispatch queue + autoreleasepool
* metal : reuse array for command buffers and encoders
* ggml : assert for odd number of blocks on ARM
15M tinyllama is an example
ctx->kv and ctx->infos was reallocated using not-aligned realloc, but freed with aligned free.
to fix this a GGML_ALIGNED_REALLOC was added, but there is no posix_memalign_realloc function.
so on non-windows and non-mingw32 platforms we fall back to aligned malloc, followed by copying
and freeing the old data.
* llama2.c: direct gguf output (WIP)
* Simplify vector building logic
* llama2.c gguf conversion: fix token types in converter
* llama2.c: support copying vocab from a llama gguf model file
* llama2.c: update default path for vocab model + readme
* llama2.c: use defines for gguf keys
* llama2.c: escape whitespaces w/ U+2581 in vocab converter the llama.cpp way
* llama2.c converter: cleanups + take n_ff from config
* Speedup tokenization
On current master it takes ~3.2 seconds to tokenize
Wikitext. With this change it becomes ~525 ms.
* Fixit: it was missing the piece after the last found occurence
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Make ggml-cuda.cu build with QK_K = 64
Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces
a meaningful result.
* k_quants tuning for Falcon-7b
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* tests : write a Python tokenizer test (wip)
* llama : prefix input text for tokenization with whitespace
* llama : distinguish pieces from decoded text + fix detokenization
* common : add comments
* examples : no longer manually add leading space when tokenizing
* tests : use Python to generate tokenizer tests for C++
* tests : add option to tokenize text files
ggml-ci
* tests : add test-tokenizer-1.py
* llama.cpp : fix LF token
* hellaswag : move the concat space for clarity
* tests : add falcon tests (py + cpp, currently do not pass Unicode)
ggml-ci
* common : temporary separate llama_detokenize calls for SPM and BPE
---------
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>