* Fix für #2721
* Reenable tokenizer test for LLaMa
* Add `console.cpp` dependency
* Fix dependency to `common`
* Fixing wrong fix.
* Make console usage platform specific
Work on compiler warnings.
* Adapting makefile
* Remove trailing whitespace
* Adapting the other parts of the makefile
* Fix typo.
* Fixing the last deviations from sentencepiece indicated by test-tokenizer-1
* Simplify logic
* Add missing change...
* Fix ugly compiler warning
* llama_tokenize should accept strings containing NUL now
* Adding huichen's test case
* add placeholder of starcoder in gguf / llama.cpp
* support convert starcoder weights to gguf
* convert MQA to MHA
* fix ffn_down name
* add LLM_ARCH_STARCODER to llama.cpp
* set head_count_kv = 1
* load starcoder weight
* add max_position_embeddings
* set n_positions to max_positioin_embeddings
* properly load all starcoder params
* fix head count kv
* fix comments
* fix vram calculation for starcoder
* store mqa directly
* add input embeddings handling
* add TBD
* working in cpu, metal buggy
* cleanup useless code
* metal : fix out-of-bounds access in soft_max kernels
* llama : make starcoder graph build more consistent with others
* refactor: cleanup comments a bit
* add other starcoder models: 3B, 7B, 15B
* support-mqa-directly
* fix: remove max_position_embeddings, use n_train_ctx
* Update llama.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update llama.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix: switch to space from tab
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* metal : relax conditions on fast matrix multiplication kernel
* metal : revert the concurrnecy change because it was wrong
* llama : remove experimental stuff
* add freebsd to ci
* bump actions/checkout to v3
* bump cuda 12.1.0 -> 12.2.0
* bump Jimver/cuda-toolkit version
* unify and simplify "Copy and pack Cuda runtime"
* install only necessary cuda sub packages
* Keep static libs and headers with install
* Add logic to generate Config package
* Use proper build info
* Add llama as import library
* Prefix target with package name
* Add example project using CMake package
* Update README
* Update README
* Remove trailing whitespace
Enables the GPU enabled container images to be built and pushed
alongside the CPU containers.
Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
y is vec_mad result vec.
x is vec_mad input vec.
v is vec_mad input scalar.
ggml_vec_mad_f32_unroll will internally loop over x and v with same y.
GGML_VEC_MAD_UNROLL is by default defined to 32.
This value is empirical optimized using performance test runs of out-prod in openllama-3b finetune with 256 context length and batch size 1. It gives 23% performance boost for out_prod.
Full measurements of out-prod runtime in ms:
unroll_xv unroll_yv
1 67014.643 87826.469
2 77117.552 89077.656
4 72091.311 109121.657
8 61077.543 88678.334
16 56914.67 79514.947
24 59024.595 84350.254
28 55952.446 83368.73
32 51476.658 85177.745
36 55973.792 84659.92
40 55139.616 93844.738
48 60736.392 93330.267
64 99856.878 116994.99
Second column is when unrollying yv instead of xv