* 3-5% faster Q4_0 on Metal
* 7-25% faster Q4_1 on Metal
* Oops, forgot to delete the original Q4_1 kernel
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Initial implementation
* Remove debug print
* Restore signature of llama_init_from_gpt_params
* Free guidance context
* Make freeing of guidance_ctx conditional
* Make Classifier-Free Guidance a sampling function
* Correct typo. CFG already means context-free grammar.
* Record sampling time in llama_sample_classifier_free_guidance
* Shift all values by the max value before applying logsoftmax
* Fix styling based on review
* This allows LLAMA models that were previously incompatible with K quants to function mostly as normal. This happens when a model has a vocab != 32000, e.g 32001 which means it's not divisible by 256 or 64. Since the problematic dimensions only apply for `tok_embeddings.weight` and `output.weight` (dimentions 4096 x n_vocab), we can simply quantize these layers to Q8_0 whereas the majority of the hidden layers are still K-quanted since they have compatible dimensions.
* Fix indentation
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* As an alternative, to avoid failing on Metal due to lack of Q8_0 support, instead quantize tok_embeddings.weight to Q4_0 and retain output.weight as F16. This results in a net gain of about 55mb for a 7B model compared to previous approach, but should minimize adverse impact to model quality.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* MPI support, first cut
* fix warnings, update README
* fixes
* wrap includes
* PR comments
* Update CMakeLists.txt
* Add GH workflow, fix test
* Add info to README
* mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099)
* mpi : add names for layer inputs + prep ggml_mpi_graph_compute()
* mpi : move all MPI logic into ggml-mpi
Not tested yet
* mpi : various fixes - communication now works but results are wrong
* mpi : fix output tensor after MPI compute (still not working)
* mpi : fix inference
* mpi : minor
* Add OpenMPI to GH action
* [mpi] continue-on-error: true
* mpi : fix after master merge
* [mpi] Link MPI C++ libraries to fix OpenMPI
* tests : fix new llama_backend API
* [mpi] use MPI_INT32_T
* mpi : factor out recv / send in functions and reuse
* mpi : extend API to allow usage with outer backends (e.g. Metal)
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>