Disabled Auto-Format

2025-01-29 17:55:41 +05:30 · 2025-01-29 17:55:41 +05:30 · ad622ca97e
commit ad622ca97e
parent 971f2f0d04
1 changed files with 243 additions and 240 deletions
--- a/README.md
+++ b/README.md
@ -11,32 +11,32 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
 ## Recent API changes
-   [Changelog for `libllama` API](https://github.com/ggerganov/llama.cpp/issues/9289)
+- [Changelog for `libllama` API](https://github.com/ggerganov/llama.cpp/issues/9289)
-   [Changelog for `llama-server` REST API](https://github.com/ggerganov/llama.cpp/issues/9291)
+- [Changelog for `llama-server` REST API](https://github.com/ggerganov/llama.cpp/issues/9291)
 ## Hot topics
-   **How to use [MTLResidencySet](https://developer.apple.com/documentation/metal/mtlresidencyset?language=objc) to keep the GPU memory active?** https://github.com/ggerganov/llama.cpp/pull/11427
+- **How to use [MTLResidencySet](https://developer.apple.com/documentation/metal/mtlresidencyset?language=objc) to keep the GPU memory active?** https://github.com/ggerganov/llama.cpp/pull/11427
-   **VS Code extension for FIM completions:** https://github.com/ggml-org/llama.vscode
+- **VS Code extension for FIM completions:** https://github.com/ggml-org/llama.vscode
-   Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
+- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
-   Introducing GGUF-my-LoRA https://github.com/ggerganov/llama.cpp/discussions/10123
+- Introducing GGUF-my-LoRA https://github.com/ggerganov/llama.cpp/discussions/10123
-   Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggerganov/llama.cpp/discussions/9669
+- Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggerganov/llama.cpp/discussions/9669
-   Hugging Face GGUF editor: [discussion](https://github.com/ggerganov/llama.cpp/discussions/9268) | [tool](https://huggingface.co/spaces/CISCai/gguf-editor)
+- Hugging Face GGUF editor: [discussion](https://github.com/ggerganov/llama.cpp/discussions/9268) | [tool](https://huggingface.co/spaces/CISCai/gguf-editor)
---
+----
 ## Description
 The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
 range of hardware - locally and in the cloud.
-   Plain C/C++ implementation without any dependencies
+- Plain C/C++ implementation without any dependencies
-   Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
+- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
-   AVX, AVX2, AVX512 and AMX support for x86 architectures
+- AVX, AVX2, AVX512 and AMX support for x86 architectures
-   1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
+- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
-   Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
+- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
-   Vulkan and SYCL backend support
+- Vulkan and SYCL backend support
-   CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
+- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
 The `llama.cpp` project is the main playground for developing new features for the [ggml](https://github.com/ggerganov/ggml) library.
@ -49,206 +49,206 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
 #### Text-only
-   [x] LLaMA 🦙
+- [X] LLaMA 🦙
-   [x] LLaMA 2 🦙🦙
+- [x] LLaMA 2 🦙🦙
-   [x] LLaMA 3 🦙🦙🦙
+- [x] LLaMA 3 🦙🦙🦙
-   [x] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)
+- [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)
-   [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral)
+- [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral)
-   [x] [DBRX](https://huggingface.co/databricks/dbrx-instruct)
+- [x] [DBRX](https://huggingface.co/databricks/dbrx-instruct)
-   [x] [Falcon](https://huggingface.co/models?search=tiiuae/falcon)
+- [X] [Falcon](https://huggingface.co/models?search=tiiuae/falcon)
-   [x] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
+- [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
-   [x] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
+- [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
-   [x] [BERT](https://github.com/ggerganov/llama.cpp/pull/5423)
+- [X] [BERT](https://github.com/ggerganov/llama.cpp/pull/5423)
-   [x] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
+- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
-   [x] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)
+- [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)
-   [x] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila)
+- [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila)
-   [x] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187)
+- [X] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187)
-   [x] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)
+- [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)
-   [x] [MPT](https://github.com/ggerganov/llama.cpp/pull/3417)
+- [X] [MPT](https://github.com/ggerganov/llama.cpp/pull/3417)
-   [x] [Bloom](https://github.com/ggerganov/llama.cpp/pull/3553)
+- [X] [Bloom](https://github.com/ggerganov/llama.cpp/pull/3553)
-   [x] [Yi models](https://huggingface.co/models?search=01-ai/Yi)
+- [x] [Yi models](https://huggingface.co/models?search=01-ai/Yi)
-   [x] [StableLM models](https://huggingface.co/stabilityai)
+- [X] [StableLM models](https://huggingface.co/stabilityai)
-   [x] [Deepseek models](https://huggingface.co/models?search=deepseek-ai/deepseek)
+- [x] [Deepseek models](https://huggingface.co/models?search=deepseek-ai/deepseek)
-   [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
+- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
-   [x] [PLaMo-13B](https://github.com/ggerganov/llama.cpp/pull/3557)
+- [x] [PLaMo-13B](https://github.com/ggerganov/llama.cpp/pull/3557)
-   [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
+- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
-   [x] [PhiMoE](https://github.com/ggerganov/llama.cpp/pull/11003)
+- [x] [PhiMoE](https://github.com/ggerganov/llama.cpp/pull/11003)
-   [x] [GPT-2](https://huggingface.co/gpt2)
+- [x] [GPT-2](https://huggingface.co/gpt2)
-   [x] [Orion 14B](https://github.com/ggerganov/llama.cpp/pull/5118)
+- [x] [Orion 14B](https://github.com/ggerganov/llama.cpp/pull/5118)
-   [x] [InternLM2](https://huggingface.co/models?search=internlm2)
+- [x] [InternLM2](https://huggingface.co/models?search=internlm2)
-   [x] [CodeShell](https://github.com/WisdomShell/codeshell)
+- [x] [CodeShell](https://github.com/WisdomShell/codeshell)
-   [x] [Gemma](https://ai.google.dev/gemma)
+- [x] [Gemma](https://ai.google.dev/gemma)
-   [x] [Mamba](https://github.com/state-spaces/mamba)
+- [x] [Mamba](https://github.com/state-spaces/mamba)
-   [x] [Grok-1](https://huggingface.co/keyfan/grok-1-hf)
+- [x] [Grok-1](https://huggingface.co/keyfan/grok-1-hf)
-   [x] [Xverse](https://huggingface.co/models?search=xverse)
+- [x] [Xverse](https://huggingface.co/models?search=xverse)
-   [x] [Command-R models](https://huggingface.co/models?search=CohereForAI/c4ai-command-r)
+- [x] [Command-R models](https://huggingface.co/models?search=CohereForAI/c4ai-command-r)
-   [x] [SEA-LION](https://huggingface.co/models?search=sea-lion)
+- [x] [SEA-LION](https://huggingface.co/models?search=sea-lion)
-   [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)
+- [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)
-   [x] [OLMo](https://allenai.org/olmo)
+- [x] [OLMo](https://allenai.org/olmo)
-   [x] [OLMo 2](https://allenai.org/olmo)
+- [x] [OLMo 2](https://allenai.org/olmo)
-   [x] [OLMoE](https://huggingface.co/allenai/OLMoE-1B-7B-0924)
+- [x] [OLMoE](https://huggingface.co/allenai/OLMoE-1B-7B-0924)
-   [x] [Granite models](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330)
+- [x] [Granite models](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330)
-   [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia)
+- [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia)
-   [x] [Snowflake-Arctic MoE](https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520)
+- [x] [Snowflake-Arctic MoE](https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520)
-   [x] [Smaug](https://huggingface.co/models?search=Smaug)
+- [x] [Smaug](https://huggingface.co/models?search=Smaug)
-   [x] [Poro 34B](https://huggingface.co/LumiOpen/Poro-34B)
+- [x] [Poro 34B](https://huggingface.co/LumiOpen/Poro-34B)
-   [x] [Bitnet b1.58 models](https://huggingface.co/1bitLLM)
+- [x] [Bitnet b1.58 models](https://huggingface.co/1bitLLM)
-   [x] [Flan T5](https://huggingface.co/models?search=flan-t5)
+- [x] [Flan T5](https://huggingface.co/models?search=flan-t5)
-   [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)
+- [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)
-   [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b)
+- [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b)
-   [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
+- [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
-   [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
+- [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
-   [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
+- [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
-   [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
+- [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
-   [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
+- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
-   [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)
+- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)
-   [x] [QRWKV-6](https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1)
+- [x] [QRWKV-6](https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1)
-   [x] [GigaChat-20B-A3B](https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct)
+- [x] [GigaChat-20B-A3B](https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct)
 #### Multimodal
-   [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
+- [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
-   [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
+- [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
-   [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
+- [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
-   [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)
+- [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)
-   [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
+- [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
-   [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
+- [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
-   [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM)
+- [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM)
-   [x] [Moondream](https://huggingface.co/vikhyatk/moondream2)
+- [x] [Moondream](https://huggingface.co/vikhyatk/moondream2)
-   [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)
+- [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)
-   [x] [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d)
+- [x] [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d)
 </details>
 <details>
 <summary>Bindings</summary>
-   Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
+- Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
-   Go: [go-skynet/go-llama.cpp](https://github.com/go-skynet/go-llama.cpp)
+- Go: [go-skynet/go-llama.cpp](https://github.com/go-skynet/go-llama.cpp)
-   Node.js: [withcatai/node-llama-cpp](https://github.com/withcatai/node-llama-cpp)
+- Node.js: [withcatai/node-llama-cpp](https://github.com/withcatai/node-llama-cpp)
-   JS/TS (llama.cpp server client): [lgrammel/modelfusion](https://modelfusion.dev/integration/model-provider/llamacpp)
+- JS/TS (llama.cpp server client): [lgrammel/modelfusion](https://modelfusion.dev/integration/model-provider/llamacpp)
-   JS/TS (Programmable Prompt Engine CLI): [offline-ai/cli](https://github.com/offline-ai/cli)
+- JS/TS (Programmable Prompt Engine CLI): [offline-ai/cli](https://github.com/offline-ai/cli)
-   JavaScript/Wasm (works in browser): [tangledgroup/llama-cpp-wasm](https://github.com/tangledgroup/llama-cpp-wasm)
+- JavaScript/Wasm (works in browser): [tangledgroup/llama-cpp-wasm](https://github.com/tangledgroup/llama-cpp-wasm)
-   Typescript/Wasm (nicer API, available on npm): [ngxson/wllama](https://github.com/ngxson/wllama)
+- Typescript/Wasm (nicer API, available on npm): [ngxson/wllama](https://github.com/ngxson/wllama)
-   Ruby: [yoshoku/llama_cpp.rb](https://github.com/yoshoku/llama_cpp.rb)
+- Ruby: [yoshoku/llama_cpp.rb](https://github.com/yoshoku/llama_cpp.rb)
-   Rust (more features): [edgenai/llama_cpp-rs](https://github.com/edgenai/llama_cpp-rs)
+- Rust (more features): [edgenai/llama_cpp-rs](https://github.com/edgenai/llama_cpp-rs)
-   Rust (nicer API): [mdrokz/rust-llama.cpp](https://github.com/mdrokz/rust-llama.cpp)
+- Rust (nicer API): [mdrokz/rust-llama.cpp](https://github.com/mdrokz/rust-llama.cpp)
-   Rust (more direct bindings): [utilityai/llama-cpp-rs](https://github.com/utilityai/llama-cpp-rs)
+- Rust (more direct bindings): [utilityai/llama-cpp-rs](https://github.com/utilityai/llama-cpp-rs)
-   C#/.NET: [SciSharp/LLamaSharp](https://github.com/SciSharp/LLamaSharp)
+- C#/.NET: [SciSharp/LLamaSharp](https://github.com/SciSharp/LLamaSharp)
-   C#/VB.NET (more features - community license): [LM-Kit.NET](https://docs.lm-kit.com/lm-kit-net/index.html)
+- C#/VB.NET (more features - community license): [LM-Kit.NET](https://docs.lm-kit.com/lm-kit-net/index.html)
-   Scala 3: [donderom/llm4s](https://github.com/donderom/llm4s)
+- Scala 3: [donderom/llm4s](https://github.com/donderom/llm4s)
-   Clojure: [phronmophobic/llama.clj](https://github.com/phronmophobic/llama.clj)
+- Clojure: [phronmophobic/llama.clj](https://github.com/phronmophobic/llama.clj)
-   React Native: [mybigday/llama.rn](https://github.com/mybigday/llama.rn)
+- React Native: [mybigday/llama.rn](https://github.com/mybigday/llama.rn)
-   Java: [kherud/java-llama.cpp](https://github.com/kherud/java-llama.cpp)
+- Java: [kherud/java-llama.cpp](https://github.com/kherud/java-llama.cpp)
-   Zig: [deins/llama.cpp.zig](https://github.com/Deins/llama.cpp.zig)
+- Zig: [deins/llama.cpp.zig](https://github.com/Deins/llama.cpp.zig)
-   Flutter/Dart: [netdur/llama_cpp_dart](https://github.com/netdur/llama_cpp_dart)
+- Flutter/Dart: [netdur/llama_cpp_dart](https://github.com/netdur/llama_cpp_dart)
-   Flutter: [xuegao-tzx/Fllama](https://github.com/xuegao-tzx/Fllama)
+- Flutter: [xuegao-tzx/Fllama](https://github.com/xuegao-tzx/Fllama)
-   PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance) [(more info)](https://github.com/ggerganov/llama.cpp/pull/6326)
+- PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance) [(more info)](https://github.com/ggerganov/llama.cpp/pull/6326)
-   Guile Scheme: [guile_llama_cpp](https://savannah.nongnu.org/projects/guile-llama-cpp)
+- Guile Scheme: [guile_llama_cpp](https://savannah.nongnu.org/projects/guile-llama-cpp)
-   Swift [srgtuszy/llama-cpp-swift](https://github.com/srgtuszy/llama-cpp-swift)
+- Swift [srgtuszy/llama-cpp-swift](https://github.com/srgtuszy/llama-cpp-swift)
-   Swift [ShenghaiWang/SwiftLlama](https://github.com/ShenghaiWang/SwiftLlama)
+- Swift [ShenghaiWang/SwiftLlama](https://github.com/ShenghaiWang/SwiftLlama)
 </details>
 <details>
 <summary>UIs</summary>
-_(to have a project listed here, it should clearly state that it depends on `llama.cpp`)_
+*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*
-   [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
+- [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
-   [cztomsik/ava](https://github.com/cztomsik/ava) (MIT)
+- [cztomsik/ava](https://github.com/cztomsik/ava) (MIT)
-   [Dot](https://github.com/alexpinel/Dot) (GPL)
+- [Dot](https://github.com/alexpinel/Dot) (GPL)
-   [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT)
+- [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT)
-   [iohub/collama](https://github.com/iohub/coLLaMA) (Apache-2.0)
+- [iohub/collama](https://github.com/iohub/coLLaMA) (Apache-2.0)
-   [janhq/jan](https://github.com/janhq/jan) (AGPL)
+- [janhq/jan](https://github.com/janhq/jan) (AGPL)
-   [KanTV](https://github.com/zhouwg/kantv?tab=readme-ov-file) (Apache-2.0)
+- [KanTV](https://github.com/zhouwg/kantv?tab=readme-ov-file) (Apache-2.0)
-   [KodiBot](https://github.com/firatkiral/kodibot) (GPL)
+- [KodiBot](https://github.com/firatkiral/kodibot) (GPL)
-   [llama.vim](https://github.com/ggml-org/llama.vim) (MIT)
+- [llama.vim](https://github.com/ggml-org/llama.vim) (MIT)
-   [LARS](https://github.com/abgulati/LARS) (AGPL)
+- [LARS](https://github.com/abgulati/LARS) (AGPL)
-   [Llama Assistant](https://github.com/vietanhdev/llama-assistant) (GPL)
+- [Llama Assistant](https://github.com/vietanhdev/llama-assistant) (GPL)
-   [LLMFarm](https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT)
+- [LLMFarm](https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT)
-   [LLMUnity](https://github.com/undreamai/LLMUnity) (MIT)
+- [LLMUnity](https://github.com/undreamai/LLMUnity) (MIT)
-   [LMStudio](https://lmstudio.ai/) (proprietary)
+- [LMStudio](https://lmstudio.ai/) (proprietary)
-   [LocalAI](https://github.com/mudler/LocalAI) (MIT)
+- [LocalAI](https://github.com/mudler/LocalAI) (MIT)
-   [LostRuins/koboldcpp](https://github.com/LostRuins/koboldcpp) (AGPL)
+- [LostRuins/koboldcpp](https://github.com/LostRuins/koboldcpp) (AGPL)
-   [MindMac](https://mindmac.app) (proprietary)
+- [MindMac](https://mindmac.app) (proprietary)
-   [MindWorkAI/AI-Studio](https://github.com/MindWorkAI/AI-Studio) (FSL-1.1-MIT)
+- [MindWorkAI/AI-Studio](https://github.com/MindWorkAI/AI-Studio) (FSL-1.1-MIT)
-   [Mobile-Artificial-Intelligence/maid](https://github.com/Mobile-Artificial-Intelligence/maid) (MIT)
+- [Mobile-Artificial-Intelligence/maid](https://github.com/Mobile-Artificial-Intelligence/maid) (MIT)
-   [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile) (Apache-2.0)
+- [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile) (Apache-2.0)
-   [nat/openplayground](https://github.com/nat/openplayground) (MIT)
+- [nat/openplayground](https://github.com/nat/openplayground) (MIT)
-   [nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all) (MIT)
+- [nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all) (MIT)
-   [ollama/ollama](https://github.com/ollama/ollama) (MIT)
+- [ollama/ollama](https://github.com/ollama/ollama) (MIT)
-   [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) (AGPL)
+- [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) (AGPL)
-   [PocketPal AI](https://github.com/a-ghorbani/pocketpal-ai) (MIT)
+- [PocketPal AI](https://github.com/a-ghorbani/pocketpal-ai) (MIT)
-   [psugihara/FreeChat](https://github.com/psugihara/FreeChat) (MIT)
+- [psugihara/FreeChat](https://github.com/psugihara/FreeChat) (MIT)
-   [ptsochantaris/emeltal](https://github.com/ptsochantaris/emeltal) (MIT)
+- [ptsochantaris/emeltal](https://github.com/ptsochantaris/emeltal) (MIT)
-   [pythops/tenere](https://github.com/pythops/tenere) (AGPL)
+- [pythops/tenere](https://github.com/pythops/tenere) (AGPL)
-   [ramalama](https://github.com/containers/ramalama) (MIT)
+- [ramalama](https://github.com/containers/ramalama) (MIT)
-   [semperai/amica](https://github.com/semperai/amica) (MIT)
+- [semperai/amica](https://github.com/semperai/amica) (MIT)
-   [withcatai/catai](https://github.com/withcatai/catai) (MIT)
+- [withcatai/catai](https://github.com/withcatai/catai) (MIT)
-   [Playstore](https://play.google.com/store/apps/details?id=com.nervesparks.irisGPT) and [nerve-sparks/iris_android](https://github.com/nerve-sparks/iris_android) , (Apache-2.0)
+- [Playstore](https://play.google.com/store/apps/details?id=com.nervesparks.irisGPT) and [nerve-sparks/iris_android](https://github.com/nerve-sparks/iris_android) , (Apache-2.0)
 </details>
 <details>
 <summary>Tools</summary>
-   [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from HuggingFace Hub and convert them to GGML
+- [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from HuggingFace Hub and convert them to GGML
-   [akx/ollama-dl](https://github.com/akx/ollama-dl) – download models from the Ollama library to be used directly with llama.cpp
+- [akx/ollama-dl](https://github.com/akx/ollama-dl) – download models from the Ollama library to be used directly with llama.cpp
-   [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
+- [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
-   [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage
+- [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage
-   [Styled Lines](https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902) (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)
+- [Styled Lines](https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902) (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)
 </details>
 <details>
 <summary>Infrastructure</summary>
-   [Paddler](https://github.com/distantmagic/paddler) - Stateful load balancer custom-tailored for llama.cpp
+- [Paddler](https://github.com/distantmagic/paddler) - Stateful load balancer custom-tailored for llama.cpp
-   [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs
+- [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs
-   [llama_cpp_canister](https://github.com/onicai/llama_cpp_canister) - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
+- [llama_cpp_canister](https://github.com/onicai/llama_cpp_canister) - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
-   [llama-swap](https://github.com/mostlygeek/llama-swap) - transparent proxy that adds automatic model switching with llama-server
+- [llama-swap](https://github.com/mostlygeek/llama-swap) - transparent proxy that adds automatic model switching with llama-server
-   [Kalavai](https://github.com/kalavai-net/kalavai-client) - Crowdsource end to end LLM deployment at any scale
+- [Kalavai](https://github.com/kalavai-net/kalavai-client) - Crowdsource end to end LLM deployment at any scale
 </details>
 <details>
 <summary>Games</summary>
-   [Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you.
+- [Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you.
 </details>
 ## Supported backends
-| Backend                            | Target devices        |
+| Backend | Target devices |
-| ---------------------------------- | --------------------- |
+| --- | --- |
-| [Metal](docs/build.md#metal-build) | Apple Silicon         |
+| [Metal](docs/build.md#metal-build) | Apple Silicon |
-| [BLAS](docs/build.md#blas-build)   | All                   |
+| [BLAS](docs/build.md#blas-build) | All |
-| [BLIS](docs/backend/BLIS.md)       | All                   |
+| [BLIS](docs/backend/BLIS.md) | All |
-| [SYCL](docs/backend/SYCL.md)       | Intel and Nvidia GPU  |
+| [SYCL](docs/backend/SYCL.md) | Intel and Nvidia GPU |
-| [MUSA](docs/build.md#musa)         | Moore Threads MTT GPU |
+| [MUSA](docs/build.md#musa) | Moore Threads MTT GPU |
-| [CUDA](docs/build.md#cuda)         | Nvidia GPU            |
+| [CUDA](docs/build.md#cuda) | Nvidia GPU |
-| [HIP](docs/build.md#hip)           | AMD GPU               |
+| [HIP](docs/build.md#hip) | AMD GPU |
-| [Vulkan](docs/build.md#vulkan)     | GPU                   |
+| [Vulkan](docs/build.md#vulkan) | GPU |
-| [CANN](docs/build.md#cann)         | Ascend NPU            |
+| [CANN](docs/build.md#cann) | Ascend NPU |
 ## Building the project
 The main product of this project is the `llama` library. Its C-style interface can be found in [include/llama.h](include/llama.h).
 The project also includes many example programs and tools using the `llama` library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server. Possible methods for obtaining the binaries:
-   Clone this repository and build locally, see [how to build](docs/build.md)
+- Clone this repository and build locally, see [how to build](docs/build.md)
-   On MacOS or Linux, install `llama.cpp` via [brew, flox or nix](docs/install.md)
+- On MacOS or Linux, install `llama.cpp` via [brew, flox or nix](docs/install.md)
-   Use a Docker image, see [documentation for Docker](docs/docker.md)
+- Use a Docker image, see [documentation for Docker](docs/docker.md)
-   Download pre-built binaries from [releases](https://github.com/ggerganov/llama.cpp/releases)
+- Download pre-built binaries from [releases](https://github.com/ggerganov/llama.cpp/releases)
 ## Obtaining and quantizing models
 The [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](https://huggingface.co/models?library=gguf&sort=trending) compatible with `llama.cpp`:
-   [Trending](https://huggingface.co/models?library=gguf&sort=trending)
+- [Trending](https://huggingface.co/models?library=gguf&sort=trending)
-   [LLaMA](https://huggingface.co/models?sort=trending&search=llama+gguf)
+- [LLaMA](https://huggingface.co/models?sort=trending&search=llama+gguf)
 You can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from Hugging Face by using this CLI argument: `-hf <user>/<model>[:quant]`
@ -258,10 +258,10 @@ After downloading a model, use the CLI tools to run it locally - see below.
 The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with `llama.cpp`:
-   Use the [GGUF-my-repo space](https://huggingface.co/spaces/ggml-org/gguf-my-repo) to convert to GGUF format and quantize model weights to smaller sizes
+- Use the [GGUF-my-repo space](https://huggingface.co/spaces/ggml-org/gguf-my-repo) to convert to GGUF format and quantize model weights to smaller sizes
-   Use the [GGUF-my-LoRA space](https://huggingface.co/spaces/ggml-org/gguf-my-lora) to convert LoRA adapters to GGUF format (more info: https://github.com/ggerganov/llama.cpp/discussions/10123)
+- Use the [GGUF-my-LoRA space](https://huggingface.co/spaces/ggml-org/gguf-my-lora) to convert LoRA adapters to GGUF format (more info: https://github.com/ggerganov/llama.cpp/discussions/10123)
-   Use the [GGUF-editor space](https://huggingface.co/spaces/CISCai/gguf-editor) to edit GGUF meta data in the browser (more info: https://github.com/ggerganov/llama.cpp/discussions/9268)
+- Use the [GGUF-editor space](https://huggingface.co/spaces/CISCai/gguf-editor) to edit GGUF meta data in the browser (more info: https://github.com/ggerganov/llama.cpp/discussions/9268)
-   Use the [Inference Endpoints](https://ui.endpoints.huggingface.co/) to directly host `llama.cpp` in the cloud (more info: https://github.com/ggerganov/llama.cpp/discussions/9669)
+- Use the [Inference Endpoints](https://ui.endpoints.huggingface.co/) to directly host `llama.cpp` in the cloud (more info: https://github.com/ggerganov/llama.cpp/discussions/9669)
 To learn more about model quantization, [read this documentation](examples/quantize/README.md)
@ -269,8 +269,8 @@ To learn more about model quantization, [read this documentation](examples/quant
 #### A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.
-   <details open>
+- <details open>
-      <summary>Run in conversation mode</summary>
+    <summary>Run in conversation mode</summary>
    Models with a built-in chat template will automatically activate conversation mode. If this doesn't occur, you can manually enable it by adding `-cnv` and specifying a suitable chat template with `--chat-template NAME`
@ -284,10 +284,10 @@ To learn more about model quantization, [read this documentation](examples/quant
    # Easy peasy! The answer to 1+1 is... 2!
    ```
-      </details>
+    </details>
-   <details>
+- <details>
-      <summary>Run in conversation mode with custom chat template</summary>
+    <summary>Run in conversation mode with custom chat template</summary>
    ```bash
    # use the "chatml" template (use -h to see the list of supported templates)
@ -297,10 +297,10 @@ To learn more about model quantization, [read this documentation](examples/quant
    llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
    ```
-      </details>
+    </details>
-   <details>
+- <details>
-      <summary>Run simple text completion</summary>
+    <summary>Run simple text completion</summary>
    To disable conversation mode explicitly, use `-no-cnv`
@ -310,10 +310,10 @@ To learn more about model quantization, [read this documentation](examples/quant
    # I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
    ```
-      </details>
+    </details>
-   <details>
+- <details>
-      <summary>Constrain the output with a custom grammar</summary>
+    <summary>Constrain the output with a custom grammar</summary>
    ```bash
    llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
@ -325,14 +325,15 @@ To learn more about model quantization, [read this documentation](examples/quant
    For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/
-      </details>
+    </details>
 ## [`llama-server`](examples/server)
 #### A lightweight, [OpenAI API](https://github.com/openai/openai-openapi) compatible, HTTP server for serving LLMs.
-   <details open>
+- <details open>
-      <summary>Start a local HTTP server with default configuration on port 8080</summary>
+    <summary>Start a local HTTP server with default configuration on port 8080</summary>
    ```bash
    llama-server -m model.gguf --port 8080
@ -341,50 +342,50 @@ To learn more about model quantization, [read this documentation](examples/quant
    # Chat completion endpoint: http://localhost:8080/v1/chat/completions
    ```
-      </details>
+    </details>
-   <details>
+- <details>
-      <summary>Support multiple-users and parallel decoding</summary>
+    <summary>Support multiple-users and parallel decoding</summary>
    ```bash
    # up to 4 concurrent requests, each with 4096 max context
    llama-server -m model.gguf -c 16384 -np 4
    ```
-      </details>
+    </details>
-   <details>
+- <details>
-      <summary>Enable speculative decoding</summary>
+    <summary>Enable speculative decoding</summary>
    ```bash
    # the draft.gguf model should be a small variant of the target model.gguf
    llama-server -m model.gguf -md draft.gguf
    ```
-      </details>
+    </details>
-   <details>
+- <details>
-      <summary>Serve an embedding model</summary>
+    <summary>Serve an embedding model</summary>
    ```bash
    # use the /embedding endpoint
    llama-server -m model.gguf --embedding --pooling cls -ub 8192
    ```
-      </details>
+    </details>
-   <details>
+- <details>
-      <summary>Serve a reranking model</summary>
+    <summary>Serve a reranking model</summary>
    ```bash
    # use the /reranking endpoint
    llama-server -m model.gguf --reranking
    ```
-      </details>
+    </details>
-   <details>
+- <details>
-      <summary>Constrain all outputs with a grammar</summary>
+    <summary>Constrain all outputs with a grammar</summary>
    ```bash
    # custom grammar
@ -394,14 +395,15 @@ To learn more about model quantization, [read this documentation](examples/quant
    llama-server -m model.gguf --grammar-file grammars/json.gbnf
    ```
-      </details>
+    </details>
 ## [`llama-perplexity`](examples/perplexity)
 #### A tool for measuring the perplexity [^1][^2] (and other quality metrics) of a model over a given text.
-   <details open>
+- <details open>
-      <summary>Measure the perplexity over a text file</summary>
+    <summary>Measure the perplexity over a text file</summary>
    ```bash
    llama-perplexity -m model.gguf -f file.txt
@ -410,16 +412,16 @@ To learn more about model quantization, [read this documentation](examples/quant
    # Final estimate: PPL = 5.4007 +/- 0.67339
    ```
-      </details>
+    </details>
-   <details>
+- <details>
-      <summary>Measure KL divergence</summary>
+    <summary>Measure KL divergence</summary>
    ```bash
    # TODO
    ```
-      </details>
+    </details>
 [^1]: [examples/perplexity/README.md](examples/perplexity/README.md)
 [^2]: [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity)
@ -428,8 +430,8 @@ To learn more about model quantization, [read this documentation](examples/quant
 #### Benchmark the performance of the inference for various parameters.
-   <details open>
+- <details open>
-      <summary>Run default benchmark</summary>
+    <summary>Run default benchmark</summary>
    ```bash
    llama-bench -m model.gguf
@ -443,20 +445,20 @@ To learn more about model quantization, [read this documentation](examples/quant
    # build: 3e0ba0e60 (4229)
    ```
-      </details>
+    </details>
 ## [`llama-run`](examples/run)
 #### A comprehensive example for running `llama.cpp` models. Useful for inferencing. Used with RamaLama [^3].
-   <details>
+- <details>
-      <summary>Run a model with a specific prompt (by default it's pulled from Ollama registry)</summary>
+    <summary>Run a model with a specific prompt (by default it's pulled from Ollama registry)</summary>
    ```bash
    llama-run granite-code
    ```
-      </details>
+    </details>
 [^3]: [RamaLama](https://github.com/containers/ramalama)
@ -464,8 +466,8 @@ To learn more about model quantization, [read this documentation](examples/quant
 #### A minimal example for implementing apps with `llama.cpp`. Useful for developers.
-   <details>
+- <details>
-      <summary>Basic text completion</summary>
+    <summary>Basic text completion</summary>
    ```bash
    llama-simple -m model.gguf
@ -473,44 +475,45 @@ To learn more about model quantization, [read this documentation](examples/quant
    # Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of
    ```
-      </details>
+    </details>
 ## Contributing
-   Contributors can open PRs
+- Contributors can open PRs
-   Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
+- Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
-   Collaborators will be invited based on contributions
+- Collaborators will be invited based on contributions
-   Any help with managing issues, PRs and projects is very appreciated!
+- Any help with managing issues, PRs and projects is very appreciated!
-   See [good first issues](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions
+- See [good first issues](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions
-   Read the [CONTRIBUTING.md](CONTRIBUTING.md) for more information
+- Read the [CONTRIBUTING.md](CONTRIBUTING.md) for more information
-   Make sure to read this: [Inference at the edge](https://github.com/ggerganov/llama.cpp/discussions/205)
+- Make sure to read this: [Inference at the edge](https://github.com/ggerganov/llama.cpp/discussions/205)
-   A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532)
+- A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532)
 ## Other documentation
-   [main (cli)](examples/main/README.md)
+- [main (cli)](examples/main/README.md)
-   [server](examples/server/README.md)
+- [server](examples/server/README.md)
-   [GBNF grammars](grammars/README.md)
+- [GBNF grammars](grammars/README.md)
 #### Development documentation
-   [How to build](docs/build.md)
+- [How to build](docs/build.md)
-   [Running on Docker](docs/docker.md)
+- [Running on Docker](docs/docker.md)
-   [Build on Android](docs/android.md)
+- [Build on Android](docs/android.md)
-   [Performance troubleshooting](docs/development/token_generation_performance_tips.md)
+- [Performance troubleshooting](docs/development/token_generation_performance_tips.md)
-   [GGML tips & tricks](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks)
+- [GGML tips & tricks](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks)
 #### Seminal papers and background on the models
 If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
-
+- LLaMA:
-   LLaMA:
+    - [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
-    -   [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
+    - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
-    -   [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
+- GPT-3
-   GPT-3
+    - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
-    -   [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
+- GPT-3.5 / InstructGPT / ChatGPT:
-   GPT-3.5 / InstructGPT / ChatGPT:
+    - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
-    -   [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
+    - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
    -   [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
 #### References