Find a file
2025-02-01 01:00:59 +00:00
.github Create CODEOWNERS 2024-11-09 18:56:57 -08:00
ci ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
cmake ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
common bug fix in common-nexa.cpp 2024-12-03 11:36:59 +08:00
docs ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
examples [swift] add module omnivlm (#41) 2025-01-13 11:39:17 +08:00
ggml fixed hardcode qk=128 bug 2025-01-30 22:55:50 +00:00
gguf-py ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
grammars ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
include ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
media README: add graphic for matrix multiplication (#6881) 2024-04-24 21:29:13 +02:00
pocs ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
prompts llama : add Qwen support (#4281) 2023-12-01 20:16:31 +02:00
requirements ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
scripts ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
spm/omnivlm [swift] add module omnivlm (#41) 2025-01-13 11:39:17 +08:00
spm-headers ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
src ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
swift/LlavaTests [swift] add module omnivlm (#41) 2025-01-13 11:39:17 +08:00
tests ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
.gitignore updated readme 2025-02-01 01:00:59 +00:00
AUTHORS authors : regen 2024-06-26 19:36:44 +03:00
CMakeLists.txt ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
CONTRIBUTING.md ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
convert_hf_to_gguf.py ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
convert_hf_to_gguf_update.py ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
convert_llama_ggml_to_gguf.py py : fix wrong input type for raw_dtype in ggml to gguf scripts (#8928) 2024-08-16 13:36:30 +03:00
convert_lora_to_gguf.py ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
flake.lock ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
flake.nix ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
LICENSE license : update copyright notice + add AUTHORS (#6405) 2024-04-09 09:23:19 +03:00
mypy.ini convert : partially revert PR #4818 (#5041) 2024-01-20 18:14:18 -05:00
Package.swift [swift] add module omnivlm (#41) 2025-01-13 11:39:17 +08:00
poetry.lock build(python): Package scripts with pip-0517 compliance 2024-07-04 15:39:13 +00:00
pyproject.toml ugrade to llama.cpp 74d73dc 2024-12-02 14:24:50 +08:00
README.md updated readme 2025-02-01 01:00:59 +00:00
requirements.txt Refactor lora adapter support (#8332) 2024-07-15 20:50:47 +02:00
SECURITY.md chore: Fix markdown warnings (#6625) 2024-04-12 10:52:36 +02:00

llama.cpp

This repo is cloned from llama.cpp commit 74d73dc85cc2057446bf63cc37ff649ae7cebd80. It is compatible with llama-cpp-python commit 7ecdd944624cbd49e4af0a5ce1aa402607d58dcc

Customize quantization group size at compilation (CPU inference only)

The only thing that is different is to add -DQK4_0 flag when cmake.

cmake -B build_cpu_g128 -DQK4_0=128
cmake --build build_cpu_g128

To quantize the model with the customized group size, run

./build_cpu_g128/bin/llama-quantize <model_path.gguf> <quantization_type>

To run the quantized model, run

./build_cpu_g128/bin/llama-cli -m <quantized_model_path.gguf>

Note:

You should make sure that the model you run is quantized to the same group size as the one you compile with. Or you'll receive a runtime error when loading the model.