from http://github.com/ggerganov/llama.cpp.git

Find a file

zhycheng614 e39e2b29d9 updated readme		2025-02-01 01:00:59 +00:00
.github	Create CODEOWNERS	2024-11-09 18:56:57 -08:00
ci	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
cmake	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
common	bug fix in common-nexa.cpp	2024-12-03 11:36:59 +08:00
docs	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
examples	[swift] add module omnivlm (#41 )	2025-01-13 11:39:17 +08:00
ggml	fixed hardcode qk=128 bug	2025-01-30 22:55:50 +00:00
gguf-py	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
grammars	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
include	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
media	README: add graphic for matrix multiplication (#6881 )	2024-04-24 21:29:13 +02:00
pocs	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
prompts	llama : add Qwen support (#4281 )	2023-12-01 20:16:31 +02:00
requirements	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
scripts	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
spm/omnivlm	[swift] add module omnivlm (#41 )	2025-01-13 11:39:17 +08:00
spm-headers	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
src	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
swift/LlavaTests	[swift] add module omnivlm (#41 )	2025-01-13 11:39:17 +08:00
tests	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
.gitignore	updated readme	2025-02-01 01:00:59 +00:00
AUTHORS	authors : regen	2024-06-26 19:36:44 +03:00
CMakeLists.txt	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
CONTRIBUTING.md	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
convert_hf_to_gguf.py	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
convert_hf_to_gguf_update.py	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
convert_llama_ggml_to_gguf.py	py : fix wrong input type for raw_dtype in ggml to gguf scripts (#8928 )	2024-08-16 13:36:30 +03:00
convert_lora_to_gguf.py	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
flake.lock	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
flake.nix	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
LICENSE	license : update copyright notice + add AUTHORS (#6405 )	2024-04-09 09:23:19 +03:00
mypy.ini	convert : partially revert PR #4818 (#5041 )	2024-01-20 18:14:18 -05:00
Package.swift	[swift] add module omnivlm (#41 )	2025-01-13 11:39:17 +08:00
poetry.lock	build(python): Package scripts with pip-0517 compliance	2024-07-04 15:39:13 +00:00
pyproject.toml	ugrade to llama.cpp `74d73dc`	2024-12-02 14:24:50 +08:00
README.md	updated readme	2025-02-01 01:00:59 +00:00
requirements.txt	Refactor lora adapter support (#8332 )	2024-07-15 20:50:47 +02:00
SECURITY.md	chore: Fix markdown warnings (#6625 )	2024-04-12 10:52:36 +02:00

README.md

llama.cpp

This repo is cloned from llama.cpp commit 74d73dc85cc2057446bf63cc37ff649ae7cebd80. It is compatible with llama-cpp-python commit 7ecdd944624cbd49e4af0a5ce1aa402607d58dcc

Customize quantization group size at compilation (CPU inference only)

The only thing that is different is to add -DQK4_0 flag when cmake.

cmake -B build_cpu_g128 -DQK4_0=128
cmake --build build_cpu_g128

To quantize the model with the customized group size, run

./build_cpu_g128/bin/llama-quantize <model_path.gguf> <quantization_type>

To run the quantized model, run

./build_cpu_g128/bin/llama-cli -m <quantized_model_path.gguf>

Note:

You should make sure that the model you run is quantized to the same group size as the one you compile with. Or you'll receive a runtime error when loading the model.