llama.cpp

from http://github.com/ggerganov/llama.cpp.git

Find a file

John f75125615a Update README.md		2023-06-17 23:08:49 +02:00
.devops	Add llama.cpp docker support for non-latin languages (#1673 )	2023-06-08 00:58:53 -07:00
.github	ci : run when changing only the CUDA sources (#1800 )	2023-06-12 20:12:47 +03:00
docs	docs : add performance troubleshoot + example benchmark documentation (#1674 )	2023-06-05 23:32:36 +03:00
examples	Made option --memory-f32 enabled by default since ggml_repeat2 currently only has F32 implementation. Improved memory allocation for ctx and kv memory to be accurate. Moved model.memory_k, model.memory_v to kv_self.k, kv_self.v and the initialization into kv_cache_init (to be more like llama.cpp).	2023-06-17 23:06:21 +02:00
media	media : add logos and banners	2023-04-05 18:58:31 +03:00
models	Make loading weights 10-100x faster	2023-03-30 12:28:25 -07:00
pocs	build : fix and ignore MSVC warnings (#1889 )	2023-06-16 21:23:53 +03:00
prompts	prompts : model agnostic DAN (#1304 )	2023-05-11 18:10:19 +03:00
scripts	hooks : setting up flake8 and pre-commit hooks (#1681 )	2023-06-17 13:32:48 +03:00
spm-headers	swift : Package compile breaks due to ggml-metal.metal (#1831 )	2023-06-15 20:47:04 +03:00
tests	build : fix and ignore MSVC warnings (#1889 )	2023-06-16 21:23:53 +03:00
.clang-tidy	clang-tidy : restore dot file from accidental deletion	2023-06-08 10:09:08 +03:00
.dockerignore	Fix whitespace, add .editorconfig, add GitHub workflow (#883 )	2023-04-11 19:45:44 +00:00
.ecrc	Fix whitespace, add .editorconfig, add GitHub workflow (#883 )	2023-04-11 19:45:44 +00:00
.editorconfig	do not force the prompt file to end with a new line (#908 )	2023-04-13 11:33:16 +02:00
.flake8	hooks : setting up flake8 and pre-commit hooks (#1681 )	2023-06-17 13:32:48 +03:00
.gitignore	Work in progress.	2023-06-17 23:06:13 +02:00
.pre-commit-config.yaml	hooks : setting up flake8 and pre-commit hooks (#1681 )	2023-06-17 13:32:48 +03:00
build.zig	zig : update build.zig (#872 )	2023-04-13 16:43:22 +03:00
CMakeLists.txt	Work in progress.	2023-06-17 23:06:13 +02:00
convert-lora-to-ggml.py	py : cast lora_alpha to int in convert-lora-to-ggml (#1170 )	2023-04-25 23:33:08 +02:00
convert-pth-to-ggml.py	Docker: change to calling convert.py (#1641 )	2023-06-03 15:11:53 +03:00
convert.py	hooks : setting up flake8 and pre-commit hooks (#1681 )	2023-06-17 13:32:48 +03:00
falcon_convert_demo.py	Work in progress.	2023-06-17 23:06:13 +02:00
flake.lock	flake : update to support metal on m1/m2 (#1724 )	2023-06-07 07:15:31 +03:00
flake.nix	exposed modules so that they can be invoked by nix run github:ggerganov/llama.cpp#server etc (#1863 )	2023-06-17 14:13:05 +02:00
ggml-cuda.cu	Only one CUDA stream per device for async compute (#1898 )	2023-06-17 19:15:02 +02:00
ggml-cuda.h	CUDA full GPU acceleration, KV cache in VRAM (#1827 )	2023-06-14 19:47:19 +02:00
ggml-metal.h	metal : parallel command buffer encoding (#1860 )	2023-06-15 20:29:48 +03:00
ggml-metal.m	minor : warning fixes	2023-06-17 20:24:11 +03:00
ggml-metal.metal	metal : add norm, cpy f16->f16, alibi kernels (#1823 )	2023-06-17 17:37:49 +03:00
ggml-opencl.cpp	ggml : fix warnings under MSVC (#1908 )	2023-06-17 18:46:15 +03:00
ggml-opencl.h	Leverage mmap for offloading tensors to GPU (#1597 )	2023-06-12 14:44:16 +02:00
ggml.c	added the tensor size calculation routines	2023-06-17 23:06:21 +02:00
ggml.h	Work in progress.	2023-06-17 23:06:13 +02:00
k_quants.c	k-quants : GCC12 compilation fix (#1792 )	2023-06-10 22:51:36 +03:00
k_quants.h	k-quants : allow to optionally disable at compile time (#1734 )	2023-06-07 10:59:52 +03:00
libfalcon.cpp	Made option --memory-f32 enabled by default since ggml_repeat2 currently only has F32 implementation. Improved memory allocation for ctx and kv memory to be accurate. Moved model.memory_k, model.memory_v to kv_self.k, kv_self.v and the initialization into kv_cache_init (to be more like llama.cpp).	2023-06-17 23:06:21 +02:00
libfalcon.h	Work in progress.	2023-06-17 23:06:13 +02:00
LICENSE	Add LICENSE (#21 )	2023-03-12 08:36:03 +02:00
llama-util.h	metal : use shared buffers between CPU and GPU (#1696 )	2023-06-05 23:24:04 +03:00
llama.cpp	llama : fix kv_cache `n` init (close #1903 )	2023-06-17 19:31:20 +03:00
llama.h	examples : add chat-vicuna.sh (#1854 )	2023-06-15 21:05:53 +03:00
Makefile	Work in progress.	2023-06-17 23:06:13 +02:00
Package.swift	swift : Package compile breaks due to ggml-metal.metal (#1831 )	2023-06-15 20:47:04 +03:00
README.md	Update README.md	2023-06-17 23:08:49 +02:00
requirements.txt	py : bump sentencepiece to 0.1.98 to support Python 3.11 (#976 )	2023-04-14 19:46:49 +00:00
SHA256SUMS	Update SHA256SUMS with current hashes for models quantized using q4_0 (#1798 )	2023-06-11 12:38:53 +03:00

README.md

llama.cpp modification to run Falcon (work in progress)

Status/Bugs:

Quantization works except for Q_K_ types
CUDA not yet functional
python conversion script is very basic (produces ggml v0)
On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows

It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second
CPU inference examples:

 Q:\ggllm.cpp> .\build\bin\Release\falcon_main.exe -t 31 -m Q:\models\falcon-40b\q5_1 -p "Love relates to hate like" -n 50 -ngl 0
main: build = 677 (dd3d346)
main: seed  = 1687010794
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
falcon.cpp: loading model from Q:\models\falcon-40b\q5_1
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65024
falcon_model_load_internal: n_ctx      = 512
falcon_model_load_internal: n_embd     = 8192
falcon_model_load_internal: n_head     = 128
falcon_model_load_internal: n_head_kv     = 8
falcon_model_load_internal: n_layer    = 60
falcon_model_load_internal: version      = 40
falcon_model_load_internal: ftype      = 9 (mostly Q5_1)
falcon_model_load_internal: n_ff       = 32768
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 40B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 29929.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: mem required  = 33513.70 MB (+  120.00 MB per state)
falcon_model_load_internal: offloading 0 layers to GPU
falcon_model_load_internal: total VRAM used: 512 MB
...................................................................................................
falcon_init_from_file: kv self size  =  120.00 MB

system_info: n_threads = 31 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0


Love relates to hate like light relates to darkness.
Love is the strongest thing in the world, but hate is the second strongest force.
Love is a force multiplier.
For every moment of love, there is a parallel moment of hate.
You can’t
falcon_print_timings:        load time =  4420.23 ms
falcon_print_timings:      sample time =    11.34 ms /    50 runs   (    0.23 ms per token)
falcon_print_timings: prompt eval time =   785.42 ms /     5 tokens (  157.08 ms per token)
falcon_print_timings:        eval time = 27512.23 ms /    49 runs   (  561.47 ms per token)
falcon_print_timings:       total time = 28315.91 ms

Below are Falcon 7B tests: Q5_1 is working, comes with ggml v3 as a bonus (mmap support)

falcon_model_load_internal: ftype      = 9 (mostly Q5_1)
falcon_print_timings:        load time =   952.24 ms
falcon_print_timings:      sample time =    67.91 ms /   300 runs   (    0.23 ms per token)
falcon_print_timings: prompt eval time =   370.94 ms /    14 tokens (   26.50 ms per token)
falcon_print_timings:        eval time = 50367.68 ms /   299 runs   (  168.45 ms per token)

Q4_1 is working as well

falcon_print_timings:        load time =   864.40 ms
falcon_print_timings:      sample time =    22.68 ms /   100 runs   (    0.23 ms per token)
falcon_print_timings: prompt eval time =   287.00 ms /    14 tokens (   20.50 ms per token)
falcon_print_timings:        eval time = 12233.39 ms /    99 runs   (  123.57 ms per token)

Q_K_*: not working (no segfaults anymore, looks like an error in qkv handling as it's outputting garbage.

README.md Unescape Escape

README.md