llama.cpp

from http://github.com/ggerganov/llama.cpp.git

Find a file

John 0ed97e529f Update README.md		2023-06-17 16:20:02 +02:00
.devops	Add llama.cpp docker support for non-latin languages (#1673 )	2023-06-08 00:58:53 -07:00
.github	ci : run when changing only the CUDA sources (#1800 )	2023-06-12 20:12:47 +03:00
docs	docs : add performance troubleshoot + example benchmark documentation (#1674 )	2023-06-05 23:32:36 +03:00
examples	Made option --memory-f32 enabled by default since ggml_repeat2 currently only has F32 implementation. Improved memory allocation for ctx and kv memory to be accurate. Moved model.memory_k, model.memory_v to kv_self.k, kv_self.v and the initialization into kv_cache_init (to be more like llama.cpp).	2023-06-17 04:48:40 +02:00
media	media : add logos and banners	2023-04-05 18:58:31 +03:00
models	Make loading weights 10-100x faster	2023-03-30 12:28:25 -07:00
pocs	ggml : a faster version for Q4_1 x Q8_0 dot products (#1083 )	2023-04-21 18:18:26 +03:00
prompts	prompts : model agnostic DAN (#1304 )	2023-05-11 18:10:19 +03:00
scripts	ggml : remove bit shuffling (#1405 )	2023-05-12 00:23:08 +03:00
spm-headers	deploy : add a Package.swift for SwiftPM support (#393 )	2023-03-28 19:39:01 +03:00
tests	train : improved training-from-scratch example (#1652 )	2023-06-13 22:04:40 +03:00
.clang-tidy	clang-tidy : restore dot file from accidental deletion	2023-06-08 10:09:08 +03:00
.dockerignore	Fix whitespace, add .editorconfig, add GitHub workflow (#883 )	2023-04-11 19:45:44 +00:00
.ecrc	Fix whitespace, add .editorconfig, add GitHub workflow (#883 )	2023-04-11 19:45:44 +00:00
.editorconfig	do not force the prompt file to end with a new line (#908 )	2023-04-13 11:33:16 +02:00
.gitignore	Work in progress.	2023-06-16 16:31:02 +02:00
build.zig	zig : update build.zig (#872 )	2023-04-13 16:43:22 +03:00
CMakeLists.txt	Work in progress.	2023-06-16 16:31:02 +02:00
convert-lora-to-ggml.py	py : cast lora_alpha to int in convert-lora-to-ggml (#1170 )	2023-04-25 23:33:08 +02:00
convert-pth-to-ggml.py	Docker: change to calling convert.py (#1641 )	2023-06-03 15:11:53 +03:00
convert.py	convert.py: Support models which are stored in a single pytorch_model.bin (#1469 )	2023-05-17 00:04:35 +02:00
falcon_convert_demo.py	Work in progress.	2023-06-16 16:31:02 +02:00
flake.lock	flake : update to support metal on m1/m2 (#1724 )	2023-06-07 07:15:31 +03:00
flake.nix	metal : fix issue with ggml-metal.metal path. Closes #1769 (#1782 )	2023-06-10 17:47:34 +03:00
ggml-cuda.cu	Work in progress.	2023-06-16 16:31:02 +02:00
ggml-cuda.h	Leverage mmap for offloading tensors to GPU (#1597 )	2023-06-12 14:44:16 +02:00
ggml-metal.h	llama : Metal inference (#1642 )	2023-06-04 23:34:30 +03:00
ggml-metal.m	Metal implementation for all k_quants (#1807 )	2023-06-12 22:39:21 +03:00
ggml-metal.metal	Metal implementation for all k_quants (#1807 )	2023-06-12 22:39:21 +03:00
ggml-opencl.cpp	Leverage mmap for offloading tensors to GPU (#1597 )	2023-06-12 14:44:16 +02:00
ggml-opencl.h	Leverage mmap for offloading tensors to GPU (#1597 )	2023-06-12 14:44:16 +02:00
ggml.c	Work in progress.	2023-06-16 16:31:02 +02:00
ggml.h	Work in progress.	2023-06-16 16:31:02 +02:00
k_quants.c	k-quants : GCC12 compilation fix (#1792 )	2023-06-10 22:51:36 +03:00
k_quants.h	k-quants : allow to optionally disable at compile time (#1734 )	2023-06-07 10:59:52 +03:00
libfalcon.cpp	Merge branch 'master' of https://github.com/cmp-nct/ggllm.cpp	2023-06-17 14:39:28 +02:00
libfalcon.h	Work in progress.	2023-06-16 16:31:02 +02:00
LICENSE	Add LICENSE (#21 )	2023-03-12 08:36:03 +02:00
llama-util.h	metal : use shared buffers between CPU and GPU (#1696 )	2023-06-05 23:24:04 +03:00
llama.cpp	train : improved training-from-scratch example (#1652 )	2023-06-13 22:04:40 +03:00
llama.h	train : improved training-from-scratch example (#1652 )	2023-06-13 22:04:40 +03:00
Makefile	Work in progress.	2023-06-16 16:31:02 +02:00
Package.swift	Add Accelerate/BLAS when using Swift (#765 )	2023-04-05 06:44:24 -04:00
README.md	Update README.md	2023-06-17 16:20:02 +02:00
requirements.txt	py : bump sentencepiece to 0.1.98 to support Python 3.11 (#976 )	2023-04-14 19:46:49 +00:00
SHA256SUMS	Update SHA256SUMS with current hashes for models quantized using q4_0 (#1798 )	2023-06-11 12:38:53 +03:00

README.md

llama.cpp modification to run Falcon (work in progress)

Status: Quantization works except for Q_K_ types CUDA not yet functional

It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second

 Q:\ggllm.cpp> .\build\bin\Release\falcon_main.exe -t 31 -m Q:\models\falcon-40b\q5_1 -p "Love relates to hate like" -n 50 -ngl 0
main: build = 677 (dd3d346)
main: seed  = 1687010794
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
falcon.cpp: loading model from Q:\models\falcon-40b\q5_1
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65024
falcon_model_load_internal: n_ctx      = 512
falcon_model_load_internal: n_embd     = 8192
falcon_model_load_internal: n_head     = 128
falcon_model_load_internal: n_head_kv     = 8
falcon_model_load_internal: n_layer    = 60
falcon_model_load_internal: version      = 40
falcon_model_load_internal: ftype      = 9 (mostly Q5_1)
falcon_model_load_internal: n_ff       = 32768
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 40B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 29929.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: mem required  = 33513.70 MB (+  120.00 MB per state)
falcon_model_load_internal: offloading 0 layers to GPU
falcon_model_load_internal: total VRAM used: 512 MB
...................................................................................................
falcon_init_from_file: kv self size  =  120.00 MB

system_info: n_threads = 31 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0


Love relates to hate like light relates to darkness.
Love is the strongest thing in the world, but hate is the second strongest force.
Love is a force multiplier.
For every moment of love, there is a parallel moment of hate.
You can’t
falcon_print_timings:        load time =  4420.23 ms
falcon_print_timings:      sample time =    11.34 ms /    50 runs   (    0.23 ms per token)
falcon_print_timings: prompt eval time =   785.42 ms /     5 tokens (  157.08 ms per token)
falcon_print_timings:        eval time = 27512.23 ms /    49 runs   (  561.47 ms per token)
falcon_print_timings:       total time = 28315.91 ms

Below are Falcon 7B tests: Q5_1 is working, comes with ggml v3 as a bonus (mmap support)

falcon_model_load_internal: ftype      = 9 (mostly Q5_1)
falcon_print_timings:        load time =   952.24 ms
falcon_print_timings:      sample time =    67.91 ms /   300 runs   (    0.23 ms per token)
falcon_print_timings: prompt eval time =   370.94 ms /    14 tokens (   26.50 ms per token)
falcon_print_timings:        eval time = 50367.68 ms /   299 runs   (  168.45 ms per token)

Q4_1 is working as well

falcon_print_timings:        load time =   864.40 ms
falcon_print_timings:      sample time =    22.68 ms /   100 runs   (    0.23 ms per token)
falcon_print_timings: prompt eval time =   287.00 ms /    14 tokens (   20.50 ms per token)
falcon_print_timings:        eval time = 12233.39 ms /    99 runs   (  123.57 ms per token)

Q_K_*: not working (no segfaults anymore, looks like an error in qkv handling as it's outputting garbage.

README.md Unescape Escape

README.md