llama.cpp

from http://github.com/ggerganov/llama.cpp.git

Find a file

John 1ca2186189 Update README.md		2023-06-19 17:53:35 +02:00
.devops	Add llama.cpp docker support for non-latin languages (#1673 )	2023-06-08 00:58:53 -07:00
.github	ci : run when changing only the CUDA sources (#1800 )	2023-06-12 20:12:47 +03:00
docs	docs : add performance troubleshoot + example benchmark documentation (#1674 )	2023-06-05 23:32:36 +03:00
examples	added missing #if defined(GGML_USE_CUBLAS)	2023-06-19 14:13:34 +02:00
media	media : add logos and banners	2023-04-05 18:58:31 +03:00
models	Make loading weights 10-100x faster	2023-03-30 12:28:25 -07:00
pocs	build : fix and ignore MSVC warnings (#1889 )	2023-06-16 21:23:53 +03:00
prompts	prompts : model agnostic DAN (#1304 )	2023-05-11 18:10:19 +03:00
scripts	hooks : setting up flake8 and pre-commit hooks (#1681 )	2023-06-17 13:32:48 +03:00
spm-headers	swift : Package compile breaks due to ggml-metal.metal (#1831 )	2023-06-15 20:47:04 +03:00
tests	build : fix and ignore MSVC warnings (#1889 )	2023-06-16 21:23:53 +03:00
.clang-tidy	clang-tidy : restore dot file from accidental deletion	2023-06-08 10:09:08 +03:00
.dockerignore	Fix whitespace, add .editorconfig, add GitHub workflow (#883 )	2023-04-11 19:45:44 +00:00
.ecrc	Fix whitespace, add .editorconfig, add GitHub workflow (#883 )	2023-04-11 19:45:44 +00:00
.editorconfig	do not force the prompt file to end with a new line (#908 )	2023-04-13 11:33:16 +02:00
.flake8	hooks : setting up flake8 and pre-commit hooks (#1681 )	2023-06-17 13:32:48 +03:00
.gitignore	Work in progress.	2023-06-17 23:06:13 +02:00
.pre-commit-config.yaml	hooks : setting up flake8 and pre-commit hooks (#1681 )	2023-06-17 13:32:48 +03:00
build.zig	zig : update build.zig (#872 )	2023-04-13 16:43:22 +03:00
CMakeLists.txt	Work in progress.	2023-06-17 23:06:13 +02:00
convert-lora-to-ggml.py	py : cast lora_alpha to int in convert-lora-to-ggml (#1170 )	2023-04-25 23:33:08 +02:00
convert-pth-to-ggml.py	Docker: change to calling convert.py (#1641 )	2023-06-03 15:11:53 +03:00
convert.py	hooks : setting up flake8 and pre-commit hooks (#1681 )	2023-06-17 13:32:48 +03:00
falcon_convert_demo.py	Work in progress.	2023-06-17 23:06:13 +02:00
flake.lock	flake : update to support metal on m1/m2 (#1724 )	2023-06-07 07:15:31 +03:00
flake.nix	exposed modules so that they can be invoked by nix run github:ggerganov/llama.cpp#server etc (#1863 )	2023-06-17 14:13:05 +02:00
ggml-cuda.cu	cuda malloc:	2023-06-19 05:03:28 +02:00
ggml-cuda.h	cuda malloc:	2023-06-19 05:03:28 +02:00
ggml-metal.h	metal : parallel command buffer encoding (#1860 )	2023-06-15 20:29:48 +03:00
ggml-metal.m	minor : warning fixes	2023-06-17 20:24:11 +03:00
ggml-metal.metal	metal : add norm, cpy f16->f16, alibi kernels (#1823 )	2023-06-17 17:37:49 +03:00
ggml-opencl.cpp	ggml : fix warnings under MSVC (#1908 )	2023-06-17 18:46:15 +03:00
ggml-opencl.h	Leverage mmap for offloading tensors to GPU (#1597 )	2023-06-12 14:44:16 +02:00
ggml.c	added the tensor size calculation routines	2023-06-17 23:06:21 +02:00
ggml.h	Work in progress.	2023-06-17 23:06:13 +02:00
k_quants.c	k-quants : GCC12 compilation fix (#1792 )	2023-06-10 22:51:36 +03:00
k_quants.h	k-quants : allow to optionally disable at compile time (#1734 )	2023-06-07 10:59:52 +03:00
libfalcon.cpp	cuda malloc:	2023-06-19 05:03:28 +02:00
libfalcon.h	cuda malloc:	2023-06-19 05:03:28 +02:00
LICENSE	Add LICENSE (#21 )	2023-03-12 08:36:03 +02:00
llama-util.h	metal : use shared buffers between CPU and GPU (#1696 )	2023-06-05 23:24:04 +03:00
llama.cpp	llama : fix kv_cache `n` init (close #1903 )	2023-06-17 19:31:20 +03:00
llama.h	examples : add chat-vicuna.sh (#1854 )	2023-06-15 21:05:53 +03:00
Makefile	Update Makefile - minor spelling error	2023-06-19 14:46:22 +10:00
Package.swift	swift : Package compile breaks due to ggml-metal.metal (#1831 )	2023-06-15 20:47:04 +03:00
README.md	Update README.md	2023-06-19 17:53:35 +02:00
requirements.txt	py : bump sentencepiece to 0.1.98 to support Python 3.11 (#976 )	2023-04-14 19:46:49 +00:00
SHA256SUMS	Update SHA256SUMS with current hashes for models quantized using q4_0 (#1798 )	2023-06-11 12:38:53 +03:00

README.md

llama.cpp modification to run Falcon (work in progress)

The Bloke features a well known fine tuned variants with quantization:
https://huggingface.co/TheBloke/falcon-40b-instruct-GGML
https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML

The official HF models are here:
https://huggingface.co/tiiuae/falcon-40b/ https://huggingface.co/tiiuae/falcon-7b/
https://huggingface.co/tiiuae/falcon-40b-instruct
https://huggingface.co/tiiuae/falcon-7b-instruct

Conversion:

use falcon_convert_demo.py to produce a GGMLv0 binary from HF - not recommended to be used directly
use examples/falcon_quantize to convert these into GGMLv3 binaries of your choice including mmap support from there on
Important: The Falcon 7B model features tensor sizes that do not support K-type quantizers - use the traditional quantization for those

Status/Bugs:

CUDA-integration branch demo ready
python conversion script is very basic (produces ggml v0)
On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows
VRAM scratch/overhead calculation on CUDA can fail - if GPU RAM fills to 100% manually reduce the layers of --ngl until it fits

How to compile:

How to build:
1) Recommended with cmake: (change the CUBLAS flag to 0 to disable CUDA requirements and support)
git clone
cd ggllm.cpp
rm -rf build; mkdir build; cd build
cmake -DLLAMA_CUBLAS=1 ..
cmake --build . --config Release
# find binaries in ./bin


2) Installing on WSL (Windows Subsystem for Linux)
# I am getting slightly better timings on WSL than native windows
# Use --no-mmap in WSL OR copy the model into native directory (not /mnt/) or it will get stuck loading (thanks @nauful)
#Choose a current distro:
wsl.exe --list --online
wsl --install -d distro
# cmake 3.16 is required and the cuda toolset
# If you run an old distro you can upgrade (like apt update; apt upgrade; apt full-upgrade; pico /etc/apt/sources.list/; apt update; apt upgrade; apt full-upgrade; apt autoremove; lsb_release -a); then wsl --shutdown and restart it
# install cuda WSL toolkit
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.0-1_all.deb
dpkg -i cuda-keyring_1.0-1_all.deb
apt-get update; apt-get -y install cuda
# you might need to add it to your path:
export LD_LIBRARY_PATH="/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH"
export PATH="/usr/local/cuda-12.1/bin:$PATH"
# now start with a fresh cmake and all should work

CUDA:
Only some tensors supported currently, only mul_mat operation supported currently
q3_k timing on 3090 of Falcon 40B:
falcon_print_timings: prompt eval time = 702.55 ms / 3 tokens ( 234.18 ms per token)
falcon_print_timings: eval time = 3350.65 ms / 24 runs ( 139.61 ms per token)

q4_k timing on 3090 of falcon 40B (partial offload):
falcon_print_timings: prompt eval time = 590.82 ms / 3 tokens ( 196.94 ms per token)
falcon_print_timings: eval time = 2817.37 ms / 24 runs ( 117.39 ms per token)

q4_1 timing on 3090 of falcon 7B:
falcon_print_timings: prompt eval time = 115.30 ms / 3 tokens ( 38.43 ms per token)
falcon_print_timings: eval time = 5926.74 ms / 147 runs ( 40.32 ms per token)

CUDA sidenote:

use 1 less threads than you have physical processor cores
If it's too slow and GPU memory is at 100% then the automated tensor skip is not working properly, reduce --ngl until gpu memory does not saturate fully at first inference

It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second
CPU inference examples:

 Q:\ggllm.cpp> .\build\bin\Release\falcon_main.exe -t 31 -m Q:\models\falcon-40b\q5_1 -p "Love relates to hate like" -n 50 -ngl 0
main: build = 677 (dd3d346)
main: seed  = 1687010794
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
falcon.cpp: loading model from Q:\models\falcon-40b\q5_1
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65024
falcon_model_load_internal: n_ctx      = 512
falcon_model_load_internal: n_embd     = 8192
falcon_model_load_internal: n_head     = 128
falcon_model_load_internal: n_head_kv     = 8
falcon_model_load_internal: n_layer    = 60
falcon_model_load_internal: version      = 40
falcon_model_load_internal: ftype      = 9 (mostly Q5_1)
falcon_model_load_internal: n_ff       = 32768
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 40B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 29929.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: mem required  = 33513.70 MB (+  120.00 MB per state)
falcon_model_load_internal: offloading 0 layers to GPU
falcon_model_load_internal: total VRAM used: 512 MB
...................................................................................................
falcon_init_from_file: kv self size  =  120.00 MB

system_info: n_threads = 31 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0


Love relates to hate like light relates to darkness.
Love is the strongest thing in the world, but hate is the second strongest force.
Love is a force multiplier.
For every moment of love, there is a parallel moment of hate.
You can’t
falcon_print_timings:        load time =  4420.23 ms
falcon_print_timings:      sample time =    11.34 ms /    50 runs   (    0.23 ms per token)
falcon_print_timings: prompt eval time =   785.42 ms /     5 tokens (  157.08 ms per token)
falcon_print_timings:        eval time = 27512.23 ms /    49 runs   (  561.47 ms per token)
falcon_print_timings:       total time = 28315.91 ms

Below are Falcon 7B tests: Q5_1 is working, comes with ggml v3 as a bonus (mmap support)

falcon_model_load_internal: ftype      = 9 (mostly Q5_1)
falcon_print_timings:        load time =   952.24 ms
falcon_print_timings:      sample time =    67.91 ms /   300 runs   (    0.23 ms per token)
falcon_print_timings: prompt eval time =   370.94 ms /    14 tokens (   26.50 ms per token)
falcon_print_timings:        eval time = 50367.68 ms /   299 runs   (  168.45 ms per token)

Q4_1 is working as well

falcon_print_timings:        load time =   864.40 ms
falcon_print_timings:      sample time =    22.68 ms /   100 runs   (    0.23 ms per token)
falcon_print_timings: prompt eval time =   287.00 ms /    14 tokens (   20.50 ms per token)
falcon_print_timings:        eval time = 12233.39 ms /    99 runs   (  123.57 ms per token)

Q_K_*: not working (no segfaults anymore, looks like an error in qkv handling as it's outputting garbage.

README.md Unescape Escape

README.md