Find a file
2023-06-19 17:53:35 +02:00
.devops Add llama.cpp docker support for non-latin languages (#1673) 2023-06-08 00:58:53 -07:00
.github ci : run when changing only the CUDA sources (#1800) 2023-06-12 20:12:47 +03:00
docs docs : add performance troubleshoot + example benchmark documentation (#1674) 2023-06-05 23:32:36 +03:00
examples added missing #if defined(GGML_USE_CUBLAS) 2023-06-19 14:13:34 +02:00
media media : add logos and banners 2023-04-05 18:58:31 +03:00
models Make loading weights 10-100x faster 2023-03-30 12:28:25 -07:00
pocs build : fix and ignore MSVC warnings (#1889) 2023-06-16 21:23:53 +03:00
prompts prompts : model agnostic DAN (#1304) 2023-05-11 18:10:19 +03:00
scripts hooks : setting up flake8 and pre-commit hooks (#1681) 2023-06-17 13:32:48 +03:00
spm-headers swift : Package compile breaks due to ggml-metal.metal (#1831) 2023-06-15 20:47:04 +03:00
tests build : fix and ignore MSVC warnings (#1889) 2023-06-16 21:23:53 +03:00
.clang-tidy clang-tidy : restore dot file from accidental deletion 2023-06-08 10:09:08 +03:00
.dockerignore Fix whitespace, add .editorconfig, add GitHub workflow (#883) 2023-04-11 19:45:44 +00:00
.ecrc Fix whitespace, add .editorconfig, add GitHub workflow (#883) 2023-04-11 19:45:44 +00:00
.editorconfig do not force the prompt file to end with a new line (#908) 2023-04-13 11:33:16 +02:00
.flake8 hooks : setting up flake8 and pre-commit hooks (#1681) 2023-06-17 13:32:48 +03:00
.gitignore Work in progress. 2023-06-17 23:06:13 +02:00
.pre-commit-config.yaml hooks : setting up flake8 and pre-commit hooks (#1681) 2023-06-17 13:32:48 +03:00
build.zig zig : update build.zig (#872) 2023-04-13 16:43:22 +03:00
CMakeLists.txt Work in progress. 2023-06-17 23:06:13 +02:00
convert-lora-to-ggml.py py : cast lora_alpha to int in convert-lora-to-ggml (#1170) 2023-04-25 23:33:08 +02:00
convert-pth-to-ggml.py Docker: change to calling convert.py (#1641) 2023-06-03 15:11:53 +03:00
convert.py hooks : setting up flake8 and pre-commit hooks (#1681) 2023-06-17 13:32:48 +03:00
falcon_convert_demo.py Work in progress. 2023-06-17 23:06:13 +02:00
flake.lock flake : update to support metal on m1/m2 (#1724) 2023-06-07 07:15:31 +03:00
flake.nix exposed modules so that they can be invoked by nix run github:ggerganov/llama.cpp#server etc (#1863) 2023-06-17 14:13:05 +02:00
ggml-cuda.cu cuda malloc: 2023-06-19 05:03:28 +02:00
ggml-cuda.h cuda malloc: 2023-06-19 05:03:28 +02:00
ggml-metal.h metal : parallel command buffer encoding (#1860) 2023-06-15 20:29:48 +03:00
ggml-metal.m minor : warning fixes 2023-06-17 20:24:11 +03:00
ggml-metal.metal metal : add norm, cpy f16->f16, alibi kernels (#1823) 2023-06-17 17:37:49 +03:00
ggml-opencl.cpp ggml : fix warnings under MSVC (#1908) 2023-06-17 18:46:15 +03:00
ggml-opencl.h Leverage mmap for offloading tensors to GPU (#1597) 2023-06-12 14:44:16 +02:00
ggml.c added the tensor size calculation routines 2023-06-17 23:06:21 +02:00
ggml.h Work in progress. 2023-06-17 23:06:13 +02:00
k_quants.c k-quants : GCC12 compilation fix (#1792) 2023-06-10 22:51:36 +03:00
k_quants.h k-quants : allow to optionally disable at compile time (#1734) 2023-06-07 10:59:52 +03:00
libfalcon.cpp cuda malloc: 2023-06-19 05:03:28 +02:00
libfalcon.h cuda malloc: 2023-06-19 05:03:28 +02:00
LICENSE Add LICENSE (#21) 2023-03-12 08:36:03 +02:00
llama-util.h metal : use shared buffers between CPU and GPU (#1696) 2023-06-05 23:24:04 +03:00
llama.cpp llama : fix kv_cache n init (close #1903) 2023-06-17 19:31:20 +03:00
llama.h examples : add chat-vicuna.sh (#1854) 2023-06-15 21:05:53 +03:00
Makefile Update Makefile - minor spelling error 2023-06-19 14:46:22 +10:00
Package.swift swift : Package compile breaks due to ggml-metal.metal (#1831) 2023-06-15 20:47:04 +03:00
README.md Update README.md 2023-06-19 17:53:35 +02:00
requirements.txt py : bump sentencepiece to 0.1.98 to support Python 3.11 (#976) 2023-04-14 19:46:49 +00:00
SHA256SUMS Update SHA256SUMS with current hashes for models quantized using q4_0 (#1798) 2023-06-11 12:38:53 +03:00

llama.cpp modification to run Falcon (work in progress)

The Bloke features a well known fine tuned variants with quantization:
https://huggingface.co/TheBloke/falcon-40b-instruct-GGML
https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML

The official HF models are here:
https://huggingface.co/tiiuae/falcon-40b/ https://huggingface.co/tiiuae/falcon-7b/
https://huggingface.co/tiiuae/falcon-40b-instruct
https://huggingface.co/tiiuae/falcon-7b-instruct

Conversion:

  1. use falcon_convert_demo.py to produce a GGMLv0 binary from HF - not recommended to be used directly
  2. use examples/falcon_quantize to convert these into GGMLv3 binaries of your choice including mmap support from there on
    Important: The Falcon 7B model features tensor sizes that do not support K-type quantizers - use the traditional quantization for those

Status/Bugs:

  • CUDA-integration branch demo ready
  • python conversion script is very basic (produces ggml v0)
  • On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows
  • VRAM scratch/overhead calculation on CUDA can fail - if GPU RAM fills to 100% manually reduce the layers of --ngl until it fits

How to compile:

How to build:
1) Recommended with cmake: (change the CUBLAS flag to 0 to disable CUDA requirements and support)
git clone
cd ggllm.cpp
rm -rf build; mkdir build; cd build
cmake -DLLAMA_CUBLAS=1 ..
cmake --build . --config Release
# find binaries in ./bin


2) Installing on WSL (Windows Subsystem for Linux)
# I am getting slightly better timings on WSL than native windows
# Use --no-mmap in WSL OR copy the model into native directory (not /mnt/) or it will get stuck loading (thanks @nauful)
#Choose a current distro:
wsl.exe --list --online
wsl --install -d distro
# cmake 3.16 is required and the cuda toolset
# If you run an old distro you can upgrade (like apt update; apt upgrade; apt full-upgrade; pico /etc/apt/sources.list/; apt update; apt upgrade; apt full-upgrade; apt autoremove; lsb_release -a); then wsl --shutdown and restart it
# install cuda WSL toolkit
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.0-1_all.deb
dpkg -i cuda-keyring_1.0-1_all.deb
apt-get update; apt-get -y install cuda
# you might need to add it to your path:
export LD_LIBRARY_PATH="/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH"
export PATH="/usr/local/cuda-12.1/bin:$PATH"
# now start with a fresh cmake and all should work 

CUDA:
Only some tensors supported currently, only mul_mat operation supported currently
q3_k timing on 3090 of Falcon 40B:
falcon_print_timings: prompt eval time = 702.55 ms / 3 tokens ( 234.18 ms per token)
falcon_print_timings: eval time = 3350.65 ms / 24 runs ( 139.61 ms per token)

q4_k timing on 3090 of falcon 40B (partial offload):
falcon_print_timings: prompt eval time = 590.82 ms / 3 tokens ( 196.94 ms per token)
falcon_print_timings: eval time = 2817.37 ms / 24 runs ( 117.39 ms per token)

q4_1 timing on 3090 of falcon 7B:
falcon_print_timings: prompt eval time = 115.30 ms / 3 tokens ( 38.43 ms per token)
falcon_print_timings: eval time = 5926.74 ms / 147 runs ( 40.32 ms per token)

CUDA sidenote:

  1. use 1 less threads than you have physical processor cores
  2. If it's too slow and GPU memory is at 100% then the automated tensor skip is not working properly, reduce --ngl until gpu memory does not saturate fully at first inference

It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second
CPU inference examples:

 Q:\ggllm.cpp> .\build\bin\Release\falcon_main.exe -t 31 -m Q:\models\falcon-40b\q5_1 -p "Love relates to hate like" -n 50 -ngl 0
main: build = 677 (dd3d346)
main: seed  = 1687010794
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
falcon.cpp: loading model from Q:\models\falcon-40b\q5_1
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65024
falcon_model_load_internal: n_ctx      = 512
falcon_model_load_internal: n_embd     = 8192
falcon_model_load_internal: n_head     = 128
falcon_model_load_internal: n_head_kv     = 8
falcon_model_load_internal: n_layer    = 60
falcon_model_load_internal: version      = 40
falcon_model_load_internal: ftype      = 9 (mostly Q5_1)
falcon_model_load_internal: n_ff       = 32768
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 40B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 29929.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: mem required  = 33513.70 MB (+  120.00 MB per state)
falcon_model_load_internal: offloading 0 layers to GPU
falcon_model_load_internal: total VRAM used: 512 MB
...................................................................................................
falcon_init_from_file: kv self size  =  120.00 MB

system_info: n_threads = 31 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0


Love relates to hate like light relates to darkness.
Love is the strongest thing in the world, but hate is the second strongest force.
Love is a force multiplier.
For every moment of love, there is a parallel moment of hate.
You cant
falcon_print_timings:        load time =  4420.23 ms
falcon_print_timings:      sample time =    11.34 ms /    50 runs   (    0.23 ms per token)
falcon_print_timings: prompt eval time =   785.42 ms /     5 tokens (  157.08 ms per token)
falcon_print_timings:        eval time = 27512.23 ms /    49 runs   (  561.47 ms per token)
falcon_print_timings:       total time = 28315.91 ms

Below are Falcon 7B tests: Q5_1 is working, comes with ggml v3 as a bonus (mmap support)

falcon_model_load_internal: ftype      = 9 (mostly Q5_1)
falcon_print_timings:        load time =   952.24 ms
falcon_print_timings:      sample time =    67.91 ms /   300 runs   (    0.23 ms per token)
falcon_print_timings: prompt eval time =   370.94 ms /    14 tokens (   26.50 ms per token)
falcon_print_timings:        eval time = 50367.68 ms /   299 runs   (  168.45 ms per token)

Q4_1 is working as well

falcon_print_timings:        load time =   864.40 ms
falcon_print_timings:      sample time =    22.68 ms /   100 runs   (    0.23 ms per token)
falcon_print_timings: prompt eval time =   287.00 ms /    14 tokens (   20.50 ms per token)
falcon_print_timings:        eval time = 12233.39 ms /    99 runs   (  123.57 ms per token)

Q_K_*: not working (no segfaults anymore, looks like an error in qkv handling as it's outputting garbage.