Find a file
2023-06-17 23:08:49 +02:00
.devops Add llama.cpp docker support for non-latin languages (#1673) 2023-06-08 00:58:53 -07:00
.github ci : run when changing only the CUDA sources (#1800) 2023-06-12 20:12:47 +03:00
docs docs : add performance troubleshoot + example benchmark documentation (#1674) 2023-06-05 23:32:36 +03:00
examples Made option --memory-f32 enabled by default since ggml_repeat2 currently only has F32 implementation. Improved memory allocation for ctx and kv memory to be accurate. Moved model.memory_k, model.memory_v to kv_self.k, kv_self.v and the initialization into kv_cache_init (to be more like llama.cpp). 2023-06-17 23:06:21 +02:00
media media : add logos and banners 2023-04-05 18:58:31 +03:00
models Make loading weights 10-100x faster 2023-03-30 12:28:25 -07:00
pocs build : fix and ignore MSVC warnings (#1889) 2023-06-16 21:23:53 +03:00
prompts prompts : model agnostic DAN (#1304) 2023-05-11 18:10:19 +03:00
scripts hooks : setting up flake8 and pre-commit hooks (#1681) 2023-06-17 13:32:48 +03:00
spm-headers swift : Package compile breaks due to ggml-metal.metal (#1831) 2023-06-15 20:47:04 +03:00
tests build : fix and ignore MSVC warnings (#1889) 2023-06-16 21:23:53 +03:00
.clang-tidy clang-tidy : restore dot file from accidental deletion 2023-06-08 10:09:08 +03:00
.dockerignore Fix whitespace, add .editorconfig, add GitHub workflow (#883) 2023-04-11 19:45:44 +00:00
.ecrc Fix whitespace, add .editorconfig, add GitHub workflow (#883) 2023-04-11 19:45:44 +00:00
.editorconfig do not force the prompt file to end with a new line (#908) 2023-04-13 11:33:16 +02:00
.flake8 hooks : setting up flake8 and pre-commit hooks (#1681) 2023-06-17 13:32:48 +03:00
.gitignore Work in progress. 2023-06-17 23:06:13 +02:00
.pre-commit-config.yaml hooks : setting up flake8 and pre-commit hooks (#1681) 2023-06-17 13:32:48 +03:00
build.zig zig : update build.zig (#872) 2023-04-13 16:43:22 +03:00
CMakeLists.txt Work in progress. 2023-06-17 23:06:13 +02:00
convert-lora-to-ggml.py py : cast lora_alpha to int in convert-lora-to-ggml (#1170) 2023-04-25 23:33:08 +02:00
convert-pth-to-ggml.py Docker: change to calling convert.py (#1641) 2023-06-03 15:11:53 +03:00
convert.py hooks : setting up flake8 and pre-commit hooks (#1681) 2023-06-17 13:32:48 +03:00
falcon_convert_demo.py Work in progress. 2023-06-17 23:06:13 +02:00
flake.lock flake : update to support metal on m1/m2 (#1724) 2023-06-07 07:15:31 +03:00
flake.nix exposed modules so that they can be invoked by nix run github:ggerganov/llama.cpp#server etc (#1863) 2023-06-17 14:13:05 +02:00
ggml-cuda.cu Only one CUDA stream per device for async compute (#1898) 2023-06-17 19:15:02 +02:00
ggml-cuda.h CUDA full GPU acceleration, KV cache in VRAM (#1827) 2023-06-14 19:47:19 +02:00
ggml-metal.h metal : parallel command buffer encoding (#1860) 2023-06-15 20:29:48 +03:00
ggml-metal.m minor : warning fixes 2023-06-17 20:24:11 +03:00
ggml-metal.metal metal : add norm, cpy f16->f16, alibi kernels (#1823) 2023-06-17 17:37:49 +03:00
ggml-opencl.cpp ggml : fix warnings under MSVC (#1908) 2023-06-17 18:46:15 +03:00
ggml-opencl.h Leverage mmap for offloading tensors to GPU (#1597) 2023-06-12 14:44:16 +02:00
ggml.c added the tensor size calculation routines 2023-06-17 23:06:21 +02:00
ggml.h Work in progress. 2023-06-17 23:06:13 +02:00
k_quants.c k-quants : GCC12 compilation fix (#1792) 2023-06-10 22:51:36 +03:00
k_quants.h k-quants : allow to optionally disable at compile time (#1734) 2023-06-07 10:59:52 +03:00
libfalcon.cpp Made option --memory-f32 enabled by default since ggml_repeat2 currently only has F32 implementation. Improved memory allocation for ctx and kv memory to be accurate. Moved model.memory_k, model.memory_v to kv_self.k, kv_self.v and the initialization into kv_cache_init (to be more like llama.cpp). 2023-06-17 23:06:21 +02:00
libfalcon.h Work in progress. 2023-06-17 23:06:13 +02:00
LICENSE Add LICENSE (#21) 2023-03-12 08:36:03 +02:00
llama-util.h metal : use shared buffers between CPU and GPU (#1696) 2023-06-05 23:24:04 +03:00
llama.cpp llama : fix kv_cache n init (close #1903) 2023-06-17 19:31:20 +03:00
llama.h examples : add chat-vicuna.sh (#1854) 2023-06-15 21:05:53 +03:00
Makefile Work in progress. 2023-06-17 23:06:13 +02:00
Package.swift swift : Package compile breaks due to ggml-metal.metal (#1831) 2023-06-15 20:47:04 +03:00
README.md Update README.md 2023-06-17 23:08:49 +02:00
requirements.txt py : bump sentencepiece to 0.1.98 to support Python 3.11 (#976) 2023-04-14 19:46:49 +00:00
SHA256SUMS Update SHA256SUMS with current hashes for models quantized using q4_0 (#1798) 2023-06-11 12:38:53 +03:00

llama.cpp modification to run Falcon (work in progress)

Status/Bugs:

  • Quantization works except for Q_K_ types
  • CUDA not yet functional
  • python conversion script is very basic (produces ggml v0)
  • On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows

It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second
CPU inference examples:

 Q:\ggllm.cpp> .\build\bin\Release\falcon_main.exe -t 31 -m Q:\models\falcon-40b\q5_1 -p "Love relates to hate like" -n 50 -ngl 0
main: build = 677 (dd3d346)
main: seed  = 1687010794
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
falcon.cpp: loading model from Q:\models\falcon-40b\q5_1
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65024
falcon_model_load_internal: n_ctx      = 512
falcon_model_load_internal: n_embd     = 8192
falcon_model_load_internal: n_head     = 128
falcon_model_load_internal: n_head_kv     = 8
falcon_model_load_internal: n_layer    = 60
falcon_model_load_internal: version      = 40
falcon_model_load_internal: ftype      = 9 (mostly Q5_1)
falcon_model_load_internal: n_ff       = 32768
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 40B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 29929.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: mem required  = 33513.70 MB (+  120.00 MB per state)
falcon_model_load_internal: offloading 0 layers to GPU
falcon_model_load_internal: total VRAM used: 512 MB
...................................................................................................
falcon_init_from_file: kv self size  =  120.00 MB

system_info: n_threads = 31 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0


Love relates to hate like light relates to darkness.
Love is the strongest thing in the world, but hate is the second strongest force.
Love is a force multiplier.
For every moment of love, there is a parallel moment of hate.
You cant
falcon_print_timings:        load time =  4420.23 ms
falcon_print_timings:      sample time =    11.34 ms /    50 runs   (    0.23 ms per token)
falcon_print_timings: prompt eval time =   785.42 ms /     5 tokens (  157.08 ms per token)
falcon_print_timings:        eval time = 27512.23 ms /    49 runs   (  561.47 ms per token)
falcon_print_timings:       total time = 28315.91 ms

Below are Falcon 7B tests: Q5_1 is working, comes with ggml v3 as a bonus (mmap support)

falcon_model_load_internal: ftype      = 9 (mostly Q5_1)
falcon_print_timings:        load time =   952.24 ms
falcon_print_timings:      sample time =    67.91 ms /   300 runs   (    0.23 ms per token)
falcon_print_timings: prompt eval time =   370.94 ms /    14 tokens (   26.50 ms per token)
falcon_print_timings:        eval time = 50367.68 ms /   299 runs   (  168.45 ms per token)

Q4_1 is working as well

falcon_print_timings:        load time =   864.40 ms
falcon_print_timings:      sample time =    22.68 ms /   100 runs   (    0.23 ms per token)
falcon_print_timings: prompt eval time =   287.00 ms /    14 tokens (   20.50 ms per token)
falcon_print_timings:        eval time = 12233.39 ms /    99 runs   (  123.57 ms per token)

Q_K_*: not working (no segfaults anymore, looks like an error in qkv handling as it's outputting garbage.