From 6e137abe56e8a3b15626e9a4505ccce547182887 Mon Sep 17 00:00:00 2001 From: John <78893154+cmp-nct@users.noreply.github.com> Date: Sat, 17 Jun 2023 16:42:23 +0200 Subject: [PATCH 1/7] Update README.md --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index ae56b6d1a..52b7ed300 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,8 @@ llama.cpp modification to run Falcon (work in progress) Status: * Quantization works except for Q_K_ types -* CUDA not yet functional +* CUDA not yet functional +* context size calculation not proper (cuda as well as cpu) It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second From c72bc02695f85957496b9ef110f75973be7d491e Mon Sep 17 00:00:00 2001 From: John <78893154+cmp-nct@users.noreply.github.com> Date: Sat, 17 Jun 2023 16:51:34 +0200 Subject: [PATCH 2/7] Update README.md --- README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/README.md b/README.md index 52b7ed300..be2eefeba 100644 --- a/README.md +++ b/README.md @@ -3,8 +3,7 @@ llama.cpp modification to run Falcon (work in progress) Status: * Quantization works except for Q_K_ types * CUDA not yet functional -* context size calculation not proper (cuda as well as cpu) - +* python conversion script is very basic (produces ggml v0) It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second CPU inference examples: From f89c7592ebfbb473d8cebe34b730f2912b67a29d Mon Sep 17 00:00:00 2001 From: John <78893154+cmp-nct@users.noreply.github.com> Date: Sat, 17 Jun 2023 18:57:40 +0200 Subject: [PATCH 3/7] Update README.md --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index be2eefeba..e122703e5 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,10 @@ llama.cpp modification to run Falcon (work in progress) -Status: +Status/Bugs: * Quantization works except for Q_K_ types * CUDA not yet functional * python conversion script is very basic (produces ggml v0) +* On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second CPU inference examples: From cbb31807a38485a27aba73657ec01b358e491b34 Mon Sep 17 00:00:00 2001 From: John <78893154+cmp-nct@users.noreply.github.com> Date: Sat, 17 Jun 2023 21:34:24 +0200 Subject: [PATCH 4/7] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e122703e5..4e8867a1d 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ llama.cpp modification to run Falcon (work in progress) Status/Bugs: -* Quantization works except for Q_K_ types +* Quantization with QK_ type appear to fail on 7B models. (Q_ works on both, QK_ works on 40B) * CUDA not yet functional * python conversion script is very basic (produces ggml v0) * On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows From 80f654631e3d8e28cc58eeccbf8874b685a74a81 Mon Sep 17 00:00:00 2001 From: John <78893154+cmp-nct@users.noreply.github.com> Date: Sun, 18 Jun 2023 05:57:19 +0200 Subject: [PATCH 5/7] Update README.md --- README.md | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 4e8867a1d..ad590cf80 100644 --- a/README.md +++ b/README.md @@ -2,10 +2,30 @@ llama.cpp modification to run Falcon (work in progress) Status/Bugs: * Quantization with QK_ type appear to fail on 7B models. (Q_ works on both, QK_ works on 40B) -* CUDA not yet functional +* CUDA-integration branch demo ready * python conversion script is very basic (produces ggml v0) * On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows +CUDA (cuda-integration branch, not in master yet): +Only some tensors supported currently, only mul_mat operation supported currently +q3_k timing on 3090 of Falcon 40B: +falcon_print_timings: prompt eval time = 702.55 ms / 3 tokens ( 234.18 ms per token) +falcon_print_timings: eval time = 3350.65 ms / 24 runs ( 139.61 ms per token) + +q4_k timing on 3090 of falcon 40B (partial offload): +falcon_print_timings: prompt eval time = 590.82 ms / 3 tokens ( 196.94 ms per token) +falcon_print_timings: eval time = 2817.37 ms / 24 runs ( 117.39 ms per token) + +q4_1 timing on 3090 of falcon 7B: +falcon_print_timings: prompt eval time = 115.30 ms / 3 tokens ( 38.43 ms per token) +falcon_print_timings: eval time = 5926.74 ms / 147 runs ( 40.32 ms per token) + + +CUDA sidenote: +1) use 1 less threads than you have physical processor cores +2) If it's too slow and GPU memory is at 100% then the automated tensor skip is not working properly, reduce --ngl until gpu memory does not saturate fully at first inference + + It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second CPU inference examples: ``` From 72f358150c8d9b311c7e345d2f2aa50291fa4bad Mon Sep 17 00:00:00 2001 From: John <78893154+cmp-nct@users.noreply.github.com> Date: Sun, 18 Jun 2023 16:37:21 +0200 Subject: [PATCH 6/7] Update README.md --- README.md | 32 +++++++++++++++++++++++++------- 1 file changed, 25 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index ad590cf80..06ac9f815 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,31 @@ llama.cpp modification to run Falcon (work in progress) -Status/Bugs: -* Quantization with QK_ type appear to fail on 7B models. (Q_ works on both, QK_ works on 40B) -* CUDA-integration branch demo ready -* python conversion script is very basic (produces ggml v0) -* On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows +**The Bloke features a well known fine tuned variants with quantization:** +https://huggingface.co/TheBloke/falcon-40b-instruct-GGML +https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML -CUDA (cuda-integration branch, not in master yet): -Only some tensors supported currently, only mul_mat operation supported currently + +**The official HF models are here:** +https://huggingface.co/tiiuae/falcon-40b/ +https://huggingface.co/tiiuae/falcon-7b/ +https://huggingface.co/tiiuae/falcon-40b-instruct +https://huggingface.co/tiiuae/falcon-7b-instruct + +**Conversion:** +1) use falcon_convert_demo.py to produce a GGMLv0 binary from HF - not recommended to be used directly +2) use examples/falcon_quantize to convert these into GGMLv3 binaries of your choice including mmap support from there on +_Important: The Falcon 7B model features tensor sizes that do not support K-type quantizers - use the traditional quantization for those_ + +**Status/Bugs:** +* CUDA-integration branch demo ready +* python conversion script is very basic (produces ggml v0) +* On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows +* VRAM scratch/overhead calculation on CUDA can fail - if GPU RAM fills to 100% manually reduce the layers of --ngl until it fits + + + +**CUDA (cuda-integration branch, not in master yet):** +Only some tensors supported currently, only mul_mat operation supported currently q3_k timing on 3090 of Falcon 40B: falcon_print_timings: prompt eval time = 702.55 ms / 3 tokens ( 234.18 ms per token) falcon_print_timings: eval time = 3350.65 ms / 24 runs ( 139.61 ms per token) From 3984d36542acfa35c7e3c50ba1b29a53f6c00b8d Mon Sep 17 00:00:00 2001 From: Alexey Parfenov Date: Sun, 18 Jun 2023 11:10:24 -0700 Subject: [PATCH 7/7] Fixes typo --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 2a1ed8c7e..512ff12f3 100644 --- a/Makefile +++ b/Makefile @@ -258,7 +258,7 @@ libfalcon.o: libfalcon.cpp ggml.h ggml-cuda.h libfalcon.h llama-util.h common.o: examples/common.cpp examples/common.h $(CXX) $(CXXFLAGS) -c $< -o $@ -falcom_common.o: examples/falcon_common.cpp examples/falcon_common.h +falcon_common.o: examples/falcon_common.cpp examples/falcon_common.h $(CXX) $(CXXFLAGS) -c $< -o $@ libllama.so: llama.o ggml.o $(OBJS)