was reverted on cuda merge

2023-06-19 13:43:12 +02:00 · 2023-06-19 13:43:12 +02:00 · eb22d7e504
commit eb22d7e504
parent c5399d1cf7
1 changed files with 42 additions and 4 deletions
--- a/README.md
+++ b/README.md
@ -1,10 +1,48 @@
 llama.cpp modification to run Falcon (work in progress)

-Status/Bugs:  
-* Quantization works except for Q_K_ types  
-* CUDA not yet functional
+**The Bloke features a well known fine tuned variants with quantization:**  
+https://huggingface.co/TheBloke/falcon-40b-instruct-GGML  
+https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML  
+
+
+**The official HF models are here:**  
+https://huggingface.co/tiiuae/falcon-40b/
+https://huggingface.co/tiiuae/falcon-7b/  
+https://huggingface.co/tiiuae/falcon-40b-instruct  
+https://huggingface.co/tiiuae/falcon-7b-instruct  
+
+**Conversion:**
+1) use falcon_convert_demo.py to produce a GGMLv0 binary from HF - not recommended to be used directly  
+2) use examples/falcon_quantize to convert these into GGMLv3 binaries of your choice including mmap support from there on  
+_Important: The Falcon 7B model features tensor sizes that do not support K-type quantizers - use the traditional quantization for those_  
+  
+**Status/Bugs:**  
+* CUDA-integration branch demo ready  
 * python conversion script is very basic (produces ggml v0)  
 * On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows
+* VRAM scratch/overhead calculation on CUDA can fail - if GPU RAM fills to 100% manually reduce the layers of --ngl until it fits  
+  
+  
+  
+**CUDA (cuda-integration branch, not in master yet):**  
+Only some tensors supported currently, only mul_mat operation supported currently  
+q3_k timing on 3090 of Falcon 40B:  
+falcon_print_timings: prompt eval time =   702.55 ms /     3 tokens (  234.18 ms per token)  
+falcon_print_timings:        eval time =  3350.65 ms /    24 runs   (  139.61 ms per token)  
+  
+q4_k timing on 3090 of falcon 40B (partial offload):  
+falcon_print_timings: prompt eval time =   590.82 ms /     3 tokens (  196.94 ms per token)  
+falcon_print_timings:        eval time =  2817.37 ms /    24 runs   (  117.39 ms per token)  
+  
+q4_1 timing on 3090 of falcon 7B:  
+falcon_print_timings: prompt eval time =   115.30 ms /     3 tokens (   38.43 ms per token)  
+falcon_print_timings:        eval time =  5926.74 ms /   147 runs   (   40.32 ms per token)  
+
+
+CUDA sidenote:  
+1) use 1 less threads than you have physical processor cores  
+2) If it's too slow and GPU memory is at 100% then the automated tensor skip is not working properly, reduce --ngl until gpu memory does not saturate fully at first inference  
+

 It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second  
 CPU inference examples: