was reverted on cuda merge

2023-06-19 13:43:12 +02:00 · 2023-06-19 13:43:12 +02:00 · eb22d7e504
commit eb22d7e504
parent c5399d1cf7
1 changed files with 42 additions and 4 deletions
--- a/README.md
+++ b/README.md
@ -1,10 +1,48 @@
 llama.cpp modification to run Falcon (work in progress)
-Status/Bugs:  
+**The Bloke features a well known fine tuned variants with quantization:**  
-* Quantization works except for Q_K_ types  
+https://huggingface.co/TheBloke/falcon-40b-instruct-GGML  
-* CUDA not yet functional
+https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML  
 **The official HF models are here:**  
 https://huggingface.co/tiiuae/falcon-40b/
 https://huggingface.co/tiiuae/falcon-7b/  
 https://huggingface.co/tiiuae/falcon-40b-instruct  
 https://huggingface.co/tiiuae/falcon-7b-instruct  
 **Conversion:**
 1) use falcon_convert_demo.py to produce a GGMLv0 binary from HF - not recommended to be used directly  
 2) use examples/falcon_quantize to convert these into GGMLv3 binaries of your choice including mmap support from there on  
 _Important: The Falcon 7B model features tensor sizes that do not support K-type quantizers - use the traditional quantization for those_  
 **Status/Bugs:**  
 * CUDA-integration branch demo ready  
 * python conversion script is very basic (produces ggml v0)  
 * On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows
 * VRAM scratch/overhead calculation on CUDA can fail - if GPU RAM fills to 100% manually reduce the layers of --ngl until it fits  
 **CUDA (cuda-integration branch, not in master yet):**  
 Only some tensors supported currently, only mul_mat operation supported currently  
 q3_k timing on 3090 of Falcon 40B:  
 falcon_print_timings: prompt eval time =   702.55 ms /     3 tokens (  234.18 ms per token)  
 falcon_print_timings:        eval time =  3350.65 ms /    24 runs   (  139.61 ms per token)  
 q4_k timing on 3090 of falcon 40B (partial offload):  
 falcon_print_timings: prompt eval time =   590.82 ms /     3 tokens (  196.94 ms per token)  
 falcon_print_timings:        eval time =  2817.37 ms /    24 runs   (  117.39 ms per token)  
 q4_1 timing on 3090 of falcon 7B:  
 falcon_print_timings: prompt eval time =   115.30 ms /     3 tokens (   38.43 ms per token)  
 falcon_print_timings:        eval time =  5926.74 ms /   147 runs   (   40.32 ms per token)  
 CUDA sidenote:  
 1) use 1 less threads than you have physical processor cores  
 2) If it's too slow and GPU memory is at 100% then the automated tensor skip is not working properly, reduce --ngl until gpu memory does not saturate fully at first inference  
 It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second  
 CPU inference examples: