Update README.md

2023-06-18 05:57:19 +02:00 · 2023-06-18 05:57:19 +02:00 · 80f654631e
commit 80f654631e
parent cbb31807a3
1 changed files with 21 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -2,10 +2,30 @@ llama.cpp modification to run Falcon (work in progress)

 Status/Bugs:  
 * Quantization with QK_ type appear to fail on 7B models. (Q_ works on both, QK_ works on 40B)
-* CUDA not yet functional
+* CUDA-integration branch demo ready
 * python conversion script is very basic (produces ggml v0)
 * On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows

+CUDA (cuda-integration branch, not in master yet):
+Only some tensors supported currently, only mul_mat operation supported currently
+q3_k timing on 3090 of Falcon 40B:  
+falcon_print_timings: prompt eval time =   702.55 ms /     3 tokens (  234.18 ms per token)  
+falcon_print_timings:        eval time =  3350.65 ms /    24 runs   (  139.61 ms per token)  
+  
+q4_k timing on 3090 of falcon 40B (partial offload):  
+falcon_print_timings: prompt eval time =   590.82 ms /     3 tokens (  196.94 ms per token)  
+falcon_print_timings:        eval time =  2817.37 ms /    24 runs   (  117.39 ms per token)  
+  
+q4_1 timing on 3090 of falcon 7B:  
+falcon_print_timings: prompt eval time =   115.30 ms /     3 tokens (   38.43 ms per token)  
+falcon_print_timings:        eval time =  5926.74 ms /   147 runs   (   40.32 ms per token)  
+
+
+CUDA sidenote:  
+1) use 1 less threads than you have physical processor cores  
+2) If it's too slow and GPU memory is at 100% then the automated tensor skip is not working properly, reduce --ngl until gpu memory does not saturate fully at first inference  
+
+
 It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second  
 CPU inference examples:  
 ```