was reverted on cuda merge

This commit is contained in:
John 2023-06-19 13:43:12 +02:00
parent c5399d1cf7
commit eb22d7e504

View file

@ -1,10 +1,48 @@
llama.cpp modification to run Falcon (work in progress)
Status/Bugs:
* Quantization works except for Q_K_ types
* CUDA not yet functional
**The Bloke features a well known fine tuned variants with quantization:**
https://huggingface.co/TheBloke/falcon-40b-instruct-GGML
https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML
**The official HF models are here:**
https://huggingface.co/tiiuae/falcon-40b/
https://huggingface.co/tiiuae/falcon-7b/
https://huggingface.co/tiiuae/falcon-40b-instruct
https://huggingface.co/tiiuae/falcon-7b-instruct
**Conversion:**
1) use falcon_convert_demo.py to produce a GGMLv0 binary from HF - not recommended to be used directly
2) use examples/falcon_quantize to convert these into GGMLv3 binaries of your choice including mmap support from there on
_Important: The Falcon 7B model features tensor sizes that do not support K-type quantizers - use the traditional quantization for those_
**Status/Bugs:**
* CUDA-integration branch demo ready
* python conversion script is very basic (produces ggml v0)
* On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows
* VRAM scratch/overhead calculation on CUDA can fail - if GPU RAM fills to 100% manually reduce the layers of --ngl until it fits
**CUDA (cuda-integration branch, not in master yet):**
Only some tensors supported currently, only mul_mat operation supported currently
q3_k timing on 3090 of Falcon 40B:
falcon_print_timings: prompt eval time = 702.55 ms / 3 tokens ( 234.18 ms per token)
falcon_print_timings: eval time = 3350.65 ms / 24 runs ( 139.61 ms per token)
q4_k timing on 3090 of falcon 40B (partial offload):
falcon_print_timings: prompt eval time = 590.82 ms / 3 tokens ( 196.94 ms per token)
falcon_print_timings: eval time = 2817.37 ms / 24 runs ( 117.39 ms per token)
q4_1 timing on 3090 of falcon 7B:
falcon_print_timings: prompt eval time = 115.30 ms / 3 tokens ( 38.43 ms per token)
falcon_print_timings: eval time = 5926.74 ms / 147 runs ( 40.32 ms per token)
CUDA sidenote:
1) use 1 less threads than you have physical processor cores
2) If it's too slow and GPU memory is at 100% then the automated tensor skip is not working properly, reduce --ngl until gpu memory does not saturate fully at first inference
It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second
CPU inference examples: