was reverted on cuda merge
This commit is contained in:
parent
c5399d1cf7
commit
eb22d7e504
1 changed files with 42 additions and 4 deletions
44
README.md
44
README.md
|
@ -1,10 +1,48 @@
|
||||||
llama.cpp modification to run Falcon (work in progress)
|
llama.cpp modification to run Falcon (work in progress)
|
||||||
|
|
||||||
Status/Bugs:
|
**The Bloke features a well known fine tuned variants with quantization:**
|
||||||
* Quantization works except for Q_K_ types
|
https://huggingface.co/TheBloke/falcon-40b-instruct-GGML
|
||||||
* CUDA not yet functional
|
https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML
|
||||||
|
|
||||||
|
|
||||||
|
**The official HF models are here:**
|
||||||
|
https://huggingface.co/tiiuae/falcon-40b/
|
||||||
|
https://huggingface.co/tiiuae/falcon-7b/
|
||||||
|
https://huggingface.co/tiiuae/falcon-40b-instruct
|
||||||
|
https://huggingface.co/tiiuae/falcon-7b-instruct
|
||||||
|
|
||||||
|
**Conversion:**
|
||||||
|
1) use falcon_convert_demo.py to produce a GGMLv0 binary from HF - not recommended to be used directly
|
||||||
|
2) use examples/falcon_quantize to convert these into GGMLv3 binaries of your choice including mmap support from there on
|
||||||
|
_Important: The Falcon 7B model features tensor sizes that do not support K-type quantizers - use the traditional quantization for those_
|
||||||
|
|
||||||
|
**Status/Bugs:**
|
||||||
|
* CUDA-integration branch demo ready
|
||||||
* python conversion script is very basic (produces ggml v0)
|
* python conversion script is very basic (produces ggml v0)
|
||||||
* On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows
|
* On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows
|
||||||
|
* VRAM scratch/overhead calculation on CUDA can fail - if GPU RAM fills to 100% manually reduce the layers of --ngl until it fits
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
**CUDA (cuda-integration branch, not in master yet):**
|
||||||
|
Only some tensors supported currently, only mul_mat operation supported currently
|
||||||
|
q3_k timing on 3090 of Falcon 40B:
|
||||||
|
falcon_print_timings: prompt eval time = 702.55 ms / 3 tokens ( 234.18 ms per token)
|
||||||
|
falcon_print_timings: eval time = 3350.65 ms / 24 runs ( 139.61 ms per token)
|
||||||
|
|
||||||
|
q4_k timing on 3090 of falcon 40B (partial offload):
|
||||||
|
falcon_print_timings: prompt eval time = 590.82 ms / 3 tokens ( 196.94 ms per token)
|
||||||
|
falcon_print_timings: eval time = 2817.37 ms / 24 runs ( 117.39 ms per token)
|
||||||
|
|
||||||
|
q4_1 timing on 3090 of falcon 7B:
|
||||||
|
falcon_print_timings: prompt eval time = 115.30 ms / 3 tokens ( 38.43 ms per token)
|
||||||
|
falcon_print_timings: eval time = 5926.74 ms / 147 runs ( 40.32 ms per token)
|
||||||
|
|
||||||
|
|
||||||
|
CUDA sidenote:
|
||||||
|
1) use 1 less threads than you have physical processor cores
|
||||||
|
2) If it's too slow and GPU memory is at 100% then the automated tensor skip is not working properly, reduce --ngl until gpu memory does not saturate fully at first inference
|
||||||
|
|
||||||
|
|
||||||
It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second
|
It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second
|
||||||
CPU inference examples:
|
CPU inference examples:
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue