From 72f358150c8d9b311c7e345d2f2aa50291fa4bad Mon Sep 17 00:00:00 2001
From: John <78893154+cmp-nct@users.noreply.github.com>
Date: Sun, 18 Jun 2023 16:37:21 +0200
Subject: [PATCH] Update README.md

---
 README.md | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/README.md b/README.md
index ad590cf80..06ac9f815 100644
--- a/README.md
+++ b/README.md
@@ -1,13 +1,31 @@
 llama.cpp modification to run Falcon (work in progress)
 
-Status/Bugs:  
-* Quantization with QK_ type appear to fail on 7B models. (Q_ works on both, QK_ works on 40B)
-* CUDA-integration branch demo ready
-* python conversion script is very basic (produces ggml v0)
-* On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows
+**The Bloke features a well known fine tuned variants with quantization:**  
+https://huggingface.co/TheBloke/falcon-40b-instruct-GGML  
+https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML  
 
-CUDA (cuda-integration branch, not in master yet):
-Only some tensors supported currently, only mul_mat operation supported currently
+
+**The official HF models are here:**  
+https://huggingface.co/tiiuae/falcon-40b/
+https://huggingface.co/tiiuae/falcon-7b/  
+https://huggingface.co/tiiuae/falcon-40b-instruct  
+https://huggingface.co/tiiuae/falcon-7b-instruct  
+
+**Conversion:**
+1) use falcon_convert_demo.py to produce a GGMLv0 binary from HF - not recommended to be used directly  
+2) use examples/falcon_quantize to convert these into GGMLv3 binaries of your choice including mmap support from there on  
+_Important: The Falcon 7B model features tensor sizes that do not support K-type quantizers - use the traditional quantization for those_  
+  
+**Status/Bugs:**  
+* CUDA-integration branch demo ready  
+* python conversion script is very basic (produces ggml v0)  
+* On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows
+* VRAM scratch/overhead calculation on CUDA can fail - if GPU RAM fills to 100% manually reduce the layers of --ngl until it fits  
+  
+  
+  
+**CUDA (cuda-integration branch, not in master yet):**  
+Only some tensors supported currently, only mul_mat operation supported currently  
 q3_k timing on 3090 of Falcon 40B:  
 falcon_print_timings: prompt eval time =   702.55 ms /     3 tokens (  234.18 ms per token)  
 falcon_print_timings:        eval time =  3350.65 ms /    24 runs   (  139.61 ms per token)