From f75125615af0889cec38d16ff2aafde7a77ee64d Mon Sep 17 00:00:00 2001 From: John <78893154+cmp-nct@users.noreply.github.com> Date: Sat, 17 Jun 2023 18:57:40 +0200 Subject: [PATCH] Update README.md --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index be2eefeba..e122703e5 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,10 @@ llama.cpp modification to run Falcon (work in progress) -Status: +Status/Bugs: * Quantization works except for Q_K_ types * CUDA not yet functional * python conversion script is very basic (produces ggml v0) +* On linux Q5_1 7B user reports a batch token ingestion context memory issue, with -b 1 it's gone. Not reproduced on Windows It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second CPU inference examples: