diff --git a/README.md b/README.md index ae56b6d1a..52b7ed300 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,8 @@ llama.cpp modification to run Falcon (work in progress) Status: * Quantization works except for Q_K_ types -* CUDA not yet functional +* CUDA not yet functional +* context size calculation not proper (cuda as well as cpu) It appears the Q5 Falcon 40B inference time on CPU is as fast as the A100 fp16 inference time at 2 tk/second