diff --git a/README.md b/README.md index 00571d8e1..2788f5423 100644 --- a/README.md +++ b/README.md @@ -293,6 +293,8 @@ Building the program with BLAS support may lead to some performance improvements cmake --build . -config Release ``` + + - **cuBLAS** This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads). diff --git a/BLIS.md b/docs/BLIS.md similarity index 100% rename from BLIS.md rename to docs/BLIS.md diff --git a/docs/token_generation_performance_tips.md b/docs/token_generation_performance_tips.md new file mode 100644 index 000000000..f318c11cd --- /dev/null +++ b/docs/token_generation_performance_tips.md @@ -0,0 +1,7 @@ +# Token generation performance tips + +## Verifying that the model is running on the GPU +Make sure you compiled llama with the correct env variables according to [this guide](../README.md#cublas) + +When running `llama.cpp`, outputs some helpful diagnostic information to stderr. +To verify that the workload is