cuBLAS doc + error if -ngl > 0 and no cuBLAS

2023-05-15 11:27:32 +02:00 · 2023-05-15 11:27:32 +02:00 · 9b1f955083
commit 9b1f955083
parent 63d20469b8
2 changed files with 7 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -278,7 +278,7 @@ Building the program with BLAS support may lead to some performance improvements
 - cuBLAS
-  This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
+  This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. It also enables GPU accelerated token generation via llama.cpp CUDA kernels. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
  - Using `make`:
    ```bash
    make LLAMA_CUBLAS=1
@ -292,6 +292,8 @@ Building the program with BLAS support may lead to some performance improvements
    cmake --build . --config Release
    ```
 Prompt processing will automatically be GPU accelerated. To enable token generation acceleration use the `-ngl` or `--n-gpu-layers` argument and specify how many layers should be offloaded to the GPU. A higher value will enable more GPU acceleration but also increase VRAM usage. Maximum effective values: 33 for 7b, 41 for 13b, 61 for 33b, 81 for 65b. Multi-GPU setups and iGPUs are currently not supported.
 Note: Because llama.cpp uses multiple CUDA streams for matrix multiplication results [are not guaranteed to be reproducible](https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility). If you need reproducibility, set `GGML_CUDA_MAX_STREAMS` in the file `ggml-cuda.cu` to 1.
 ### Prepare Data & Run
--- a/llama.cpp
+++ b/llama.cpp
@ -1054,7 +1054,10 @@ static void llama_model_load_internal(
        fprintf(stderr, "%s: [cublas] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024);
    }
 #else
-    (void) n_gpu_layers;
+    if (n_gpu_layers > 0) {
        throw format("llama.cpp was compiled without cuBLAS. "
            "It is not possible to offload the requested %d layers onto the GPU.\n", n_gpu_layers);
    }
 #endif
    // loading time will be recalculate after the first eval, so