cuBLAS doc + error if -ngl > 0 and no cuBLAS
This commit is contained in:
parent
63d20469b8
commit
9b1f955083
2 changed files with 7 additions and 2 deletions
|
@ -278,7 +278,7 @@ Building the program with BLAS support may lead to some performance improvements
|
||||||
|
|
||||||
- cuBLAS
|
- cuBLAS
|
||||||
|
|
||||||
This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
|
This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. It also enables GPU accelerated token generation via llama.cpp CUDA kernels. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
|
||||||
- Using `make`:
|
- Using `make`:
|
||||||
```bash
|
```bash
|
||||||
make LLAMA_CUBLAS=1
|
make LLAMA_CUBLAS=1
|
||||||
|
@ -292,6 +292,8 @@ Building the program with BLAS support may lead to some performance improvements
|
||||||
cmake --build . --config Release
|
cmake --build . --config Release
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Prompt processing will automatically be GPU accelerated. To enable token generation acceleration use the `-ngl` or `--n-gpu-layers` argument and specify how many layers should be offloaded to the GPU. A higher value will enable more GPU acceleration but also increase VRAM usage. Maximum effective values: 33 for 7b, 41 for 13b, 61 for 33b, 81 for 65b. Multi-GPU setups and iGPUs are currently not supported.
|
||||||
|
|
||||||
Note: Because llama.cpp uses multiple CUDA streams for matrix multiplication results [are not guaranteed to be reproducible](https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility). If you need reproducibility, set `GGML_CUDA_MAX_STREAMS` in the file `ggml-cuda.cu` to 1.
|
Note: Because llama.cpp uses multiple CUDA streams for matrix multiplication results [are not guaranteed to be reproducible](https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility). If you need reproducibility, set `GGML_CUDA_MAX_STREAMS` in the file `ggml-cuda.cu` to 1.
|
||||||
|
|
||||||
### Prepare Data & Run
|
### Prepare Data & Run
|
||||||
|
|
|
@ -1054,7 +1054,10 @@ static void llama_model_load_internal(
|
||||||
fprintf(stderr, "%s: [cublas] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024);
|
fprintf(stderr, "%s: [cublas] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024);
|
||||||
}
|
}
|
||||||
#else
|
#else
|
||||||
(void) n_gpu_layers;
|
if (n_gpu_layers > 0) {
|
||||||
|
throw format("llama.cpp was compiled without cuBLAS. "
|
||||||
|
"It is not possible to offload the requested %d layers onto the GPU.\n", n_gpu_layers);
|
||||||
|
}
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
// loading time will be recalculate after the first eval, so
|
// loading time will be recalculate after the first eval, so
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue