CUDA full GPU acceleration, KV cache in VRAM (#1827)

* Fixed CUDA RoPE * ggml_cuda_mul_mat_vec_p021 * ggml_cuda_scale * ggml_cuda_diag_mask_inf * ggml_is_permuted * ggml_cuda_cpy * flatten rows for ggml_cuda_op * Added a --low-vram option * Fixed Windows performance * Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM
2023-06-14 19:47:19 +02:00 · 2023-06-14 19:47:19 +02:00 · 254a7a7a5f
commit 254a7a7a5f
parent 9254920265
11 changed files with 853 additions and 149 deletions
--- a/llama.h
+++ b/llama.h
@ -77,6 +77,7 @@ extern "C" {
        int n_gpu_layers;                      // number of layers to store in VRAM
        int main_gpu;                          // the GPU that is used for scratch and small tensors
        float tensor_split[LLAMA_MAX_DEVICES]; // how to split layers across multiple GPUs
+        bool low_vram;                         // if true, reduce VRAM usage at the cost of performance
        int seed;                              // RNG seed, -1 for random

        bool f16_kv;     // use fp16 for KV cache