CUDA full GPU acceleration, KV cache in VRAM (#1827)

* Fixed CUDA RoPE * ggml_cuda_mul_mat_vec_p021 * ggml_cuda_scale * ggml_cuda_diag_mask_inf * ggml_is_permuted * ggml_cuda_cpy * flatten rows for ggml_cuda_op * Added a --low-vram option * Fixed Windows performance * Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM
2023-06-14 19:47:19 +02:00 · 2023-06-14 19:47:19 +02:00 · 254a7a7a5f
commit 254a7a7a5f
parent 9254920265
11 changed files with 853 additions and 149 deletions
--- a/ggml.h
+++ b/ggml.h
@ -485,6 +485,7 @@ extern "C" {

    GGML_API bool ggml_is_transposed(const struct ggml_tensor * tensor);
    GGML_API bool ggml_is_contiguous(const struct ggml_tensor * tensor);
+    GGML_API bool ggml_is_permuted  (const struct ggml_tensor * tensor);

    // use this to compute the memory overhead of a tensor
    GGML_API size_t ggml_tensor_overhead(void);