CUDA: use tensor cores for MMQ (#7676)

* CUDA: int8 tensor cores for MMQ (legacy quants)

* fix out-of-bounds writes

* __builtin_assume -> GGML_CUDA_ASSUME

* fix writeback returning too early
This commit is contained in:
Johannes Gäßler 2024-06-10 11:45:13 +02:00 committed by GitHub
parent af4ae502dd
commit 1f0dabda8d
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
7 changed files with 550 additions and 55 deletions

View file

@ -40,7 +40,7 @@ static __global__ void flash_attn_vec_ext_f16(
const int ne1,
const int ne2,
const int ne3) {
#if FP16_AVAILABLE
#ifdef FP16_AVAILABLE
//In this kernel Q, K, V are matrices while i, j, k are matrix indices.
constexpr vec_dot_KQ_f16_t vec_dot_KQ = get_vec_dot_KQ_f16<D>(type_K);