CUDA: quantized KV support for FA vec (#7527)

* CUDA: quantized KV support for FA vec

* try CI fix

* fix commented-out kernel variants

* add q8_0 q4_0 tests

* fix nwarps > batch size

* split fattn compile via extern templates

* fix flake8

* fix metal tests

* fix cmake

* make generate_cu_files.py executable

* add autogenerated .cu files

* fix AMD

* error if type_v != FP16 and not flash_attn

* remove obsolete code
This commit is contained in:
Johannes Gäßler 2024-06-01 08:44:14 +02:00 committed by GitHub
parent a323ec60af
commit 9b596417af
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
110 changed files with 2697 additions and 1200 deletions

View file

@ -779,6 +779,12 @@ static bool ggml_metal_supports_op(const struct ggml_metal_context * ctx, const
case GGML_OP_LEAKY_RELU:
return true;
case GGML_OP_FLASH_ATTN_EXT:
if (op->src[1]->type != GGML_TYPE_F16) {
return false;
}
if (op->src[2]->type != GGML_TYPE_F16) {
return false;
}
if (op->src[0]->ne[0] == 256) {
return false;
}