Commit graph

22 commits

Author SHA1 Message Date
Georgi Gerganov
ce281b904c
llama : disable FA for AMD 2024-04-24 17:54:32 +03:00
Georgi Gerganov
c70bfd7bcb
cuda : "constexpr dim3" -> "const dim3"
ggml-ci
2024-04-22 20:31:23 +03:00
Georgi Gerganov
5408d55506
cuda : uint -> uint32_t 2024-04-22 19:12:06 +03:00
Johannes Gäßler
87968de9a9 fix KQ FP32 precision fpr parallel_blocks > 1 2024-04-18 13:15:32 +02:00
Johannes Gäßler
0bc67dd1c8 Calculate KQ as FP32 if KQV has GGML_PREC_F32 2024-04-18 13:15:32 +02:00
Johannes Gäßler
a5b0e2dea0 store temp KQ in registers 2024-04-18 13:15:32 +02:00
Johannes Gäßler
ef9e1593f3 flush softmax exp below threshold to 0 2024-04-18 13:15:32 +02:00
Johannes Gäßler
6a3b84236d fix flash_attn_vec_f16 race condition 2024-04-18 13:15:32 +02:00
Johannes Gäßler
34f93bbb39 CUDA: refactor host code, dyn. par. blocks 2024-04-18 13:15:32 +02:00
Johannes Gäßler
ee19a4ab7e
fix KV cache padding, NaN from INFINITY (#6438) 2024-04-02 17:26:22 +02:00
Johannes Gäßler
c63dfdf765 fix cmake build 2024-04-02 13:48:13 +03:00
Johannes Gäßler
bb0d51accd fix excessive KQ_b loads 2024-04-02 13:48:13 +03:00
Johannes Gäßler
e1ecd3b129 fix compile warnings 2024-04-02 13:48:13 +03:00
Johannes Gäßler
3f777acf06 Multiple parallel blocks for batch size 1 2024-04-02 13:48:13 +03:00
Johannes Gäßler
68d793bee8 no ncols == 64 2024-04-02 13:48:13 +03:00
Johannes Gäßler
cca6d027a3 4 warps, 256 stride for all D 2024-04-02 13:48:13 +03:00
Johannes Gäßler
269374ed81 adjust kernel selection logic 2024-04-02 13:48:13 +03:00
Johannes Gäßler
81da919864 no vec for hs, no hs==256 ncols==32 for Volta 2024-04-02 13:48:13 +03:00
Johannes Gäßler
d59ac670bf 16 cols for Phi-2 2024-04-02 13:48:13 +03:00
Johannes Gäßler
75aa7b4b18 CUDA: faster FlashAttention, kernel for bs == 1 2024-04-02 13:48:13 +03:00
Georgi Gerganov
6be02b5969
cuda : fix build 2024-03-27 10:31:52 +02:00
Georgi Gerganov
013721df2b
Merge branch 'master' into gg/flash-attn 2024-03-27 10:24:09 +02:00