llama.cpp

Author	SHA1	Message	Date
Georgi Gerganov	ce281b904c	llama : disable FA for AMD	2024-04-24 17:54:32 +03:00
Georgi Gerganov	c70bfd7bcb	cuda : "constexpr dim3" -> "const dim3" ggml-ci	2024-04-22 20:31:23 +03:00
Georgi Gerganov	5408d55506	cuda : uint -> uint32_t	2024-04-22 19:12:06 +03:00
Johannes Gäßler	87968de9a9	fix KQ FP32 precision fpr parallel_blocks > 1	2024-04-18 13:15:32 +02:00
Johannes Gäßler	0bc67dd1c8	Calculate KQ as FP32 if KQV has GGML_PREC_F32	2024-04-18 13:15:32 +02:00
Johannes Gäßler	a5b0e2dea0	store temp KQ in registers	2024-04-18 13:15:32 +02:00
Johannes Gäßler	ef9e1593f3	flush softmax exp below threshold to 0	2024-04-18 13:15:32 +02:00
Johannes Gäßler	6a3b84236d	fix flash_attn_vec_f16 race condition	2024-04-18 13:15:32 +02:00
Johannes Gäßler	34f93bbb39	CUDA: refactor host code, dyn. par. blocks	2024-04-18 13:15:32 +02:00
Johannes Gäßler	ee19a4ab7e	fix KV cache padding, NaN from INFINITY (#6438 )	2024-04-02 17:26:22 +02:00
Johannes Gäßler	c63dfdf765	fix cmake build	2024-04-02 13:48:13 +03:00
Johannes Gäßler	bb0d51accd	fix excessive KQ_b loads	2024-04-02 13:48:13 +03:00
Johannes Gäßler	e1ecd3b129	fix compile warnings	2024-04-02 13:48:13 +03:00
Johannes Gäßler	3f777acf06	Multiple parallel blocks for batch size 1	2024-04-02 13:48:13 +03:00
Johannes Gäßler	68d793bee8	no ncols == 64	2024-04-02 13:48:13 +03:00
Johannes Gäßler	cca6d027a3	4 warps, 256 stride for all D	2024-04-02 13:48:13 +03:00
Johannes Gäßler	269374ed81	adjust kernel selection logic	2024-04-02 13:48:13 +03:00
Johannes Gäßler	81da919864	no vec for hs, no hs==256 ncols==32 for Volta	2024-04-02 13:48:13 +03:00
Johannes Gäßler	d59ac670bf	16 cols for Phi-2	2024-04-02 13:48:13 +03:00
Johannes Gäßler	75aa7b4b18	CUDA: faster FlashAttention, kernel for bs == 1	2024-04-02 13:48:13 +03:00
Georgi Gerganov	6be02b5969	cuda : fix build	2024-03-27 10:31:52 +02:00
Georgi Gerganov	013721df2b	Merge branch 'master' into gg/flash-attn	2024-03-27 10:24:09 +02:00

22 commits