llama.cpp

Author	SHA1	Message	Date
Georgi Gerganov	20f1789dfb	vulkan : fix build (#0 ) ggml-ci	2024-08-27 22:41:27 +03:00
Georgi Gerganov	231cff5f6f	sync : ggml	2024-08-27 22:41:27 +03:00
Xie Yanbo	3246fe84d7	Fix minicpm example directory (#9111 )	2024-08-27 14:33:08 +02:00
compilade	78eb487bb0	llama : fix qs.n_attention_wv for DeepSeek-V2 (#9156 )	2024-08-27 13:09:23 +03:00
Xuan Son Nguyen	a77feb5d71	server : add some missing env variables (#9116 ) * server : add some missing env variables * add LLAMA_ARG_HOST to server dockerfile * also add LLAMA_ARG_CONT_BATCHING	2024-08-27 11:07:01 +02:00
CausalLM	2e59d61c1b	llama : fix ChatGLM4 wrong shape (#9194 ) This should fix THUDM/glm-4-9b-chat-1m and CausalLM/miniG	2024-08-27 09:58:22 +03:00
Carsten Kragelund Jørgensen	75e1dbbaab	llama : fix llama3.1 rope_freqs not respecting custom head_dim (#9141 ) * fix: llama3.1 rope_freqs not respecting custom head_dim * fix: use potential head_dim for Exaone	2024-08-27 09:53:40 +03:00
arch-btw	ad76569f8e	common : Update stb_image.h to latest version (#9161 ) * Update stb_image.h to latest version Fixes https://github.com/ggerganov/llama.cpp/issues/7431 * Update .ecrc	2024-08-27 08:58:50 +03:00
slaren	7d787ed96c	ggml : do not crash when quantizing q4_x_x with an imatrix (#9192 )	2024-08-26 19:44:43 +02:00
Georgi Gerganov	06658ad7c3	metal : separate scale and mask from QKT in FA kernel (#9189 ) * metal : separate scale and mask from QKT in FA kernel * metal : ne01 check no longer necessary * metal : keep data in local memory	2024-08-26 18:31:02 +03:00
Georgi Gerganov	fc18425b6a	ggml : add SSM Metal kernels (#8546 ) * ggml : add ggml_ssm_conv metal impl * ggml : add ssm_scan metal impl ggml-ci	2024-08-26 17:55:36 +03:00
Georgi Gerganov	879275ac98	tests : fix compile warnings for unreachable code (#9185 ) ggml-ci	2024-08-26 16:30:25 +03:00
Georgi Gerganov	7a3df798fc	ci : add VULKAN support to ggml-ci (#9055 )	2024-08-26 12:19:39 +03:00
Georgi Gerganov	e5edb210cd	server : update deps (#9183 )	2024-08-26 12:16:57 +03:00
slaren	0c41e03ceb	metal : gemma2 flash attention support (#9159 )	2024-08-26 11:08:59 +02:00
slaren	f12ceaca0c	ggml-ci : try to improve build time (#9160 )	2024-08-26 11:03:30 +02:00
Justine Tunney	436787f170	llama : fix time complexity of string replacement (#9163 ) This change fixes a bug where replacing text in a very long string could cause llama.cpp to hang indefinitely. This is because the algorithm used was quadratic, due to memmove() when s.replace() is called in a loop. It seems most search results and LLM responses actually provide the O(n**2) algorithm, which is a great tragedy. Using a builder string fixes things	2024-08-26 09:09:53 +03:00
Herman Semenov	93bc3839f9	common: fixed not working find argument --n-gpu-layers-draft (#9175 )	2024-08-26 00:54:37 +02:00
Johannes Gäßler	f91fc5639b	CUDA: fix Gemma 2 numerical issues for FA (#9166 )	2024-08-25 22:11:48 +02:00
Nexesenex	16aee45179	correction	2024-08-25 14:26:29 +02:00
Nexesenex	dd3df754b2	Bad indents and trailing whitespaces	2024-08-25 03:30:43 +02:00
Nexesenex	f63860eaac	Put back ffn_down tree where it was before.	2024-08-25 03:20:29 +02:00
Nexesenex	8fc46df134	Bump a bit ffn_gate and down for some GQA<2 models	2024-08-25 03:12:29 +02:00
Nexesenex	53b8eaa316	Remove deprecated rules for token embeddings	2024-08-25 03:12:29 +02:00
Nexesenex	844d11b8f3	bad indent	2024-08-25 03:12:29 +02:00
Nexesenex	5ae59714d2	Revamp Q2_K and Q3_K quants Q3_K_XL takes the place of Q3_K_L. Q3_K_L becomes intermediary between Q3_K_M and XL.	2024-08-25 03:12:29 +02:00
Nexesenex	1bde168c07	Usage of n_head to discriminate very small models Of which the size is more sensitive to the non repeating tensors	2024-08-25 03:04:17 +02:00
Nexesenex	16e9c3771a	various corrections on IQ2_S+ and IQ3 quants	2024-08-25 03:04:17 +02:00
Nexesenex	380b53d061	Fix IQ4_XSR	2024-08-25 03:04:17 +02:00
Nexesenex	608108597c	Ravamp attn_output	2024-08-25 03:04:17 +02:00
Nexesenex	6b5cebfb2b	Revamp a bit output weight for more granularity in low quants.	2024-08-25 03:04:16 +02:00
Nexesenex	f796954872	Revamp FFN down and attn_k And complete FFN up Shrink a bit more non GQA models	2024-08-25 03:04:16 +02:00
Nexesenex	596a4aec86	Readd variable attn_k, attn_q, attn_o after merge	2024-08-25 03:00:13 +02:00
Nexesenex	fb2b9ea667	Merge branch 'master' into pr/8836	2024-08-25 02:59:57 +02:00
Nexesenex	3a027b878b	Revamp IQ4_XSR, remove IQ3_XXXL	2024-08-25 02:54:45 +02:00
Nexesenex	e05da54eff	Overhaul of FFN, if GQA and if not	2024-08-25 02:54:45 +02:00
Nexesenex	1607a02bdd	Further adjustments difquant formulas	2024-08-25 02:54:45 +02:00
Nexesenex	179ad0fad4	Little rework of the difquant formulas	2024-08-25 02:54:45 +02:00
Johannes Gäßler	e11bd856d5	CPU/CUDA: Gemma 2 FlashAttention support (#8542 ) * CPU/CUDA: Gemma 2 FlashAttention support * apply logit_softcap to scale in kernel * disable logit softcapping tests on Metal * remove metal check	2024-08-24 21:34:59 +02:00
João Dinis Ferreira	8f824ffe8e	quantize : fix typo in usage help of `quantize.cpp` (#9145 )	2024-08-24 09:22:45 +03:00
Xuan Son Nguyen	3ba780e2a8	lora : fix llama conversion script with ROPE_FREQS (#9117 )	2024-08-23 12:58:53 +02:00
piDack	a07c32ea54	llama : use F32 precision in GLM4 attention and no FA (#9130 )	2024-08-23 10:27:17 +03:00
Akarshan Biswas	11b84eb457	[SYCL] Add a space to supress a cmake warning (#9133 )	2024-08-22 22:09:47 +08:00
luoyu-intel	1731d4238f	[SYCL] Add oneDNN primitive support (#9091 ) * add onednn * add sycl_f16 * add dnnl stream * add engine map * use dnnl for intel only * use fp16fp16fp16 * update doc	2024-08-22 12:50:10 +08:00
compilade	a1631e53f6	llama : simplify Mamba with advanced batch splits (#8526 ) * llama : advanced batch splits This includes equal-sequence-length batch splits which are useful to simplify recurrent model operators. * llama : always make recurrent state slots contiguous * ggml : simplify mamba operators * llama : fix integer signedness mixing * llama : logits_all has priority over batch->logits Otherwise, the server embeddings tests failed. This was likely an existing problem but was only detected here because of an additional assertion. * llama : apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : fix t5 segfault * llama : fix Mamba session save and restore * llama : minor cosmetic changes * llama : rename llama_reorder_outputs to llama_output_reorder Also move it closer to llama_output_reserve. * llama : fix pooled embeddings when using batches with equal_seqs * minor : add struct members for clarity ggml-ci * llama : fix T5 segfault again * llama : fix Mamba pooled embeddings with multiple sequences Until the pooled embeddings are refactored to allow splitting across ubatches for causal embeddings, recurrent models can only process a single sequence per ubatch when calculating pooled embeddings. * llama : add llama_model_is_recurrent to simplify figuring that out This will make it easier to more cleanly support RWKV-v6 and Mamba-2. * llama : fix simple splits when the batch contains embeddings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-08-21 17:58:11 -04:00
Nexesenex	644aa9fd41	Correction too small tensor embeddings to quantize IQ2_XS doesn't seem to work as such, back to IQ2_S	2024-08-21 13:07:32 +02:00
Nexesenex	32f6ead0d9	Improve IQ1 and IQ2 quants And fix mistakes for the attn.output of IQ2_XL and the ffn gate and up of IQ2_XS Reformat attn_ouput mess and split GQA4/GQA2	2024-08-21 12:52:45 +02:00
Nexesenex	d7b9d214fb	Shrink a bit IQ3_XXS, bump a bit IQ3_M	2024-08-21 12:49:40 +02:00
Nexesenex	dbadcdd5cf	harmonize formatting of tensor type conditions	2024-08-21 12:30:38 +02:00
Nexesenex	ce86019770	change function use__bits into difquant__tensors this to clarify what it does, especially with the 5 additional levels of difquant	2024-08-21 12:26:12 +02:00

... 2 3 4 5 6 ...

3857 commits