llama.cpp

Author	SHA1	Message	Date
Nexesenex	f63860eaac	Put back ffn_down tree where it was before.	2024-08-25 03:20:29 +02:00
Nexesenex	8fc46df134	Bump a bit ffn_gate and down for some GQA<2 models	2024-08-25 03:12:29 +02:00
Nexesenex	53b8eaa316	Remove deprecated rules for token embeddings	2024-08-25 03:12:29 +02:00
Nexesenex	844d11b8f3	bad indent	2024-08-25 03:12:29 +02:00
Nexesenex	5ae59714d2	Revamp Q2_K and Q3_K quants Q3_K_XL takes the place of Q3_K_L. Q3_K_L becomes intermediary between Q3_K_M and XL.	2024-08-25 03:12:29 +02:00
Nexesenex	1bde168c07	Usage of n_head to discriminate very small models Of which the size is more sensitive to the non repeating tensors	2024-08-25 03:04:17 +02:00
Nexesenex	16e9c3771a	various corrections on IQ2_S+ and IQ3 quants	2024-08-25 03:04:17 +02:00
Nexesenex	380b53d061	Fix IQ4_XSR	2024-08-25 03:04:17 +02:00
Nexesenex	608108597c	Ravamp attn_output	2024-08-25 03:04:17 +02:00
Nexesenex	6b5cebfb2b	Revamp a bit output weight for more granularity in low quants.	2024-08-25 03:04:16 +02:00
Nexesenex	f796954872	Revamp FFN down and attn_k And complete FFN up Shrink a bit more non GQA models	2024-08-25 03:04:16 +02:00
Nexesenex	596a4aec86	Readd variable attn_k, attn_q, attn_o after merge	2024-08-25 03:00:13 +02:00
Nexesenex	fb2b9ea667	Merge branch 'master' into pr/8836	2024-08-25 02:59:57 +02:00
Nexesenex	3a027b878b	Revamp IQ4_XSR, remove IQ3_XXXL	2024-08-25 02:54:45 +02:00
Nexesenex	e05da54eff	Overhaul of FFN, if GQA and if not	2024-08-25 02:54:45 +02:00
Nexesenex	1607a02bdd	Further adjustments difquant formulas	2024-08-25 02:54:45 +02:00
Nexesenex	179ad0fad4	Little rework of the difquant formulas	2024-08-25 02:54:45 +02:00
Johannes Gäßler	e11bd856d5	CPU/CUDA: Gemma 2 FlashAttention support (#8542 ) * CPU/CUDA: Gemma 2 FlashAttention support * apply logit_softcap to scale in kernel * disable logit softcapping tests on Metal * remove metal check	2024-08-24 21:34:59 +02:00
João Dinis Ferreira	8f824ffe8e	quantize : fix typo in usage help of `quantize.cpp` (#9145 )	2024-08-24 09:22:45 +03:00
Xuan Son Nguyen	3ba780e2a8	lora : fix llama conversion script with ROPE_FREQS (#9117 )	2024-08-23 12:58:53 +02:00
piDack	a07c32ea54	llama : use F32 precision in GLM4 attention and no FA (#9130 )	2024-08-23 10:27:17 +03:00
Akarshan Biswas	11b84eb457	[SYCL] Add a space to supress a cmake warning (#9133 )	2024-08-22 22:09:47 +08:00
luoyu-intel	1731d4238f	[SYCL] Add oneDNN primitive support (#9091 ) * add onednn * add sycl_f16 * add dnnl stream * add engine map * use dnnl for intel only * use fp16fp16fp16 * update doc	2024-08-22 12:50:10 +08:00
compilade	a1631e53f6	llama : simplify Mamba with advanced batch splits (#8526 ) * llama : advanced batch splits This includes equal-sequence-length batch splits which are useful to simplify recurrent model operators. * llama : always make recurrent state slots contiguous * ggml : simplify mamba operators * llama : fix integer signedness mixing * llama : logits_all has priority over batch->logits Otherwise, the server embeddings tests failed. This was likely an existing problem but was only detected here because of an additional assertion. * llama : apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : fix t5 segfault * llama : fix Mamba session save and restore * llama : minor cosmetic changes * llama : rename llama_reorder_outputs to llama_output_reorder Also move it closer to llama_output_reserve. * llama : fix pooled embeddings when using batches with equal_seqs * minor : add struct members for clarity ggml-ci * llama : fix T5 segfault again * llama : fix Mamba pooled embeddings with multiple sequences Until the pooled embeddings are refactored to allow splitting across ubatches for causal embeddings, recurrent models can only process a single sequence per ubatch when calculating pooled embeddings. * llama : add llama_model_is_recurrent to simplify figuring that out This will make it easier to more cleanly support RWKV-v6 and Mamba-2. * llama : fix simple splits when the batch contains embeddings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-08-21 17:58:11 -04:00
Nexesenex	644aa9fd41	Correction too small tensor embeddings to quantize IQ2_XS doesn't seem to work as such, back to IQ2_S	2024-08-21 13:07:32 +02:00
Nexesenex	32f6ead0d9	Improve IQ1 and IQ2 quants And fix mistakes for the attn.output of IQ2_XL and the ffn gate and up of IQ2_XS Reformat attn_ouput mess and split GQA4/GQA2	2024-08-21 12:52:45 +02:00
Nexesenex	d7b9d214fb	Shrink a bit IQ3_XXS, bump a bit IQ3_M	2024-08-21 12:49:40 +02:00
Nexesenex	dbadcdd5cf	harmonize formatting of tensor type conditions	2024-08-21 12:30:38 +02:00
Nexesenex	ce86019770	change function use__bits into difquant__tensors this to clarify what it does, especially with the 5 additional levels of difquant	2024-08-21 12:26:12 +02:00
Nexesenex	cfe866e152	Merge branch 'master' into pr/8836	2024-08-21 12:23:41 +02:00
Xuan Son Nguyen	fc54ef0d1c	server : support reading arguments from environment variables (#9105 ) * server : support reading arguments from environment variables * add -fa and -dt * readme : specify non-arg env var	2024-08-21 11:04:34 +02:00
Younes Belkada	b40eb84895	llama : support for `falcon-mamba` architecture (#9074 ) * feat: initial support for llama.cpp * fix: lint * refactor: better refactor * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * fix: address comments * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * fix: add more cleanup and harmonization * fix: lint * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * fix: change name * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> * add in operator * fix: add `dt_b_c_rms` in `llm_load_print_meta` * fix: correct printf format for bool * fix: correct print format * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * llama : quantize more Mamba tensors * llama : use f16 as the fallback of fallback quant types --------- Co-authored-by: compilade <git@compilade.net>	2024-08-21 11:06:36 +03:00
fairydreaming	f63f603c87	llava : zero-initialize clip_ctx structure fields with aggregate initialization 908) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-08-21 09:45:49 +02:00
Daniel Bevenius	8455340b87	llama : std::move llm_bigram_bpe from work_queue (#9062 ) * llama : std::move llm_bigram_bpe from work_queue This commit updates the retrieval of llm_bigram_bpe objects from work_queue.top() by using std::move. The motivation for this is to avoid the copying of the std::string `text` member of the llm_bigram_bpe struct. * squash! llama : std::move llm_bigram_bpe from work_queue Introduced a MovablePriorityQueue class to allow moving elements out of the priority queue for llm_bigram_bpe. * squash! llama : std::move llm_bigram_bpe from work_queue Rename MovablePriorityQueue to lama_priority_queue. * squash! llama : std::move llm_bigram_bpe from work_queue Rename lama_priority_queue -> llama_priority_queue.	2024-08-21 10:32:58 +03:00
Changyeon Kim	2f3c1466ff	llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. (#8984 ) * llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. - The CLIP model now prioritizes the Vulkan backend over the CPU when vulkan available. - A GGML_OP_ACC shader has been added. - The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * fix-up coding style. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * Fix-up the missing initial parameter to resolve the compilation warning. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * [fix] Add missing parameters. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * [fix] Use nb1 and nb2 for dst. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * Fix check results ggml_acc call --------- Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> Co-authored-by: 0cc4m <picard12@live.de>	2024-08-20 21:00:00 +02:00
Meng, Hengyu	50addec9a5	[SYCL] fallback mmvq (#9088 ) * fallback mmvq to mul_mat * mmvq in cuda path * Update ggml/src/ggml-sycl.cpp Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com> --------- Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>	2024-08-20 23:50:17 +08:00
zhentaoyu	4f8d19ff17	[SYCL] Fix SYCL `im2col` and `convert` Overflow with Large Dims (#9052 ) * sycl: fix im2col overflow and sync with cuda Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix convert overflow Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix convert and dequantize Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix ib in dmmv Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl:refine convert Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: move downsample global_range into common Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: add im2col and convert test cases Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: make new cases only in sycl Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: comment new test_cases for only local testing Signed-off-by: zhentaoyu <zhentao.yu@intel.com> --------- Signed-off-by: zhentaoyu <zhentao.yu@intel.com>	2024-08-20 23:06:51 +08:00
fairydreaming	90db8146d5	tests : add missing comma in grammar integration tests (#9099 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-08-20 12:09:55 +03:00
Nexesenex	fddff02915	Rework IQ3_XXS and IQ3_XS and fix parenthesis mistake on IQ3_S	2024-08-20 01:16:24 +02:00
Nexesenex	207ffe681f	Reorder, corrections, settling lower IQ3 quants	2024-08-20 00:59:54 +02:00
Nexesenex	8c1a3c5ba2	Merge branch 'master' into pr/8836	2024-08-20 00:48:05 +02:00
Nexesenex	a7f91643bb	Fix mistake	2024-08-19 20:02:21 +02:00
wangshuai09	cfac111e2b	cann: add doc for cann backend (#8867 ) Co-authored-by: xuedinge233 <damow890@gmail.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2024-08-19 16:46:38 +08:00
Radoslav Gerganov	1b6ff90ff8	rpc : print error message when failed to connect endpoint (#9042 )	2024-08-19 10:11:45 +03:00
Radoslav Gerganov	18eaf29f4c	rpc : prevent crashes on invalid input (#9040 ) Add more checks which prevent RPC server from crashing if invalid input is received from client	2024-08-19 10:10:21 +03:00
Nexesenex	caeb839ae3	Boost embeddings and output weights for MOEs. They are single and non-repeating, the boost is thus reasonable compared to the 4 or more experts size.	2024-08-18 22:20:58 +02:00
Nexesenex	503048a197	Correct IQ3_M	2024-08-18 22:14:05 +02:00
Nexesenex	ddb13732c4	IQ3_XXL and IQ3_XXXL We now have a full range of quants between IQ3_M and IQ4_XS	2024-08-18 22:14:04 +02:00
Nexesenex	a79633b49e	Merge branch 'master' into pr/8836	2024-08-18 22:12:39 +02:00
Nexesenex	b02eaf6803	Mass use of the few/some/more/many bits bump logic Add few bits logic and rework the 4 settings for 25/37.5/50/75% quant bump when used.	2024-08-18 22:11:24 +02:00

1 2 3 4 5 ...

3686 commits