llama.cpp

Author	SHA1	Message	Date
Nexesenex	eeccd31a9c	Merge branch 'master' into pr/8836	2024-08-15 02:30:10 +02:00
0cc4m	5fd89a70ea	Vulkan Optimizations and Fixes (#8959 ) * Optimize Vulkan REPEAT performance * Use Vulkan GLSL fused multiply-add instruction where possible * Add GGML_VULKAN_PERF option to output performance data per operator * Rework and fix Vulkan descriptor set and descriptor pool handling * Fix float32 concat f16 shader validation error * Add Vulkan GROUP_NORM eps parameter * Fix validation error with transfer queue memory barrier flags * Remove trailing whitespaces	2024-08-14 18:32:53 +02:00
compilade	98a532d474	server : fix segfault on long system prompt (#8987 ) * server : fix segfault on long system prompt * server : fix parallel generation with very small batch sizes * server : fix typo in comment	2024-08-14 09:51:02 +03:00
Georgi Gerganov	43bdd3ce18	cmake : remove unused option GGML_CURL (#9011 )	2024-08-14 09:14:49 +03:00
Daniel Bevenius	06943a69f6	ggml : move rope type enum to ggml.h (#8949 ) * ggml : move rope type enum to ggml.h This commit moves the `llama_rope_type` enum from `llama.h` to `ggml.h` and changes its name to `ggml_rope_type`. The motivation for this change is to address the TODO in `llama.h` and use the enum in ggml. Note: This commit does not change the `mode` parameter to be of type `enum ggml_rope_type`. The name `mode` and its usage suggest that it might be more generic and possibly used as a bit field for multiple flags. Further investigation/discussion may be needed to determine if `mode` should be restricted to RoPE types. * squash! ggml : move rope type enum to ggml.h This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from ggml.h, and back the llama_rope_type enum. I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is safe to remove it yet. * squash! ggml : move rope type enum to ggml.h This commit removes the enum ggml_rope_type from ggml.h and replaces it with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has been updated to reflect this change. * squash! ggml : move rope type enum to ggml.h This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX macro/define to be passed to the shader compiler. * squash! ggml : move rope type enum to ggml.h This commit fixes the editorconfig-checker warnings. * squash! ggml : move rope type enum to ggml.h Update comment for ggml_rope function. * Revert "squash! ggml : move rope type enum to ggml.h" This reverts commit `6261222bd0`. * squash! ggml : move rope type enum to ggml.h Add GGML_ROPE_TYPE_NEOX to rope_common.comp. * remove extra line --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-08-13 21:13:15 +02:00
Xuan Son Nguyen	828d6ff7d7	export-lora : throw error if lora is quantized (#9002 )	2024-08-13 11:41:14 +02:00
Nexesenex	8c9017bfbe	Simplify IQ4_XSR But leave in place as a "demo" the more complex template set by Ikawrakow to customize the layers quants, with the added attn_q, attn_k, and attn_output tensors.	2024-08-12 22:20:02 +02:00
Nexesenex	8c10533409	Merge branch 'master' into pr/8836	2024-08-12 20:28:38 +02:00
Nexesenex	cd92ba612f	IQ4_XSR (test FTYPE) and attention_wv logic for all attn_*.weights Also, Advise iMatrix for IQ2_M and Q2_K FTypes	2024-08-12 20:27:36 +02:00
Diogo Teles Sant'Anna	fc4ca27b25	ci : fix github workflow vulnerable to script injection (#9008 ) Signed-off-by: Diogo Teles Sant'Anna <diogoteles@google.com>	2024-08-12 19:28:23 +03:00
Radoslav Gerganov	1f67436c5e	ci : enable RPC in all of the released builds (#9006 ) ref: #8912	2024-08-12 19:17:03 +03:00
Nico Bosshard	0fd93cdef5	llama : model-based max number of graph nodes calculation (#8970 ) * llama : model-based max number of graph nodes calculation * Update src/llama.cpp --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-08-12 17:13:59 +02:00
Frank Mai	84eb2f4fad	docs: introduce gpustack and gguf-parser (#8873 ) * readme: introduce gpustack GPUStack is an open-source GPU cluster manager for running large language models, which uses llama.cpp as the backend. Signed-off-by: thxCode <thxcode0824@gmail.com> * readme: introduce gguf-parser GGUF Parser is a tool to review/check the GGUF file and estimate the memory usage without downloading the whole model. Signed-off-by: thxCode <thxcode0824@gmail.com> --------- Signed-off-by: thxCode <thxcode0824@gmail.com>	2024-08-12 14:45:50 +02:00
DavidKorczynski	1262e7ed13	grammar-parser : fix possible null-deref (#9004 ) Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680 Signed-off-by: David Korczynski <david@adalogics.com>	2024-08-12 15:36:41 +03:00
Nexesenex	3e2eb6dc57	Merge branch 'master' into pr/8836	2024-08-12 14:25:23 +02:00
DavidKorczynski	df5478fbea	ggml: fix div-by-zero (#9003 ) Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724 In order to access the above bug you need to login using one of the emails in https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5 Signed-off-by: David Korczynski <david@adalogics.com>	2024-08-12 14:21:41 +02:00
Liu Jia	2589292cde	Fix a spelling mistake (#9001 )	2024-08-12 11:46:03 +02:00
Georgi Gerganov	d3ae0ee8d7	py : fix requirements check '==' -> '~=' (#8982 ) * py : fix requirements check '==' -> '~=' * cont : fix the fix * ci : run on all requirements.txt	2024-08-12 11:02:01 +03:00
Georgi Gerganov	5ef07e25ac	server : handle models with missing EOS token (#8997 ) ggml-ci	2024-08-12 10:21:50 +03:00
Nexesenex	df9e6fda50	Adjustments on output and embeddings	2024-08-11 21:49:23 +02:00
Nexesenex	1ad18f80e9	Adjustments on attn_k	2024-08-11 21:44:29 +02:00
compilade	4134999e01	gguf-py : Numpy dequantization for most types (#8939 ) * gguf-py : Numpy dequantization for most types * gguf-py : Numpy dequantization for grid-based i-quants	2024-08-11 14:45:41 -04:00
Nexes the Old	8c2c03f4a7	Merge b3569 b3569	2024-08-11 16:46:15 +02:00
Nexesenex	91db53b645	IQ1_XL and some corrections notably on attn_q and parenthesis	2024-08-11 16:41:23 +02:00
Georgi Gerganov	8cd1bcfd3f	flake.lock: Update (#8979 )	2024-08-11 06:58:58 -07:00
Neo Zhang	a21c6fd450	update guide (#8909 ) Co-authored-by: Neo Zhang <>	2024-08-11 14:07:43 +05:30
fairydreaming	33309f661a	llama : check all graph nodes when searching for result_embd_pooled (#8956 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-08-11 10:35:26 +02:00
Markus Tavenrath	7c5bfd57f8	Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. (#8943 ) * Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove. - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors. * Fix small typo --------- Co-authored-by: 0cc4m <picard12@live.de>	2024-08-11 10:09:09 +02:00
Nexesenex	1268d58ca8	More adjustments	2024-08-11 03:05:52 +02:00
Nexesenex	ef83a87cfe	Revert of ffn gate and up on IQ3_M and indent	2024-08-11 01:30:18 +02:00
Nexesenex	e2e2d77e8e	misplaced file lol	2024-08-11 01:13:12 +02:00
Nexesenex	8ad71f4469	IQ1_XS and small adjustments.	2024-08-11 01:11:24 +02:00
Nexes the Old	14f4f404d5	Merge b3565 Merge b3565	2024-08-10 20:45:26 +02:00
Nexesenex	8bc7a9849e	2 forgotten files	2024-08-10 20:40:27 +02:00
Nexesenex	f0806ac943	IQ2_XL , IQ3_XL , Q2_K_L Plus some adjustments on the FFNs	2024-08-10 20:36:49 +02:00
Nexesenex	49617b1960	Advancing on several tensors - Progressivity for token embeddings and attn_qkv - FFN down for IQ1 and IQ2 quants - FFN gate and up for IQ2_S and IQ2_M, for progressivity in the IQ2 range.	2024-08-10 18:37:29 +02:00
Nexesenex	415d5e40e1	Refactor furthermore attn.v And also lower attn_q for IQ2_XS, in order to separate it more for the quite misnamed IQ2_S	2024-08-10 17:32:29 +02:00
Nexesenex	8c8e43ce20	Settings for MOE >= 8 experts applied to >= 4 experts	2024-08-10 16:38:11 +02:00
Nexesenex	aa4eb594ef	Further refactor attn_k With attn_k set for all quants bellow 3bpw except Q2_K_S.	2024-08-10 16:33:55 +02:00
slaren	6e02327e8b	metal : fix uninitialized abort_callback (#8968 )	2024-08-10 15:42:10 +02:00
Nexesenex	8f1b99fee8	Shortening formatting	2024-08-10 13:09:11 +02:00
Xuan Son Nguyen	7eb23840ed	llama : default n_swa for phi-3 (#8931 ) * default n_swa for phi-3 * fix * double check swa	2024-08-10 13:04:40 +02:00
Nexesenex	7212098755	IQ1 and IQ2 refactor Attn_q in Q3_K for experts >= 8 Attn_k in Q5_K for experts >= 8 Attn_v in Q6_K for experts >= 8, in IQ3_XXS for IQ2_XXS and IQ2_XS Attn_output in Q4_K for experts >= 8	2024-08-10 12:52:57 +02:00
fairydreaming	7c3f55c100	Add support for encoder-only T5 models (#8900 ) * gguf-py : add T5ENCODER model architecture * common : call llama_decode() during warmup only if the model has decoder * convert-hf : add T5EncoderModel * llama : add llama_model_has_decoder() API function * llama : split build_t5() into build_t5_encoder() and build_t5_decoder() * llama : add support for LLM_ARCH_T5ENCODER * llama-embedding : add support for LLAMA_POOLING_TYPE_NONE * llama-embedding : add support for encoder-only models --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-08-10 11:43:26 +02:00
Matteo Mortari	911b437f22	gguf-py : fix double call to add_architecture() (#8952 ) Signed-off-by: tarilabs <matteo.mortari@gmail.com>	2024-08-10 08:58:49 +03:00
Nexesenex	1bc4dc5c15	Bump IQ3_M attn.v in Q5_K attn.k in IQ4_XS	2024-08-09 22:49:42 +02:00
Georgi Gerganov	b72942fac9	Merge commit from fork	2024-08-09 23:03:21 +03:00
fairydreaming	6afd1a99dc	llama : add support for lora adapters in T5 model (#8938 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-08-09 18:53:09 +02:00
Georgi Gerganov	272e3bd95e	make : fix llava obj file race (#8946 ) ggml-ci	2024-08-09 18:24:30 +03:00
Georgi Gerganov	45a55b91aa	llama : better replace_all (cont) (#8926 ) * llama : better replace_all (cont) ggml-ci * code : deduplicate replace_all ggml-ci	2024-08-09 18:23:52 +03:00

1 2 3 4 5 ...

3615 commits