llama.cpp

Author	SHA1	Message	Date
Gabe Goodhart	b83e9a6cd2	fix: Remove unused LLM_KV_ATTENTION_LAYER_COUNT I'd added this at one point, but it's not actually needed Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 15:02:38 -07:00
Gabe Goodhart	97e6ba8d99	fix: Remove outdated TODO in convrsion script Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 15:02:05 -07:00
Gabe Goodhart	204e78fba1	fix: A number of places where hybrid needs to be handled Still not fully working, but worth committing these: * per-layer n_embd_[kv]_s (probably a no-op since first layer is ssm) * fix setting n_kv_hybrid when not worst_case * Use the right n_kv for build_inp_s_copy when hybrid * Use the right n_kv for recurrent section of llama_set_inputs * Use the right logic to determine batch splitting for hybrid Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:24:02 -07:00
Gabe Goodhart	4543ed5640	feat: Update the logic in llama_decode_internal for kv_hybrid cache Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:23:58 -07:00
Gabe Goodhart	44bf431ab4	fix: Only allocate kv cache tensors for the appropriate layers in hybrid models Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:23:54 -07:00
Gabe Goodhart	92653d05fd	WIP: Partial work towards separate hybrid cache This also seems like not _quite_ the right direction Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:23:51 -07:00
Gabe Goodhart	d3a34e0282	fix: per-layer recurrent embd_[kv]_s For hybrid models, this value should be 0 for the non-recurrent layers Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:23:48 -07:00
Gabe Goodhart	f2478bcab5	fix: Get n_head_kv per-layer in build_bamba Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:23:43 -07:00
Gabe Goodhart	e7b1abbc0a	feat(bamba): Partially complete work on constructing the forward graph There are still problems at inference around matrix dimensions not lining up, so there are likely still places where the per-layer sizes are not being used correctly. Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:23:38 -07:00
Gabe Goodhart	41fc019057	fix(bamba): Remove ssm_head_count and ssm_chunk_size in llama.cpp Not necessary despite their presence in the model config. Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:23:34 -07:00
Gabe Goodhart	dfe8d3ddb8	fix(bamba conv): Remove chunk size and consolidate head count w/ time step rank head count and time step rank are used for the same purpose in the model, so we stick with the existing key. Chunk size is not used in this impl because of the way the mixer is implemented without chunking. Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:23:30 -07:00
Gabe Goodhart	3ee0ae3b90	feat(bamba): Full tensor parsing for bamba Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:23:26 -07:00
Gabe Goodhart	fd3bb30118	fix(bamba conv): Fizes in tensor name and hparam conversion for llama.cpp parsing Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:23:21 -07:00
Gabe Goodhart	e0af809b05	feat(bamba): hparam parsing in llama.cpp Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:23:17 -07:00
Gabe Goodhart	1c1e0080ed	fix(bamba): Jamba->Bamba in llama.cpp Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:23:12 -07:00
Gabe Goodhart	fd98682ec3	fix(bamba conv): Jamba -> Bamba Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:23:04 -07:00
Gabe Goodhart	e3525e9e50	feat(convert): Full pass at hparam conversion Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:22:58 -07:00
Gabe Goodhart	246dfdba65	feat(jamba): Add jamba architecture to llama.cpp enums Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:22:51 -07:00
Gabe Goodhart	9a68f7537b	feat(jamba): First pass at GGUF conversion for Jamba models There are likely still some missing hparams, but the tensor mapping should be correct Branch: BambaArchitecture Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-12-12 12:22:29 -07:00
Francis Couture-Harpin	1ee6c482d0	Merge branch 'master' into compilade/mamba2	2024-11-25 12:06:56 -05:00
brucepro	a9a678a6b2	Add download chat feature to server chat (#10481 ) * Add download chat feature to server chat Add a download feature next to the delete chat feature in the server vue chat interface. * code style --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-11-25 17:11:55 +01:00
Georgi Gerganov	9ca2e67762	server : add speculative decoding support (#10455 ) * server : add speculative decoding support ggml-ci * server : add helper function slot.can_speculate() ggml-ci	2024-11-25 16:31:38 +02:00
Diego Devesa	5931c1f233	ggml : add support for dynamic loading of backends (#10469 ) * ggml : add support for dynamic loading of backends --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-25 15:13:39 +01:00
Georgi Gerganov	f6d12e7df8	tests : fix compile warning	2024-11-25 15:17:32 +02:00
Georgi Gerganov	b756441104	metal : minor code formatting	2024-11-25 15:08:04 +02:00
Neo Zhang Jianyu	5a8987793f	[SYCL] Fix building Win package for oneAPI 2025.0 update (#10483 ) * fix build package for 2025.0 * debug * debug * fix * rm debug --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>	2024-11-25 17:31:10 +08:00
Georgi Gerganov	d9d54e498d	speculative : refactor and add a simpler example (#10362 ) * speculative : refactor and add a simpler example ggml-ci * speculative : clean-up and add comments and TODOs [no ci] * speculative : manage context in common_speculative ggml-ci * speculative : simplify ggml-ci * speculative : simplify (cont) ggml-ci * speculative : add --draft-min CLI arg * speculative : minor fixup * make : build fixes * speculative : do not redraft previous drafts ggml-ci * speculative : fix the draft sampling ggml-ci * speculative : fix compile warning * common : refactor args ggml-ci * common : change defaults [no ci] * common : final touches ggml-ci	2024-11-25 09:58:41 +02:00
Georgi Gerganov	cce5a90075	flake.lock: Update (#10470 ) Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/5e4fbfb6b3de1aa2872b76d49fafc942626e2add?narHash=sha256-OZiZ3m8SCMfh3B6bfGC/Bm4x3qc1m2SVEAlkV6iY7Yg%3D' (2024-11-15) → 'github:NixOS/nixpkgs/23e89b7da85c3640bbc2173fe04f4bd114342367?narHash=sha256-y/MEyuJ5oBWrWAic/14LaIr/u5E0wRVzyYsouYY3W6w%3D' (2024-11-19) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-11-24 08:03:25 -08:00
Diego Devesa	dc39012cba	llama : fix op mul check with command-r-plus (#10476 )	2024-11-24 16:10:26 +01:00
Gabe Goodhart	9336db462c	convert : XLMRoberta Type Vocab Size (#10458 ) This matches the key in common bert-based embedding models and may have a value other than 1 in it. Branch: XLMRobertaTypeVocabSize Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-11-24 11:02:34 +02:00
momonga	96fa2c5e2d	fix gguf-py: Conversion error when multiple licenses are configured (#9807 ) * fix general.license list to str * fix join license list --------- Co-authored-by: momonga <115213907+mmnga@users.noreply.github.com>	2024-11-24 01:09:22 +01:00
Diego Devesa	55ed008b2d	ggml : do not use ARM features not included in the build (#10457 )	2024-11-23 14:41:12 +01:00
蕭澧邦	6dfcfef078	ci: Update oneAPI runtime dll packaging (#10428 ) This is the minimum runtime dll dependencies for oneAPI 2025.0	2024-11-22 10:44:08 +01:00
Johannes Gäßler	599b3e0cd4	GitHub: ask for more info in issue templates (#10426 ) * GitHub: ask for more info in issues [no ci] * refactor issue templates to be component-specific * more understandable issue description * add dropdown for llama.cpp module	2024-11-22 08:32:40 +01:00
leo-pony	c18610b4ee	CANN: Support Ascend310P to accelerate F32 and F16 Model (#10216 ) * CANN Support Ascend310P to accelerate F32 and F16 Model * Add compile option soc type macro ASCEND_310P to ggml-cann lib * Remove unused code * Remove the ascend soc_type hard code compile option in CMakelist.txt	2024-11-22 14:07:20 +08:00
Diego Devesa	a5e47592b6	cuda : optimize argmax (#10441 ) * cuda : optimize argmax * remove unused parameter ggml-ci * fixup : use full warps ggml-ci * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * fix ub * ggml : check ne00 <= INT32_MAX in argmax and argsort --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-11-21 18:18:50 +01:00
Georgi Gerganov	1bb30bf28c	llama : handle KV shift for recurrent models (#10402 ) ggml-ci	2024-11-21 10:22:47 +02:00
Georgi Gerganov	87a533be57	sync : ggml	2024-11-21 09:22:11 +02:00
slaren	59b9172822	ggml/sched : do not skip views in pre-assignments	2024-11-21 09:22:05 +02:00
Johannes Gäßler	02e4eaf22f	ggml-opt: fix data corruption (ggml/1022)	2024-11-21 09:22:02 +02:00
Jeff Bolz	9abe9eeae9	vulkan: predicate max operation in soft_max shaders/soft_max (#10437 ) Fixes #10434	2024-11-20 20:47:36 +01:00
bandoti	f95caa7954	cmake: add link dependencies to cmake find pkg (#10433 ) * cmake pkg: find accelerate, openmp, memkind libs * cmake pkg: find BLAS libs * try BLAS_LIBRARIES instead * Add BLAS link opts * Add more link deps. and set GGML_ vars	2024-11-20 17:22:19 +01:00
Diego Devesa	fab5d30ff6	llama : add .clang-format file (#10415 )	2024-11-20 12:57:53 +01:00
Jeff Bolz	8fd4b7fa29	vulkan: copy iq4_nl LUT into shared memory (#10409 )	2024-11-20 08:40:18 +01:00
Jeff Bolz	1bacb9f625	vulkan: further optimize mul_mat_vec using larger loads (#10387 ) * vulkan: Use pipeline_robustness to disable robustness in mul_mat_vec. Add some early returns for nonexistent rows in mul_mat_vec shaders. These can only be hit when dispatching a 2D grid of workgroups. Fix the logic for the 2D grid of workgroups to round up. Enable the pipeline robustness extension if it's available, and use it to disable robustness for these pipelines. The instructions to do the bounds checking contend for the same ALU resources as the bit twiddling dequant instructions. * vulkan: Add GLSL structure aliases for quant types to allow larger loads In Vulkan it's not possible to cast pointer types, so instead you have to declare an aliased binding for the memory with a different type. This commit adds aliases for the quant formats using 16b ints, and in a few places where the struct size is a multiple of 4 also using 32b ints. Currently only q4_k's aliases are used, but others will be used in subsequent commits. * vulkan: use larger loads in q5_k and q6_k shaders. Similar to the optimization I did in q4_k recently, this vectorizes some loads and reduces the number of bit twiddling instructions. * vulkan: use larger K step per iteration in mul_mat_vec. Add vec4 dequantization functions, and use them to do K=8 per iteration in mul_mat_vec. This uses 16b loads for the quant values and 128b loads for B which helps reduce the load on the memory system. The K_PER_ITER==2 logic is still there, just for F16/F32, and really only because they support unaligned sizes. Tweak the num_iters/unrolling logic to be simpler and catch a couple missed unrolling opportunities.	2024-11-20 08:11:00 +01:00
Neo Zhang Jianyu	ad21c9e1f1	update rel to 4040 (#10395 ) Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>	2024-11-20 13:54:25 +08:00
Anthony Van de Gejuchte	3952a221af	Fix missing file renames in Makefile due to changes in commit `ae8de6d50a` (#10413 )	2024-11-19 23:18:17 +01:00
haopeng	42ae10bbcd	add cmake rvv support (#10411 )	2024-11-19 21:10:31 +01:00
Georgi Gerganov	9fe0fb0626	sync : ggml	2024-11-19 20:03:21 +02:00
Plamen Minev	611fabd792	metal : fox offset integer overflows in im2col (ggml/1015) -- While running StableDiffusion.cpp locally with Metal some offsets overflow and results in incorrect calculations	2024-11-19 20:03:21 +02:00

1 2 3 4 5 ...

4208 commits