llama.cpp

Author	SHA1	Message	Date
f.buciuni	0a8995a375	adding test case for velvet chat template	2025-02-07 19:56:23 +01:00
f.buciuni	39795570db	updating velvet chat template	2025-02-07 19:54:55 +01:00
f.buciuni	66e6d10b61	fixing position of LLM_CHAT_TEMPLATE_VELVET in enum	2025-02-07 19:53:16 +01:00
fbuciuni90	9d86a0442d	removing whitespaces in src/lla-a-chat.cpp	2025-02-07 08:12:02 +00:00
Francesco Buciuni	52b0bb3731	Update src/llama-chat.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-02-06 17:44:45 +01:00
Francesco Buciuni	3df9d221ed	Update include/llama.h Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-02-06 17:39:47 +01:00
Francesco Buciuni	99be555369	Update convert_hf_to_gguf.py Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-02-06 17:38:58 +01:00
Francesco Buciuni	07e1d0a14c	Update convert_hf_to_gguf.py Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-02-06 17:38:30 +01:00
fbuciuni90	67b38f5849	Supporting Velvet model	2025-02-06 16:02:00 +00:00
Tei Home	9ab42dc722	docs: update fedora cuda guide for 12.8 release (#11393 ) * docs: update fedora cuda guide for 12.8 release * docs: build cuda update	2025-02-06 12:16:15 +00:00
Akarshan Biswas	194b2e69f8	SYCL: Adjust support condition for norm operators (#11674 ) SYCL does not support non contiguous tensors for norm operations	2025-02-06 11:42:35 +00:00
Georgi Gerganov	9dd7a0390f	llama : add log about loading model tensors (#11699 )	2025-02-06 13:41:37 +02:00
Adrien Gallouët	c0d4843225	build : fix llama.pc (#11658 ) Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>	2025-02-06 13:08:13 +02:00
junchao-zhao	8d4d2be143	ggml : fix LoongArch compile error with 128-bit SIMD (#11701 )	2025-02-06 11:20:00 +02:00
Jeff Bolz	2c6c8df56d	vulkan: optimize coopmat2 iq2/iq3 callbacks (#11521 ) * vulkan: optimize coopmat2 iq2/iq3 callbacks * build: trigger CI on GLSL compute shader changes	2025-02-06 07:15:30 +01:00
Rémy O	8a7e3bf17a	vulkan: initial support for IQ4_XS quantization (#11501 )	2025-02-06 07:09:59 +01:00
Jeff Bolz	1b598b3058	vulkan: use smaller combined allocations to avoid fragmentation (#11551 )	2025-02-06 07:02:18 +01:00
Charles Duffy	902368a06b	metal : avoid breaking build when metal API predates TARGET_OS_VISION (#11690 ) Avoids breakage in nix flake build introduced by `b0569130c5`	2025-02-06 09:52:31 +08:00
Matvey Soloviev	c3db0480bb	readme : add link to Autopen under UIs (#11684 ) Autopen (https://github.com/blackhole89/autopen) is a graphical text editor that uses llama.cpp to tokenize the buffer on the fly, score the buffer, visualise token logits and allow you to switch back and forth between different possible completions at any point. It hopefully meets the criteria for inclusion, as the dependency on llama.cpp is stated prominently.	2025-02-06 01:55:25 +01:00
Georgi Gerganov	d774ab3acc	metal : adjust support conditions for norm operators (#11671 ) cont #11659 ggml-ci	2025-02-05 10:57:42 +02:00
Johannes Gäßler	fa62da9b2d	CUDA: support for mat. mul. with ne03 != ne13 (#11656 )	2025-02-05 08:58:31 +01:00
SAMI	1ec208083c	llava: add quantization for the visual projector LLAVA, Qwen2VL (#11644 ) * Added quantization for visual projector * Added README * Fixed the clip quantize implementation in the file * Fixed the gcc warning regarding minor linting * Removed trailing whitespace	2025-02-05 10:45:40 +03:00
Olivier Chafik	9f4cc8f8d3	`sync`: minja (#11641 ) * `sync`: minja `182de30cda` https://github.com/google/minja/pull/46 https://github.com/google/minja/pull/45	2025-02-05 01:00:12 +00:00
Johannes Gäßler	fd08255d0d	CUDA: non-contiguous (RMS) norm support (#11659 ) * CUDA: non-contiguous (RMS) norm support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-04 22:21:42 +01:00
fxzjshm	3ec9fd4b77	HIP: force max threads per block to be 1024 (#11621 ) Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm. Signed-off-by: fxzjshm <fxzjshm@163.com>	2025-02-04 19:18:38 +01:00
Xuan-Son Nguyen	3962fc1a79	server : add try..catch to places not covered by set_exception_handler (#11620 ) * server : add try..catch to places not covered by set_exception_handler * log_server_request: rm try catch, add reminder	2025-02-04 18:25:42 +01:00
Radoslav Gerganov	1bef571f6a	arg : list RPC devices first when using --list-devices (#11655 ) List devices in the same order as they appear when evaluating the model and splitting tensors across devices, i.e. RPC devices come first in the list. ref #11435	2025-02-04 18:16:20 +02:00
Olivier Chafik	db288b60cb	`tool-call`: command r7b fix for normal responses (#11608 ) * fix command r7b normal response regex + add to server test * test multiline non-tool-call responses in test-chat	2025-02-04 15:48:53 +00:00
Shelby Jenkins	106045e7bb	readme : add llm_client Rust crate to readme bindings (#11628 ) [This crate](https://github.com/ShelbyJenkins/llm_client) has been in a usable state for quite awhile, so I figured now is fair to add it. It installs from crates.io, and automatically downloads the llama.cpp repo and builds it for the target platform - with the goal being the easiest user experience possible. It also integrates model presets and choosing the largest quant given the target's available VRAM. So a user just has to specify one of the presets (I manually add the most popular models), and it will download from hugging face. So, it's like a Rust Ollama, but it's not really for chatting. It makes heavy use of llama.cpp's grammar system to do structured output for decision making and control flow tasks.	2025-02-04 13:20:55 +02:00
Jhen-Jie Hong	f117d84b48	swift : fix llama-vocab api usage (#11645 ) * swiftui : fix vocab api usage * batched.swift : fix vocab api usage	2025-02-04 13:15:24 +02:00
Jhen-Jie Hong	534c46b53c	metal : use residency set for other platforms (#11648 )	2025-02-04 13:07:18 +02:00
Georgi Gerganov	387a1598ca	authors : update	2025-02-04 13:04:10 +02:00
Georgi Gerganov	7c9e0ca520	sync : ggml	2025-02-04 12:59:21 +02:00
Christian Kastner	8f8290ada9	cmake: Add ability to pass in GGML_BUILD_NUMBER (ggml/1096) This makes git as a dependency optional, and is useful in the case where ggml is built not from git, but from a tarball, or a distribution source package. This conditional also affects GGML_BUILD_COMMIT. Nothing seems to be using it, though, so there doesn't seem much value factor it out, or even require it.	2025-02-04 12:59:15 +02:00
Georgi Gerganov	b34aedd558	ci : do not stale-close roadmap issues	2025-02-04 09:31:01 +02:00
Olivier Chafik	cde3833239	`tool-call`: allow `--chat-template chatml` w/ `--jinja`, default to chatml upon parsing issue, avoid double bos (#11616 ) * tool-call: allow `--jinja --chat-template chatml` * fix double bos issue (drop bos/eos tokens from jinja template) * add missing try catch around jinja parsing to default to chatml * Simplify default chatml logic	2025-02-03 23:49:27 +00:00
Xuan-Son Nguyen	b3451785ac	server : (webui) revert hacky solution from #11626 (#11634 )	2025-02-04 00:10:52 +01:00
Woof Dog	1d1e6a90bc	server : (webui) allow typing and submitting during llm response (#11626 )	2025-02-03 23:16:27 +01:00
Daniel Bevenius	5598f475be	server : remove CPPHTTPLIB_NO_EXCEPTIONS define (#11622 ) This commit removes the CPPHTTPLIB_NO_EXCEPTIONS define from the server code. The motivation for this is that when using a debug build the server would crash when an exception was throws and terminate the server process, as it was unhandled. When CPPHTTPLIB_NO_EXCEPTIONS is set cpp_httplib will not call the exception handler, which would normally return a 500 error to the client. This caused tests to fail when using a debug build. Fixes: https://github.com/ggerganov/llama.cpp/issues/11613	2025-02-03 16:45:38 +01:00
Georgi Gerganov	8ec05832fa	sync : ggml	2025-02-03 14:57:08 +02:00
Johannes Gäßler	21c84b5d2d	CUDA: fix Volta FlashAttention logic (#11615 )	2025-02-03 14:25:56 +02:00
mashdragon	d92cb67e37	server : (webui) Fix Shift+Enter handling (#11609 ) * Fix Shift+Enter handling `exact` on the Enter handler means the message is not sent when Shift+Enter is pressed anyway * build index.html.gz --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-02-03 10:42:55 +01:00
Johannes Gäßler	6eecde3cc8	HIP: fix flash_attn_stream_k_fixup warning (#11604 )	2025-02-02 23:48:29 +01:00
uvos	396856b400	CUDA/HIP: add support for selectable warp size to mmv (#11519 ) CUDA/HIP: add support for selectable warp size to mmv	2025-02-02 22:40:09 +01:00
uvos	4d0598e144	HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other (#11601 ) This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly	2025-02-02 22:08:05 +01:00
Olivier Chafik	90f9b88afb	nit: more informative crash when grammar sampler fails (#11593 )	2025-02-02 19:58:34 +00:00
Johannes Gäßler	864a0b67a6	CUDA: use mma PTX instructions for FlashAttention (#11583 ) * CUDA: use mma PTX instructions for FlashAttention * __shfl_sync workaround for movmatrix * add __shfl_sync to HIP Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-02 19:31:09 +01:00
Eric Curtin	84ec8a58f7	Name colors (#11573 ) It's more descriptive, use #define's so we can use compile-time concatenations. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-02-02 15:14:48 +00:00
Olivier Chafik	bfcce4d693	`tool-call`: support Command R7B (+ return tool_plan "thoughts" in API) (#11585 ) * `tool-call`: support Command R7B (w/ tool_plan return) * `tool-call`: cleaner preservation of tokens + warn when likely bad chat template override * `tool-call`: test cleanup / handle lazy grammar triggers	2025-02-02 09:25:38 +00:00
Olivier Chafik	69804487e0	Fix exotic ci env that lacks ostringstream::str (#11581 )	2025-02-02 09:10:15 +00:00

1 2 3 4 5 ...

4663 commits