llama.cpp

Author	SHA1	Message	Date
Alex-Brooks	17bf6ad304	Update notes for alternative to legacy llm conversion script Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	78f765e8a5	Update comment for vision feature layer init Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	2327897175	Cleanup logs Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	188a068a04	Standardize vision feature layers Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	3a191f8edb	Use 10 for max number of patches Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	d85580c41c	Avoid dropping last image encoder layer in llava models Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	65935431b4	fix num gridpoints and use all layers Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	ab71c9e9c4	Pull vision feature layers out of gguf keys Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	ae291e5405	Fix hardcoded concat for multiple feature layers Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	e1ec851121	Increase max flattened gridpoints to 64 Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	987f76840a	Fix linear 2 substitution index Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	7905f9dd40	Fix projector linear substitution Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	61d4ae4699	Make siglip / openclip mutuall exclusive Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	50504063b2	Add transformers llava next tensor name mapping Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	cc1c135367	Clean up llava surgery and remove name substitution hacks Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	92046a103d	Add vision feature layer to gguf params Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	bc66d1931b	remove hardcoded path Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	fd0111c043	Add example for converting mmgranite to gguf Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Alex-Brooks	6ccf234031	Add super wip scripts for multimodal granite gguf Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2025-02-10 07:53:37 -07:00
Georgi Gerganov	d774ab3acc	metal : adjust support conditions for norm operators (#11671 ) cont #11659 ggml-ci	2025-02-05 10:57:42 +02:00
Johannes Gäßler	fa62da9b2d	CUDA: support for mat. mul. with ne03 != ne13 (#11656 )	2025-02-05 08:58:31 +01:00
SAMI	1ec208083c	llava: add quantization for the visual projector LLAVA, Qwen2VL (#11644 ) * Added quantization for visual projector * Added README * Fixed the clip quantize implementation in the file * Fixed the gcc warning regarding minor linting * Removed trailing whitespace	2025-02-05 10:45:40 +03:00
Olivier Chafik	9f4cc8f8d3	`sync`: minja (#11641 ) * `sync`: minja `182de30cda` https://github.com/google/minja/pull/46 https://github.com/google/minja/pull/45	2025-02-05 01:00:12 +00:00
Johannes Gäßler	fd08255d0d	CUDA: non-contiguous (RMS) norm support (#11659 ) * CUDA: non-contiguous (RMS) norm support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-04 22:21:42 +01:00
fxzjshm	3ec9fd4b77	HIP: force max threads per block to be 1024 (#11621 ) Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm. Signed-off-by: fxzjshm <fxzjshm@163.com>	2025-02-04 19:18:38 +01:00
Xuan-Son Nguyen	3962fc1a79	server : add try..catch to places not covered by set_exception_handler (#11620 ) * server : add try..catch to places not covered by set_exception_handler * log_server_request: rm try catch, add reminder	2025-02-04 18:25:42 +01:00
Radoslav Gerganov	1bef571f6a	arg : list RPC devices first when using --list-devices (#11655 ) List devices in the same order as they appear when evaluating the model and splitting tensors across devices, i.e. RPC devices come first in the list. ref #11435	2025-02-04 18:16:20 +02:00
Olivier Chafik	db288b60cb	`tool-call`: command r7b fix for normal responses (#11608 ) * fix command r7b normal response regex + add to server test * test multiline non-tool-call responses in test-chat	2025-02-04 15:48:53 +00:00
Shelby Jenkins	106045e7bb	readme : add llm_client Rust crate to readme bindings (#11628 ) [This crate](https://github.com/ShelbyJenkins/llm_client) has been in a usable state for quite awhile, so I figured now is fair to add it. It installs from crates.io, and automatically downloads the llama.cpp repo and builds it for the target platform - with the goal being the easiest user experience possible. It also integrates model presets and choosing the largest quant given the target's available VRAM. So a user just has to specify one of the presets (I manually add the most popular models), and it will download from hugging face. So, it's like a Rust Ollama, but it's not really for chatting. It makes heavy use of llama.cpp's grammar system to do structured output for decision making and control flow tasks.	2025-02-04 13:20:55 +02:00
Jhen-Jie Hong	f117d84b48	swift : fix llama-vocab api usage (#11645 ) * swiftui : fix vocab api usage * batched.swift : fix vocab api usage	2025-02-04 13:15:24 +02:00
Jhen-Jie Hong	534c46b53c	metal : use residency set for other platforms (#11648 )	2025-02-04 13:07:18 +02:00
Georgi Gerganov	387a1598ca	authors : update	2025-02-04 13:04:10 +02:00
Georgi Gerganov	7c9e0ca520	sync : ggml	2025-02-04 12:59:21 +02:00
Christian Kastner	8f8290ada9	cmake: Add ability to pass in GGML_BUILD_NUMBER (ggml/1096) This makes git as a dependency optional, and is useful in the case where ggml is built not from git, but from a tarball, or a distribution source package. This conditional also affects GGML_BUILD_COMMIT. Nothing seems to be using it, though, so there doesn't seem much value factor it out, or even require it.	2025-02-04 12:59:15 +02:00
Georgi Gerganov	b34aedd558	ci : do not stale-close roadmap issues	2025-02-04 09:31:01 +02:00
Olivier Chafik	cde3833239	`tool-call`: allow `--chat-template chatml` w/ `--jinja`, default to chatml upon parsing issue, avoid double bos (#11616 ) * tool-call: allow `--jinja --chat-template chatml` * fix double bos issue (drop bos/eos tokens from jinja template) * add missing try catch around jinja parsing to default to chatml * Simplify default chatml logic	2025-02-03 23:49:27 +00:00
Xuan-Son Nguyen	b3451785ac	server : (webui) revert hacky solution from #11626 (#11634 )	2025-02-04 00:10:52 +01:00
Woof Dog	1d1e6a90bc	server : (webui) allow typing and submitting during llm response (#11626 )	2025-02-03 23:16:27 +01:00
Daniel Bevenius	5598f475be	server : remove CPPHTTPLIB_NO_EXCEPTIONS define (#11622 ) This commit removes the CPPHTTPLIB_NO_EXCEPTIONS define from the server code. The motivation for this is that when using a debug build the server would crash when an exception was throws and terminate the server process, as it was unhandled. When CPPHTTPLIB_NO_EXCEPTIONS is set cpp_httplib will not call the exception handler, which would normally return a 500 error to the client. This caused tests to fail when using a debug build. Fixes: https://github.com/ggerganov/llama.cpp/issues/11613	2025-02-03 16:45:38 +01:00
Georgi Gerganov	8ec05832fa	sync : ggml	2025-02-03 14:57:08 +02:00
Johannes Gäßler	21c84b5d2d	CUDA: fix Volta FlashAttention logic (#11615 )	2025-02-03 14:25:56 +02:00
mashdragon	d92cb67e37	server : (webui) Fix Shift+Enter handling (#11609 ) * Fix Shift+Enter handling `exact` on the Enter handler means the message is not sent when Shift+Enter is pressed anyway * build index.html.gz --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-02-03 10:42:55 +01:00
Johannes Gäßler	6eecde3cc8	HIP: fix flash_attn_stream_k_fixup warning (#11604 )	2025-02-02 23:48:29 +01:00
uvos	396856b400	CUDA/HIP: add support for selectable warp size to mmv (#11519 ) CUDA/HIP: add support for selectable warp size to mmv	2025-02-02 22:40:09 +01:00
uvos	4d0598e144	HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other (#11601 ) This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly	2025-02-02 22:08:05 +01:00
Olivier Chafik	90f9b88afb	nit: more informative crash when grammar sampler fails (#11593 )	2025-02-02 19:58:34 +00:00
Johannes Gäßler	864a0b67a6	CUDA: use mma PTX instructions for FlashAttention (#11583 ) * CUDA: use mma PTX instructions for FlashAttention * __shfl_sync workaround for movmatrix * add __shfl_sync to HIP Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-02 19:31:09 +01:00
Eric Curtin	84ec8a58f7	Name colors (#11573 ) It's more descriptive, use #define's so we can use compile-time concatenations. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-02-02 15:14:48 +00:00
Olivier Chafik	bfcce4d693	`tool-call`: support Command R7B (+ return tool_plan "thoughts" in API) (#11585 ) * `tool-call`: support Command R7B (w/ tool_plan return) * `tool-call`: cleaner preservation of tokens + warn when likely bad chat template override * `tool-call`: test cleanup / handle lazy grammar triggers	2025-02-02 09:25:38 +00:00
Olivier Chafik	69804487e0	Fix exotic ci env that lacks ostringstream::str (#11581 )	2025-02-02 09:10:15 +00:00

1 2 3 4 5 ...

4663 commits