Alex-Brooks
17bf6ad304
Update notes for alternative to legacy llm conversion script
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
78f765e8a5
Update comment for vision feature layer init
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
2327897175
Cleanup logs
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
188a068a04
Standardize vision feature layers
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
3a191f8edb
Use 10 for max number of patches
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
d85580c41c
Avoid dropping last image encoder layer in llava models
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
65935431b4
fix num gridpoints and use all layers
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
ab71c9e9c4
Pull vision feature layers out of gguf keys
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
ae291e5405
Fix hardcoded concat for multiple feature layers
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
e1ec851121
Increase max flattened gridpoints to 64
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
987f76840a
Fix linear 2 substitution index
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
7905f9dd40
Fix projector linear substitution
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
61d4ae4699
Make siglip / openclip mutuall exclusive
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
50504063b2
Add transformers llava next tensor name mapping
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
cc1c135367
Clean up llava surgery and remove name substitution hacks
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
92046a103d
Add vision feature layer to gguf params
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
bc66d1931b
remove hardcoded path
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
fd0111c043
Add example for converting mmgranite to gguf
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Alex-Brooks
6ccf234031
Add super wip scripts for multimodal granite gguf
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-10 07:53:37 -07:00
Georgi Gerganov
d774ab3acc
metal : adjust support conditions for norm operators ( #11671 )
...
cont #11659
ggml-ci
2025-02-05 10:57:42 +02:00
Johannes Gäßler
fa62da9b2d
CUDA: support for mat. mul. with ne03 != ne13 ( #11656 )
2025-02-05 08:58:31 +01:00
SAMI
1ec208083c
llava: add quantization for the visual projector LLAVA, Qwen2VL ( #11644 )
...
* Added quantization for visual projector
* Added README
* Fixed the clip quantize implementation in the file
* Fixed the gcc warning regarding minor linting
* Removed trailing whitespace
2025-02-05 10:45:40 +03:00
Olivier Chafik
9f4cc8f8d3
sync
: minja (#11641 )
...
* `sync`: minja
182de30cda
https://github.com/google/minja/pull/46
https://github.com/google/minja/pull/45
2025-02-05 01:00:12 +00:00
Johannes Gäßler
fd08255d0d
CUDA: non-contiguous (RMS) norm support ( #11659 )
...
* CUDA: non-contiguous (RMS) norm support
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-04 22:21:42 +01:00
fxzjshm
3ec9fd4b77
HIP: force max threads per block to be 1024 ( #11621 )
...
Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm.
Signed-off-by: fxzjshm <fxzjshm@163.com>
2025-02-04 19:18:38 +01:00
Xuan-Son Nguyen
3962fc1a79
server : add try..catch to places not covered by set_exception_handler ( #11620 )
...
* server : add try..catch to places not covered by set_exception_handler
* log_server_request: rm try catch, add reminder
2025-02-04 18:25:42 +01:00
Radoslav Gerganov
1bef571f6a
arg : list RPC devices first when using --list-devices ( #11655 )
...
List devices in the same order as they appear when evaluating the model
and splitting tensors across devices, i.e. RPC devices come first in the
list.
ref #11435
2025-02-04 18:16:20 +02:00
Olivier Chafik
db288b60cb
tool-call
: command r7b fix for normal responses (#11608 )
...
* fix command r7b normal response regex + add to server test
* test multiline non-tool-call responses in test-chat
2025-02-04 15:48:53 +00:00
Shelby Jenkins
106045e7bb
readme : add llm_client Rust crate to readme bindings ( #11628 )
...
[This crate](https://github.com/ShelbyJenkins/llm_client ) has been in a usable state for quite awhile, so I figured now is fair to add it.
It installs from crates.io, and automatically downloads the llama.cpp repo and builds it for the target platform - with the goal being the easiest user experience possible.
It also integrates model presets and choosing the largest quant given the target's available VRAM. So a user just has to specify one of the presets (I manually add the most popular models), and it will download from hugging face.
So, it's like a Rust Ollama, but it's not really for chatting. It makes heavy use of llama.cpp's grammar system to do structured output for decision making and control flow tasks.
2025-02-04 13:20:55 +02:00
Jhen-Jie Hong
f117d84b48
swift : fix llama-vocab api usage ( #11645 )
...
* swiftui : fix vocab api usage
* batched.swift : fix vocab api usage
2025-02-04 13:15:24 +02:00
Jhen-Jie Hong
534c46b53c
metal : use residency set for other platforms ( #11648 )
2025-02-04 13:07:18 +02:00
Georgi Gerganov
387a1598ca
authors : update
2025-02-04 13:04:10 +02:00
Georgi Gerganov
7c9e0ca520
sync : ggml
2025-02-04 12:59:21 +02:00
Christian Kastner
8f8290ada9
cmake: Add ability to pass in GGML_BUILD_NUMBER (ggml/1096)
...
This makes git as a dependency optional, and is useful in the case where
ggml is built not from git, but from a tarball, or a distribution source
package.
This conditional also affects GGML_BUILD_COMMIT. Nothing seems to be
using it, though, so there doesn't seem much value factor it out, or
even require it.
2025-02-04 12:59:15 +02:00
Georgi Gerganov
b34aedd558
ci : do not stale-close roadmap issues
2025-02-04 09:31:01 +02:00
Olivier Chafik
cde3833239
tool-call
: allow --chat-template chatml
w/ --jinja
, default to chatml upon parsing issue, avoid double bos (#11616 )
...
* tool-call: allow `--jinja --chat-template chatml`
* fix double bos issue (drop bos/eos tokens from jinja template)
* add missing try catch around jinja parsing to default to chatml
* Simplify default chatml logic
2025-02-03 23:49:27 +00:00
Xuan-Son Nguyen
b3451785ac
server : (webui) revert hacky solution from #11626 ( #11634 )
2025-02-04 00:10:52 +01:00
Woof Dog
1d1e6a90bc
server : (webui) allow typing and submitting during llm response ( #11626 )
2025-02-03 23:16:27 +01:00
Daniel Bevenius
5598f475be
server : remove CPPHTTPLIB_NO_EXCEPTIONS define ( #11622 )
...
This commit removes the CPPHTTPLIB_NO_EXCEPTIONS define from the server
code.
The motivation for this is that when using a debug build the server
would crash when an exception was throws and terminate the server
process, as it was unhandled. When CPPHTTPLIB_NO_EXCEPTIONS is set
cpp_httplib will not call the exception handler, which would normally
return a 500 error to the client. This caused tests to fail when using
a debug build.
Fixes: https://github.com/ggerganov/llama.cpp/issues/11613
2025-02-03 16:45:38 +01:00
Georgi Gerganov
8ec05832fa
sync : ggml
2025-02-03 14:57:08 +02:00
Johannes Gäßler
21c84b5d2d
CUDA: fix Volta FlashAttention logic ( #11615 )
2025-02-03 14:25:56 +02:00
mashdragon
d92cb67e37
server : (webui) Fix Shift+Enter handling ( #11609 )
...
* Fix Shift+Enter handling
`exact` on the Enter handler means the message is not sent when Shift+Enter is pressed anyway
* build index.html.gz
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-02-03 10:42:55 +01:00
Johannes Gäßler
6eecde3cc8
HIP: fix flash_attn_stream_k_fixup warning ( #11604 )
2025-02-02 23:48:29 +01:00
uvos
396856b400
CUDA/HIP: add support for selectable warp size to mmv ( #11519 )
...
CUDA/HIP: add support for selectable warp size to mmv
2025-02-02 22:40:09 +01:00
uvos
4d0598e144
HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other ( #11601 )
...
This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly
2025-02-02 22:08:05 +01:00
Olivier Chafik
90f9b88afb
nit: more informative crash when grammar sampler fails ( #11593 )
2025-02-02 19:58:34 +00:00
Johannes Gäßler
864a0b67a6
CUDA: use mma PTX instructions for FlashAttention ( #11583 )
...
* CUDA: use mma PTX instructions for FlashAttention
* __shfl_sync workaround for movmatrix
* add __shfl_sync to HIP
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-02 19:31:09 +01:00
Eric Curtin
84ec8a58f7
Name colors ( #11573 )
...
It's more descriptive, use #define's so we can use compile-time
concatenations.
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-02 15:14:48 +00:00
Olivier Chafik
bfcce4d693
tool-call
: support Command R7B (+ return tool_plan "thoughts" in API) (#11585 )
...
* `tool-call`: support Command R7B (w/ tool_plan return)
* `tool-call`: cleaner preservation of tokens + warn when likely bad chat template override
* `tool-call`: test cleanup / handle lazy grammar triggers
2025-02-02 09:25:38 +00:00
Olivier Chafik
69804487e0
Fix exotic ci env that lacks ostringstream::str ( #11581 )
2025-02-02 09:10:15 +00:00