Eve
6145fc79e5
q2_k separate out
2025-01-09 21:41:50 -05:00
Eve
973bc4069f
q3_k separate out calculation
2025-01-09 21:25:21 -05:00
Eve
51b5ac507d
make the caches happy
2025-01-09 17:06:54 -05:00
Eve
c9463641af
Merge https://github.com/ggerganov/llama.cpp into vulkan
2025-01-08 21:59:37 -05:00
Eve
923e9a8377
q3_k use hmask simd from cpu avx version
2025-01-08 21:51:27 -05:00
Eve
fe71a8c4a1
q3_k optimizations
2025-01-08 21:51:27 -05:00
Eve
cc28742ca3
q2_k better dequant
2025-01-08 21:51:27 -05:00
Eve
91f1d9ce99
better q6_k with separate paths for all threads and partial threads in use, plus some more optimizations
2025-01-08 21:51:27 -05:00
Eve
6f5d62b098
q5_k
2025-01-08 21:51:27 -05:00
Eve
cdf70cf27f
better q4_k scales
2025-01-08 21:51:27 -05:00
Eve
b4ae7005e6
unpack should be u16, add vim swap to gitignore (about time)
2025-01-08 21:51:27 -05:00
Eve
173077180f
Revert "try precalculating products of a and q2_k scales"
...
This reverts commit 65110b81f23f66331a50c6e889a7c1ab9470a86b.
2025-01-08 21:51:27 -05:00
Eve
bdd98c74e2
try precalculating products of a and q2_k scales
2025-01-08 21:51:27 -05:00
hydai
8d59d91171
fix: add missing msg in static_assert ( #11143 )
...
Signed-off-by: hydai <z54981220@gmail.com>
2025-01-08 20:03:28 +00:00
Vinesh Janarthanan
8a1d9c25fa
gguf-py : move scripts directory ( #11116 )
...
* Moved scripts dir and fixed pyproject.toml
* updated readme
* fixed README urls
* bump pypi gguf to v0.14.0
* retrigger ci
* empty commit - trigger ci
2025-01-08 20:54:58 +02:00
Eric Curtin
1bf839b1e8
Enhance user input handling for llama-run ( #11138 )
...
The main motivation for this change is it was not handing
ctrl-c/ctrl-d correctly. Modify `read_user_input` to handle EOF,
"/bye" command, and empty input cases. Introduce `get_user_input`
function to manage user input loop and handle different return
cases.
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-08 18:47:05 +00:00
Xuan Son Nguyen
f7cd13301c
ci : use actions from ggml-org ( #11140 )
2025-01-08 16:09:20 +01:00
Xuan Son Nguyen
4d2b3d8804
lora : improve compat with mergekit-extract-lora
( #11131 )
...
* (wip) support mergekit-extracted lora
* support mergekit-extract-lora
* use lora->get_scale
* correct comment
* correct norm name & condition
* add some hints
2025-01-08 15:59:53 +01:00
Georgi Gerganov
c07d437bbd
llama : avoid hardcoded QK_K ( #11061 )
...
ggml-ci
2025-01-08 16:19:36 +02:00
Georgi Gerganov
99a3755a3c
sync : ggml
2025-01-08 13:40:30 +02:00
Radoslav Gerganov
c792dcf488
ggml : allow loading backend with env variable (ggml/1059)
...
ref: #1058
2025-01-08 13:40:18 +02:00
Xuan Son Nguyen
80ccf5d725
ci : pin dependency to specific version ( #11137 )
...
* ci : pin dependency to specific version
* will this fix ec?
2025-01-08 12:07:20 +01:00
Georgi Gerganov
a3c1232c3f
arg : option to exclude arguments from specific examples ( #11136 )
...
* arg : option to exclude arguments from specific examples
ggml-ci
* readme : remove old args [no ci]
2025-01-08 12:55:36 +02:00
amritahs-ibm
8cef75c743
llamafile : ppc64le MMA INT8 implementation ( #10912 )
...
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.
This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.
The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.
Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
2025-01-08 12:54:19 +02:00
Georgi Gerganov
0d52a69e4b
ci : fix cmake option ( #11125 )
2025-01-08 11:29:34 +02:00
Mathieu Baudier
02f0430141
Disable GL_KHR_cooperative_matrix Vulkan extension if not available. ( #11117 )
...
* Disable GL_KHR_cooperative_matrix Vulkan extension if not available.
* Perform Vulkan extensions checks in a more sensible order
* Remove unnecessary #ifdef directive
2025-01-08 09:18:13 +01:00
ag2s20150909
bec2183f2c
fix: Vulkan shader gen binary path when Cross-compiling ( #11096 )
...
* fix: Vulkan shader gen binary path when cross compiling
2025-01-08 09:17:29 +01:00
Johannes Gäßler
53ff6b9b9f
GGUF: C++ refactor, backend support, misc fixes ( #11030 )
...
* GGUF: C++ refactor, backend support, misc fixes
remove ggml_tensor.backend
update CODEOWNERS [no ci]
remove gguf_get_data from API
revise GGUF API data types
2025-01-07 18:01:58 +01:00
Diego Devesa
017cc5f446
ggml-backend : only offload from host buffers (fix) ( #11124 )
2025-01-07 16:11:57 +01:00
Diego Devesa
a3d50bc022
ggml-backend : only offload from host buffers ( #11120 )
2025-01-07 12:38:05 +01:00
Radoslav Gerganov
a4dd490069
rpc : code cleanup ( #11107 )
...
Remove duplicated macros, use GGML_LOG_ERROR for errors
2025-01-07 08:37:02 +02:00
Akarshan Biswas
c0d6f790d0
SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6 ( #11087 )
...
* SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6
* Revert "SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6"
This reverts commit f62dc45f31
.
* Reland: Use get_multi_ptr instead of deprecated get_pointer in wkv6
2025-01-07 14:26:07 +08:00
Eric Curtin
dc7cef9f37
llama-run : fix context size ( #11094 )
...
Set `n_ctx` equal to `n_batch` in `Opt` class. Now context size is
a more reasonable 2048.
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-06 23:45:28 +01:00
Georgi Gerganov
ecebbd292d
llama : remove unused headers ( #11109 )
...
ggml-ci
2025-01-06 17:52:35 +02:00
Xuan Son Nguyen
96be8c3264
github : add cmd line field to bug report ( #11090 )
...
* github : cmd line to bug report
* codeowners : (@ngxson) only watch dockerfile
* Apply suggestions from code review [no ci]
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* rm cmd in log output [no ci]
* rm 2 [no ci]
* no need backticks [no ci]
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-01-06 16:34:49 +01:00
Georgi Gerganov
e6e7c75d94
server : fix extra BOS in infill endpoint ( #11106 )
...
* server : fix extra BOS in infill endpoing
ggml-ci
* server : update infill tests
2025-01-06 15:36:08 +02:00
Xuan Son Nguyen
09186fabbe
llama : remove check flash_attn with lora ( #11104 )
2025-01-06 13:41:12 +01:00
Asghar Ghorbani
96a1dc27c3
llama : prevent system info string accumulation across calls ( #11101 )
2025-01-06 13:21:46 +02:00
Daniel Bevenius
6369f867a4
llama : rename missed batch params/vars to ubatch ( #10059 )
...
This commit renames the `batch` parameter to `ubatch` in the
`llama_kv_cache_find_slot`, `llm_build_inp_embd`, and
`llm_build_mamba` functions.
The motivation for this is that this should have been done as part of
Commit 19d900a756
("llama : rename batch
to ubatch (#9950 )") but for some reason I missed these functions in
that commit and only noticed them now (sorry).
2025-01-06 11:28:17 +02:00
Georgi Gerganov
47182dd03f
llama : update llama_model API names ( #11063 )
...
* llama : deprecate llama_free_model, add llama_model_free
ggml-ci
* llama : change `llama_load_model_from_file` -> `llama_model_load_from_file`
ggml-ci
2025-01-06 10:55:18 +02:00
Georgi Gerganov
3e6e7a6bc2
tokenize : escape the prompt ( #11058 )
...
* tokenize : escape the prompt
* tokenize : update help
2025-01-06 10:54:25 +02:00
Georgi Gerganov
ae2f606bb5
mmap : fix fileno macro clash ( #11076 )
...
* mmap : fix fileno macro clash
ggml-ci
* cont
ggml-ci
2025-01-06 10:52:38 +02:00
Georgi Gerganov
727368c60f
llama : use LLAMA_TOKEN_NULL ( #11062 )
...
ggml-ci
2025-01-06 10:52:15 +02:00
Georgi Gerganov
5047dd3546
llama : use _impl suffix instead of _internal ( #11060 )
...
ggml-ci
2025-01-06 10:52:01 +02:00
Johannes Gäßler
46e3556e01
CUDA: add BF16 support ( #11093 )
...
* CUDA: add BF16 support
2025-01-06 02:33:52 +01:00
Eve
c01ccf8288
little stuff
2025-01-04 21:41:50 -05:00
Eve
d70a731639
q2_k
2025-01-04 21:00:45 -05:00
Eve
07d0d58bef
q3_k
2025-01-04 20:50:59 -05:00
Eve
b0e4ccbeb9
revert it
2025-01-04 20:50:59 -05:00
Eve
21c6b805c9
q4_k test (slow)
2025-01-04 20:50:44 -05:00