pidack
63b6e73500
recommit for ci pass
2024-08-29 11:17:12 +08:00
pidack
99f2ac1a9d
Merge branch 'master' of github.com:ggerganov/llama.cpp into mfalcon_mamba_cuda
2024-08-29 10:36:51 +08:00
pidack
316a049533
add restrict for dst
2024-08-29 10:36:33 +08:00
slaren
9fe94ccac9
docker : build images only once ( #9225 )
2024-08-28 17:28:00 +02:00
slaren
66b039a501
docker : update CUDA images ( #9213 )
2024-08-28 13:20:36 +02:00
pidack
5999d6d06e
fix conflicts
2024-08-28 09:49:17 +08:00
Georgi Gerganov
20f1789dfb
vulkan : fix build ( #0 )
...
ggml-ci
2024-08-27 22:41:27 +03:00
Georgi Gerganov
231cff5f6f
sync : ggml
2024-08-27 22:41:27 +03:00
pidack
0e682ced5e
add restrict
2024-08-27 20:54:39 +08:00
pidack
eec0e8ca81
memory access pattern
2024-08-27 20:51:26 +08:00
Xie Yanbo
3246fe84d7
Fix minicpm example directory ( #9111 )
2024-08-27 14:33:08 +02:00
pidack
e53b14f152
del debug ingo
2024-08-27 19:33:28 +08:00
pidack
21c16fa5ed
fix trailing whitespace
2024-08-27 19:10:57 +08:00
compilade
78eb487bb0
llama : fix qs.n_attention_wv for DeepSeek-V2 ( #9156 )
2024-08-27 13:09:23 +03:00
pidack
1928967874
resolve test-backend-ops conflicts
2024-08-27 17:31:40 +08:00
pidack
40f47872b3
Merge branch 'master' of github.com:ggerganov/llama.cpp into mfalcon_mamba_cuda
2024-08-27 17:08:23 +08:00
Xuan Son Nguyen
a77feb5d71
server : add some missing env variables ( #9116 )
...
* server : add some missing env variables
* add LLAMA_ARG_HOST to server dockerfile
* also add LLAMA_ARG_CONT_BATCHING
2024-08-27 11:07:01 +02:00
pidack
b423a6df5e
fix ssm_scan numerical error & others update
2024-08-27 16:51:21 +08:00
CausalLM
2e59d61c1b
llama : fix ChatGLM4 wrong shape ( #9194 )
...
This should fix THUDM/glm-4-9b-chat-1m and CausalLM/miniG
2024-08-27 09:58:22 +03:00
Carsten Kragelund Jørgensen
75e1dbbaab
llama : fix llama3.1 rope_freqs not respecting custom head_dim ( #9141 )
...
* fix: llama3.1 rope_freqs not respecting custom head_dim
* fix: use potential head_dim for Exaone
2024-08-27 09:53:40 +03:00
arch-btw
ad76569f8e
common : Update stb_image.h to latest version ( #9161 )
...
* Update stb_image.h to latest version
Fixes https://github.com/ggerganov/llama.cpp/issues/7431
* Update .ecrc
2024-08-27 08:58:50 +03:00
pidack
8dd323b496
Merge branch 'master' of github.com:ggerganov/llama.cpp into mfalcon_mamba_cuda
2024-08-27 09:44:18 +08:00
slaren
7d787ed96c
ggml : do not crash when quantizing q4_x_x with an imatrix ( #9192 )
2024-08-26 19:44:43 +02:00
Georgi Gerganov
06658ad7c3
metal : separate scale and mask from QKT in FA kernel ( #9189 )
...
* metal : separate scale and mask from QKT in FA kernel
* metal : ne01 check no longer necessary
* metal : keep data in local memory
2024-08-26 18:31:02 +03:00
Georgi Gerganov
fc18425b6a
ggml : add SSM Metal kernels ( #8546 )
...
* ggml : add ggml_ssm_conv metal impl
* ggml : add ssm_scan metal impl
ggml-ci
2024-08-26 17:55:36 +03:00
Georgi Gerganov
879275ac98
tests : fix compile warnings for unreachable code ( #9185 )
...
ggml-ci
2024-08-26 16:30:25 +03:00
pidack
20d390bea4
10x performance improve 4 cuda ssm conv & scan
2024-08-26 17:33:23 +08:00
Georgi Gerganov
7a3df798fc
ci : add VULKAN support to ggml-ci ( #9055 )
2024-08-26 12:19:39 +03:00
Georgi Gerganov
e5edb210cd
server : update deps ( #9183 )
2024-08-26 12:16:57 +03:00
slaren
0c41e03ceb
metal : gemma2 flash attention support ( #9159 )
2024-08-26 11:08:59 +02:00
slaren
f12ceaca0c
ggml-ci : try to improve build time ( #9160 )
2024-08-26 11:03:30 +02:00
Justine Tunney
436787f170
llama : fix time complexity of string replacement ( #9163 )
...
This change fixes a bug where replacing text in a very long string could
cause llama.cpp to hang indefinitely. This is because the algorithm used
was quadratic, due to memmove() when s.replace() is called in a loop. It
seems most search results and LLM responses actually provide the O(n**2)
algorithm, which is a great tragedy. Using a builder string fixes things
2024-08-26 09:09:53 +03:00
Herman Semenov
93bc3839f9
common: fixed not working find argument --n-gpu-layers-draft ( #9175 )
2024-08-26 00:54:37 +02:00
Johannes Gäßler
f91fc5639b
CUDA: fix Gemma 2 numerical issues for FA ( #9166 )
2024-08-25 22:11:48 +02:00
Jan Ploski
fae826fb56
Fix failed assertions while running Falcon Mamba
2024-08-25 14:57:47 +02:00
Jan Ploski
061e520075
Update CUDA ops and tests to match implementation from commit 8fb57ac0
(llama : use im2col and mul_mat to perform convolution for Mamba); GPU version breaks with assert because of unsupported MUL_MAT
2024-08-25 00:19:37 +02:00
Jan Ploski
12c913c52c
Fix backend test for ssm_conv CUDA op not working
2024-08-24 23:43:42 +02:00
Jan Ploski
64fbd320ef
Add patch to test cases provided by @compilade; test for ssm_conv fails
2024-08-24 23:43:36 +02:00
Jan Ploski
25f9e65d3a
Update CUDA ops ssm_conv and ssm_scan to match CPU implementation from PR #7531 (as per eb589d5e
)
2024-08-24 23:43:30 +02:00
Jan Ploski
cc365b045b
Add GGML_OP_SSM_CONF, GGML_OP_SSM_SCAN to supported ops for CUDA backend + test case for each op
2024-08-24 23:43:24 +02:00
Jan Ploski
f809568fa1
Add initial/naive CUDA kernels for the GGML_OP_SSM_CONV and GGML_OP_SSM_SCAN ops
2024-08-24 23:43:10 +02:00
Johannes Gäßler
e11bd856d5
CPU/CUDA: Gemma 2 FlashAttention support ( #8542 )
...
* CPU/CUDA: Gemma 2 FlashAttention support
* apply logit_softcap to scale in kernel
* disable logit softcapping tests on Metal
* remove metal check
2024-08-24 21:34:59 +02:00
João Dinis Ferreira
8f824ffe8e
quantize : fix typo in usage help of quantize.cpp
( #9145 )
2024-08-24 09:22:45 +03:00
Xuan Son Nguyen
3ba780e2a8
lora : fix llama conversion script with ROPE_FREQS ( #9117 )
2024-08-23 12:58:53 +02:00
piDack
a07c32ea54
llama : use F32 precision in GLM4 attention and no FA ( #9130 )
2024-08-23 10:27:17 +03:00
Akarshan Biswas
11b84eb457
[SYCL] Add a space to supress a cmake warning ( #9133 )
2024-08-22 22:09:47 +08:00
luoyu-intel
1731d4238f
[SYCL] Add oneDNN primitive support ( #9091 )
...
* add onednn
* add sycl_f16
* add dnnl stream
* add engine map
* use dnnl for intel only
* use fp16fp16fp16
* update doc
2024-08-22 12:50:10 +08:00
compilade
a1631e53f6
llama : simplify Mamba with advanced batch splits ( #8526 )
...
* llama : advanced batch splits
This includes equal-sequence-length batch splits which are useful
to simplify recurrent model operators.
* llama : always make recurrent state slots contiguous
* ggml : simplify mamba operators
* llama : fix integer signedness mixing
* llama : logits_all has priority over batch->logits
Otherwise, the server embeddings tests failed.
This was likely an existing problem but was only detected here
because of an additional assertion.
* llama : apply suggestions
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama : fix t5 segfault
* llama : fix Mamba session save and restore
* llama : minor cosmetic changes
* llama : rename llama_reorder_outputs to llama_output_reorder
Also move it closer to llama_output_reserve.
* llama : fix pooled embeddings when using batches with equal_seqs
* minor : add struct members for clarity
ggml-ci
* llama : fix T5 segfault again
* llama : fix Mamba pooled embeddings with multiple sequences
Until the pooled embeddings are refactored to allow splitting
across ubatches for causal embeddings,
recurrent models can only process a single sequence per ubatch
when calculating pooled embeddings.
* llama : add llama_model_is_recurrent to simplify figuring that out
This will make it easier to more cleanly support RWKV-v6 and Mamba-2.
* llama : fix simple splits when the batch contains embeddings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-21 17:58:11 -04:00
Xuan Son Nguyen
fc54ef0d1c
server : support reading arguments from environment variables ( #9105 )
...
* server : support reading arguments from environment variables
* add -fa and -dt
* readme : specify non-arg env var
2024-08-21 11:04:34 +02:00
Younes Belkada
b40eb84895
llama : support for falcon-mamba
architecture ( #9074 )
...
* feat: initial support for llama.cpp
* fix: lint
* refactor: better refactor
* Update src/llama.cpp
Co-authored-by: compilade <git@compilade.net>
* Update src/llama.cpp
Co-authored-by: compilade <git@compilade.net>
* fix: address comments
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <git@compilade.net>
* fix: add more cleanup and harmonization
* fix: lint
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: compilade <git@compilade.net>
* fix: change name
* Apply suggestions from code review
Co-authored-by: compilade <git@compilade.net>
* add in operator
* fix: add `dt_b_c_rms` in `llm_load_print_meta`
* fix: correct printf format for bool
* fix: correct print format
* Update src/llama.cpp
Co-authored-by: compilade <git@compilade.net>
* llama : quantize more Mamba tensors
* llama : use f16 as the fallback of fallback quant types
---------
Co-authored-by: compilade <git@compilade.net>
2024-08-21 11:06:36 +03:00