Commit graph

3857 commits

Author SHA1 Message Date
Georgi Gerganov
20f1789dfb vulkan : fix build (#0)
ggml-ci
2024-08-27 22:41:27 +03:00
Georgi Gerganov
231cff5f6f sync : ggml 2024-08-27 22:41:27 +03:00
Xie Yanbo
3246fe84d7
Fix minicpm example directory (#9111) 2024-08-27 14:33:08 +02:00
compilade
78eb487bb0
llama : fix qs.n_attention_wv for DeepSeek-V2 (#9156) 2024-08-27 13:09:23 +03:00
Xuan Son Nguyen
a77feb5d71
server : add some missing env variables (#9116)
* server : add some missing env variables

* add LLAMA_ARG_HOST to server dockerfile

* also add LLAMA_ARG_CONT_BATCHING
2024-08-27 11:07:01 +02:00
CausalLM
2e59d61c1b
llama : fix ChatGLM4 wrong shape (#9194)
This should fix THUDM/glm-4-9b-chat-1m and CausalLM/miniG
2024-08-27 09:58:22 +03:00
Carsten Kragelund Jørgensen
75e1dbbaab
llama : fix llama3.1 rope_freqs not respecting custom head_dim (#9141)
* fix: llama3.1 rope_freqs not respecting custom head_dim

* fix: use potential head_dim for Exaone
2024-08-27 09:53:40 +03:00
arch-btw
ad76569f8e
common : Update stb_image.h to latest version (#9161)
* Update stb_image.h to latest version

Fixes https://github.com/ggerganov/llama.cpp/issues/7431

* Update .ecrc
2024-08-27 08:58:50 +03:00
slaren
7d787ed96c
ggml : do not crash when quantizing q4_x_x with an imatrix (#9192) 2024-08-26 19:44:43 +02:00
Georgi Gerganov
06658ad7c3
metal : separate scale and mask from QKT in FA kernel (#9189)
* metal : separate scale and mask from QKT in FA kernel

* metal : ne01 check no longer necessary

* metal : keep data in local memory
2024-08-26 18:31:02 +03:00
Georgi Gerganov
fc18425b6a
ggml : add SSM Metal kernels (#8546)
* ggml : add ggml_ssm_conv metal impl

* ggml : add ssm_scan metal impl

ggml-ci
2024-08-26 17:55:36 +03:00
Georgi Gerganov
879275ac98
tests : fix compile warnings for unreachable code (#9185)
ggml-ci
2024-08-26 16:30:25 +03:00
Georgi Gerganov
7a3df798fc
ci : add VULKAN support to ggml-ci (#9055) 2024-08-26 12:19:39 +03:00
Georgi Gerganov
e5edb210cd
server : update deps (#9183) 2024-08-26 12:16:57 +03:00
slaren
0c41e03ceb
metal : gemma2 flash attention support (#9159) 2024-08-26 11:08:59 +02:00
slaren
f12ceaca0c
ggml-ci : try to improve build time (#9160) 2024-08-26 11:03:30 +02:00
Justine Tunney
436787f170
llama : fix time complexity of string replacement (#9163)
This change fixes a bug where replacing text in a very long string could
cause llama.cpp to hang indefinitely. This is because the algorithm used
was quadratic, due to memmove() when s.replace() is called in a loop. It
seems most search results and LLM responses actually provide the O(n**2)
algorithm, which is a great tragedy. Using a builder string fixes things
2024-08-26 09:09:53 +03:00
Herman Semenov
93bc3839f9
common: fixed not working find argument --n-gpu-layers-draft (#9175) 2024-08-26 00:54:37 +02:00
Johannes Gäßler
f91fc5639b
CUDA: fix Gemma 2 numerical issues for FA (#9166) 2024-08-25 22:11:48 +02:00
Nexesenex
16aee45179 correction 2024-08-25 14:26:29 +02:00
Nexesenex
dd3df754b2 Bad indents and trailing whitespaces 2024-08-25 03:30:43 +02:00
Nexesenex
f63860eaac Put back ffn_down tree where it was before. 2024-08-25 03:20:29 +02:00
Nexesenex
8fc46df134 Bump a bit ffn_gate and down for some GQA<2 models 2024-08-25 03:12:29 +02:00
Nexesenex
53b8eaa316 Remove deprecated rules for token embeddings 2024-08-25 03:12:29 +02:00
Nexesenex
844d11b8f3 bad indent 2024-08-25 03:12:29 +02:00
Nexesenex
5ae59714d2 Revamp Q2_K and Q3_K quants
Q3_K_XL takes the place of Q3_K_L.
Q3_K_L becomes intermediary between Q3_K_M and XL.
2024-08-25 03:12:29 +02:00
Nexesenex
1bde168c07 Usage of n_head to discriminate very small models
Of which the size is more sensitive to the non repeating tensors
2024-08-25 03:04:17 +02:00
Nexesenex
16e9c3771a various corrections on IQ2_S+ and IQ3 quants 2024-08-25 03:04:17 +02:00
Nexesenex
380b53d061 Fix IQ4_XSR 2024-08-25 03:04:17 +02:00
Nexesenex
608108597c Ravamp attn_output 2024-08-25 03:04:17 +02:00
Nexesenex
6b5cebfb2b Revamp a bit output weight
for more granularity in low quants.
2024-08-25 03:04:16 +02:00
Nexesenex
f796954872 Revamp FFN down and attn_k
And complete FFN up
Shrink a bit more non GQA models
2024-08-25 03:04:16 +02:00
Nexesenex
596a4aec86 Readd variable attn_k, attn_q, attn_o after merge 2024-08-25 03:00:13 +02:00
Nexesenex
fb2b9ea667 Merge branch 'master' into pr/8836 2024-08-25 02:59:57 +02:00
Nexesenex
3a027b878b Revamp IQ4_XSR, remove IQ3_XXXL 2024-08-25 02:54:45 +02:00
Nexesenex
e05da54eff Overhaul of FFN, if GQA and if not 2024-08-25 02:54:45 +02:00
Nexesenex
1607a02bdd Further adjustments difquant formulas 2024-08-25 02:54:45 +02:00
Nexesenex
179ad0fad4 Little rework of the difquant formulas 2024-08-25 02:54:45 +02:00
Johannes Gäßler
e11bd856d5
CPU/CUDA: Gemma 2 FlashAttention support (#8542)
* CPU/CUDA: Gemma 2 FlashAttention support

* apply logit_softcap to scale in kernel

* disable logit softcapping tests on Metal

* remove metal check
2024-08-24 21:34:59 +02:00
João Dinis Ferreira
8f824ffe8e
quantize : fix typo in usage help of quantize.cpp (#9145) 2024-08-24 09:22:45 +03:00
Xuan Son Nguyen
3ba780e2a8
lora : fix llama conversion script with ROPE_FREQS (#9117) 2024-08-23 12:58:53 +02:00
piDack
a07c32ea54
llama : use F32 precision in GLM4 attention and no FA (#9130) 2024-08-23 10:27:17 +03:00
Akarshan Biswas
11b84eb457
[SYCL] Add a space to supress a cmake warning (#9133) 2024-08-22 22:09:47 +08:00
luoyu-intel
1731d4238f
[SYCL] Add oneDNN primitive support (#9091)
* add onednn

* add sycl_f16

* add dnnl stream

* add engine map

* use dnnl for intel only

* use fp16fp16fp16

* update doc
2024-08-22 12:50:10 +08:00
compilade
a1631e53f6
llama : simplify Mamba with advanced batch splits (#8526)
* llama : advanced batch splits

This includes equal-sequence-length batch splits which are useful
to simplify recurrent model operators.

* llama : always make recurrent state slots contiguous

* ggml : simplify mamba operators

* llama : fix integer signedness mixing

* llama : logits_all has priority over batch->logits

Otherwise, the server embeddings tests failed.
This was likely an existing problem but was only detected here
because of an additional assertion.

* llama : apply suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : fix t5 segfault

* llama : fix Mamba session save and restore

* llama : minor cosmetic changes

* llama : rename llama_reorder_outputs to llama_output_reorder

Also move it closer to llama_output_reserve.

* llama : fix pooled embeddings when using batches with equal_seqs

* minor : add struct members for clarity

ggml-ci

* llama : fix T5 segfault again

* llama : fix Mamba pooled embeddings with multiple sequences

Until the pooled embeddings are refactored to allow splitting
across ubatches for causal embeddings,
recurrent models can only process a single sequence per ubatch
when calculating pooled embeddings.

* llama : add llama_model_is_recurrent to simplify figuring that out

This will make it easier to more cleanly support RWKV-v6 and Mamba-2.

* llama : fix simple splits when the batch contains embeddings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-21 17:58:11 -04:00
Nexesenex
644aa9fd41 Correction too small tensor embeddings to quantize
IQ2_XS doesn't seem to work as such, back to IQ2_S
2024-08-21 13:07:32 +02:00
Nexesenex
32f6ead0d9 Improve IQ1 and IQ2 quants
And fix mistakes for the attn.output of IQ2_XL and the ffn gate and up of IQ2_XS

Reformat attn_ouput mess and split GQA4/GQA2
2024-08-21 12:52:45 +02:00
Nexesenex
d7b9d214fb Shrink a bit IQ3_XXS, bump a bit IQ3_M 2024-08-21 12:49:40 +02:00
Nexesenex
dbadcdd5cf harmonize formatting of tensor type conditions 2024-08-21 12:30:38 +02:00
Nexesenex
ce86019770 change function use_*_bits into difquant_*_tensors
this to clarify what it does, especially with the 5 additional levels of difquant
2024-08-21 12:26:12 +02:00