Commit graph

3686 commits

Author SHA1 Message Date
Nexesenex
f63860eaac Put back ffn_down tree where it was before. 2024-08-25 03:20:29 +02:00
Nexesenex
8fc46df134 Bump a bit ffn_gate and down for some GQA<2 models 2024-08-25 03:12:29 +02:00
Nexesenex
53b8eaa316 Remove deprecated rules for token embeddings 2024-08-25 03:12:29 +02:00
Nexesenex
844d11b8f3 bad indent 2024-08-25 03:12:29 +02:00
Nexesenex
5ae59714d2 Revamp Q2_K and Q3_K quants
Q3_K_XL takes the place of Q3_K_L.
Q3_K_L becomes intermediary between Q3_K_M and XL.
2024-08-25 03:12:29 +02:00
Nexesenex
1bde168c07 Usage of n_head to discriminate very small models
Of which the size is more sensitive to the non repeating tensors
2024-08-25 03:04:17 +02:00
Nexesenex
16e9c3771a various corrections on IQ2_S+ and IQ3 quants 2024-08-25 03:04:17 +02:00
Nexesenex
380b53d061 Fix IQ4_XSR 2024-08-25 03:04:17 +02:00
Nexesenex
608108597c Ravamp attn_output 2024-08-25 03:04:17 +02:00
Nexesenex
6b5cebfb2b Revamp a bit output weight
for more granularity in low quants.
2024-08-25 03:04:16 +02:00
Nexesenex
f796954872 Revamp FFN down and attn_k
And complete FFN up
Shrink a bit more non GQA models
2024-08-25 03:04:16 +02:00
Nexesenex
596a4aec86 Readd variable attn_k, attn_q, attn_o after merge 2024-08-25 03:00:13 +02:00
Nexesenex
fb2b9ea667 Merge branch 'master' into pr/8836 2024-08-25 02:59:57 +02:00
Nexesenex
3a027b878b Revamp IQ4_XSR, remove IQ3_XXXL 2024-08-25 02:54:45 +02:00
Nexesenex
e05da54eff Overhaul of FFN, if GQA and if not 2024-08-25 02:54:45 +02:00
Nexesenex
1607a02bdd Further adjustments difquant formulas 2024-08-25 02:54:45 +02:00
Nexesenex
179ad0fad4 Little rework of the difquant formulas 2024-08-25 02:54:45 +02:00
Johannes Gäßler
e11bd856d5
CPU/CUDA: Gemma 2 FlashAttention support (#8542)
* CPU/CUDA: Gemma 2 FlashAttention support

* apply logit_softcap to scale in kernel

* disable logit softcapping tests on Metal

* remove metal check
2024-08-24 21:34:59 +02:00
João Dinis Ferreira
8f824ffe8e
quantize : fix typo in usage help of quantize.cpp (#9145) 2024-08-24 09:22:45 +03:00
Xuan Son Nguyen
3ba780e2a8
lora : fix llama conversion script with ROPE_FREQS (#9117) 2024-08-23 12:58:53 +02:00
piDack
a07c32ea54
llama : use F32 precision in GLM4 attention and no FA (#9130) 2024-08-23 10:27:17 +03:00
Akarshan Biswas
11b84eb457
[SYCL] Add a space to supress a cmake warning (#9133) 2024-08-22 22:09:47 +08:00
luoyu-intel
1731d4238f
[SYCL] Add oneDNN primitive support (#9091)
* add onednn

* add sycl_f16

* add dnnl stream

* add engine map

* use dnnl for intel only

* use fp16fp16fp16

* update doc
2024-08-22 12:50:10 +08:00
compilade
a1631e53f6
llama : simplify Mamba with advanced batch splits (#8526)
* llama : advanced batch splits

This includes equal-sequence-length batch splits which are useful
to simplify recurrent model operators.

* llama : always make recurrent state slots contiguous

* ggml : simplify mamba operators

* llama : fix integer signedness mixing

* llama : logits_all has priority over batch->logits

Otherwise, the server embeddings tests failed.
This was likely an existing problem but was only detected here
because of an additional assertion.

* llama : apply suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : fix t5 segfault

* llama : fix Mamba session save and restore

* llama : minor cosmetic changes

* llama : rename llama_reorder_outputs to llama_output_reorder

Also move it closer to llama_output_reserve.

* llama : fix pooled embeddings when using batches with equal_seqs

* minor : add struct members for clarity

ggml-ci

* llama : fix T5 segfault again

* llama : fix Mamba pooled embeddings with multiple sequences

Until the pooled embeddings are refactored to allow splitting
across ubatches for causal embeddings,
recurrent models can only process a single sequence per ubatch
when calculating pooled embeddings.

* llama : add llama_model_is_recurrent to simplify figuring that out

This will make it easier to more cleanly support RWKV-v6 and Mamba-2.

* llama : fix simple splits when the batch contains embeddings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-21 17:58:11 -04:00
Nexesenex
644aa9fd41 Correction too small tensor embeddings to quantize
IQ2_XS doesn't seem to work as such, back to IQ2_S
2024-08-21 13:07:32 +02:00
Nexesenex
32f6ead0d9 Improve IQ1 and IQ2 quants
And fix mistakes for the attn.output of IQ2_XL and the ffn gate and up of IQ2_XS

Reformat attn_ouput mess and split GQA4/GQA2
2024-08-21 12:52:45 +02:00
Nexesenex
d7b9d214fb Shrink a bit IQ3_XXS, bump a bit IQ3_M 2024-08-21 12:49:40 +02:00
Nexesenex
dbadcdd5cf harmonize formatting of tensor type conditions 2024-08-21 12:30:38 +02:00
Nexesenex
ce86019770 change function use_*_bits into difquant_*_tensors
this to clarify what it does, especially with the 5 additional levels of difquant
2024-08-21 12:26:12 +02:00
Nexesenex
cfe866e152 Merge branch 'master' into pr/8836 2024-08-21 12:23:41 +02:00
Xuan Son Nguyen
fc54ef0d1c
server : support reading arguments from environment variables (#9105)
* server : support reading arguments from environment variables

* add -fa and -dt

* readme : specify non-arg env var
2024-08-21 11:04:34 +02:00
Younes Belkada
b40eb84895
llama : support for falcon-mamba architecture (#9074)
* feat: initial support for llama.cpp

* fix: lint

* refactor: better refactor

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* fix: address comments

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* fix: add more cleanup and harmonization

* fix: lint

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <git@compilade.net>

* fix: change name

* Apply suggestions from code review

Co-authored-by: compilade <git@compilade.net>

* add in operator

* fix: add `dt_b_c_rms` in `llm_load_print_meta`

* fix: correct printf format for bool

* fix: correct print format

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* llama : quantize more Mamba tensors

* llama : use f16 as the fallback of fallback quant types

---------

Co-authored-by: compilade <git@compilade.net>
2024-08-21 11:06:36 +03:00
fairydreaming
f63f603c87
llava : zero-initialize clip_ctx structure fields with aggregate initialization 908)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-21 09:45:49 +02:00
Daniel Bevenius
8455340b87
llama : std::move llm_bigram_bpe from work_queue (#9062)
* llama : std::move llm_bigram_bpe from work_queue

This commit updates the retrieval of llm_bigram_bpe objects from
work_queue.top() by using std::move.

The motivation for this is to avoid the copying of the std::string
`text` member of the llm_bigram_bpe struct.

* squash! llama : std::move llm_bigram_bpe from work_queue

Introduced a MovablePriorityQueue class to allow moving elements
out of the priority queue for llm_bigram_bpe.

* squash! llama : std::move llm_bigram_bpe from work_queue

Rename MovablePriorityQueue to lama_priority_queue.

* squash! llama : std::move llm_bigram_bpe from work_queue

Rename lama_priority_queue -> llama_priority_queue.
2024-08-21 10:32:58 +03:00
Changyeon Kim
2f3c1466ff
llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. (#8984)
* llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model.

- The CLIP model now prioritizes the Vulkan backend over the CPU when vulkan available.
- A GGML_OP_ACC shader has been added.
- The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* fix-up coding style.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* Fix-up the missing initial parameter to resolve the compilation warning.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* [fix] Add missing parameters.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* [fix] Use nb1 and nb2 for dst.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* Fix check results ggml_acc call

---------

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
Co-authored-by: 0cc4m <picard12@live.de>
2024-08-20 21:00:00 +02:00
Meng, Hengyu
50addec9a5
[SYCL] fallback mmvq (#9088)
* fallback mmvq to mul_mat

* mmvq in cuda path

* Update ggml/src/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>

---------

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>
2024-08-20 23:50:17 +08:00
zhentaoyu
4f8d19ff17
[SYCL] Fix SYCL im2col and convert Overflow with Large Dims (#9052)
* sycl: fix im2col overflow and sync with cuda

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: fix convert overflow

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: fix convert and dequantize

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: fix ib in dmmv

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl:refine convert

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: move downsample global_range into common

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* test: add im2col and convert test cases

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* test: make new cases only in sycl

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* test: comment new test_cases for only local testing

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

---------

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
2024-08-20 23:06:51 +08:00
fairydreaming
90db8146d5
tests : add missing comma in grammar integration tests (#9099)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-20 12:09:55 +03:00
Nexesenex
fddff02915 Rework IQ3_XXS and IQ3_XS
and fix parenthesis mistake on IQ3_S
2024-08-20 01:16:24 +02:00
Nexesenex
207ffe681f Reorder, corrections, settling lower IQ3 quants 2024-08-20 00:59:54 +02:00
Nexesenex
8c1a3c5ba2 Merge branch 'master' into pr/8836 2024-08-20 00:48:05 +02:00
Nexesenex
a7f91643bb Fix mistake 2024-08-19 20:02:21 +02:00
wangshuai09
cfac111e2b
cann: add doc for cann backend (#8867)
Co-authored-by: xuedinge233 <damow890@gmail.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
2024-08-19 16:46:38 +08:00
Radoslav Gerganov
1b6ff90ff8
rpc : print error message when failed to connect endpoint (#9042) 2024-08-19 10:11:45 +03:00
Radoslav Gerganov
18eaf29f4c
rpc : prevent crashes on invalid input (#9040)
Add more checks which prevent RPC server from crashing if invalid input
is received from client
2024-08-19 10:10:21 +03:00
Nexesenex
caeb839ae3 Boost embeddings and output weights for MOEs.
They are single and non-repeating, the boost is thus reasonable compared to the 4 or more experts size.
2024-08-18 22:20:58 +02:00
Nexesenex
503048a197 Correct IQ3_M 2024-08-18 22:14:05 +02:00
Nexesenex
ddb13732c4 IQ3_XXL and IQ3_XXXL
We now have a full range of quants between IQ3_M and IQ4_XS
2024-08-18 22:14:04 +02:00
Nexesenex
a79633b49e Merge branch 'master' into pr/8836 2024-08-18 22:12:39 +02:00
Nexesenex
b02eaf6803 Mass use of the few/some/more/many bits bump logic
Add few bits logic and rework the 4 settings for 25/37.5/50/75% quant bump when used.
2024-08-18 22:11:24 +02:00