Nexesenex
8c10533409
Merge branch 'master' into pr/8836
2024-08-12 20:28:38 +02:00
Nexesenex
cd92ba612f
IQ4_XSR (test FTYPE) and attention_wv logic for all attn_*.weights
...
Also, Advise iMatrix for IQ2_M and Q2_K FTypes
2024-08-12 20:27:36 +02:00
Diogo Teles Sant'Anna
fc4ca27b25
ci : fix github workflow vulnerable to script injection ( #9008 )
...
Signed-off-by: Diogo Teles Sant'Anna <diogoteles@google.com>
2024-08-12 19:28:23 +03:00
Radoslav Gerganov
1f67436c5e
ci : enable RPC in all of the released builds ( #9006 )
...
ref: #8912
2024-08-12 19:17:03 +03:00
Nico Bosshard
0fd93cdef5
llama : model-based max number of graph nodes calculation ( #8970 )
...
* llama : model-based max number of graph nodes calculation
* Update src/llama.cpp
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-12 17:13:59 +02:00
Frank Mai
84eb2f4fad
docs: introduce gpustack and gguf-parser ( #8873 )
...
* readme: introduce gpustack
GPUStack is an open-source GPU cluster manager for running large
language models, which uses llama.cpp as the backend.
Signed-off-by: thxCode <thxcode0824@gmail.com>
* readme: introduce gguf-parser
GGUF Parser is a tool to review/check the GGUF file and estimate the
memory usage without downloading the whole model.
Signed-off-by: thxCode <thxcode0824@gmail.com>
---------
Signed-off-by: thxCode <thxcode0824@gmail.com>
2024-08-12 14:45:50 +02:00
DavidKorczynski
1262e7ed13
grammar-parser : fix possible null-deref ( #9004 )
...
Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680
Signed-off-by: David Korczynski <david@adalogics.com>
2024-08-12 15:36:41 +03:00
Nexesenex
3e2eb6dc57
Merge branch 'master' into pr/8836
2024-08-12 14:25:23 +02:00
DavidKorczynski
df5478fbea
ggml: fix div-by-zero ( #9003 )
...
Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724
In order to access the above bug you need to login using one of the
emails in
https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5
Signed-off-by: David Korczynski <david@adalogics.com>
2024-08-12 14:21:41 +02:00
Liu Jia
2589292cde
Fix a spelling mistake ( #9001 )
2024-08-12 11:46:03 +02:00
Georgi Gerganov
d3ae0ee8d7
py : fix requirements check '==' -> '~=' ( #8982 )
...
* py : fix requirements check '==' -> '~='
* cont : fix the fix
* ci : run on all requirements.txt
2024-08-12 11:02:01 +03:00
Georgi Gerganov
5ef07e25ac
server : handle models with missing EOS token ( #8997 )
...
ggml-ci
2024-08-12 10:21:50 +03:00
Nexesenex
df9e6fda50
Adjustments on output and embeddings
2024-08-11 21:49:23 +02:00
Nexesenex
1ad18f80e9
Adjustments on attn_k
2024-08-11 21:44:29 +02:00
compilade
4134999e01
gguf-py : Numpy dequantization for most types ( #8939 )
...
* gguf-py : Numpy dequantization for most types
* gguf-py : Numpy dequantization for grid-based i-quants
2024-08-11 14:45:41 -04:00
Nexes the Old
8c2c03f4a7
Merge b3569
...
b3569
2024-08-11 16:46:15 +02:00
Nexesenex
91db53b645
IQ1_XL and some corrections
...
notably on attn_q and parenthesis
2024-08-11 16:41:23 +02:00
Georgi Gerganov
8cd1bcfd3f
flake.lock: Update ( #8979 )
2024-08-11 06:58:58 -07:00
Neo Zhang
a21c6fd450
update guide ( #8909 )
...
Co-authored-by: Neo Zhang <>
2024-08-11 14:07:43 +05:30
fairydreaming
33309f661a
llama : check all graph nodes when searching for result_embd_pooled ( #8956 )
...
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-11 10:35:26 +02:00
Markus Tavenrath
7c5bfd57f8
Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. ( #8943 )
...
* Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead.
- Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
- ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.
* Fix small typo
---------
Co-authored-by: 0cc4m <picard12@live.de>
2024-08-11 10:09:09 +02:00
Nexesenex
1268d58ca8
More adjustments
2024-08-11 03:05:52 +02:00
Nexesenex
ef83a87cfe
Revert of ffn gate and up on IQ3_M
...
and indent
2024-08-11 01:30:18 +02:00
Nexesenex
e2e2d77e8e
misplaced file lol
2024-08-11 01:13:12 +02:00
Nexesenex
8ad71f4469
IQ1_XS
...
and small adjustments.
2024-08-11 01:11:24 +02:00
Nexes the Old
14f4f404d5
Merge b3565
...
Merge b3565
2024-08-10 20:45:26 +02:00
Nexesenex
8bc7a9849e
2 forgotten files
2024-08-10 20:40:27 +02:00
Nexesenex
f0806ac943
IQ2_XL , IQ3_XL , Q2_K_L
...
Plus some adjustments on the FFNs
2024-08-10 20:36:49 +02:00
Nexesenex
49617b1960
Advancing on several tensors
...
- Progressivity for token embeddings and attn_qkv
- FFN down for IQ1 and IQ2 quants
- FFN gate and up for IQ2_S and IQ2_M, for progressivity in the IQ2 range.
2024-08-10 18:37:29 +02:00
Nexesenex
415d5e40e1
Refactor furthermore attn.v
...
And also lower attn_q for IQ2_XS, in order to separate it more for the quite misnamed IQ2_S
2024-08-10 17:32:29 +02:00
Nexesenex
8c8e43ce20
Settings for MOE >= 8 experts applied to >= 4 experts
2024-08-10 16:38:11 +02:00
Nexesenex
aa4eb594ef
Further refactor attn_k
...
With attn_k set for all quants bellow 3bpw except Q2_K_S.
2024-08-10 16:33:55 +02:00
slaren
6e02327e8b
metal : fix uninitialized abort_callback ( #8968 )
2024-08-10 15:42:10 +02:00
Nexesenex
8f1b99fee8
Shortening formatting
2024-08-10 13:09:11 +02:00
Xuan Son Nguyen
7eb23840ed
llama : default n_swa for phi-3 ( #8931 )
...
* default n_swa for phi-3
* fix
* double check swa
2024-08-10 13:04:40 +02:00
Nexesenex
7212098755
IQ1 and IQ2 refactor
...
Attn_q in Q3_K for experts >= 8
Attn_k in Q5_K for experts >= 8
Attn_v in Q6_K for experts >= 8, in IQ3_XXS for IQ2_XXS and IQ2_XS
Attn_output in Q4_K for experts >= 8
2024-08-10 12:52:57 +02:00
fairydreaming
7c3f55c100
Add support for encoder-only T5 models ( #8900 )
...
* gguf-py : add T5ENCODER model architecture
* common : call llama_decode() during warmup only if the model has decoder
* convert-hf : add T5EncoderModel
* llama : add llama_model_has_decoder() API function
* llama : split build_t5() into build_t5_encoder() and build_t5_decoder()
* llama : add support for LLM_ARCH_T5ENCODER
* llama-embedding : add support for LLAMA_POOLING_TYPE_NONE
* llama-embedding : add support for encoder-only models
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-10 11:43:26 +02:00
Matteo Mortari
911b437f22
gguf-py : fix double call to add_architecture() ( #8952 )
...
Signed-off-by: tarilabs <matteo.mortari@gmail.com>
2024-08-10 08:58:49 +03:00
Nexesenex
1bc4dc5c15
Bump IQ3_M
...
attn.v in Q5_K
attn.k in IQ4_XS
2024-08-09 22:49:42 +02:00
Georgi Gerganov
b72942fac9
Merge commit from fork
2024-08-09 23:03:21 +03:00
fairydreaming
6afd1a99dc
llama : add support for lora adapters in T5 model ( #8938 )
...
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-09 18:53:09 +02:00
Georgi Gerganov
272e3bd95e
make : fix llava obj file race ( #8946 )
...
ggml-ci
2024-08-09 18:24:30 +03:00
Georgi Gerganov
45a55b91aa
llama : better replace_all (cont) ( #8926 )
...
* llama : better replace_all (cont)
ggml-ci
* code : deduplicate replace_all
ggml-ci
2024-08-09 18:23:52 +03:00
tc-mb
3071c0a5f2
llava : support MiniCPM-V-2.5 ( #7599 )
...
* init
* rename
* add run android for termux in readme
* add android readme
* add instructions in readme
* change name in readme
* Update README.md
* fixed line
* add result in readme
* random pos_embed
* add positions index
* change for ollama
* change for ollama
* better pos_embed in clip
* support ollama
* updata cmakelist
* updata cmakelist
* rename wrapper
* clear code
* replace and organize code
* add link
* sync master
* fix warnings
* fix warnings
* fix bug in bicubic resize when need resize iamge smaller
* receive review comments and modify
* receive review comments and modify
* put all code into llava dir
* fix quality problem in pr code
* change n_layer
* add space in "-1"
* imitate reshape bug of python code
* fix bug in clip
* fix issues for merging
* fix llama-minicpmv-cli in cmake file
* change pr readme
* fix code review
* remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir
* fix cmakefile
* add warn
* fix KEY_HAS_MINICPMV_PROJ
* remove load_image_size into clip_ctx
* remove the extern "C", MINICPMV_API
* fix uhd code for review comment
* delete minicpmv-wrapper in pr
* remove uhd_image_embed
* Modify 2 notes
* clip : style changes
* del common.h in clip
* fix Type-Check error
* fix Type-Check error
* fix Type-Check error
* fix Type-Check error
* fix makefile error
* fix ubuntu-make error
* try fix clip
* try fix 1
---------
Co-authored-by: Hongji Zhu <fireyoucan@gmail.com>
Co-authored-by: harvestingmoon <leewenyeong@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-09 13:33:53 +03:00
Georgi Gerganov
4305b57c80
sync : ggml
2024-08-09 10:03:48 +03:00
Matt Stephenson
70c0ea3560
whisper : use vulkan as gpu backend when available (whisper/2302)
...
* ggml: use vulkan as gpu backend when available
Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>
* whisper: enable using vk as default buffer type
Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>
---------
Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>
2024-08-09 10:03:44 +03:00
Daniel Bevenius
5b2c04f492
embedding : add --pooling option to README.md [no ci] ( #8934 )
...
This commit adds the `--pooling` option to the README.md file in the
`examples/embedding` directory.
The motivation for adding this options is that currently if the model
used does not specify a pooling type the embedding example will fail
with the following error message:
```console
main: error: pooling type NONE not supported
```
This commit also updates the name of the executable in the examples
section.
2024-08-09 09:33:30 +03:00
Daniel Bevenius
6f6496bb09
llama : fix typo in llama_tensor_get_type comment [no ci] ( #8937 )
2024-08-09 09:32:23 +03:00
Mathieu Geli
daef3ab233
server : add one level list nesting for embeddings ( #8936 )
2024-08-09 09:32:02 +03:00
compilade
345a686d82
llama : reduce useless copies when saving session ( #8916 )
...
* llama : avoid useless copies in dummy session writer
* llama : avoid double tensor copy when saving session to buffer
2024-08-08 23:54:00 -04:00