Nexesenex
91db53b645
IQ1_XL and some corrections
...
notably on attn_q and parenthesis
2024-08-11 16:41:23 +02:00
Nexesenex
1268d58ca8
More adjustments
2024-08-11 03:05:52 +02:00
Nexesenex
ef83a87cfe
Revert of ffn gate and up on IQ3_M
...
and indent
2024-08-11 01:30:18 +02:00
Nexesenex
e2e2d77e8e
misplaced file lol
2024-08-11 01:13:12 +02:00
Nexesenex
8ad71f4469
IQ1_XS
...
and small adjustments.
2024-08-11 01:11:24 +02:00
Nexes the Old
14f4f404d5
Merge b3565
...
Merge b3565
2024-08-10 20:45:26 +02:00
Nexesenex
8bc7a9849e
2 forgotten files
2024-08-10 20:40:27 +02:00
Nexesenex
f0806ac943
IQ2_XL , IQ3_XL , Q2_K_L
...
Plus some adjustments on the FFNs
2024-08-10 20:36:49 +02:00
Nexesenex
49617b1960
Advancing on several tensors
...
- Progressivity for token embeddings and attn_qkv
- FFN down for IQ1 and IQ2 quants
- FFN gate and up for IQ2_S and IQ2_M, for progressivity in the IQ2 range.
2024-08-10 18:37:29 +02:00
Nexesenex
415d5e40e1
Refactor furthermore attn.v
...
And also lower attn_q for IQ2_XS, in order to separate it more for the quite misnamed IQ2_S
2024-08-10 17:32:29 +02:00
Nexesenex
8c8e43ce20
Settings for MOE >= 8 experts applied to >= 4 experts
2024-08-10 16:38:11 +02:00
Nexesenex
aa4eb594ef
Further refactor attn_k
...
With attn_k set for all quants bellow 3bpw except Q2_K_S.
2024-08-10 16:33:55 +02:00
slaren
6e02327e8b
metal : fix uninitialized abort_callback ( #8968 )
2024-08-10 15:42:10 +02:00
Nexesenex
8f1b99fee8
Shortening formatting
2024-08-10 13:09:11 +02:00
Xuan Son Nguyen
7eb23840ed
llama : default n_swa for phi-3 ( #8931 )
...
* default n_swa for phi-3
* fix
* double check swa
2024-08-10 13:04:40 +02:00
Nexesenex
7212098755
IQ1 and IQ2 refactor
...
Attn_q in Q3_K for experts >= 8
Attn_k in Q5_K for experts >= 8
Attn_v in Q6_K for experts >= 8, in IQ3_XXS for IQ2_XXS and IQ2_XS
Attn_output in Q4_K for experts >= 8
2024-08-10 12:52:57 +02:00
fairydreaming
7c3f55c100
Add support for encoder-only T5 models ( #8900 )
...
* gguf-py : add T5ENCODER model architecture
* common : call llama_decode() during warmup only if the model has decoder
* convert-hf : add T5EncoderModel
* llama : add llama_model_has_decoder() API function
* llama : split build_t5() into build_t5_encoder() and build_t5_decoder()
* llama : add support for LLM_ARCH_T5ENCODER
* llama-embedding : add support for LLAMA_POOLING_TYPE_NONE
* llama-embedding : add support for encoder-only models
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-10 11:43:26 +02:00
Matteo Mortari
911b437f22
gguf-py : fix double call to add_architecture() ( #8952 )
...
Signed-off-by: tarilabs <matteo.mortari@gmail.com>
2024-08-10 08:58:49 +03:00
Nexesenex
1bc4dc5c15
Bump IQ3_M
...
attn.v in Q5_K
attn.k in IQ4_XS
2024-08-09 22:49:42 +02:00
Georgi Gerganov
b72942fac9
Merge commit from fork
2024-08-09 23:03:21 +03:00
fairydreaming
6afd1a99dc
llama : add support for lora adapters in T5 model ( #8938 )
...
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-09 18:53:09 +02:00
Georgi Gerganov
272e3bd95e
make : fix llava obj file race ( #8946 )
...
ggml-ci
2024-08-09 18:24:30 +03:00
Georgi Gerganov
45a55b91aa
llama : better replace_all (cont) ( #8926 )
...
* llama : better replace_all (cont)
ggml-ci
* code : deduplicate replace_all
ggml-ci
2024-08-09 18:23:52 +03:00
tc-mb
3071c0a5f2
llava : support MiniCPM-V-2.5 ( #7599 )
...
* init
* rename
* add run android for termux in readme
* add android readme
* add instructions in readme
* change name in readme
* Update README.md
* fixed line
* add result in readme
* random pos_embed
* add positions index
* change for ollama
* change for ollama
* better pos_embed in clip
* support ollama
* updata cmakelist
* updata cmakelist
* rename wrapper
* clear code
* replace and organize code
* add link
* sync master
* fix warnings
* fix warnings
* fix bug in bicubic resize when need resize iamge smaller
* receive review comments and modify
* receive review comments and modify
* put all code into llava dir
* fix quality problem in pr code
* change n_layer
* add space in "-1"
* imitate reshape bug of python code
* fix bug in clip
* fix issues for merging
* fix llama-minicpmv-cli in cmake file
* change pr readme
* fix code review
* remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir
* fix cmakefile
* add warn
* fix KEY_HAS_MINICPMV_PROJ
* remove load_image_size into clip_ctx
* remove the extern "C", MINICPMV_API
* fix uhd code for review comment
* delete minicpmv-wrapper in pr
* remove uhd_image_embed
* Modify 2 notes
* clip : style changes
* del common.h in clip
* fix Type-Check error
* fix Type-Check error
* fix Type-Check error
* fix Type-Check error
* fix makefile error
* fix ubuntu-make error
* try fix clip
* try fix 1
---------
Co-authored-by: Hongji Zhu <fireyoucan@gmail.com>
Co-authored-by: harvestingmoon <leewenyeong@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-09 13:33:53 +03:00
Georgi Gerganov
4305b57c80
sync : ggml
2024-08-09 10:03:48 +03:00
Matt Stephenson
70c0ea3560
whisper : use vulkan as gpu backend when available (whisper/2302)
...
* ggml: use vulkan as gpu backend when available
Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>
* whisper: enable using vk as default buffer type
Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>
---------
Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>
2024-08-09 10:03:44 +03:00
Daniel Bevenius
5b2c04f492
embedding : add --pooling option to README.md [no ci] ( #8934 )
...
This commit adds the `--pooling` option to the README.md file in the
`examples/embedding` directory.
The motivation for adding this options is that currently if the model
used does not specify a pooling type the embedding example will fail
with the following error message:
```console
main: error: pooling type NONE not supported
```
This commit also updates the name of the executable in the examples
section.
2024-08-09 09:33:30 +03:00
Daniel Bevenius
6f6496bb09
llama : fix typo in llama_tensor_get_type comment [no ci] ( #8937 )
2024-08-09 09:32:23 +03:00
Mathieu Geli
daef3ab233
server : add one level list nesting for embeddings ( #8936 )
2024-08-09 09:32:02 +03:00
compilade
345a686d82
llama : reduce useless copies when saving session ( #8916 )
...
* llama : avoid useless copies in dummy session writer
* llama : avoid double tensor copy when saving session to buffer
2024-08-08 23:54:00 -04:00
compilade
3a14e00366
gguf-py : simplify support for quant types ( #8838 )
...
* gguf-py : use classes for quants
* convert_hf : simplify internal quantization type selection
* gguf-py : fix flake8 lint
* gguf-py : fix BF16 numpy view type
* gguf-py : remove LlamaFileTypeMap
Too specific to 'llama.cpp', and would be a maintenance burden
to keep up to date.
* gguf-py : add generic quantize and dequantize functions
The quant classes no longer need to be known,
only the target or the source type,
for 'quantize' and 'dequantize', respectively.
2024-08-08 13:33:09 -04:00
Nexes the Old
1118c046df
correct mistake in conditionality for attn.k
2024-08-08 18:56:20 +02:00
Nexes the Old
8006b15fd1
Avoid to shrink attn.k.weight for IQ3_XS and XXS when GQA or MOE
2024-08-08 18:50:48 +02:00
Georgi Gerganov
afd27f01fe
scripts : sync cann files ( #0 )
2024-08-08 14:56:52 +03:00
Georgi Gerganov
366d486c16
scripts : fix sync filenames ( #0 )
2024-08-08 14:40:12 +03:00
Georgi Gerganov
e44a561ab0
sync : ggml
2024-08-08 13:19:47 +03:00
Borislav Stanimirov
f93d49ab1e
ggml : ignore more msvc warnings (ggml/906)
2024-08-08 13:19:31 +03:00
Georgi Gerganov
5b33ea1ee7
metal : fix struct name (ggml/912)
...
ggml-ci
2024-08-08 13:19:31 +03:00
Conrad Kramer
85fca8deb6
metal : add abort callback (ggml/905)
2024-08-08 13:19:30 +03:00
Pablo Duboue
ebd541a570
make : clean llamafile objects ( #8923 )
...
`ggml/src/llamafile/sgemm.o` was not deleted on `make clean`
2024-08-08 11:44:51 +03:00
slaren
15fa07a5c5
make : use C compiler to build metal embed object ( #8899 )
...
* make : use C compiler to build metal embed object
* use rm + rmdir to avoid -r flag in rm
2024-08-07 18:24:05 +02:00
slaren
be55695eff
ggml-backend : fix async copy from CPU ( #8897 )
...
* ggml-backend : fix async copy from CPU
* cuda : more reliable async copy, fix stream used when the devices are the same
2024-08-07 13:29:02 +02:00
Ouadie EL FAROUKI
0478174d59
[SYCL] Updated SYCL device filtering ( #8901 )
...
* Updated device filter to depend on default_selector (fixes non-intel device issues)
* Small related update to example/sycl Readme
2024-08-07 11:25:36 +01:00
Johannes Gäßler
a8dbc6f753
CUDA/HIP: fix tests/test-backend-ops ( #8896 )
2024-08-07 09:07:52 +02:00
Zhenwei Jin
506122d854
llama-bench : add support for getting cpu info on Windows ( #8824 )
...
* Add support for getting cpu info on Windows for llama_bench
* refactor
---------
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-07 03:01:06 +02:00
Daniel Bevenius
725e3d9437
quantize : update usage comment in quantize.cpp ( #8889 )
...
This commit updates the usage comment in quantize.cpp to reflect the
new name of the executable, which is llama-quantize.
2024-08-07 01:43:00 +02:00
Nexes the Old
31958546c3
typo correction ( #8891 )
2024-08-07 01:41:54 +02:00
Xuan Son Nguyen
1e6f6554aa
server : add lora hotswap endpoint (WIP) ( #8857 )
...
* server : add lora hotswap endpoint
* handle lora_no_apply
* fix build
* updae docs
* clean up struct def
* fix build
* add LoRA test
* fix style
2024-08-06 17:33:39 +02:00
Johannes Gäßler
641f5dd2a6
CUDA: fix padding logic for FP16/FP32 ( #8884 )
2024-08-06 17:13:55 +02:00
Daniel Bevenius
5f4dcb1e60
simple : update name of executable to llama-simple ( #8885 )
...
This commit updates the name of the executable in README.md from
`simple` to `llama-simple`.
2024-08-06 16:44:35 +02:00