Iwan Kawrakow
e6e61e3158
iq3_s: partial fix for QK_K = 64
2024-02-23 16:04:28 +02:00
Iwan Kawrakow
1d47de3258
ROCm again
2024-02-23 14:03:52 +02:00
Iwan Kawrakow
0d6d185e0f
Attempt to fix ROCm
2024-02-23 13:52:33 +02:00
Iwan Kawrakow
303f3f3258
Another attempt to fix the Windows builds
2024-02-23 10:38:02 +02:00
Iwan Kawrakow
436a146f98
Attempt to fix failing tests
2024-02-23 10:16:15 +02:00
Iwan Kawrakow
cd6a0f08be
Move Q3_K_XS mix to 3.25 bpw
2024-02-23 08:02:43 +02:00
Iwan Kawrakow
47cf30b0ee
iq3_s: make tests pass
2024-02-23 08:02:43 +02:00
Iwan Kawrakow
2730225c5f
iq3_xs: rename to iq3_s
2024-02-23 08:02:43 +02:00
Iwan Kawrakow
272c7f7739
Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS
2024-02-23 08:02:43 +02:00
Iwan Kawrakow
b25f99607d
Fix stupid warning
2024-02-23 08:02:43 +02:00
Iwan Kawrakow
4d5feebeb6
iq3_xs: tiny Metal speed improvement
2024-02-23 08:02:43 +02:00
Iwan Kawrakow
87038fe198
iq3_xs: tiny Metal speed improvement
2024-02-23 08:02:43 +02:00
Iwan Kawrakow
1777825550
iq3_xs: make new version work on metal
...
Performance is very similar to Q3_K_S
2024-02-23 08:02:43 +02:00
Iwan Kawrakow
1328331db7
iq3_s: make ARM_NEON work with new version
2024-02-23 08:02:43 +02:00
Iwan Kawrakow
1fef4b8b68
iq3_xs: make scalar and AVX2 work for new version
2024-02-23 08:02:43 +02:00
Iwan Kawrakow
eacff4aa81
iq3_xs: make CUDA work for new version
2024-02-23 08:02:43 +02:00
Iwan Kawrakow
d83fddaa3b
iiq3_xs: a 3.4375 bpw variant
2024-02-23 08:02:42 +02:00
Iwan Kawrakow
2ec600b7a4
Adding IQ3_M - IQ3_XS mix with mostly Q4_K
2024-02-23 08:02:42 +02:00
Iwan Kawrakow
38aa7b176f
iq3_xs: working Metal implementation
2024-02-23 08:02:42 +02:00
Iwan Kawrakow
76214ab655
iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)
2024-02-23 08:02:42 +02:00
Iwan Kawrakow
f1255c50c0
iq3_xs: working scalar and AVX2 dot products
2024-02-23 08:02:42 +02:00
Iwan Kawrakow
5be4e7ac4a
Minor improvement via 3 neighbours
2024-02-23 08:02:42 +02:00
Iwan Kawrakow
76aff093b4
Minor PPL improvement via a block scale fudge factor
2024-02-23 08:02:42 +02:00
Iwan Kawrakow
5691fecd06
Resurrecting iq3_xs
...
After all the experimentation, nothing was better than this.
2024-02-23 08:02:42 +02:00
Iwan Kawrakow
10a47fa678
iq4_nl: squash commits for easier rebase
...
* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels
2024-02-23 08:00:52 +02:00
Jared Van Bortel
15499eb942
mpt : do not duplicate token_embd.weight on disk ( #5670 )
2024-02-22 17:05:23 -05:00
Georgi Gerganov
96633eeca1
gemma : use more bits for the token_embd.weight tensor ( #5650 )
...
* gemma : use Q8_0 for the token_embd.weight tensor
* llama : quantize token_embd.weight using output type
2024-02-22 23:23:46 +02:00
Georgi Gerganov
847eedbdb2
py : add Gemma conversion from HF models ( #5647 )
...
* py : add gemma conversion from HF models
* Update convert-hf-to-gguf.py
Co-authored-by: Aarni Koskela <akx@iki.fi>
* Update convert-hf-to-gguf.py
Co-authored-by: Aarni Koskela <akx@iki.fi>
* Update convert-hf-to-gguf.py
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
---------
Co-authored-by: Aarni Koskela <akx@iki.fi>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-02-22 23:22:48 +02:00
Georgi Gerganov
7e4f339c40
ggml : always define ggml_fp16_t as uint16_t ( #5666 )
...
* ggml : always define ggml_fp16_t as uint16_t
ggml-ci
* ggml : cont
ggml-ci
* ggml : cont
* ggml : cont
ggml-ci
* ggml : cont
ggml-ci
* cuda : no longer ggml headers last
ggml-ci
* ggml : fix q6_K FP16 -> FP32 conversion
ggml-ci
* ggml : more FP16 -> FP32 conversion fixes
ggml-ci
2024-02-22 23:21:39 +02:00
Georgi Gerganov
334f76fa38
sync : ggml
2024-02-22 23:21:05 +02:00
Georgi Gerganov
efd56b1c21
ggml : 32-bit arm compat (whisper/1891)
...
* ggml : 32-bit arm compat
* ggml : add ggml_vqtbl1q_s8 impl
* ggml : cont
2024-02-22 23:20:50 +02:00
Someone
201294ae17
nix: init singularity and docker images ( #5056 )
...
Exposes a few attributes demonstrating how to build [singularity](https://docs.sylabs.io/guides/latest/user-guide/ )/[apptainer](https://apptainer.org/ ) and Docker images re-using llama.cpp's Nix expression.
Built locally on `x86_64-linux` with `nix build github:someoneserge/llama.cpp/feat/nix/images#llamaPackages.{docker,docker-min,sif,llama-cpp}` and it's fast and effective.
2024-02-22 11:44:10 -08:00
Georgi Gerganov
5a9e2f60ba
py : minor fixes ( #5668 )
2024-02-22 20:13:25 +02:00
Xuan Son Nguyen
373ee3fbba
Add Gemma chat template ( #5665 )
...
* add gemma chat template
* gemma: only apply system_prompt on non-model message
2024-02-22 19:10:21 +01:00
Someone
4cb4d8b22d
workflows: nix: hardcode cachix ids, build unconditionally ( #5663 )
...
GitHub does not expose environment and repository variables to PRs coming from forks implies that we've been disabling the Nix CI actions for most PRs.
The `if:` also didn't make much sense, because we can always pull from cachix, and there's no point (albeit no risk either) in pushing cache for the untrusted code.
2024-02-22 08:32:09 -08:00
Georgi Gerganov
3a03541ced
minor : fix trailing whitespace ( #5638 )
2024-02-22 13:54:03 +02:00
Georgi Gerganov
56d03d92be
readme : update hot topics
2024-02-22 10:35:54 +02:00
Xuan Son Nguyen
a46f50747b
server : fallback to chatml, add AlphaMonarch chat template ( #5628 )
...
* server: fallback to chatml
* add new chat template
* server: add AlphaMonarch to test chat template
* server: only check model template if there is no custom tmpl
* remove TODO
2024-02-22 10:33:24 +02:00
Alexey Parfenov
c5688c6250
server : clarify some params in the docs ( #5640 )
2024-02-22 10:27:32 +02:00
Dat Quoc Nguyen
4ef245a92a
mpt : add optional bias tensors ( #5638 )
...
Update for MPT with optional bias parameters: to work with PhoGPT and SEA-LION models that were pre-trained with 'bias'.
2024-02-22 10:15:13 +02:00
slaren
973053d8b0
llama : fix loading models with shared tok_embd and output ( #5651 )
...
ggml-ci
2024-02-22 00:42:09 +01:00
Xuan Son Nguyen
7c8bcc11dc
Add docs for llama_chat_apply_template ( #5645 )
...
* add docs for llama_chat_apply_template
* fix typo
2024-02-22 00:31:00 +01:00
slaren
7fe4678b02
llama : fix session save/load with quantized KV ( #5649 )
2024-02-21 22:52:39 +01:00
slaren
ba2135ccae
gemma : allow offloading the output tensor ( #5646 )
2024-02-21 22:18:23 +01:00
Jared Van Bortel
89febfed93
examples : do not assume BOS when shifting context ( #5622 )
2024-02-21 10:33:54 -05:00
Georgi Gerganov
5022cf242d
sync : ggml
2024-02-21 16:52:52 +02:00
Pierrick Hymbert
1ecea255eb
server: health: fix race condition on slots data using tasks queue ( #5634 )
...
* server: health: fix race condition on slots data using tasks queue
* server: health:
* include_slots only if slots_endpoint
* fix compile warning task.target_id not initialized.
2024-02-21 15:47:48 +01:00
Ettore Di Giacinto
a00a35cef9
readme : add LocalAI to the availables UI ( #5629 )
2024-02-21 16:39:10 +02:00
Georgi Gerganov
eccd7a26dd
sync : ggml ( #5633 )
...
* ggml : fix conv_2d batch mode (ggml/737)
Co-authored-by: bssrdf <bssrdf@gmail.com>
* ggml : compute forward no longer pass src tensors (ggml/729)
* sync : ggml
ggml-ci
---------
Co-authored-by: bssrdf <merlintiger@hotmail.com>
Co-authored-by: bssrdf <bssrdf@gmail.com>
2024-02-21 16:17:10 +02:00
Georgi Gerganov
c14f72db9c
readme : update hot topics
2024-02-21 15:39:54 +02:00