Iwan Kawrakow
53e81ca289
k_quants: 10% faster ARM_NEON Q5_K dot product
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
2da3a59708
k_quants: AVX2 implementation for new 64-weight Q5_K
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
ccf4901334
k_quants: change Q5_K to be type 0 when QK_K = 64
...
Still needs AVX2 implementation
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
4f61506929
k_quants: forgot to add the Metal changes in last commit
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
ce19b965f0
k_quants: switch Q4_K to 4-bit scales when QK_K = 64
...
Here the loss in accuracy is greater than for Q3_K,
but the Q4_K points still move further to the left on
the perplexity vs size curve.
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
aeefd4e781
k_quants: swicth Q3_K to 4-bit scales when QK_K = 64
...
Otherwise there isn't much benefit from this
quantization type. There is some very slight loss
in accuracy, but we reduce size by ~7%.
E.g., for OpenLLaMA-3B, Q3_K_S perplexity is
8.6131 with 8-bit scales and 8.6352 with 4-bit,
while file size decreases from 1.53G to 1.44G.
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
88412a1aa0
Simplify via lambda
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
333ffcc5ba
Fixed bug in q4_K quantization added with the 64-block addition
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
558a19427b
k_quants: correctly define QK_K in llama.cpp
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
8b98d01e31
k_quants: call them _K, not _k, also on Metal
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
285eeb1531
k_quants: WIP super-blocks with 64 weights
...
Q5_K works on Metal and is slightly faster
than QK_K = 256 (23.7 ms vs 26.3 ms).
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
ff83e32c6a
k_quants: WIP super-blocks with 64 weights
...
Q3_K works on Metal and is slightly faster
than QK_K = 256 (26.6 ms vs 28.3 ms).
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
6081a65527
k_quants: WIP super-blocks with 64 weights
...
Q2_K works on Metal and is very slightly faster
than QK_K = 256 (23.8 ms vs 24.2 ms).
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
167a0bbe34
k_quants: WIP super-blocks with 64 weights
...
Q4_K works on Metal and is actually slightly faster
than QK_K = 256 (21.95 ms vs 24.0 ms).
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
e1bbcfc5cb
k_quants: WIP super-blocks with 64 weights
...
* We are able to pass preprocessor macros to the Metal
compiler
* Q6_K works and is actually slightly more efficient than
the QK_K = 256 version (25.2 ms vs 25.8 ms)
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
fae24afd01
k_quants: WIP super-blocks with 64 weights
...
Yet another speedup for Q5_K on ARM_NEON.
We are now within 10% of the QK_K = 256 version.
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
d92c5a9e29
k_quants: WIP super-blocks with 64 weights
...
Another small improvement for Q3_K and Q5_K on ARM_NEON
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
2ff543c147
k_quants: WIP super-blocks with 64 weights
...
Slightly more efficient Q3_K and Q5_K
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
9d27d8d0ea
k_quants: WIP super-blocks with 64 weights
...
Q5_K working on ARM_NEON, but quite a bit slower than 256 weights.
With that, we have full support for ARM_NEON, although
performance is not quite there.
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
2b2a13c4f9
k_quants: WIP super-blocks with 64 weights
...
Q3_K working on ARM_NEON, but quite a bit slower than 256 weights.
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
80c75fe821
k_quants: WIP super-blocks with 64 weights
...
Q2_K working on ARM_NEON, but quite a bit slower than 256 weights
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
cda47a6b2f
k_quants: WIP super-blocks with 64 weights
...
Q4_K working on ARM_NEON, but quite a bit slower than 256 weights
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
03f30c8eca
k_quants: WIP super-blocks with 64 weights
...
Q6_K working on ARM_NEON
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
3bd9ae79d8
k_quants: WIP super-blocks with 64 weights
...
Q5_K working on CUDA, and with this CUDA is done.
2023-06-26 12:58:32 +03:00
Iwan Kawrakow
460dd841b1
k_quants: WIP super-blocks with 64 weights
...
Q3_K working on CUDA.
2023-06-26 12:58:29 +03:00
Iwan Kawrakow
41e46ec1c2
k_quants: WIP super-blocks with 64 weights
...
Q2_K working on CUDA. ~3% slower on GTX-1660,
10% slower on 4080.
2023-06-26 12:55:35 +03:00
Iwan Kawrakow
5aae4b8d4f
k_quants: WIP super-blocks with 64 weights
...
Q4_K working on CUDA. ~10% slower on GTX-1660,
16% slower on 4080.
2023-06-26 12:52:57 +03:00
Iwan Kawrakow
c6c35366bf
k_quants: WIP super-blocks with 64 weights
...
Q6_K working on CUDA. Cannot make it run quite as gast as
with super-blocks with 256 weigths: 8% slower on 4080,
20% slower on the 1660 (but there we fit 1 less layer on the
GPU because pf the larger model size), so some fraction of
these 20% is due to that,
2023-06-26 12:42:36 +03:00
Iwan Kawrakow
bcf8c5c384
k_quants: WIP super-blocks with 64 weights
...
Q5_K scalar and AVX2 works, and with that all
k_quants are done on AVX2 and scalar
2023-06-26 12:42:36 +03:00
Iwan Kawrakow
2b2ab31a89
k_quants: WIP super-blocks with 64 weights
...
Q3_K scalar and AVX2 works.
2023-06-26 12:42:36 +03:00
Iwan Kawrakow
aebd5471e9
k_quants: WIP super-blocks with 64 weights
...
Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower
than the scalar implementation)
2023-06-26 12:42:36 +03:00
Iwan Kawrakow
1f6195c2f2
k_quants: WIP super-blocks with 64 weights
...
Q4_K scalar and AVX2 works
2023-06-26 12:42:36 +03:00
Iwan Kawrakow
9fe2a2b1db
k_quants: WIP super-blocks with 64 weights
...
Q6_K scalar and AVX2 works
2023-06-26 12:42:36 +03:00
Iwan Kawrakow
d2f12ac354
k_quants: WIP super-blocks with 64 weights
2023-06-26 12:42:36 +03:00
Georgi Gerganov
447ccbe8c3
readme : add new roadmap + manifesto
2023-06-25 16:08:12 +03:00
Georgi Gerganov
bd34cdde38
ggml : sync latest ggml (custom operators)
2023-06-25 14:25:08 +03:00
anon998
c2a08f87b8
fix server sampling: top k sampler first ( #1977 )
...
Co-authored-by: anon <anon@example.org>
2023-06-25 10:48:36 +02:00
Georgi Gerganov
66a2555ba6
readme : add Azure CI discussion link
2023-06-25 09:07:03 +03:00
sjinzh
e65ca7e14a
zig : upgrade build system support ( #1981 )
...
* upgrade zig build system support
* zig : add new line at the end of the file
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-25 08:45:44 +03:00
Robyn
5ec8dd5a3c
#1869 Fix null reference errors when training from scratch with CUDA ( #1907 )
...
* #1869 Fix null reference errors when training from scratch with CUDA build
Calling ggml_compute_forward when node->src0 was null was causing train-text-from-scratch.exe to terminate unexpectedly.
* ggml : do not dereference src0 if NULL
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-24 20:10:29 +02:00
Georgi Gerganov
65bdd52a86
tests : sync test-grad0 from ggml
2023-06-24 19:40:18 +03:00
Rowan Hart
fdd1860911
flake : fix ggml-metal.metal path and run nixfmt ( #1974 )
2023-06-24 14:07:08 +03:00
AN Long
c943d823c1
convert : fix invalid params in write_vocab_only ( #1975 )
2023-06-24 14:02:06 +03:00
slaren
f2c754e1c3
ggml : improve ggml_graph_dump_dot, add ggml_format_name ( #1978 )
...
* Improve ggml_graph_dump_dot, add ggml_format_name
* add more automatic names to view ops
* fix name of copies
2023-06-24 13:57:18 +03:00
Georgi Gerganov
11da1a85cd
readme : fix whitespaces
2023-06-24 13:38:18 +03:00
Alberto
235b610d65
readme : fixed termux instructions ( #1973 )
2023-06-24 13:32:13 +03:00
Alex Renda
b061ba9e2a
llama : fix top-p sampling to match the canonical definition ( #1953 )
...
* Fix top-p sampling to match the standard definition (smallest set that has probability mass at least p, not largest set with probability mass less than p)
* top-p: correct gt to gte
* add test for correct top-p behavior
2023-06-24 13:15:01 +03:00
Didzis Gosko
527b6fba1d
llama : make model stateless and context stateful (llama_state) ( #1797 )
...
* llama : make model stateless and context stateful
* llama : minor cleanup
* llama : update internal API declaration
* Apply suggestions from code review
fix style
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Missing model memory release
* Fix style
* Add deprecated warning for public API function llama_init_from_file
* Update public API use cases: move away from deprecated llama_init_from_file
* Deprecate public API function llama_apply_lora_from_file
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-24 11:47:58 +03:00
eiery
d7b7484f74
Add OpenLLaMA instructions to the README ( #1954 )
...
* add openllama to readme
2023-06-23 10:38:01 +02:00
Erik Scholz
7487137227
rework convert.py to read hyper-parameters from config.json ( #1958 )
...
* Read hyper-parameters from HuggingFace-transformer config.json, if they exist, and fall back to guessing, like before otherwise.
This allows converting open_llama 3B and other non-standard model designs.
2023-06-22 14:20:47 +02:00