llama.cpp

Author	SHA1	Message	Date
Iwan Kawrakow	ce19b965f0	k_quants: switch Q4_K to 4-bit scales when QK_K = 64 Here the loss in accuracy is greater than for Q3_K, but the Q4_K points still move further to the left on the perplexity vs size curve.	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	aeefd4e781	k_quants: swicth Q3_K to 4-bit scales when QK_K = 64 Otherwise there isn't much benefit from this quantization type. There is some very slight loss in accuracy, but we reduce size by ~7%. E.g., for OpenLLaMA-3B, Q3_K_S perplexity is 8.6131 with 8-bit scales and 8.6352 with 4-bit, while file size decreases from 1.53G to 1.44G.	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	88412a1aa0	Simplify via lambda	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	333ffcc5ba	Fixed bug in q4_K quantization added with the 64-block addition	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	558a19427b	k_quants: correctly define QK_K in llama.cpp	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	8b98d01e31	k_quants: call them _K, not _k, also on Metal	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	285eeb1531	k_quants: WIP super-blocks with 64 weights Q5_K works on Metal and is slightly faster than QK_K = 256 (23.7 ms vs 26.3 ms).	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	ff83e32c6a	k_quants: WIP super-blocks with 64 weights Q3_K works on Metal and is slightly faster than QK_K = 256 (26.6 ms vs 28.3 ms).	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	6081a65527	k_quants: WIP super-blocks with 64 weights Q2_K works on Metal and is very slightly faster than QK_K = 256 (23.8 ms vs 24.2 ms).	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	167a0bbe34	k_quants: WIP super-blocks with 64 weights Q4_K works on Metal and is actually slightly faster than QK_K = 256 (21.95 ms vs 24.0 ms).	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	e1bbcfc5cb	k_quants: WIP super-blocks with 64 weights * We are able to pass preprocessor macros to the Metal compiler * Q6_K works and is actually slightly more efficient than the QK_K = 256 version (25.2 ms vs 25.8 ms)	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	fae24afd01	k_quants: WIP super-blocks with 64 weights Yet another speedup for Q5_K on ARM_NEON. We are now within 10% of the QK_K = 256 version.	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	d92c5a9e29	k_quants: WIP super-blocks with 64 weights Another small improvement for Q3_K and Q5_K on ARM_NEON	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	2ff543c147	k_quants: WIP super-blocks with 64 weights Slightly more efficient Q3_K and Q5_K	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	9d27d8d0ea	k_quants: WIP super-blocks with 64 weights Q5_K working on ARM_NEON, but quite a bit slower than 256 weights. With that, we have full support for ARM_NEON, although performance is not quite there.	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	2b2a13c4f9	k_quants: WIP super-blocks with 64 weights Q3_K working on ARM_NEON, but quite a bit slower than 256 weights.	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	80c75fe821	k_quants: WIP super-blocks with 64 weights Q2_K working on ARM_NEON, but quite a bit slower than 256 weights	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	cda47a6b2f	k_quants: WIP super-blocks with 64 weights Q4_K working on ARM_NEON, but quite a bit slower than 256 weights	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	03f30c8eca	k_quants: WIP super-blocks with 64 weights Q6_K working on ARM_NEON	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	3bd9ae79d8	k_quants: WIP super-blocks with 64 weights Q5_K working on CUDA, and with this CUDA is done.	2023-06-26 12:58:32 +03:00
Iwan Kawrakow	460dd841b1	k_quants: WIP super-blocks with 64 weights Q3_K working on CUDA.	2023-06-26 12:58:29 +03:00
Iwan Kawrakow	41e46ec1c2	k_quants: WIP super-blocks with 64 weights Q2_K working on CUDA. ~3% slower on GTX-1660, 10% slower on 4080.	2023-06-26 12:55:35 +03:00
Iwan Kawrakow	5aae4b8d4f	k_quants: WIP super-blocks with 64 weights Q4_K working on CUDA. ~10% slower on GTX-1660, 16% slower on 4080.	2023-06-26 12:52:57 +03:00
Iwan Kawrakow	c6c35366bf	k_quants: WIP super-blocks with 64 weights Q6_K working on CUDA. Cannot make it run quite as gast as with super-blocks with 256 weigths: 8% slower on 4080, 20% slower on the 1660 (but there we fit 1 less layer on the GPU because pf the larger model size), so some fraction of these 20% is due to that,	2023-06-26 12:42:36 +03:00
Iwan Kawrakow	bcf8c5c384	k_quants: WIP super-blocks with 64 weights Q5_K scalar and AVX2 works, and with that all k_quants are done on AVX2 and scalar	2023-06-26 12:42:36 +03:00
Iwan Kawrakow	2b2ab31a89	k_quants: WIP super-blocks with 64 weights Q3_K scalar and AVX2 works.	2023-06-26 12:42:36 +03:00
Iwan Kawrakow	aebd5471e9	k_quants: WIP super-blocks with 64 weights Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower than the scalar implementation)	2023-06-26 12:42:36 +03:00
Iwan Kawrakow	1f6195c2f2	k_quants: WIP super-blocks with 64 weights Q4_K scalar and AVX2 works	2023-06-26 12:42:36 +03:00
Iwan Kawrakow	9fe2a2b1db	k_quants: WIP super-blocks with 64 weights Q6_K scalar and AVX2 works	2023-06-26 12:42:36 +03:00
Iwan Kawrakow	d2f12ac354	k_quants: WIP super-blocks with 64 weights	2023-06-26 12:42:36 +03:00
Georgi Gerganov	447ccbe8c3	readme : add new roadmap + manifesto	2023-06-25 16:08:12 +03:00
Georgi Gerganov	bd34cdde38	ggml : sync latest ggml (custom operators)	2023-06-25 14:25:08 +03:00
anon998	c2a08f87b8	fix server sampling: top k sampler first (#1977 ) Co-authored-by: anon <anon@example.org>	2023-06-25 10:48:36 +02:00
Georgi Gerganov	66a2555ba6	readme : add Azure CI discussion link	2023-06-25 09:07:03 +03:00
sjinzh	e65ca7e14a	zig : upgrade build system support (#1981 ) * upgrade zig build system support * zig : add new line at the end of the file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-25 08:45:44 +03:00
Robyn	5ec8dd5a3c	#1869 Fix null reference errors when training from scratch with CUDA (#1907 ) * #1869 Fix null reference errors when training from scratch with CUDA build Calling ggml_compute_forward when node->src0 was null was causing train-text-from-scratch.exe to terminate unexpectedly. * ggml : do not dereference src0 if NULL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-24 20:10:29 +02:00
Georgi Gerganov	65bdd52a86	tests : sync test-grad0 from ggml	2023-06-24 19:40:18 +03:00
Rowan Hart	fdd1860911	flake : fix ggml-metal.metal path and run nixfmt (#1974 )	2023-06-24 14:07:08 +03:00
AN Long	c943d823c1	convert : fix invalid params in write_vocab_only (#1975 )	2023-06-24 14:02:06 +03:00
slaren	f2c754e1c3	ggml : improve ggml_graph_dump_dot, add ggml_format_name (#1978 ) * Improve ggml_graph_dump_dot, add ggml_format_name * add more automatic names to view ops * fix name of copies	2023-06-24 13:57:18 +03:00
Georgi Gerganov	11da1a85cd	readme : fix whitespaces	2023-06-24 13:38:18 +03:00
Alberto	235b610d65	readme : fixed termux instructions (#1973 )	2023-06-24 13:32:13 +03:00
Alex Renda	b061ba9e2a	llama : fix top-p sampling to match the canonical definition (#1953 ) * Fix top-p sampling to match the standard definition (smallest set that has probability mass at least p, not largest set with probability mass less than p) * top-p: correct gt to gte * add test for correct top-p behavior	2023-06-24 13:15:01 +03:00
Didzis Gosko	527b6fba1d	llama : make model stateless and context stateful (llama_state) (#1797 ) * llama : make model stateless and context stateful * llama : minor cleanup * llama : update internal API declaration * Apply suggestions from code review fix style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Missing model memory release * Fix style * Add deprecated warning for public API function llama_init_from_file * Update public API use cases: move away from deprecated llama_init_from_file * Deprecate public API function llama_apply_lora_from_file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-24 11:47:58 +03:00
eiery	d7b7484f74	Add OpenLLaMA instructions to the README (#1954 ) * add openllama to readme	2023-06-23 10:38:01 +02:00
Erik Scholz	7487137227	rework convert.py to read hyper-parameters from config.json (#1958 ) * Read hyper-parameters from HuggingFace-transformer config.json, if they exist, and fall back to guessing, like before otherwise. This allows converting open_llama 3B and other non-standard model designs.	2023-06-22 14:20:47 +02:00
Johannes Gäßler	bbca06e269	cmake: revert CUDA arch default to 52, 61 if f16 (#1959 )	2023-06-21 23:49:25 +02:00
Rahul Vivek Nair	fb98254f99	Fix typo in README.md (#1961 )	2023-06-21 23:48:43 +02:00
Georgi Gerganov	049aa16b8c	readme : add link to p1	2023-06-20 19:05:54 +03:00
Xiake Sun	2322ec223a	Fix typo (#1949 )	2023-06-20 15:42:40 +03:00

1 2 3 4 5 ...

770 commits