llama.cpp

Author	SHA1	Message	Date
Concedo	375540837e	updated lite	2023-06-10 19:16:29 +08:00
Concedo	a68fcfe738	only start a new thread when using sse	2023-06-10 19:03:41 +08:00
Concedo	43f7e40470	added extra endpoints for abort gen and polled streaming	2023-06-10 18:13:26 +08:00
Georgi Gerganov	17c10acfb4	ggml : force no_alloc == false when creating opt tensors (close #1699 ) This is needed to make operators like ggml_view() be able to store their parameters in the ggml context's memory and not get discarded when no_alloc is true	2023-06-10 12:08:15 +03:00
Kawrakow	e9b66ee982	metal : add Q4_1 implementation (#1785 ) 23.3 ms / token, so just ~1% slower than q4_0. Achieves 290 GB/s memory throughput. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-10 11:28:11 +03:00
Kerfuffle	4f0154b0ba	llama : support requantizing models instead of only allowing quantization from 16/32bit (#1691 ) * Add support for quantizing already quantized models * Threaded dequantizing and f16 to f32 conversion * Clean up thread blocks with spares calculation a bit * Use std::runtime_error exceptions.	2023-06-10 10:59:17 +03:00
Xingchen Song(宋星辰)	ef3171d162	ggml : workaround for missing _mm256_setr_m128i in GCC < 8 (#1638 )	2023-06-10 10:49:40 +03:00
rankaiyx	555275a693	make : add SSSE3 compilation use case (#1659 )	2023-06-10 09:41:59 +03:00
Robert Sung-wook Shin	98ed165574	OpenCL: Add release memory (#1741 ) * Add opencl release memory * Rename function name	2023-06-09 18:24:40 +02:00
Concedo	5bd9cef9fa	merging Proper SSE Token Streaming #220 with end connection fix test	2023-06-09 23:22:16 +08:00
Concedo	b92f9fe3a2	Merge remote-tracking branch 'sammcheese/sammcheese/tokenstreaming' into concedo_experimental	2023-06-09 20:41:02 +08:00
Concedo	507939c135	Merge branch 'master' into concedo_experimental	2023-06-09 20:20:04 +08:00
Concedo	788784179a	Merge branch 'concedo' into concedo_experimental	2023-06-09 20:19:56 +08:00
12Boti	e1ab14c4ab	fix format string vulnerability (#223 )	2023-06-09 20:16:03 +08:00
Johannes Gäßler	ae9663f188	Windows nvcc workaround (#1753 ) Fix gibberish output on Windows when using CUDA	2023-06-09 13:58:15 +02:00
SammCheese	57b0b53b54	fix kobold lite generation	2023-06-09 12:39:35 +02:00
SammCheese	c99ab9df33	Revert "Squashed commit of the following:" This reverts commit `4f665cd63d`.	2023-06-09 12:19:08 +02:00
SammCheese	e6231c3055	back to http.server, improved implementation	2023-06-09 12:17:55 +02:00
Concedo	d28ed99e59	remove unused declarations	2023-06-09 18:01:55 +08:00
SammCheese	4f665cd63d	Squashed commit of the following: commit `b617f2847b` Merge: `73cc5b8` `92f44ff` Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 16:10:35 2023 +0800 Merge branch 'master' into concedo_experimental commit `73cc5b88fb` Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 16:09:23 2023 +0800 added warning message for unsupported K quants commit `92f44ff7f7` Author: AT <manyoso@users.noreply.github.com> Date: Fri Jun 9 04:00:51 2023 -0400 metal : add GELU implementation (#1770) Co-authored-by: Adam Treat <adam@nomic.ai> commit `245fc3c37d` Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Fri Jun 9 10:39:59 2023 +0300 metal : faster q4_0 (#1775) * metal : 8% faster q4_0 Avoid copying into local uchar4 anf float4. * metal : 17% faster Q4_0 Use 64 threads in a thread group. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit `01dc509038` Merge: `0833845` `72ff528` Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 14:53:35 2023 +0800 Merge branch 'master' into concedo_experimental commit `0833845268` Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 14:38:31 2023 +0800 merged metal patch directly into the file commit `72ff5282bf` Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Thu Jun 8 22:28:21 2023 +0300 metal : add Q2_K implementation (#1762) * metal : add Q2_K implementation 27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s. The access pattern used in the Q2_K CUDA implementation resulted in significantly lower performance (~31 ms/token). * Fixing merge conflicts --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit `0bf7cf1b29` Author: Georgi Gerganov <ggerganov@gmail.com> Date: Thu Jun 8 20:48:14 2023 +0300 Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)" This reverts commit `8432d4d9f7`. commit `8432d4d9f7` Author: le.chang <cljs118@126.com> Date: Fri Jun 9 00:47:56 2023 +0800 ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738) commit `6fa1613f15` Author: Hyun-joo KIM <bebopkim@gmail.com> Date: Fri Jun 9 01:47:36 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment commit `0f291e1f65` Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Thu Jun 8 19:46:22 2023 +0300 metal : Q6_K implementation (#1752) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master * Metal implementation for Q6_K Similar to the CUDA implementation. No idea if this is the optimum for Metal, but the few alternative variants I tried all had a lower performance. We get 36.5 ms / token on M2 Max with 30 GPU cores. This corresponds to ~200 GB/second throughput. * clang-tidy : add config back * Much better Q6_K implementation for metal 28.3 ms / token for 7B. Subtracting ~9 ms that is spent in other compute graph operations, we are left with ~19 ms for the matrix multiplications. The model is ~5.5 GB, so we are getting 1000 / 19 * 5.5 = 290 GB/s! --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit `7f181600c7` Author: Hyun-joo KIM <bebopkim@gmail.com> Date: Fri Jun 9 01:24:22 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment commit `8fc8179919` Author: qingfengfenga <41416092+qingfengfenga@users.noreply.github.com> Date: Thu Jun 8 15:58:53 2023 +0800 Add llama.cpp docker support for non-latin languages (#1673) * Modify Dockerfile default character set to improve compatibility (#1673) commit `b50b570ed9` Author: Steven Roussey <sroussey@gmail.com> Date: Thu Jun 8 00:12:28 2023 -0700 ggml : fix fprintf warnings (#1720) commit `53aba3f393` Author: Georgi Gerganov <ggerganov@gmail.com> Date: Thu Jun 8 10:09:08 2023 +0300 clang-tidy : restore dot file from accidental deletion commit `4161bdc04d` Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Thu Jun 8 10:08:23 2023 +0300 metal : add Q4_K implementation (#1733) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit `0035858273` Author: johnson442 <56517414+johnson442@users.noreply.github.com> Date: Thu Jun 8 08:02:48 2023 +0100 k-quants : add missing compile definition to CMakeLists (#1748)	2023-06-09 10:55:07 +02:00
Georgi Gerganov	b33dee282f	metal : fix build "tanhf" -> "tanh"	2023-06-09 11:11:04 +03:00
Concedo	b617f2847b	Merge branch 'master' into concedo_experimental	2023-06-09 16:10:35 +08:00
Concedo	73cc5b88fb	added warning message for unsupported K quants	2023-06-09 16:09:23 +08:00
AT	92f44ff7f7	metal : add GELU implementation (#1770 ) Co-authored-by: Adam Treat <adam@nomic.ai>	2023-06-09 11:00:51 +03:00
Kawrakow	245fc3c37d	metal : faster q4_0 (#1775 ) * metal : 8% faster q4_0 Avoid copying into local uchar4 anf float4. * metal : 17% faster Q4_0 Use 64 threads in a thread group. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-09 10:39:59 +03:00
Concedo	01dc509038	Merge branch 'master' into concedo_experimental # Conflicts: # .devops/full.Dockerfile # .devops/main.Dockerfile # CMakeLists.txt	2023-06-09 14:53:35 +08:00
Concedo	0833845268	merged metal patch directly into the file	2023-06-09 14:38:31 +08:00
Kawrakow	72ff5282bf	metal : add Q2_K implementation (#1762 ) * metal : add Q2_K implementation 27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s. The access pattern used in the Q2_K CUDA implementation resulted in significantly lower performance (~31 ms/token). * Fixing merge conflicts --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-08 22:28:21 +03:00
Georgi Gerganov	0bf7cf1b29	Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738 )" This reverts commit `8432d4d9f7`.	2023-06-08 20:48:14 +03:00
le.chang	8432d4d9f7	ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738 )	2023-06-08 19:47:56 +03:00
Hyun-joo KIM	6fa1613f15	Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment	2023-06-09 01:47:36 +09:00
Kawrakow	0f291e1f65	metal : Q6_K implementation (#1752 ) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master * Metal implementation for Q6_K Similar to the CUDA implementation. No idea if this is the optimum for Metal, but the few alternative variants I tried all had a lower performance. We get 36.5 ms / token on M2 Max with 30 GPU cores. This corresponds to ~200 GB/second throughput. * clang-tidy : add config back * Much better Q6_K implementation for metal 28.3 ms / token for 7B. Subtracting ~9 ms that is spent in other compute graph operations, we are left with ~19 ms for the matrix multiplications. The model is ~5.5 GB, so we are getting 1000 / 19 * 5.5 = 290 GB/s! --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-08 19:46:22 +03:00
SammCheese	dee692a63e	compability with basic_api, change api path to /extra	2023-06-08 18:34:24 +02:00
SammCheese	b4e9e185d3	fix legacy streaming	2023-06-08 18:34:24 +02:00
SammCheese	9a8da35ec4	working streaming. TODO: fix lite	2023-06-08 18:34:23 +02:00
SammCheese	97971291e9	draft: token streaming	2023-06-08 18:34:08 +02:00
Hyun-joo KIM	7f181600c7	Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment	2023-06-09 01:24:22 +09:00
Concedo	a6a0fa338a	cleanup indentation, fixing cublas build	2023-06-08 22:40:53 +08:00
Concedo	a979e71ddc	add obj flags to all output make targets	2023-06-08 16:28:26 +08:00
qingfengfenga	8fc8179919	Add llama.cpp docker support for non-latin languages (#1673 ) * Modify Dockerfile default character set to improve compatibility (#1673)	2023-06-08 00:58:53 -07:00
Steven Roussey	b50b570ed9	ggml : fix fprintf warnings (#1720 )	2023-06-08 10:12:28 +03:00
Georgi Gerganov	53aba3f393	clang-tidy : restore dot file from accidental deletion	2023-06-08 10:09:08 +03:00
Kawrakow	4161bdc04d	metal : add Q4_K implementation (#1733 ) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-08 10:08:23 +03:00
johnson442	0035858273	k-quants : add missing compile definition to CMakeLists (#1748 )	2023-06-08 10:02:48 +03:00
Concedo	6635f7efce	updated lite	2023-06-08 00:20:32 +08:00
Concedo	49a6be3d87	add llama metal compile flags as an option	2023-06-07 22:29:38 +08:00
Concedo	7b0707ff26	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile	2023-06-07 17:06:56 +08:00
Georgi Gerganov	5c64a0952e	k-quants : allow to optionally disable at compile time (#1734 ) * k-quants : put behind optional compile flag LLAMA_K_QUANTS * build : enable k-quants by default	2023-06-07 10:59:52 +03:00
Concedo	e78c675a6e	Merge branch 'master' into concedo_experimental # Conflicts: # README.md # flake.lock # flake.nix # ggml-opencl.cpp	2023-06-07 15:23:29 +08:00
jacobi petrucciani	5b57a5b726	flake : update to support metal on m1/m2 (#1724 )	2023-06-07 07:15:31 +03:00

1 2 3 4 5 ...

1203 commits