llama.cpp

Author	SHA1	Message	Date
Concedo	9830871d0f	pulled all Occam's fixes and the kquants are all working now	2023-06-13 16:15:13 +08:00
Concedo	9b6c35b651	rwkv speed enhancements (batch processing), fixed a rwkv token processing bug	2023-06-13 16:02:12 +08:00
Kawrakow	74a6d922f1	Metal implementation for all k_quants (#1807 ) * metal : improve q4_K 28.3 -> 26.0 ms/token by avoiding a branch in the calculation of the scales. * metal : small improvement for Q4_K * metal : still optimizing Q4_K This commit pushes it down to 25.3 ms / token. The crazy idea of using 6 bits for the scales is really costly on Metal: if I remove the bit fiddling necessary to make the block scales, time goes almost to the Q4_0 23 ms/token. Before pushing the k-quants upstream I had a Q4_K variant that had used 8-bit scales. It wasn't more accurate, used 0.125 bits more per weight, was running slightly slower on the CPU (due to the larger model size and being memory bound there), and the difference was entirely negligible under CUDA. So, I decided to publish the version with 6-bit scales. Perhaps I should re-consider and change to 8-bit scales? * metal : some more optimizations Q2_K: 25.4 ms/token Q6_K: 27.3 ms/token Q4_0: 22.8 ms/token Q4_1: 23.1 ms/token * metal : Q3_K support Something is not quite right yet. * metal : Q5_K support Initial version achieves 31.2 ms/token, 210 GB/s * metal : still not able to figure out why q3_K does not work * Minor * metal : yet another failed attempt to make q3_K work * metal : optimize Q5_K 31.2 ms -> 27.8 ms. 250 GB/s. * metal : q3_K still not working Adding a heavily commented q3_K metal kernel to explain my obviously faulty logic. Perhaps someone could spot the issue? * metal : q3_K finally working Not optimized at all. What was the issue? The scales are not 4-bytes aligned, and I was accessing them with a uint32_t pointer. When I tried that on CUDA, I got an error (illegal memory access) and added a memcpy to a local array of 3 uint32_t's. But on Metal it told me there is no memcpy, so I tried accessing directly. There is no error, just garbage results. At some point I did try accessing the scales with an uint16_t pointer (the scales are for sure 2-byte aligned), but was still getting garbage. I guess, there must have been another bug. No access to scales is via a uint16_t pointer and, after starting from scratch from the C dequantize function, it finally works. * metal : Q3_K 1st optimization pass * metal : Q3_K second optimization pass - 29.6 ms/token * metal : Q3_K cleanup * metal : fixed accidentally broken Q2_K --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-12 22:39:21 +03:00
slaren	e4caa8da59	ci : run when changing only the CUDA sources (#1800 )	2023-06-12 20:12:47 +03:00
Concedo	860fb026df	rwkv compile fix (+1 squashed commits) Squashed commits: [8b0ebb1] upgraded rwkv + added memory overheads + added state_out bufs	2023-06-12 23:04:40 +08:00
Concedo	120851df53	prevent gpu offload if kquant is selected with clblast for now	2023-06-12 21:57:31 +08:00
Concedo	215edf420b	Merge branch 'master' into concedo_experimental	2023-06-12 21:53:13 +08:00
Concedo	9c08017051	this patch is a work in progress implementation for the k-quants. the dequant kernels are working, but the DMMV ones are not.	2023-06-12 21:47:57 +08:00
Howard Su	58970a4c39	Leverage mmap for offloading tensors to GPU (#1597 ) * Rebase to latest * Show progress * Add assert to make sure we only allocate temp buffer for non-CPU backend tensor Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2023-06-12 14:44:16 +02:00
Kawrakow	8c0a10e64d	metal : fix failure to load model (#1817 ) The number of buffers in the ggml context was left unitialized. This leads to sporadic failures to load the model on startup. It is actually strange that the failure occurred so infrequantly. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-12 14:31:36 +03:00
Concedo	b9a4da3c6f	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # SHA256SUMS	2023-06-11 23:27:28 +08:00
Concedo	c44b9c3ecf	added the llama_v2 cuda back (+2 squashed commit) Squashed commit: [1c97fd4] Revert "fix for cublas" This reverts commit `994be9a4db`. [fce03c3] Revert "fix for cublas" This reverts commit `33528f5b1d`.	2023-06-11 23:23:24 +08:00
Kerfuffle	fa84c4b3e8	Fix issue where interactive mode crashes when input exceeds ctx size (#1789 ) * Fix issue where interactive mode in the main example crashes when input exceeds ctx size * Ensure the context size is at least 8 tokens in the main example. Closes #1768	2023-06-11 08:19:17 -06:00
Kyle Liang	12b063f0ec	Fixed WSL cuda's OOM error (#1594 ) * In the function , add the cuda error bypass. * remove excessive codes and prints --------- Co-authored-by: liang <liangmanlai@126.com>	2023-06-11 15:20:52 +02:00
Ryan Landay	31d2b5f4a4	Update SHA256SUMS with current hashes for models quantized using q4_0 (#1798 )	2023-06-11 12:38:53 +03:00
Georgi Gerganov	4de0334f5c	cmake : fix Metal build (close #1791 )	2023-06-10 22:56:53 +03:00
Artyom Lebedev	3f1223155a	k-quants : GCC12 compilation fix (#1792 )	2023-06-10 22:51:36 +03:00
Concedo	fb67506c1b	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # README.md # flake.nix # ggml-metal.m	2023-06-10 23:04:48 +08:00
Andrei	303f5809f1	metal : fix issue with ggml-metal.metal path. Closes #1769 (#1782 ) * Fix issue with ggml-metal.metal path * Add ggml-metal.metal as a resource for llama target * Update flake.nix metal kernel substitution	2023-06-10 17:47:34 +03:00
Concedo	0c9cd39259	lowered streaming tickrate for greater efficiency	2023-06-10 22:12:01 +08:00
Aisuko	059e99066d	doc : fix wrong address of BLIS.md (#1772 ) Signed-off-by: Aisuko <urakiny@gmail.com>	2023-06-10 17:08:11 +03:00
Concedo	b9f74db89e	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile	2023-06-10 21:07:20 +08:00
Concedo	fa64971881	encoding	2023-06-10 21:05:35 +08:00
Concedo	66a3f4e421	added support for lora base	2023-06-10 19:29:45 +08:00
Concedo	375540837e	updated lite	2023-06-10 19:16:29 +08:00
Concedo	a68fcfe738	only start a new thread when using sse	2023-06-10 19:03:41 +08:00
Concedo	43f7e40470	added extra endpoints for abort gen and polled streaming	2023-06-10 18:13:26 +08:00
Georgi Gerganov	17c10acfb4	ggml : force no_alloc == false when creating opt tensors (close #1699 ) This is needed to make operators like ggml_view() be able to store their parameters in the ggml context's memory and not get discarded when no_alloc is true	2023-06-10 12:08:15 +03:00
Kawrakow	e9b66ee982	metal : add Q4_1 implementation (#1785 ) 23.3 ms / token, so just ~1% slower than q4_0. Achieves 290 GB/s memory throughput. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-10 11:28:11 +03:00
Kerfuffle	4f0154b0ba	llama : support requantizing models instead of only allowing quantization from 16/32bit (#1691 ) * Add support for quantizing already quantized models * Threaded dequantizing and f16 to f32 conversion * Clean up thread blocks with spares calculation a bit * Use std::runtime_error exceptions.	2023-06-10 10:59:17 +03:00
Xingchen Song(宋星辰)	ef3171d162	ggml : workaround for missing _mm256_setr_m128i in GCC < 8 (#1638 )	2023-06-10 10:49:40 +03:00
rankaiyx	555275a693	make : add SSSE3 compilation use case (#1659 )	2023-06-10 09:41:59 +03:00
Robert Sung-wook Shin	98ed165574	OpenCL: Add release memory (#1741 ) * Add opencl release memory * Rename function name	2023-06-09 18:24:40 +02:00
Concedo	5bd9cef9fa	merging Proper SSE Token Streaming #220 with end connection fix test	2023-06-09 23:22:16 +08:00
Concedo	b92f9fe3a2	Merge remote-tracking branch 'sammcheese/sammcheese/tokenstreaming' into concedo_experimental	2023-06-09 20:41:02 +08:00
Concedo	507939c135	Merge branch 'master' into concedo_experimental	2023-06-09 20:20:04 +08:00
Concedo	788784179a	Merge branch 'concedo' into concedo_experimental	2023-06-09 20:19:56 +08:00
12Boti	e1ab14c4ab	fix format string vulnerability (#223 )	2023-06-09 20:16:03 +08:00
Johannes Gäßler	ae9663f188	Windows nvcc workaround (#1753 ) Fix gibberish output on Windows when using CUDA	2023-06-09 13:58:15 +02:00
SammCheese	57b0b53b54	fix kobold lite generation	2023-06-09 12:39:35 +02:00
SammCheese	c99ab9df33	Revert "Squashed commit of the following:" This reverts commit `4f665cd63d`.	2023-06-09 12:19:08 +02:00
SammCheese	e6231c3055	back to http.server, improved implementation	2023-06-09 12:17:55 +02:00
Concedo	d28ed99e59	remove unused declarations	2023-06-09 18:01:55 +08:00
SammCheese	4f665cd63d	Squashed commit of the following: commit `b617f2847b` Merge: `73cc5b8` `92f44ff` Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 16:10:35 2023 +0800 Merge branch 'master' into concedo_experimental commit `73cc5b88fb` Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 16:09:23 2023 +0800 added warning message for unsupported K quants commit `92f44ff7f7` Author: AT <manyoso@users.noreply.github.com> Date: Fri Jun 9 04:00:51 2023 -0400 metal : add GELU implementation (#1770) Co-authored-by: Adam Treat <adam@nomic.ai> commit `245fc3c37d` Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Fri Jun 9 10:39:59 2023 +0300 metal : faster q4_0 (#1775) * metal : 8% faster q4_0 Avoid copying into local uchar4 anf float4. * metal : 17% faster Q4_0 Use 64 threads in a thread group. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit `01dc509038` Merge: `0833845` `72ff528` Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 14:53:35 2023 +0800 Merge branch 'master' into concedo_experimental commit `0833845268` Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 14:38:31 2023 +0800 merged metal patch directly into the file commit `72ff5282bf` Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Thu Jun 8 22:28:21 2023 +0300 metal : add Q2_K implementation (#1762) * metal : add Q2_K implementation 27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s. The access pattern used in the Q2_K CUDA implementation resulted in significantly lower performance (~31 ms/token). * Fixing merge conflicts --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit `0bf7cf1b29` Author: Georgi Gerganov <ggerganov@gmail.com> Date: Thu Jun 8 20:48:14 2023 +0300 Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)" This reverts commit `8432d4d9f7`. commit `8432d4d9f7` Author: le.chang <cljs118@126.com> Date: Fri Jun 9 00:47:56 2023 +0800 ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738) commit `6fa1613f15` Author: Hyun-joo KIM <bebopkim@gmail.com> Date: Fri Jun 9 01:47:36 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment commit `0f291e1f65` Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Thu Jun 8 19:46:22 2023 +0300 metal : Q6_K implementation (#1752) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master * Metal implementation for Q6_K Similar to the CUDA implementation. No idea if this is the optimum for Metal, but the few alternative variants I tried all had a lower performance. We get 36.5 ms / token on M2 Max with 30 GPU cores. This corresponds to ~200 GB/second throughput. * clang-tidy : add config back * Much better Q6_K implementation for metal 28.3 ms / token for 7B. Subtracting ~9 ms that is spent in other compute graph operations, we are left with ~19 ms for the matrix multiplications. The model is ~5.5 GB, so we are getting 1000 / 19 * 5.5 = 290 GB/s! --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit `7f181600c7` Author: Hyun-joo KIM <bebopkim@gmail.com> Date: Fri Jun 9 01:24:22 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment commit `8fc8179919` Author: qingfengfenga <41416092+qingfengfenga@users.noreply.github.com> Date: Thu Jun 8 15:58:53 2023 +0800 Add llama.cpp docker support for non-latin languages (#1673) * Modify Dockerfile default character set to improve compatibility (#1673) commit `b50b570ed9` Author: Steven Roussey <sroussey@gmail.com> Date: Thu Jun 8 00:12:28 2023 -0700 ggml : fix fprintf warnings (#1720) commit `53aba3f393` Author: Georgi Gerganov <ggerganov@gmail.com> Date: Thu Jun 8 10:09:08 2023 +0300 clang-tidy : restore dot file from accidental deletion commit `4161bdc04d` Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Thu Jun 8 10:08:23 2023 +0300 metal : add Q4_K implementation (#1733) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit `0035858273` Author: johnson442 <56517414+johnson442@users.noreply.github.com> Date: Thu Jun 8 08:02:48 2023 +0100 k-quants : add missing compile definition to CMakeLists (#1748)	2023-06-09 10:55:07 +02:00
Georgi Gerganov	b33dee282f	metal : fix build "tanhf" -> "tanh"	2023-06-09 11:11:04 +03:00
Concedo	b617f2847b	Merge branch 'master' into concedo_experimental	2023-06-09 16:10:35 +08:00
Concedo	73cc5b88fb	added warning message for unsupported K quants	2023-06-09 16:09:23 +08:00
AT	92f44ff7f7	metal : add GELU implementation (#1770 ) Co-authored-by: Adam Treat <adam@nomic.ai>	2023-06-09 11:00:51 +03:00
Kawrakow	245fc3c37d	metal : faster q4_0 (#1775 ) * metal : 8% faster q4_0 Avoid copying into local uchar4 anf float4. * metal : 17% faster Q4_0 Use 64 threads in a thread group. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-09 10:39:59 +03:00
Concedo	01dc509038	Merge branch 'master' into concedo_experimental # Conflicts: # .devops/full.Dockerfile # .devops/main.Dockerfile # CMakeLists.txt	2023-06-09 14:53:35 +08:00

1 2 3 4 5 ...

1227 commits