llama.cpp

Author	SHA1	Message	Date
Concedo	788784179a	Merge branch 'concedo' into concedo_experimental	2023-06-09 20:19:56 +08:00
12Boti	e1ab14c4ab	fix format string vulnerability (#223 )	2023-06-09 20:16:03 +08:00
Johannes Gäßler	ae9663f188	Windows nvcc workaround (#1753 ) Fix gibberish output on Windows when using CUDA	2023-06-09 13:58:15 +02:00
SammCheese	57b0b53b54	fix kobold lite generation	2023-06-09 12:39:35 +02:00
SammCheese	c99ab9df33	Revert "Squashed commit of the following:" This reverts commit `4f665cd63d`.	2023-06-09 12:19:08 +02:00
SammCheese	e6231c3055	back to http.server, improved implementation	2023-06-09 12:17:55 +02:00
Concedo	d28ed99e59	remove unused declarations	2023-06-09 18:01:55 +08:00
SammCheese	4f665cd63d	Squashed commit of the following: commit `b617f2847b` Merge: `73cc5b8` `92f44ff` Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 16:10:35 2023 +0800 Merge branch 'master' into concedo_experimental commit `73cc5b88fb` Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 16:09:23 2023 +0800 added warning message for unsupported K quants commit `92f44ff7f7` Author: AT <manyoso@users.noreply.github.com> Date: Fri Jun 9 04:00:51 2023 -0400 metal : add GELU implementation (#1770) Co-authored-by: Adam Treat <adam@nomic.ai> commit `245fc3c37d` Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Fri Jun 9 10:39:59 2023 +0300 metal : faster q4_0 (#1775) * metal : 8% faster q4_0 Avoid copying into local uchar4 anf float4. * metal : 17% faster Q4_0 Use 64 threads in a thread group. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit `01dc509038` Merge: `0833845` `72ff528` Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 14:53:35 2023 +0800 Merge branch 'master' into concedo_experimental commit `0833845268` Author: Concedo <39025047+LostRuins@users.noreply.github.com> Date: Fri Jun 9 14:38:31 2023 +0800 merged metal patch directly into the file commit `72ff5282bf` Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Thu Jun 8 22:28:21 2023 +0300 metal : add Q2_K implementation (#1762) * metal : add Q2_K implementation 27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s. The access pattern used in the Q2_K CUDA implementation resulted in significantly lower performance (~31 ms/token). * Fixing merge conflicts --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit `0bf7cf1b29` Author: Georgi Gerganov <ggerganov@gmail.com> Date: Thu Jun 8 20:48:14 2023 +0300 Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)" This reverts commit `8432d4d9f7`. commit `8432d4d9f7` Author: le.chang <cljs118@126.com> Date: Fri Jun 9 00:47:56 2023 +0800 ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738) commit `6fa1613f15` Author: Hyun-joo KIM <bebopkim@gmail.com> Date: Fri Jun 9 01:47:36 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment commit `0f291e1f65` Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Thu Jun 8 19:46:22 2023 +0300 metal : Q6_K implementation (#1752) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master * Metal implementation for Q6_K Similar to the CUDA implementation. No idea if this is the optimum for Metal, but the few alternative variants I tried all had a lower performance. We get 36.5 ms / token on M2 Max with 30 GPU cores. This corresponds to ~200 GB/second throughput. * clang-tidy : add config back * Much better Q6_K implementation for metal 28.3 ms / token for 7B. Subtracting ~9 ms that is spent in other compute graph operations, we are left with ~19 ms for the matrix multiplications. The model is ~5.5 GB, so we are getting 1000 / 19 * 5.5 = 290 GB/s! --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit `7f181600c7` Author: Hyun-joo KIM <bebopkim@gmail.com> Date: Fri Jun 9 01:24:22 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment commit `8fc8179919` Author: qingfengfenga <41416092+qingfengfenga@users.noreply.github.com> Date: Thu Jun 8 15:58:53 2023 +0800 Add llama.cpp docker support for non-latin languages (#1673) * Modify Dockerfile default character set to improve compatibility (#1673) commit `b50b570ed9` Author: Steven Roussey <sroussey@gmail.com> Date: Thu Jun 8 00:12:28 2023 -0700 ggml : fix fprintf warnings (#1720) commit `53aba3f393` Author: Georgi Gerganov <ggerganov@gmail.com> Date: Thu Jun 8 10:09:08 2023 +0300 clang-tidy : restore dot file from accidental deletion commit `4161bdc04d` Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Date: Thu Jun 8 10:08:23 2023 +0300 metal : add Q4_K implementation (#1733) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> commit `0035858273` Author: johnson442 <56517414+johnson442@users.noreply.github.com> Date: Thu Jun 8 08:02:48 2023 +0100 k-quants : add missing compile definition to CMakeLists (#1748)	2023-06-09 10:55:07 +02:00
Georgi Gerganov	b33dee282f	metal : fix build "tanhf" -> "tanh"	2023-06-09 11:11:04 +03:00
Concedo	b617f2847b	Merge branch 'master' into concedo_experimental	2023-06-09 16:10:35 +08:00
Concedo	73cc5b88fb	added warning message for unsupported K quants	2023-06-09 16:09:23 +08:00
AT	92f44ff7f7	metal : add GELU implementation (#1770 ) Co-authored-by: Adam Treat <adam@nomic.ai>	2023-06-09 11:00:51 +03:00
Kawrakow	245fc3c37d	metal : faster q4_0 (#1775 ) * metal : 8% faster q4_0 Avoid copying into local uchar4 anf float4. * metal : 17% faster Q4_0 Use 64 threads in a thread group. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-09 10:39:59 +03:00
Concedo	01dc509038	Merge branch 'master' into concedo_experimental # Conflicts: # .devops/full.Dockerfile # .devops/main.Dockerfile # CMakeLists.txt	2023-06-09 14:53:35 +08:00
Concedo	0833845268	merged metal patch directly into the file	2023-06-09 14:38:31 +08:00
Kawrakow	72ff5282bf	metal : add Q2_K implementation (#1762 ) * metal : add Q2_K implementation 27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s. The access pattern used in the Q2_K CUDA implementation resulted in significantly lower performance (~31 ms/token). * Fixing merge conflicts --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-08 22:28:21 +03:00
Georgi Gerganov	0bf7cf1b29	Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738 )" This reverts commit `8432d4d9f7`.	2023-06-08 20:48:14 +03:00
le.chang	8432d4d9f7	ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738 )	2023-06-08 19:47:56 +03:00
Hyun-joo KIM	6fa1613f15	Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment	2023-06-09 01:47:36 +09:00
Kawrakow	0f291e1f65	metal : Q6_K implementation (#1752 ) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master * Metal implementation for Q6_K Similar to the CUDA implementation. No idea if this is the optimum for Metal, but the few alternative variants I tried all had a lower performance. We get 36.5 ms / token on M2 Max with 30 GPU cores. This corresponds to ~200 GB/second throughput. * clang-tidy : add config back * Much better Q6_K implementation for metal 28.3 ms / token for 7B. Subtracting ~9 ms that is spent in other compute graph operations, we are left with ~19 ms for the matrix multiplications. The model is ~5.5 GB, so we are getting 1000 / 19 * 5.5 = 290 GB/s! --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-08 19:46:22 +03:00
SammCheese	dee692a63e	compability with basic_api, change api path to /extra	2023-06-08 18:34:24 +02:00
SammCheese	b4e9e185d3	fix legacy streaming	2023-06-08 18:34:24 +02:00
SammCheese	9a8da35ec4	working streaming. TODO: fix lite	2023-06-08 18:34:23 +02:00
SammCheese	97971291e9	draft: token streaming	2023-06-08 18:34:08 +02:00
Hyun-joo KIM	7f181600c7	Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment	2023-06-09 01:24:22 +09:00
Concedo	a6a0fa338a	cleanup indentation, fixing cublas build	2023-06-08 22:40:53 +08:00
Concedo	a979e71ddc	add obj flags to all output make targets	2023-06-08 16:28:26 +08:00
qingfengfenga	8fc8179919	Add llama.cpp docker support for non-latin languages (#1673 ) * Modify Dockerfile default character set to improve compatibility (#1673)	2023-06-08 00:58:53 -07:00
Steven Roussey	b50b570ed9	ggml : fix fprintf warnings (#1720 )	2023-06-08 10:12:28 +03:00
Georgi Gerganov	53aba3f393	clang-tidy : restore dot file from accidental deletion	2023-06-08 10:09:08 +03:00
Kawrakow	4161bdc04d	metal : add Q4_K implementation (#1733 ) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-08 10:08:23 +03:00
johnson442	0035858273	k-quants : add missing compile definition to CMakeLists (#1748 )	2023-06-08 10:02:48 +03:00
Concedo	6635f7efce	updated lite	2023-06-08 00:20:32 +08:00
Concedo	49a6be3d87	add llama metal compile flags as an option	2023-06-07 22:29:38 +08:00
Concedo	7b0707ff26	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile	2023-06-07 17:06:56 +08:00
Georgi Gerganov	5c64a0952e	k-quants : allow to optionally disable at compile time (#1734 ) * k-quants : put behind optional compile flag LLAMA_K_QUANTS * build : enable k-quants by default	2023-06-07 10:59:52 +03:00
Concedo	e78c675a6e	Merge branch 'master' into concedo_experimental # Conflicts: # README.md # flake.lock # flake.nix # ggml-opencl.cpp	2023-06-07 15:23:29 +08:00
jacobi petrucciani	5b57a5b726	flake : update to support metal on m1/m2 (#1724 )	2023-06-07 07:15:31 +03:00
Georgi Gerganov	4dc62c545d	readme : add June roadmap	2023-06-07 07:15:08 +03:00
Willy Tarreau	35a84916fb	main: add the possibility to open the prompt cache read-only (#1640 ) The prompt cache constitutes a nice speed up when using the same prompt prefix across multiple evaluations, but when using it, it will also be updated, which is not always desirable. One use case is to have a large prompt containing some context and usage rules, and a second part containing variable data of the problem being studied. In this case it's desirable to be able to save the first part once, and to always reuse it as-is without updating it with the second part. The new argument --prompt-cache-ro enables this read-only mode on the prompt cache. The prompt's contents that match the cache are loaded from the cache but the rest is not modified. This allowed to reduce a total analysis time from 112s to 49.7s here, without having to backup and restore a copy of the prompt, which takes significant time at 500 MB. Signed-off-by: Willy Tarreau <w@1wt.eu>	2023-06-06 22:10:17 -04:00
Georgi Gerganov	2d7bf110ed	llama : fix vram_scratch var	2023-06-06 22:54:39 +03:00
Georgi Gerganov	2a4e41a086	llama : fix compile warnings	2023-06-06 22:41:53 +03:00
Johannes Gäßler	17366df842	Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703 ) * CUDA multi GPU + scratch ggml_cuda_compute_forward Tensor parallelism ggml_cuda_add ggml_cuda_rms_norm ggml_cuda_silu CUDA scratch buffer --main-gpu CLI option	2023-06-06 21:33:23 +02:00
Georgi Gerganov	44f906e853	metal : add f16 support	2023-06-06 20:21:56 +03:00
LostRuins	d5b111f53d	Clblast fixes + enhancements to save VRAM and offload more layers (#1675 ) * Use events instead of clFinish, where possible * OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel * Reduce queueing overhead for contiguous tensors by using single mul kernel call * Adapt to #1612 cl_mem malloc changes * Reduce code duplication between cuda and opencl branches * Improve implementation * Clblast fixes + enhancements to save VRAM: 1. Change all Clblast buffers to CL_MEM_READ_WRITE, as the pool malloc currently doesn't properly handle them. 2. When recycling buffers in pool malloc, always assign the SMALLEST available buffer that fits, instead of the FIRST available buffer 3. When failing to recycle a buffer in pool malloc (all too small), instead recycle the largest available free buffer by resizing it. * change max value size_t to use limits * removed flags from the CL pool malloc, apply code tidying suggestions.	2023-06-06 19:00:01 +02:00
Concedo	ed603dcafc	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # README.md # docs/BLIS.md # llama.cpp # tests/test-quantize-fns.cpp	2023-06-06 23:12:01 +08:00
Concedo	c046db5197	lite bugfixes, buffer size changes, fixed a topk bug.	2023-06-06 22:38:25 +08:00
Georgi Gerganov	2d43387daf	ggml : fix builds, add ggml-quants-k.o (close #1712 , close #1710 )	2023-06-06 10:18:03 +03:00
Georgi Gerganov	7ad7750c5c	gitignore : add .clang-tidy	2023-06-06 09:55:25 +03:00
Georgi Gerganov	7a74dee6b4	llama : temporary disable Q6_K output quantization (#1711 )	2023-06-06 09:39:38 +03:00

1 2 3 4 5 ...

1191 commits