llama.cpp

Author	SHA1	Message	Date
Henri Vasserman	f344d090f7	streaming shell script	2023-06-12 22:49:08 +03:00
Henri Vasserman	429ed950af	move CPPHTTPLIB settings inside server Since they aren't configurable and were missing from the Makefile.	2023-06-12 20:46:53 +03:00
Henri Vasserman	28694f7ac9	add a simple bash script too	2023-06-12 19:53:13 +03:00
Henri Vasserman	fc4264d14a	api url	2023-06-12 18:43:40 +03:00
Henri Vasserman	1510337901	fix make flags propagation	2023-06-12 18:34:12 +03:00
Henri Vasserman	b91200a2e5	javascript chat update.	2023-06-12 18:34:01 +03:00
Henri Vasserman	13cf6929b7	more json changes and stop info	2023-06-12 17:46:16 +03:00
Henri Vasserman	dff11a14d2	json parsing improvements	2023-06-12 16:52:21 +03:00
Henri Vasserman	4148b9bd03	remove void	2023-06-12 10:28:17 +03:00
Randall Fitzgerald	eee8b28d36	Merge pull request #20 from SlyEcho/server_refactor Logging changes	2023-06-11 15:17:46 -04:00
Henri Vasserman	6518f9c482	build settings	2023-06-11 16:32:53 +03:00
Henri Vasserman	9612d12fbf	big logging update	2023-06-11 16:18:39 +03:00
Henri Vasserman	2c00bf855d	more formatting changes	2023-06-11 14:01:42 +03:00
Randall Fitzgerald	bac0ddb58f	Merge branch 'ggerganov:master' into master	2023-06-10 06:11:31 -04:00
Georgi Gerganov	17c10acfb4	ggml : force no_alloc == false when creating opt tensors (close #1699 ) This is needed to make operators like ggml_view() be able to store their parameters in the ggml context's memory and not get discarded when no_alloc is true	2023-06-10 12:08:15 +03:00
Kawrakow	e9b66ee982	metal : add Q4_1 implementation (#1785 ) 23.3 ms / token, so just ~1% slower than q4_0. Achieves 290 GB/s memory throughput. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-10 11:28:11 +03:00
Kerfuffle	4f0154b0ba	llama : support requantizing models instead of only allowing quantization from 16/32bit (#1691 ) * Add support for quantizing already quantized models * Threaded dequantizing and f16 to f32 conversion * Clean up thread blocks with spares calculation a bit * Use std::runtime_error exceptions.	2023-06-10 10:59:17 +03:00
Xingchen Song(宋星辰)	ef3171d162	ggml : workaround for missing _mm256_setr_m128i in GCC < 8 (#1638 )	2023-06-10 10:49:40 +03:00
rankaiyx	555275a693	make : add SSSE3 compilation use case (#1659 )	2023-06-10 09:41:59 +03:00
Randall Fitzgerald	d6d263fc4f	Merge pull request #19 from lesaun/master Clarify build instructions in README.	2023-06-09 23:11:02 -04:00
Lesaun Harvey	917540ce43	Clarify build instructions in README.	2023-06-09 19:06:09 -07:00
Randall Fitzgerald	1a9141b6c3	Remove model assign in main(). Clarified stop in README. The model will now load the default from gptparams ("models/7B/ggml-model.bin")	2023-06-09 16:29:10 -04:00
Robert Sung-wook Shin	98ed165574	OpenCL: Add release memory (#1741 ) * Add opencl release memory * Rename function name	2023-06-09 18:24:40 +02:00
Johannes Gäßler	ae9663f188	Windows nvcc workaround (#1753 ) Fix gibberish output on Windows when using CUDA	2023-06-09 13:58:15 +02:00
Randall Fitzgerald	7cdeb08483	More formatting cleanup	2023-06-09 05:12:16 -04:00
Randall Fitzgerald	889d9044bf	Merge branch 'master' of https://github.com/digiwombat/llama.cpp	2023-06-09 04:57:21 -04:00
Randall Fitzgerald	7580427837	Resolving some review comments	2023-06-09 04:56:31 -04:00
Randall Fitzgerald	23a1b1841e	Merge branch 'ggerganov:master' into master	2023-06-09 04:51:20 -04:00
Randall Fitzgerald	cc2b33649d	Missed a pair of catch statements for formatting.	2023-06-09 04:50:31 -04:00
Randall Fitzgerald	a9c34779f6	Spaces to 4 and other code style cleanup. Notes in README.	2023-06-09 04:47:18 -04:00
Georgi Gerganov	b33dee282f	metal : fix build "tanhf" -> "tanh"	2023-06-09 11:11:04 +03:00
AT	92f44ff7f7	metal : add GELU implementation (#1770 ) Co-authored-by: Adam Treat <adam@nomic.ai>	2023-06-09 11:00:51 +03:00
Kawrakow	245fc3c37d	metal : faster q4_0 (#1775 ) * metal : 8% faster q4_0 Avoid copying into local uchar4 anf float4. * metal : 17% faster Q4_0 Use 64 threads in a thread group. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-09 10:39:59 +03:00
Kawrakow	72ff5282bf	metal : add Q2_K implementation (#1762 ) * metal : add Q2_K implementation 27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s. The access pattern used in the Q2_K CUDA implementation resulted in significantly lower performance (~31 ms/token). * Fixing merge conflicts --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-08 22:28:21 +03:00
Henri Vasserman	ccd85e0a6b	Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2023-06-08 22:17:46 +03:00
Henri Vasserman	61befcba7b	Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2023-06-08 22:14:43 +03:00
Georgi Gerganov	0bf7cf1b29	Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738 )" This reverts commit `8432d4d9f7`.	2023-06-08 20:48:14 +03:00
le.chang	8432d4d9f7	ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738 )	2023-06-08 19:47:56 +03:00
Kawrakow	0f291e1f65	metal : Q6_K implementation (#1752 ) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master * Metal implementation for Q6_K Similar to the CUDA implementation. No idea if this is the optimum for Metal, but the few alternative variants I tried all had a lower performance. We get 36.5 ms / token on M2 Max with 30 GPU cores. This corresponds to ~200 GB/second throughput. * clang-tidy : add config back * Much better Q6_K implementation for metal 28.3 ms / token for 7B. Subtracting ~9 ms that is spent in other compute graph operations, we are left with ~19 ms for the matrix multiplications. The model is ~5.5 GB, so we are getting 1000 / 19 * 5.5 = 290 GB/s! --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-08 19:46:22 +03:00
qingfengfenga	8fc8179919	Add llama.cpp docker support for non-latin languages (#1673 ) * Modify Dockerfile default character set to improve compatibility (#1673)	2023-06-08 00:58:53 -07:00
Steven Roussey	b50b570ed9	ggml : fix fprintf warnings (#1720 )	2023-06-08 10:12:28 +03:00
Georgi Gerganov	53aba3f393	clang-tidy : restore dot file from accidental deletion	2023-06-08 10:09:08 +03:00
Kawrakow	4161bdc04d	metal : add Q4_K implementation (#1733 ) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-08 10:08:23 +03:00
johnson442	0035858273	k-quants : add missing compile definition to CMakeLists (#1748 )	2023-06-08 10:02:48 +03:00
Randall Fitzgerald	64a06536cb	Merge remote-tracking branch 'upstream/master' # Resolved Conflicts: # examples/server/README.md # examples/server/server.cpp	2023-06-07 12:23:49 -04:00
Georgi Gerganov	5c64a0952e	k-quants : allow to optionally disable at compile time (#1734 ) * k-quants : put behind optional compile flag LLAMA_K_QUANTS * build : enable k-quants by default	2023-06-07 10:59:52 +03:00
jacobi petrucciani	5b57a5b726	flake : update to support metal on m1/m2 (#1724 )	2023-06-07 07:15:31 +03:00
Georgi Gerganov	4dc62c545d	readme : add June roadmap	2023-06-07 07:15:08 +03:00
Willy Tarreau	35a84916fb	main: add the possibility to open the prompt cache read-only (#1640 ) The prompt cache constitutes a nice speed up when using the same prompt prefix across multiple evaluations, but when using it, it will also be updated, which is not always desirable. One use case is to have a large prompt containing some context and usage rules, and a second part containing variable data of the problem being studied. In this case it's desirable to be able to save the first part once, and to always reuse it as-is without updating it with the second part. The new argument --prompt-cache-ro enables this read-only mode on the prompt cache. The prompt's contents that match the cache are loaded from the cache but the rest is not modified. This allowed to reduce a total analysis time from 112s to 49.7s here, without having to backup and restore a copy of the prompt, which takes significant time at 500 MB. Signed-off-by: Willy Tarreau <w@1wt.eu>	2023-06-06 22:10:17 -04:00
Georgi Gerganov	2d7bf110ed	llama : fix vram_scratch var	2023-06-06 22:54:39 +03:00

1 2 3 4 5 ...

794 commits