llama.cpp

Author	SHA1	Message	Date
anon	bd81096927	fix typo in readme + don't ignore integers	2023-06-14 13:30:55 -03:00
Henri Vasserman	546f850796	Update examples/server/server.cpp	2023-06-14 17:41:58 +03:00
Randall Fitzgerald	6075d7862d	Merge pull request #23 from anon998/fix-linter-warnings Fix linter warnings + stuff	2023-06-13 14:32:19 -04:00
anon	7a48ade7ef	fix comment indentation	2023-06-13 14:46:40 -03:00
anon	7df316b728	fix linter warnings + make variables const	2023-06-13 14:28:52 -03:00
anon	575cf23862	remove json_indent variable	2023-06-13 14:21:40 -03:00
anon	99ef967d42	add static prefix to the other functions too	2023-06-13 14:17:22 -03:00
anon	1f3945236a	remove old verbose variable And expand macro to nothing when verbose is disabled with compilation flags.	2023-06-13 14:14:29 -03:00
Henri Vasserman	6627a02540	Allow overriding the server address	2023-06-13 13:36:31 +03:00
Henri Vasserman	b8b8a6ed00	Add log flush	2023-06-13 12:58:02 +03:00
Randall Fitzgerald	909970921e	Merge pull request #22 from anon998/bash-trim Trim response and trim trailing space in prompt.	2023-06-12 21:06:50 -04:00
anon	9d564db9ae	trim response and trim trailing space in prompt Also add "-r" to read because of this: https://www.shellcheck.net/wiki/SC2162	2023-06-12 21:30:33 -03:00
Randall Fitzgerald	6d72f0f070	Make chat shell script work by piping the content out of the subshell.	2023-06-12 19:44:53 -04:00
Randall Fitzgerald	fc78910bc3	Merge branch 'ggerganov:master' into master	2023-06-12 16:18:13 -04:00
Randall Fitzgerald	50e7c5434f	Merge pull request #21 from SlyEcho/server_refactor Server refactor	2023-06-12 16:16:20 -04:00
Henri Vasserman	f344d090f7	streaming shell script	2023-06-12 22:49:08 +03:00
Kawrakow	74a6d922f1	Metal implementation for all k_quants (#1807 ) * metal : improve q4_K 28.3 -> 26.0 ms/token by avoiding a branch in the calculation of the scales. * metal : small improvement for Q4_K * metal : still optimizing Q4_K This commit pushes it down to 25.3 ms / token. The crazy idea of using 6 bits for the scales is really costly on Metal: if I remove the bit fiddling necessary to make the block scales, time goes almost to the Q4_0 23 ms/token. Before pushing the k-quants upstream I had a Q4_K variant that had used 8-bit scales. It wasn't more accurate, used 0.125 bits more per weight, was running slightly slower on the CPU (due to the larger model size and being memory bound there), and the difference was entirely negligible under CUDA. So, I decided to publish the version with 6-bit scales. Perhaps I should re-consider and change to 8-bit scales? * metal : some more optimizations Q2_K: 25.4 ms/token Q6_K: 27.3 ms/token Q4_0: 22.8 ms/token Q4_1: 23.1 ms/token * metal : Q3_K support Something is not quite right yet. * metal : Q5_K support Initial version achieves 31.2 ms/token, 210 GB/s * metal : still not able to figure out why q3_K does not work * Minor * metal : yet another failed attempt to make q3_K work * metal : optimize Q5_K 31.2 ms -> 27.8 ms. 250 GB/s. * metal : q3_K still not working Adding a heavily commented q3_K metal kernel to explain my obviously faulty logic. Perhaps someone could spot the issue? * metal : q3_K finally working Not optimized at all. What was the issue? The scales are not 4-bytes aligned, and I was accessing them with a uint32_t pointer. When I tried that on CUDA, I got an error (illegal memory access) and added a memcpy to a local array of 3 uint32_t's. But on Metal it told me there is no memcpy, so I tried accessing directly. There is no error, just garbage results. At some point I did try accessing the scales with an uint16_t pointer (the scales are for sure 2-byte aligned), but was still getting garbage. I guess, there must have been another bug. No access to scales is via a uint16_t pointer and, after starting from scratch from the C dequantize function, it finally works. * metal : Q3_K 1st optimization pass * metal : Q3_K second optimization pass - 29.6 ms/token * metal : Q3_K cleanup * metal : fixed accidentally broken Q2_K --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-12 22:39:21 +03:00
Henri Vasserman	429ed950af	move CPPHTTPLIB settings inside server Since they aren't configurable and were missing from the Makefile.	2023-06-12 20:46:53 +03:00
slaren	e4caa8da59	ci : run when changing only the CUDA sources (#1800 )	2023-06-12 20:12:47 +03:00
Henri Vasserman	28694f7ac9	add a simple bash script too	2023-06-12 19:53:13 +03:00
Henri Vasserman	fc4264d14a	api url	2023-06-12 18:43:40 +03:00
Henri Vasserman	1510337901	fix make flags propagation	2023-06-12 18:34:12 +03:00
Henri Vasserman	b91200a2e5	javascript chat update.	2023-06-12 18:34:01 +03:00
Henri Vasserman	13cf6929b7	more json changes and stop info	2023-06-12 17:46:16 +03:00
Henri Vasserman	dff11a14d2	json parsing improvements	2023-06-12 16:52:21 +03:00
Howard Su	58970a4c39	Leverage mmap for offloading tensors to GPU (#1597 ) * Rebase to latest * Show progress * Add assert to make sure we only allocate temp buffer for non-CPU backend tensor Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2023-06-12 14:44:16 +02:00
Kawrakow	8c0a10e64d	metal : fix failure to load model (#1817 ) The number of buffers in the ggml context was left unitialized. This leads to sporadic failures to load the model on startup. It is actually strange that the failure occurred so infrequantly. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-12 14:31:36 +03:00
Henri Vasserman	4148b9bd03	remove void	2023-06-12 10:28:17 +03:00
Randall Fitzgerald	eee8b28d36	Merge pull request #20 from SlyEcho/server_refactor Logging changes	2023-06-11 15:17:46 -04:00
Kerfuffle	fa84c4b3e8	Fix issue where interactive mode crashes when input exceeds ctx size (#1789 ) * Fix issue where interactive mode in the main example crashes when input exceeds ctx size * Ensure the context size is at least 8 tokens in the main example. Closes #1768	2023-06-11 08:19:17 -06:00
Henri Vasserman	6518f9c482	build settings	2023-06-11 16:32:53 +03:00
Kyle Liang	12b063f0ec	Fixed WSL cuda's OOM error (#1594 ) * In the function , add the cuda error bypass. * remove excessive codes and prints --------- Co-authored-by: liang <liangmanlai@126.com>	2023-06-11 15:20:52 +02:00
Henri Vasserman	9612d12fbf	big logging update	2023-06-11 16:18:39 +03:00
Henri Vasserman	2c00bf855d	more formatting changes	2023-06-11 14:01:42 +03:00
Ryan Landay	31d2b5f4a4	Update SHA256SUMS with current hashes for models quantized using q4_0 (#1798 )	2023-06-11 12:38:53 +03:00
Georgi Gerganov	4de0334f5c	cmake : fix Metal build (close #1791 )	2023-06-10 22:56:53 +03:00
Artyom Lebedev	3f1223155a	k-quants : GCC12 compilation fix (#1792 )	2023-06-10 22:51:36 +03:00
Andrei	303f5809f1	metal : fix issue with ggml-metal.metal path. Closes #1769 (#1782 ) * Fix issue with ggml-metal.metal path * Add ggml-metal.metal as a resource for llama target * Update flake.nix metal kernel substitution	2023-06-10 17:47:34 +03:00
Aisuko	059e99066d	doc : fix wrong address of BLIS.md (#1772 ) Signed-off-by: Aisuko <urakiny@gmail.com>	2023-06-10 17:08:11 +03:00
Randall Fitzgerald	bac0ddb58f	Merge branch 'ggerganov:master' into master	2023-06-10 06:11:31 -04:00
Georgi Gerganov	17c10acfb4	ggml : force no_alloc == false when creating opt tensors (close #1699 ) This is needed to make operators like ggml_view() be able to store their parameters in the ggml context's memory and not get discarded when no_alloc is true	2023-06-10 12:08:15 +03:00
Kawrakow	e9b66ee982	metal : add Q4_1 implementation (#1785 ) 23.3 ms / token, so just ~1% slower than q4_0. Achieves 290 GB/s memory throughput. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-10 11:28:11 +03:00
Kerfuffle	4f0154b0ba	llama : support requantizing models instead of only allowing quantization from 16/32bit (#1691 ) * Add support for quantizing already quantized models * Threaded dequantizing and f16 to f32 conversion * Clean up thread blocks with spares calculation a bit * Use std::runtime_error exceptions.	2023-06-10 10:59:17 +03:00
Xingchen Song(宋星辰)	ef3171d162	ggml : workaround for missing _mm256_setr_m128i in GCC < 8 (#1638 )	2023-06-10 10:49:40 +03:00
rankaiyx	555275a693	make : add SSSE3 compilation use case (#1659 )	2023-06-10 09:41:59 +03:00
Randall Fitzgerald	d6d263fc4f	Merge pull request #19 from lesaun/master Clarify build instructions in README.	2023-06-09 23:11:02 -04:00
Lesaun Harvey	917540ce43	Clarify build instructions in README.	2023-06-09 19:06:09 -07:00
Randall Fitzgerald	1a9141b6c3	Remove model assign in main(). Clarified stop in README. The model will now load the default from gptparams ("models/7B/ggml-model.bin")	2023-06-09 16:29:10 -04:00
Robert Sung-wook Shin	98ed165574	OpenCL: Add release memory (#1741 ) * Add opencl release memory * Rename function name	2023-06-09 18:24:40 +02:00
Johannes Gäßler	ae9663f188	Windows nvcc workaround (#1753 ) Fix gibberish output on Windows when using CUDA	2023-06-09 13:58:15 +02:00

1 2 3 4 5 ...

820 commits