llama.cpp

Author	SHA1	Message	Date
mgroeber9110	c2df36d60d	llama : consistently catch and throw only exceptions deriving from std::exception (#1599 ) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-05 23:24:29 +03:00
kiltyj	9d0693bce3	metal : use shared buffers between CPU and GPU (#1696 ) * Use MTLDevice.newBufferWithBytesNoCopy to share buffers between CPU and GPU * Page-align buffers used by Metal * Remove trailing whitespace * Only import unistd.h for Metal builds * metal : remove unnecessary copies --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-05 23:24:04 +03:00
grahameth	efe0507632	ggml : fix internal overflow in ggml_time_us on Windows (#1702 ) Co-authored-by: grahameth <->	2023-06-05 23:11:49 +03:00
Georgi Gerganov	e7fe66e670	ci : disable auto tidy (#1705 )	2023-06-05 23:05:05 +03:00
Kawrakow	99009e72f8	ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684 ) * Starting to add k-quantization to ggml I think it is better to have quantization separate from ggml. For now just adding the k-quants there, but it would be better to also factor out the existing ggml quantizations. * Adding Q3_K and Q8_K (de)-quantization * Q3_K now working on CUDA and AVX2/scalar CUDA is not ideal - ~50% slower than Q4_0 for single token prediction, about the same in batch mode (perplexity). CPU single token is ~55 ms (on Ryzen 7950X). * Some improvement for Q3_K on CUDA It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0. * Some more CUDA optimizations for Q3_K Single token is now 20.5 ms/token (~20% slower than Q4_0). Perplexity is on par with Q4_0. * Adding Q4_K - scalar, AVX2, CUDA Performance is the same or perhaps very slightly better than Q4_0 on the CPU. On the GPU, single token prediction is ~10% better than Q4_0, batch mode (perplexity is about the same). * Adding Q6_K - scalar, AVX2, CUDA Performance is ~40% lower compared to Q4_K on the CPU. This is to be expected, considering that we are memory bound on the CPU and the 6-bit model is ~44% larger than the 4-bit. On the GPU, single token prediction is ~6% lower than Q4_0, batch mode (perplexity) is even closer (but still slower). * Adding Q5_K - scalar, AVX2, CUDA Performance is ~20% lower compared to Q4_K on the CPU. This is to be expected, considering that we are memory bound on the CPU and the 5-bit model is ~22% larger than the 4-bit. On the GPU, single token prediction is about the same as Q4_0 for both, single token and batch prediction. * Per convention, all QX_K quantizations use Q5_K for output.weight * Adding quantization mixes * Quantization mixes: didn't quite get what I wanted in the last commit * Q4_K dot product for ARM_NEON * Q6_K dot product for ARM_NEON * Q5_K dot product for ARM_NEON * Adding Q3_K dot for ARM_NEON It is 22% slower than Q4_K, despite the smaller model size. On x86_64, where we are memory bound, the Q3_K model is quite a bit faster than Q4_K. * A very slightly faster ARM_NEON Q3_K dot * Adding Q2_K - just CUDA for now Token prediction is pretty good - about 15.5 ms on a RTX 4080. Perplexity is about the same as Q4_K. * Adding scalar and AVX2 Q2_K dot * Adding ARM_NEON Q2_K dot About the same performance as Q4_K. * A slightly faster ARM_NEON Q2_K dot Single token prediction is now ~36 ms on M2 Max. The code is much simpler too. * Fixed bug in Q2_K CUDA dot product kernel Stranegly enough, for the few prompts I tried with the 7B model the responses looked perfectly reasonable. Only realized something is not quite right when I tried the larger models and started getting nonse back. In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X box iusing CUDA and model fully loaded on the GPU are ~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B. The max number of layers that fit in VRAM for The 65B is 32. With that, we get ~330 ms per token, which is not that much faster than just running on the CPU (~470 ms per token). * Don't print zeros/NaNs when no count histogram has been collected * A 10% faster CUDA vector dot kernel for Q3_K Q3_K is now running at ~18.5 ms / token on CUDA, so the gap to Q4_0 is only 10%. It seems memory acccess pattern is more important for performance than the amount of computation the kernel does. * A slightly daster Q4_K AVX2 dot product For perplexity, where we are less memory bound, time per pass drops by ~5%. Barely measurable difference for single token prediction. * A slightly faster ARM_NEON A4_K dot product * Minor * Fix quantization error test We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit quantization variants. * Fix docker build I have been sloppy with vector reinterpret casts on ARM_NEON. It seems clang is very forgiving in that regard. * Added forgotten ggml.o dependence on k_quants.h to the Makefile * Had unintentionally committed the Makefile with -Ofast enabled * ggml : rename k_quants -> ggml-quants-k, use lowercase in code --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-05 22:56:18 +03:00
Henri Vasserman	5220a991a5	Increase 3B scratch buffers. (#1698 ) The 128 MB was too optimistic. Too bad it is not dynamically computed.	2023-06-05 13:43:08 +03:00
Georgi Gerganov	d1f563a743	llama : fix Metal KV cache sync (close #1695 )	2023-06-05 10:19:03 +03:00
Georgi Gerganov	827f5eda91	readme : update hot topics	2023-06-04 23:38:19 +03:00
Georgi Gerganov	ecb217db4f	llama : Metal inference (#1642 ) * mtl : export the LLaMA computation graph * ci : disable temporary * mtl : adapt the MNIST example as starter * mtl : no need for mtl-export tool, add cli arg for main instead * mtl : export just a small part of the graph for now to make it easier * mtl : move MSL code into separate file for easy editing * mtl : initial get_rows_q4_0 kernel * mtl : confirmed get_rows_q4_0 is working correctly * mtl : add rms_norm kernel + confirm working * mtl : add mul kernel + confirm working * mtl : initial mul_mat Q4 kernel (wrong results) * mtl : mul_mat fixes (still wrong) * mtl : another mul_mat Q4 (still does not work) * mtl : working mul_mat q4 * ggml : fix handling of "view" ops in ggml_graph_import() * mtl : add rope kernel * mtl : add reshape and transpose handling * ggml : store offset as opt arg for ggml_view_xd() operators * mtl : add cpy kernel + handle view ops * mtl : confirm f16 x f32 attention mul mat * mtl : add scale kernel * mtl : add diag_mask_inf kernel * mtl : fix soft_max kernel * ggml : update ggml_nbytes() to handle non-contiguous tensors * mtl : verify V tensor contents * mtl : add f32 -> f32 cpy kernel * mtl : add silu kernel * mtl : add non-broadcast mul kernel * mtl : full GPU inference of the computation graph * mtl : optimize rms_norm and soft_max kernels * mtl : add f16 mat x f32 vec multiplication kernel * mtl : fix bug in f16 x f32 mul mat + speed-up computation * mtl : faster mul_mat_q4_0_f32 kernel * mtl : fix kernel signature + roll inner loop * mtl : more threads for rms_norm + better timing * mtl : remove printfs from inner loop * mtl : simplify implementation * mtl : add save/load vocab to ggml file * mtl : plug Metal inference into llama.cpp (very quick-n-dirty) * mtl : make it work with main example Lots of hacks but at least now it generates text * mtl : preparing for merge * mtl : clean-up ggml mtl interface + suport scratch / inplace * mtl : remove temp / debug code * metal : final refactoring and simplification * Revert "ci : disable temporary" This reverts commit `98c267fc77`. * metal : add comments * metal : clean-up stuff, fix typos * readme : add Metal instructions * readme : add example for main	2023-06-04 23:34:30 +03:00
0cc4m	dcb2ed4826	OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel (#1653 ) * Use events instead of clFinish, where possible * OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel * Reduce queueing overhead for contiguous tensors by using single mul kernel call * Adapt to #1612 cl_mem malloc changes * Reduce code duplication between cuda and opencl branches * Improve implementation	2023-06-04 08:12:05 +02:00
Henri Vasserman	d8bd0013e8	Add info about CUDA_VISIBLE_DEVICES (#1682 )	2023-06-03 16:35:20 +03:00
Jiří Podivín	b5c85468a3	Docker: change to calling convert.py (#1641 ) Deprecation disclaimer was added to convert-pth-to-ggml.py	2023-06-03 15:11:53 +03:00
Evan Jones	136476e898	Fix prompt cache saving and chat-persistent rollover (#1678 ) * Fix prompt cache saving and chat-persistent rollover (fixes #1670) * clang-tidy Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2023-06-03 07:28:45 -04:00
Randall Fitzgerald	df2ecc942a	Merge pull request #18 from anon998/update-readme Update readme + parse --mlock and --no-mmap	2023-06-02 17:04:25 -04:00
anon	98ae2de017	parse --mlock and --no-mmap + format	2023-06-02 17:54:46 -03:00
anon	05a5a485b8	make help text load faster	2023-06-02 17:52:04 -03:00
anon	a6ed390cc6	update readme	2023-06-02 17:48:29 -03:00
anon	e1e2be2146	remove --keep from help text	2023-06-02 17:47:42 -03:00
Randall Fitzgerald	5758e9f09b	Removed embedding from flags.	2023-06-02 08:31:12 -07:00
Randall Fitzgerald	310bf61496	Merge pull request #17 from SlyEcho/server_refactor improve docs and example	2023-06-02 11:25:01 -04:00
Randall Fitzgerald	de6df486e9	Removed embedding from README	2023-06-02 08:24:46 -07:00
Henri Vasserman	bcd616700e	improve docs and example	2023-06-02 18:06:02 +03:00
digiwombat	7cebe2eaf8	Merge branch 'master' of https://github.com/digiwombat/llama.cpp	2023-06-02 10:06:04 -04:00
digiwombat	16e1c9813a	Removed the embedding api endpoint and associated code.	2023-06-02 10:05:52 -04:00
Randall Fitzgerald	4dd72fc6e4	Merge pull request #16 from anon998/fix-log-json Replace invalid characters instead of crashing.	2023-06-02 09:43:29 -04:00
anon	41bb71bde7	replace invalid characters instead of crashing While logging the requests.	2023-06-02 10:37:13 -03:00
digiwombat	3ff27d30e3	Fixed up a few things in embedding mode.	2023-06-02 09:20:53 -04:00
Randall Fitzgerald	28cc0cdc50	Merge pull request #15 from SlyEcho/server_refactor Improve long input truncation and add more verbose logging	2023-06-02 08:47:54 -04:00
Henri Vasserman	3df0192804	improve long input truncation and add more verbose logging	2023-06-02 15:19:05 +03:00
Randall Fitzgerald	1bd52c8627	Merge branch 'ggerganov:master' into master	2023-06-02 07:31:55 -04:00
Randall Fitzgerald	f5d5e7020d	Merge pull request #14 from anon998/do-completion-update Trim partial stopping strings when not streaming and move multibyte check.	2023-06-02 07:30:53 -04:00
anon	f820740dad	move multibyte check to doCompletion	2023-06-02 08:27:23 -03:00
anon	8f9e546b51	trim partial stopping strings when not streaming	2023-06-02 08:25:31 -03:00
Randall Fitzgerald	bebea657cb	Merge pull request #13 from anon998/small-fixes Small fixes.	2023-06-02 06:53:10 -04:00
anon998	abb7782745	Merge branch 'master' into small-fixes	2023-06-02 10:35:06 +00:00
Henri Vasserman	88cc7bb6f7	Stuff with logits	2023-06-02 13:29:57 +03:00
anon	47efbb5cf3	use std::isinf to check if ignore_eos is active	2023-06-02 07:19:21 -03:00
anon	2932db15a3	avoid creating element in logit_bias accidentally	2023-06-02 06:59:11 -03:00
anon	a8a9f19689	small fixes	2023-06-02 06:01:10 -03:00
anon	49dce94885	make types match gpt_params exactly	2023-06-02 06:01:10 -03:00
anon	1488a0f528	make functions that never return false void	2023-06-02 06:00:48 -03:00
anon	ebfead6e5a	remove unused variables	2023-06-02 05:45:57 -03:00
anon	731ecc0d1b	fix typo	2023-06-02 05:45:16 -03:00
Henri Vasserman	0bc047730f	Apply suggestions from code review Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2023-06-02 10:29:09 +03:00
Randall Fitzgerald	d29b6d5f55	Merge pull request #12 from anon998/clear-logit-bias Clear logit bias between requests.	2023-06-01 08:58:35 -04:00
anon	8cbc4be6c2	clear logit_bias between requests + print	2023-06-01 09:49:50 -03:00
anon	6025476e39	default penalize_nl back to true	2023-06-01 09:49:16 -03:00
anon	49a18bdd14	remove unused parameter warning	2023-06-01 09:41:35 -03:00
Randall Fitzgerald	af711263ae	Merge pull request #11 from SlyEcho/server_refactor Server refactor	2023-06-01 08:10:55 -04:00
Randall Fitzgerald	797155a0d1	Merge pull request #10 from cirk2/master Add Options enpoints and Access-Control-Allow-Headers to satisfy CORS	2023-06-01 08:10:26 -04:00

1 2 3 4 5 ...

784 commits