llama.cpp

Author	SHA1	Message	Date
staviq	10151bee2e	server : support for saving templates in browser LocalStorage (#2486 ) * support for templates in browser LocalStorage * sync accepted #2409 fix from upstream * convert autosave invocation to useEffect * Apply suggestions from code review Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com> * Regen index.html.cpp, suggested from code review --------- Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>	2023-08-18 07:34:01 +08:00
Johannes Gäßler	0992a7b8b1	README: fix LLAMA_CUDA_MMV_Y documentation (#2647 )	2023-08-17 23:57:59 +02:00
Henri Vasserman	6ddeefad9b	[Zig] Fixing Zig build and improvements (#2554 ) * Fix zig after console.o was split * Better include and flag management * Change LTO to option	2023-08-17 23:11:18 +03:00
Kerfuffle	8dae7ce684	Add --cfg-negative-prompt-file option for examples (#2591 ) Add --cfg-negative-prompt-file option for examples	2023-08-17 07:29:44 -06:00
Georgi Gerganov	a73ccf1aa3	llama : replace (permute + reshape + view_1d) with (view_3d) (#2538 ) ggml-ci	2023-08-17 10:47:09 +03:00
drbh	7cf54e1f74	tests : adds simple llama grammar tests (#2618 ) * adds simple llama grammar tests * fix lint and add Makefile * 0 terminate code_points * avoid dangling pointers in candidate cleanup * cleanup grammar at end of test	2023-08-17 10:41:01 +03:00
Shouzheng Liu	a872a2b28e	ggml-alloc : fix discrepency between measure&eval (#2639 ) The GGML memory allocator consistently places a tensor within the optimal-fit memory block, which is the smallest block capable of accommodating the tensor's size. During the measurement phase, the final block is generously sized, ensuring it never qualifies as the optimal-fit block as long as there exists another block capable of accommodating the tensor. Nevertheless, in the evaluation phase, the last block is constrained in size and could potentially qualify as the optimal-fit block. Consequently, there exists the possibility of a tensor being allocated to a different region during evaluation, leading to more memory fragmentation in our scratch buffer. This recent commit guarantees uniform behavior of the allocator across both the measurement and evaluation phases, eliminating discrepancies between the two.	2023-08-17 10:35:53 +03:00
Kolen Cheung	0919a0f73d	cmake : install ggml-meta.metal if LLAMA_METAL (#2449 )	2023-08-16 23:09:49 +03:00
Jhen-Jie Hong	ed53db86c3	metal : print error of load pipeline state (#2564 ) * metal : print error of load pipeline state * metal : return null if load pipeline failed	2023-08-16 23:09:03 +03:00
Shouzheng Liu	fc8ef549e5	metal : enable ggml-alloc (#2627 ) * metal: enable ggml-alloc Make ggml-alloc work with concurrently dispatch. * style-fix Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-08-16 23:08:28 +03:00
Shouzheng Liu	bf83bff674	metal : matrix-matrix multiplication kernel (#2615 ) * metal: matrix-matrix multiplication kernel This commit removes MPS and uses custom matrix-matrix multiplication kernels for all quantization types. This commit also adds grouped-query attention to support llama2 70B. * metal: fix performance degradation from gqa Integers are slow on the GPU, and 64-bit divides are extremely slow. In the context of GQA, we introduce a 64-bit divide that cannot be optimized out by the compiler, which results in a decrease of ~8% in inference performance. This commit fixes that issue by calculating a part of the offset with a 32-bit divide. Naturally, this limits the size of a single matrix to ~4GB. However, this limitation should suffice for the near future. * metal: fix bugs for GQA and perplexity test. I mixed up ne02 and nb02 in previous commit.	2023-08-16 23:07:04 +03:00
Georgi Gerganov	b5ffb2849d	scripts : add helper script to get wikitext	2023-08-15 10:05:25 +03:00
Jhen-Jie Hong	3ebb00935f	server : add missing /json-schema-to-grammar.mjs (#2616 ) fixes #2611	2023-08-15 06:14:14 +08:00
xaedes	3b5515bbe0	reverse order of for loop in ggml_build_backward_expand to save memory when using gradient checkpointing and allocator with this loop order gradient checkpointing with allocator on 16 layer model saves 13% memory; 2 layer memory it saves 2% memory. the computation results are the same	2023-08-14 22:09:36 +02:00
xaedes	56228461c8	fix memory "leak" in optimizers each iteration a new cplan with new memory for work data was allocated. now cplan creation only happens at the start of optimization, with each iteration reusing the cplan and its work data.	2023-08-14 21:12:02 +02:00
xaedes	3e6468b097	fix test when to create temporary backward graph temporary backward graph is only necessary when using checkpointing	2023-08-14 20:57:18 +02:00
xaedes	098654c277	only use ggml_allocr_alloc when tensor has NULL data and is no view	2023-08-14 20:57:18 +02:00
xaedes	faf3e21eaf	add debug asserts in ggml_allocr_alloc to some common pitfalls when using this function directly	2023-08-14 20:50:09 +02:00
xaedes	6e280b24dc	remove unused forward_batch function	2023-08-14 19:02:12 +02:00
xaedes	3794dceb7f	remove unused train params: mem_compute1_gb & mem_compute2_gb mem_compute_gb is used for compute when automatic memory allocator is not enabled, otherwise it can be very small to only hold the tensor definitions mem_compute0_gb is used for automatic memory allocator (as long as measurement of max required size is not implemented)	2023-08-14 18:44:42 +02:00
xaedes	6f161c784b	remove trailing whitespace	2023-08-14 18:33:27 +02:00
xaedes	271e4d64b5	remove unused training parameters "use_scratch" and "use_unified"	2023-08-14 18:31:59 +02:00
xaedes	c954f41ca4	remove handwritten training functions	2023-08-14 18:30:50 +02:00
xaedes	fe788a1c7a	allocate graph on context using ggml_new_graph	2023-08-14 18:24:13 +02:00
xaedes	75baed230c	set names for tensors in unified train function for easier debugging	2023-08-14 18:17:14 +02:00
xaedes	3e99a8d653	format name of cloned tensors with " (clone)" suffix	2023-08-14 18:15:09 +02:00
xaedes	865c4cd3c1	integrate unified training function which may use memory allocator the unified training function also supports arguments whether to use flash attention and/or gradient checkpointing	2023-08-14 18:12:58 +02:00
xaedes	4ed096c6b0	add training options whether to use allocator and/or unified training function	2023-08-14 18:10:02 +02:00
xaedes	d6c5b03858	fix ASSERT to work with zero layers	2023-08-14 18:08:19 +02:00
xaedes	38f4438c32	make sure some tensors are not reallocated by inserting new temporary nodes depending on them: output and parameter gradient tensors need to be available at the end of the graph execution parameter gradient tensors also need to be available before the graph execution because they are set to zero before each optimizer iteration checkpoint tensors are allocated all together to reduce memory allocator fragmentation afterwards, in addition to the temporary nodes, we also need to reset the temporary leafs	2023-08-14 18:07:16 +02:00
xaedes	9716eb8ef0	fix variable name and add missing boolean negation	2023-08-14 17:59:19 +02:00
xaedes	5884b43a62	add input tensors as checkpoints so that recursive tensor cloning of gradient checkpointing terminates on input tensors	2023-08-14 17:58:49 +02:00
xaedes	b2f1310196	swap arguments to commutative ops to be the same as in `forward_batch_wo_cache_flash_attn`	2023-08-14 17:57:13 +02:00
xaedes	5a11b75875	fix variable names	2023-08-14 17:55:51 +02:00
xaedes	345f516f7c	correctly clone view tensors by setting data pointers without this the checkpointing would only work when being used together with memory allocator	2023-08-14 17:55:13 +02:00
xaedes	52c92c0a8c	terminate recursive tensor cloning when reaching tensor without src tensors	2023-08-14 17:53:36 +02:00
xaedes	0dd496c5e2	fix variable name and add missing type cast	2023-08-14 17:52:48 +02:00
xaedes	cfddc36be2	correctly clone reshape and permute operations by also cloning tensor->nb values	2023-08-14 17:52:15 +02:00
xaedes	d43741540b	don't use allocate hash_map on context because the context has no_alloc=True when using memory allocator resulting in NULL data pointers	2023-08-14 17:51:20 +02:00
xaedes	fc826c8ea8	in train function replace add_inplace by regular add because using add_inplace seems to result in different gradients	2023-08-14 17:49:22 +02:00
Jhen-Jie Hong	d783f7982e	metal : return null instead of exit(1) (#2573 )	2023-08-14 16:37:39 +03:00
Cheng Shao	d75561df20	server : add --numa support (#2524 )	2023-08-14 16:36:42 +03:00
Kamil Tomšík	348acf188c	llama : add missing enum keyword in function signatures (#2610 )	2023-08-14 16:35:16 +03:00
Johannes Gäßler	1cd06fa25e	CUDA: launch_bounds, small q4_K, q5_K mmq refactor (#2596 )	2023-08-14 10:41:22 +02:00
Jhen-Jie Hong	2feb8934eb	server : fix default grammar by use empty string in the UI (#2604 )	2023-08-14 16:20:17 +08:00
Jhen-Jie Hong	5517d6e692	server : implement json-schema-to-grammar.mjs & add grammar param in the UI (#2588 ) * server : implement json-schema-to-grammar.mjs by follow python impl * server : add grammar support in chat.mjs * server : implement grammer param in the UI * server : generate .hpp * server : remove trailing whitespaces * server : generate .hpp * server : fix sort of prop pairs * server : optimize regex & iteration	2023-08-14 15:16:54 +08:00
vxiiduu	f31b539714	Enhance Windows 7 and below compatibility. (#2592 ) * Enhance Windows 7 compatibility. * Clean away unnecessary preprocessor conditional	2023-08-13 20:59:16 -07:00
drbh	ee77efea2a	test : add simple grammar parsing tests (#2594 ) * adds simple grammar parsing tests * adds cassert header	2023-08-13 17:00:48 +03:00
Johannes Gäßler	f64d44a9b9	CUDA: Fixed OpenLLaMA 3b mmq, reduced compile time (#2590 )	2023-08-13 00:24:45 +02:00
byte-6174	b19edd54d5	Adding support for llama2.c models (#2559 )	2023-08-12 01:17:25 +02:00

1 2 3 4 5 ...

1114 commits