llama.cpp

Author	SHA1	Message	Date
xaedes	27c24ffa1b	add option to save finetune output every N iterations	2023-08-20 20:16:46 +02:00
xaedes	d61ed6b431	mixing multiple LORA adapters is now possible pass more than one '--lora FNAME' argument to apply more than one LORA. use '--lora-scaled FNAME S' when you want to specify a user-defined scale for an adapter.	2023-08-20 18:48:35 +02:00
xaedes	37dfb544aa	resolve todo allocator will only make it inplace when they are of the same type	2023-08-18 21:22:41 +02:00
xaedes	3e47890760	remove unnecessary src tensor from ggml_repeat & ggml_repeat_back we don't need data of src[1] for computation, only to setup the correct output shape. remove dependency on src[1], so that allocator can work more freely. the computational graph is still completely determined, because the output shape is naturally included	2023-08-18 20:51:00 +02:00
xaedes	65b0561637	remove unnecessary src tensor from ggml_get_rows_back we don't need data of src[2] for computation, only to setup the correct output shape. remove dependency on src[2], so that allocator can work more freely. the computational graph is still completely determined, because the output shape is naturally included. this is similar to how ggml_reshape does it.	2023-08-18 20:25:42 +02:00
xaedes	6c98640035	bug fix: make sure finetune input gradient is allocated at begin and kept until end	2023-08-18 20:10:04 +02:00
xaedes	63cb374a99	change default finetune params lora_r and lora_alpha to match the n_rank parameters of 4	2023-08-18 19:08:15 +02:00
xaedes	7a63d429af	adjust maximal values to support finetuning 3B models	2023-08-18 17:32:31 +02:00
xaedes	113c90f1cc	improve optimization iteration prints	2023-08-18 16:24:42 +02:00
xaedes	a0c2752ba7	remove debug prints and function to compute tensor data hash	2023-08-18 16:24:13 +02:00
xaedes	011f47f972	remove trailing whitespace	2023-08-18 16:02:46 +02:00
xaedes	f358204a5f	avoid keeping in memory ALL of the gradients The problem here stems from ggml_graph_reset. This function is called in the optimization function, before each graph computation, to reset the gradients to zero. This required a unique memory slot for each gradient: allocating memory from a previosly freed memory location might lead to non-zero input gradients. During ggml_compute_backward the gradients are build stepwise by adding or substracting new values, starting from a OP_NONE tensor which needs to contain zero-values. This requires the graph reset. To avoid this I now remember in ggml_build_backward_expand the original OP_NONE gradient tensors in a hash table, which is passed to ggml_compute_backward. There instead of using add (or sub or similar) I test whether the existing gradient to be changed is a zero-valued-tensor by looking up its existence in the hash table. When it is such a zero-tensor it will not be modified, but replaced by the value to be added, otherwise the regular add (not inplace, allocator will take care of this) will be used. This way none of those zero-tensor values will be necessary in the final backward graph and more importantly they won't need a unique memory slot, just to make them zero.	2023-08-18 16:01:43 +02:00
xaedes	a252111b45	fix bug in ggml_out_prod which resulted in wrong n_dims of result tensors	2023-08-18 15:03:57 +02:00
xaedes	44526cb261	make sure base model tensors data cannot be used in viewable operations memory allocator would try to make lora application inplace on base model tensors. since those are memory mapped this will result in memory access violations	2023-08-18 15:03:17 +02:00
xaedes	0bb897c82a	bug fix: actually use result type passed to ggml_add_cast	2023-08-18 00:59:06 +02:00
xaedes	714fec06ee	use ggml_add_cast in finetuning lora-applied weights will now have data type F32, which improves gradients when finetuning quantized base models	2023-08-16 23:53:12 +02:00
xaedes	9198b24e4e	add ggml_add_cast API function this function works like ggml_add, but accepts a data type for the resulting tensor. only supported for quantized src0 input.	2023-08-16 23:50:46 +02:00
xaedes	f80e245d7b	add lora finetune support on quantized base model tensors	2023-08-16 22:08:44 +02:00
xaedes	83a4ad7986	remove trailing whitespace	2023-08-16 22:05:41 +02:00
xaedes	83cb9ed4f5	implement ggml_compute_forward_out_prod_q_f32	2023-08-16 22:01:06 +02:00
xaedes	79ad888768	remove unused call to not existing llama_get_layer_from_model	2023-08-16 21:56:36 +02:00
xaedes	1151653b15	replace llama API functions to get model tensors by one function to get model tensor by name LLAMA_API struct ggml_tensor * llama_get_model_tensor(struct llama_model * model, const char * name);	2023-08-16 21:36:40 +02:00
xaedes	39a2d15461	avoid stack overflow resulting from big ggml_cgraph replace stack allocation and ggml_build_forward by ggml_new_graph in combination with ggml_build_forward_expand	2023-08-16 16:42:25 +02:00
xaedes	0ab2507ce5	fix names of lora tensors	2023-08-16 16:41:20 +02:00
xaedes	620275361d	add debug prints for training memory improvements	2023-08-16 16:23:21 +02:00
xaedes	be7e564b11	bug fixes to make finetune compile automatic allocator does not work yet	2023-08-16 16:21:43 +02:00
xaedes	50b1e66200	remove const model and layer arguments in API functions for accessing model tensors	2023-08-16 16:21:02 +02:00
xaedes	28ee0c8583	first draft for LORA finetune training	2023-08-16 15:31:04 +02:00
xaedes	c0a372fd3d	add API functions to access remaining model parameters: mult, head and rot	2023-08-16 15:30:31 +02:00
xaedes	9eb1ef8653	move and remove code	2023-08-15 14:03:02 +02:00
xaedes	5e059ace25	add stub example for finetuning, based on train-text-from-scratch	2023-08-15 13:54:28 +02:00
xaedes	316b0707f4	add API functions to access llama model tensors	2023-08-15 13:53:13 +02:00
xaedes	3b5515bbe0	reverse order of for loop in ggml_build_backward_expand to save memory when using gradient checkpointing and allocator with this loop order gradient checkpointing with allocator on 16 layer model saves 13% memory; 2 layer memory it saves 2% memory. the computation results are the same	2023-08-14 22:09:36 +02:00
xaedes	56228461c8	fix memory "leak" in optimizers each iteration a new cplan with new memory for work data was allocated. now cplan creation only happens at the start of optimization, with each iteration reusing the cplan and its work data.	2023-08-14 21:12:02 +02:00
xaedes	3e6468b097	fix test when to create temporary backward graph temporary backward graph is only necessary when using checkpointing	2023-08-14 20:57:18 +02:00
xaedes	098654c277	only use ggml_allocr_alloc when tensor has NULL data and is no view	2023-08-14 20:57:18 +02:00
xaedes	faf3e21eaf	add debug asserts in ggml_allocr_alloc to some common pitfalls when using this function directly	2023-08-14 20:50:09 +02:00
xaedes	6e280b24dc	remove unused forward_batch function	2023-08-14 19:02:12 +02:00
xaedes	3794dceb7f	remove unused train params: mem_compute1_gb & mem_compute2_gb mem_compute_gb is used for compute when automatic memory allocator is not enabled, otherwise it can be very small to only hold the tensor definitions mem_compute0_gb is used for automatic memory allocator (as long as measurement of max required size is not implemented)	2023-08-14 18:44:42 +02:00
xaedes	6f161c784b	remove trailing whitespace	2023-08-14 18:33:27 +02:00
xaedes	271e4d64b5	remove unused training parameters "use_scratch" and "use_unified"	2023-08-14 18:31:59 +02:00
xaedes	c954f41ca4	remove handwritten training functions	2023-08-14 18:30:50 +02:00
xaedes	fe788a1c7a	allocate graph on context using ggml_new_graph	2023-08-14 18:24:13 +02:00
xaedes	75baed230c	set names for tensors in unified train function for easier debugging	2023-08-14 18:17:14 +02:00
xaedes	3e99a8d653	format name of cloned tensors with " (clone)" suffix	2023-08-14 18:15:09 +02:00
xaedes	865c4cd3c1	integrate unified training function which may use memory allocator the unified training function also supports arguments whether to use flash attention and/or gradient checkpointing	2023-08-14 18:12:58 +02:00
xaedes	4ed096c6b0	add training options whether to use allocator and/or unified training function	2023-08-14 18:10:02 +02:00
xaedes	d6c5b03858	fix ASSERT to work with zero layers	2023-08-14 18:08:19 +02:00
xaedes	38f4438c32	make sure some tensors are not reallocated by inserting new temporary nodes depending on them: output and parameter gradient tensors need to be available at the end of the graph execution parameter gradient tensors also need to be available before the graph execution because they are set to zero before each optimizer iteration checkpoint tensors are allocated all together to reduce memory allocator fragmentation afterwards, in addition to the temporary nodes, we also need to reset the temporary leafs	2023-08-14 18:07:16 +02:00
xaedes	9716eb8ef0	fix variable name and add missing boolean negation	2023-08-14 17:59:19 +02:00

1 2 3 4 5 ...

1052 commits