llama.cpp

Author	SHA1	Message	Date
xaedes	bef1e97875	move common opt_callback into common/train	2023-09-16 18:54:57 +02:00
xaedes	e9758ae1d2	move common train params into common/train	2023-09-16 18:45:59 +02:00
xaedes	ee27333b16	move train data saving code into callback to unify code of opt_callback train_params are still different in finetune and train-text-from-scratch, so it can't yet be moved to train.h\|cpp	2023-09-16 17:50:16 +02:00
xaedes	a8c8907c62	move train state into struct train_state	2023-09-16 17:30:38 +02:00
xaedes	9f4b1bf88d	move common train functions into common/train.[h\|cpp]	2023-09-16 16:17:13 +02:00
xaedes	00b656f6db	remove lbfgs related train parameters	2023-09-16 15:59:46 +02:00
xaedes	ab56b63b27	update train-text-from-scratch with tokenization, sample selection and shuffling from finetune	2023-09-15 23:45:54 +02:00
xaedes	cc60b3f639	remove outcommented old code	2023-09-15 23:45:05 +02:00
xaedes	4f2ce91b9e	add static keywords	2023-09-15 23:44:53 +02:00
xaedes	76804fab1d	exclude some more known zero values from computations in flash_attn_f32 & flash_attn_back_f32	2023-09-14 22:19:39 +02:00
xaedes	d88dae2980	block tiling for out-prod inspired by mul-mat block sizes are empirically optimized roughly doubles the flops of out-prod	2023-09-14 19:50:02 +02:00
xaedes	0971fee710	reshuffle original sample order instead of the previous shuffled order otherwise resumed reshuffle will not result in same sample order	2023-09-14 18:21:23 +02:00
xaedes	3a9c1d7f5a	set lora_alpha to value of lora_r if it is not set via command line otherwise only changing lora_r will change scaling of lora adapter used in prediction	2023-09-14 17:58:31 +02:00
xaedes	20cf1a4589	use unrolled vec_mad in out_prod y is vec_mad result vec. x is vec_mad input vec. v is vec_mad input scalar. ggml_vec_mad_f32_unroll will internally loop over x and v with same y. GGML_VEC_MAD_UNROLL is by default defined to 32. This value is empirical optimized using performance test runs of out-prod in openllama-3b finetune with 256 context length and batch size 1. It gives 23% performance boost for out_prod. Full measurements of out-prod runtime in ms: unroll_xv unroll_yv 1 67014.643 87826.469 2 77117.552 89077.656 4 72091.311 109121.657 8 61077.543 88678.334 16 56914.67 79514.947 24 59024.595 84350.254 28 55952.446 83368.73 32 51476.658 85177.745 36 55973.792 84659.92 40 55139.616 93844.738 48 60736.392 93330.267 64 99856.878 116994.99 Second column is when unrollying yv instead of xv	2023-09-14 17:20:29 +02:00
xaedes	2c59f7bea3	account for possible leading whitespace that will be added by tokenizer e.g. '\t' will be tokenized by llama spm tokenizer to [29871, 12]	2023-09-14 10:48:38 +02:00
xaedes	f627e2fe9c	pass correct max number of tokens to llama_tokenize	2023-09-14 03:04:04 +02:00
xaedes	7f378a7561	remove probably unnecessary exception type flags from stringstream	2023-09-14 00:21:05 +02:00
xaedes	ec57689f64	exclude known zero values from computations in flash_attn_f32 & flash_attn_back_f32	2023-09-13 18:37:51 +02:00
xaedes	7898652dfb	update shuffle rng state on reshuffle	2023-09-13 16:20:50 +02:00
xaedes	0e32932931	add sample start patterns and options to force new or by default resume last shuffling	2023-09-13 15:36:09 +02:00
xaedes	1cef45953b	remove unused command line options	2023-09-09 21:58:36 +02:00
xaedes	54b21a397c	Merge branch 'master' into finetune-lora # Conflicts: # examples/train-text-from-scratch/train-text-from-scratch.cpp # llama.h	2023-09-09 21:30:22 +02:00
xaedes	ace90884a6	measure max compute size for each cgraph eval order and use best order this can bring huge memory savings: e.g. codellama-34b with n_ctx=64, n_batch=1 goes from 92927.8mb down to 4627.6 MB	2023-09-09 21:00:25 +02:00
xaedes	917d2870b4	add cgraph evaluation order member and corresponding enum type this controls in which order ggml_build_forward visits source nodes. by default the nodes are visited left to right, i.e. src[0] first. in some cases it is beneficial for ggml-alloc to visit in a different order. two possible orders are supported: left-to-right (src[0] first) and right-to-left (src[0] last).	2023-09-09 20:52:53 +02:00
xaedes	d3f1b438a8	simplify broadcasting mul_mat backward using ggml_repeat_back	2023-09-09 18:55:18 +02:00
xaedes	d3aaf0876a	add comment briefly describing what ggml_repeat_back does	2023-09-09 18:47:27 +02:00
xaedes	9738526899	decouple random number generator of each operation test when changing one test the rng of others tests is not influenced anymore	2023-09-09 18:46:35 +02:00
xaedes	dd3278619d	test broadcasting mul_mat backward pass	2023-09-09 18:38:29 +02:00
xaedes	aea8b6be74	support broadcastable a in out_prod(a, b) and backward pass of broadcasting mul_mat(a, b)	2023-09-09 18:37:45 +02:00
xaedes	35260f7d74	fix finetune to support grouped-query-attention (using flash-attention) note: ggml changes to ggml_out_prod are necessary to support grouped-query-attention without flash-attention.	2023-09-09 17:10:23 +02:00
xaedes	833a56c144	add llama API functions to get grouped-query-attention n_head parameter 'n_head_kv'.	2023-09-09 17:07:59 +02:00
xaedes	d7aade7d8a	support grouped-query-attention in ggml_flash_attn and ggml_flash_attn_back k and v can now be repeated in q along ne[2] in forward pass just use modulo to compute k and v indices, like ik2 = iq2 % nek2. in backard pass this won't work as easy, because multiple threads will compete to accumulate to the same k->grad[:,ik1,ik2,ik3] and v->grad[:,iv1,iv2,iv3]. so we change the parallelization over q rows to be over k rows. this ensures non-overlapping (ik2,ik3) across threads. in each thread we then iterate over the number of repetitions of k/v in q to compute iq2 as iq2 = ik2 + irep*nek2. since ne2 is not the same for q,k and v we also change how the gradients are concatenated into the result tensor. additionally the offsets of gradq, gradk and gradv in the result tensor are now memory aligned. we also simplify the compute_backward part of flash_attn to use ggml_reshape instead of switching over the number of dimensions. this needs a small change to ggml_reshape, removing the assertion of second argument to be contiguous. since only the shape (ne) of the second reshape argument is of relevance, its memory layout (nb) is irrelevant -> it can very well be non-contiguous. change test-grad0 to also test for repeated k/v in q. this changes the rng and now results in small gradient differences in softmax. these solely come from using f16 exp table lookup in forward softmax: when temporarily changing softmax to use actual exp function, the reported gradient differences go away. gradient differences coming solely from f16 table lookup are acceptable. added a note to explain this.	2023-09-09 17:07:07 +02:00
kchro3	21ac3a1503	metal : support for Swift (#3078 ) * Metal support for Swift * update * add a toggle for arm/arm64 * set minimum versions for all platforms * update to use newLibraryWithURL * bump version Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com> --------- Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>	2023-09-09 17:12:10 +08:00
Jhen-Jie Hong	4fd5477955	metal : support build for iOS/tvOS (#3089 )	2023-09-09 11:46:04 +03:00
takov751	ec2a24fedf	flake : add train-text-from-scratch to flake.nix (#3042 )	2023-09-08 19:06:26 +03:00
Ikko Eltociear Ashimine	7d99aca759	readme : fix typo (#3043 ) * readme : fix typo acceleation -> acceleration * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-09-08 19:04:32 +03:00
Kawrakow	ba7ffbb251	metal : Q3_K speedup (#2995 ) * Slightly faster Q3_K and Q5_K on metal * Another Q3_K speedup on metal Combined with previous commit, we are now +9.6% for TG. PP is not affected as this happens via the matrix multiplication templates. * Slowly progressing on Q3_K on metal We are now 13% faster than master * nother small improvement for Q3_K on metal --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-09-08 19:01:04 +03:00
Cebtenzzre	e64f5b5578	examples : make n_ctx warning work again (#3066 ) This was broken by commit `e36ecdcc` ("build : on Mac OS enable Metal by default (#2901)").	2023-09-08 11:43:35 -04:00
Georgi Gerganov	94f10b91ed	readme : update hot tpoics	2023-09-08 18:18:04 +03:00
Georgi Gerganov	b3e9852e47	sync : ggml (CUDA GLM RoPE + POSIX) (#3082 ) ggml-ci	2023-09-08 17:58:07 +03:00
Przemysław Pawełczyk	cb6c44c5e0	build : do not use _GNU_SOURCE gratuitously (#2035 ) * Do not use _GNU_SOURCE gratuitously. What is needed to build llama.cpp and examples is availability of stuff defined in The Open Group Base Specifications Issue 6 (https://pubs.opengroup.org/onlinepubs/009695399/) known also as Single Unix Specification v3 (SUSv3) or POSIX.1-2001 + XSI extensions, plus some stuff from BSD that is not specified in POSIX.1. Well, that was true until NUMA support was added recently, so enable GNU libc extensions for Linux builds to cover that. Not having feature test macros in source code gives greater flexibility to those wanting to reuse it in 3rd party app, as they can build it with FTMs set by Makefile here or other FTMs depending on their needs. It builds without issues in Alpine (musl libc), Ubuntu (glibc), MSYS2. * make : enable Darwin extensions for macOS to expose RLIMIT_MEMLOCK * make : enable BSD extensions for DragonFlyBSD to expose RLIMIT_MEMLOCK * make : use BSD-specific FTMs to enable alloca on BSDs * make : fix OpenBSD build by exposing newer POSIX definitions * cmake : follow recent FTM improvements from Makefile	2023-09-08 15:09:21 +03:00
hongbo.mo	a21baeb122	docker : add git to full-cuda.Dockerfile main-cuda.Dockerfile (#3044 )	2023-09-08 13:57:55 +03:00
Yui	6ff712a6d1	Update deprecated GGML TheBloke links to GGUF (#3079 )	2023-09-08 12:32:55 +02:00
slaren	ebc96086af	ggml-alloc : correctly check mmap return value for errors (#3075 )	2023-09-08 04:04:56 +02:00
Kunshang Ji	7f412dab9c	enable CPU HBM (#2603 ) * add cpu hbm support * add memalign 0 byte check * Update ggml.c * Update llama.cpp * ggml : allow ggml_init with 0 size * retrigger ci * fix code style --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-09-08 03:46:56 +02:00
Cebtenzzre	6336d834ec	convert : fix F32 ftype not being saved (#3048 )	2023-09-07 14:27:42 -04:00
Cebtenzzre	00d62adb79	fix some warnings from gcc and clang-tidy (#3038 ) Co-authored-by: xaedes <xaedes@gmail.com>	2023-09-07 13:22:29 -04:00
Cebtenzzre	4fa2cc1750	make : improve test target (#3031 )	2023-09-07 10:15:01 -04:00
Cebtenzzre	5ffab089a5	make : fix CPPFLAGS (#3035 )	2023-09-07 10:13:50 -04:00
slaren	15b67a66c2	llama-bench : use two tokens in the warmup run for prompt evals (#3059 )	2023-09-07 15:52:34 +02:00

1 2 3 4 5 ...

1416 commits