llama.cpp

Author	SHA1	Message	Date
xaedes	2c59f7bea3	account for possible leading whitespace that will be added by tokenizer e.g. '\t' will be tokenized by llama spm tokenizer to [29871, 12]	2023-09-14 10:48:38 +02:00
xaedes	f627e2fe9c	pass correct max number of tokens to llama_tokenize	2023-09-14 03:04:04 +02:00
xaedes	7f378a7561	remove probably unnecessary exception type flags from stringstream	2023-09-14 00:21:05 +02:00
xaedes	ec57689f64	exclude known zero values from computations in flash_attn_f32 & flash_attn_back_f32	2023-09-13 18:37:51 +02:00
xaedes	7898652dfb	update shuffle rng state on reshuffle	2023-09-13 16:20:50 +02:00
xaedes	0e32932931	add sample start patterns and options to force new or by default resume last shuffling	2023-09-13 15:36:09 +02:00
xaedes	1cef45953b	remove unused command line options	2023-09-09 21:58:36 +02:00
xaedes	54b21a397c	Merge branch 'master' into finetune-lora # Conflicts: # examples/train-text-from-scratch/train-text-from-scratch.cpp # llama.h	2023-09-09 21:30:22 +02:00
xaedes	ace90884a6	measure max compute size for each cgraph eval order and use best order this can bring huge memory savings: e.g. codellama-34b with n_ctx=64, n_batch=1 goes from 92927.8mb down to 4627.6 MB	2023-09-09 21:00:25 +02:00
xaedes	917d2870b4	add cgraph evaluation order member and corresponding enum type this controls in which order ggml_build_forward visits source nodes. by default the nodes are visited left to right, i.e. src[0] first. in some cases it is beneficial for ggml-alloc to visit in a different order. two possible orders are supported: left-to-right (src[0] first) and right-to-left (src[0] last).	2023-09-09 20:52:53 +02:00
xaedes	d3f1b438a8	simplify broadcasting mul_mat backward using ggml_repeat_back	2023-09-09 18:55:18 +02:00
xaedes	d3aaf0876a	add comment briefly describing what ggml_repeat_back does	2023-09-09 18:47:27 +02:00
xaedes	9738526899	decouple random number generator of each operation test when changing one test the rng of others tests is not influenced anymore	2023-09-09 18:46:35 +02:00
xaedes	dd3278619d	test broadcasting mul_mat backward pass	2023-09-09 18:38:29 +02:00
xaedes	aea8b6be74	support broadcastable a in out_prod(a, b) and backward pass of broadcasting mul_mat(a, b)	2023-09-09 18:37:45 +02:00
xaedes	35260f7d74	fix finetune to support grouped-query-attention (using flash-attention) note: ggml changes to ggml_out_prod are necessary to support grouped-query-attention without flash-attention.	2023-09-09 17:10:23 +02:00
xaedes	833a56c144	add llama API functions to get grouped-query-attention n_head parameter 'n_head_kv'.	2023-09-09 17:07:59 +02:00
xaedes	d7aade7d8a	support grouped-query-attention in ggml_flash_attn and ggml_flash_attn_back k and v can now be repeated in q along ne[2] in forward pass just use modulo to compute k and v indices, like ik2 = iq2 % nek2. in backard pass this won't work as easy, because multiple threads will compete to accumulate to the same k->grad[:,ik1,ik2,ik3] and v->grad[:,iv1,iv2,iv3]. so we change the parallelization over q rows to be over k rows. this ensures non-overlapping (ik2,ik3) across threads. in each thread we then iterate over the number of repetitions of k/v in q to compute iq2 as iq2 = ik2 + irep*nek2. since ne2 is not the same for q,k and v we also change how the gradients are concatenated into the result tensor. additionally the offsets of gradq, gradk and gradv in the result tensor are now memory aligned. we also simplify the compute_backward part of flash_attn to use ggml_reshape instead of switching over the number of dimensions. this needs a small change to ggml_reshape, removing the assertion of second argument to be contiguous. since only the shape (ne) of the second reshape argument is of relevance, its memory layout (nb) is irrelevant -> it can very well be non-contiguous. change test-grad0 to also test for repeated k/v in q. this changes the rng and now results in small gradient differences in softmax. these solely come from using f16 exp table lookup in forward softmax: when temporarily changing softmax to use actual exp function, the reported gradient differences go away. gradient differences coming solely from f16 table lookup are acceptable. added a note to explain this.	2023-09-09 17:07:07 +02:00
kchro3	21ac3a1503	metal : support for Swift (#3078 ) * Metal support for Swift * update * add a toggle for arm/arm64 * set minimum versions for all platforms * update to use newLibraryWithURL * bump version Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com> --------- Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>	2023-09-09 17:12:10 +08:00
Jhen-Jie Hong	4fd5477955	metal : support build for iOS/tvOS (#3089 )	2023-09-09 11:46:04 +03:00
takov751	ec2a24fedf	flake : add train-text-from-scratch to flake.nix (#3042 )	2023-09-08 19:06:26 +03:00
Ikko Eltociear Ashimine	7d99aca759	readme : fix typo (#3043 ) * readme : fix typo acceleation -> acceleration * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-09-08 19:04:32 +03:00
Kawrakow	ba7ffbb251	metal : Q3_K speedup (#2995 ) * Slightly faster Q3_K and Q5_K on metal * Another Q3_K speedup on metal Combined with previous commit, we are now +9.6% for TG. PP is not affected as this happens via the matrix multiplication templates. * Slowly progressing on Q3_K on metal We are now 13% faster than master * nother small improvement for Q3_K on metal --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-09-08 19:01:04 +03:00
Cebtenzzre	e64f5b5578	examples : make n_ctx warning work again (#3066 ) This was broken by commit `e36ecdcc` ("build : on Mac OS enable Metal by default (#2901)").	2023-09-08 11:43:35 -04:00
Georgi Gerganov	94f10b91ed	readme : update hot tpoics	2023-09-08 18:18:04 +03:00
Georgi Gerganov	b3e9852e47	sync : ggml (CUDA GLM RoPE + POSIX) (#3082 ) ggml-ci	2023-09-08 17:58:07 +03:00
Przemysław Pawełczyk	cb6c44c5e0	build : do not use _GNU_SOURCE gratuitously (#2035 ) * Do not use _GNU_SOURCE gratuitously. What is needed to build llama.cpp and examples is availability of stuff defined in The Open Group Base Specifications Issue 6 (https://pubs.opengroup.org/onlinepubs/009695399/) known also as Single Unix Specification v3 (SUSv3) or POSIX.1-2001 + XSI extensions, plus some stuff from BSD that is not specified in POSIX.1. Well, that was true until NUMA support was added recently, so enable GNU libc extensions for Linux builds to cover that. Not having feature test macros in source code gives greater flexibility to those wanting to reuse it in 3rd party app, as they can build it with FTMs set by Makefile here or other FTMs depending on their needs. It builds without issues in Alpine (musl libc), Ubuntu (glibc), MSYS2. * make : enable Darwin extensions for macOS to expose RLIMIT_MEMLOCK * make : enable BSD extensions for DragonFlyBSD to expose RLIMIT_MEMLOCK * make : use BSD-specific FTMs to enable alloca on BSDs * make : fix OpenBSD build by exposing newer POSIX definitions * cmake : follow recent FTM improvements from Makefile	2023-09-08 15:09:21 +03:00
hongbo.mo	a21baeb122	docker : add git to full-cuda.Dockerfile main-cuda.Dockerfile (#3044 )	2023-09-08 13:57:55 +03:00
Yui	6ff712a6d1	Update deprecated GGML TheBloke links to GGUF (#3079 )	2023-09-08 12:32:55 +02:00
slaren	ebc96086af	ggml-alloc : correctly check mmap return value for errors (#3075 )	2023-09-08 04:04:56 +02:00
Kunshang Ji	7f412dab9c	enable CPU HBM (#2603 ) * add cpu hbm support * add memalign 0 byte check * Update ggml.c * Update llama.cpp * ggml : allow ggml_init with 0 size * retrigger ci * fix code style --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-09-08 03:46:56 +02:00
Cebtenzzre	6336d834ec	convert : fix F32 ftype not being saved (#3048 )	2023-09-07 14:27:42 -04:00
Cebtenzzre	00d62adb79	fix some warnings from gcc and clang-tidy (#3038 ) Co-authored-by: xaedes <xaedes@gmail.com>	2023-09-07 13:22:29 -04:00
Cebtenzzre	4fa2cc1750	make : improve test target (#3031 )	2023-09-07 10:15:01 -04:00
Cebtenzzre	5ffab089a5	make : fix CPPFLAGS (#3035 )	2023-09-07 10:13:50 -04:00
slaren	15b67a66c2	llama-bench : use two tokens in the warmup run for prompt evals (#3059 )	2023-09-07 15:52:34 +02:00
Kawrakow	be8c9c245b	metal : parallel RoPE on Metal (#3024 ) * Parallel RoPE on metal * PR suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-09-07 16:45:01 +03:00
Kawrakow	be6beeb8d7	metal : correct fix of kernel_norm (#3060 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-09-07 16:42:42 +03:00
Georgi Gerganov	c4f496648c	metal : fix kernel_norm (fixes Falcon on Metal) (#3057 ) * metal : fix kernel_norm ggml-ci * metal : put warning in kernel_norm to not combine the loops * metal : restore original F16 mat-vec multiplication It works after the norm fixes * common : don't do warm-up with more than n_batch tokens (close #3058) ggml-ci * metal : minor	2023-09-07 15:49:09 +03:00
Przemysław Pawełczyk	fec2fb19e4	ggml : posixify madvise and pagesize (#3037 ) * llama : use posix_madvise() instead of madvise() derived from BSD sed -i 's,\<madvise\>,posix_&,g;s,\<MADV_,POSIX_&,g' llama.cpp * ggml : use sysconf(_SC_PAGESIZE) instead of getpagesize() derived from BSD sed -i 's,getpagesize(),sysconf(_SC_PAGESIZE),g' ggml.c * metal : use sysconf(_SC_PAGESIZE) instead of getpagesize() derived from BSD sed -i 's,getpagesize(),sysconf(_SC_PAGESIZE),g' ggml-metal.m	2023-09-07 11:15:06 +03:00
xaedes	0c2c9c7545	fix gradient accumulation bug where the same batch was used for each microstep	2023-09-06 22:45:36 +02:00
xaedes	de6170d818	fix gradient accumulation bug where the same batch was used for each microstep	2023-09-06 21:35:21 +02:00
xaedes	0393116628	Merge branch 'master' into finetune-lora # Conflicts: # common/common.cpp	2023-09-06 20:15:24 +02:00
xaedes	c08fcf5947	specify default lora rank with '--lora-r N' '--lora-r N' will specify default rank for all tensors '--rank-wq N', etc. will override this default rank for specific tensor types.	2023-09-06 20:11:49 +02:00
xaedes	8c2d7e37f9	improve finetune time measurement fix printf warnings on system where int64_t is (long int). change time datatypes to double because values get big with long training times. exclude file saving from time measurement. converge faster to actual time per iteration by removing very small first duration before first iteration was performed. fix bug in output of total training time, the reported value was 1000 times to small.	2023-09-06 18:06:24 +02:00
Georgi Gerganov	178b1850eb	k-quants : fix zero-weight guard in Q6_K (ref #3040 )	2023-09-06 12:40:57 +03:00
Kerfuffle	ea2c85d5d2	convert-llama-ggml-to-gguf: Try to handle files older than GGJTv3 (#3023 ) * convert-llama-ggmlv3-to-gguf: Try to handle files older than GGJTv3 * Better error messages for files that cannot be converted * Add file type to GGUF output * Rename to convert-llama-ggml-to-gguf.py * Include original file type information in description * Improve some informational output	2023-09-06 02:49:11 -06:00
Cebtenzzre	9912b9efc8	build : add LLAMA_METAL_NDEBUG flag (#3033 )	2023-09-05 18:21:10 -04:00
Cebtenzzre	9e2023156e	make : use new flag variables for recent changes (#3019 )	2023-09-05 15:12:00 -04:00
Cebtenzzre	de2fe892af	examples : replace fprintf to stdout with printf (#3017 )	2023-09-05 15:10:27 -04:00

1 2 3 4 5 ...

1402 commits