llama.cpp

Author	SHA1	Message	Date
xaedes	917d2870b4	add cgraph evaluation order member and corresponding enum type this controls in which order ggml_build_forward visits source nodes. by default the nodes are visited left to right, i.e. src[0] first. in some cases it is beneficial for ggml-alloc to visit in a different order. two possible orders are supported: left-to-right (src[0] first) and right-to-left (src[0] last).	2023-09-09 20:52:53 +02:00
xaedes	d3f1b438a8	simplify broadcasting mul_mat backward using ggml_repeat_back	2023-09-09 18:55:18 +02:00
xaedes	d3aaf0876a	add comment briefly describing what ggml_repeat_back does	2023-09-09 18:47:27 +02:00
xaedes	9738526899	decouple random number generator of each operation test when changing one test the rng of others tests is not influenced anymore	2023-09-09 18:46:35 +02:00
xaedes	dd3278619d	test broadcasting mul_mat backward pass	2023-09-09 18:38:29 +02:00
xaedes	aea8b6be74	support broadcastable a in out_prod(a, b) and backward pass of broadcasting mul_mat(a, b)	2023-09-09 18:37:45 +02:00
xaedes	35260f7d74	fix finetune to support grouped-query-attention (using flash-attention) note: ggml changes to ggml_out_prod are necessary to support grouped-query-attention without flash-attention.	2023-09-09 17:10:23 +02:00
xaedes	833a56c144	add llama API functions to get grouped-query-attention n_head parameter 'n_head_kv'.	2023-09-09 17:07:59 +02:00
xaedes	d7aade7d8a	support grouped-query-attention in ggml_flash_attn and ggml_flash_attn_back k and v can now be repeated in q along ne[2] in forward pass just use modulo to compute k and v indices, like ik2 = iq2 % nek2. in backard pass this won't work as easy, because multiple threads will compete to accumulate to the same k->grad[:,ik1,ik2,ik3] and v->grad[:,iv1,iv2,iv3]. so we change the parallelization over q rows to be over k rows. this ensures non-overlapping (ik2,ik3) across threads. in each thread we then iterate over the number of repetitions of k/v in q to compute iq2 as iq2 = ik2 + irep*nek2. since ne2 is not the same for q,k and v we also change how the gradients are concatenated into the result tensor. additionally the offsets of gradq, gradk and gradv in the result tensor are now memory aligned. we also simplify the compute_backward part of flash_attn to use ggml_reshape instead of switching over the number of dimensions. this needs a small change to ggml_reshape, removing the assertion of second argument to be contiguous. since only the shape (ne) of the second reshape argument is of relevance, its memory layout (nb) is irrelevant -> it can very well be non-contiguous. change test-grad0 to also test for repeated k/v in q. this changes the rng and now results in small gradient differences in softmax. these solely come from using f16 exp table lookup in forward softmax: when temporarily changing softmax to use actual exp function, the reported gradient differences go away. gradient differences coming solely from f16 table lookup are acceptable. added a note to explain this.	2023-09-09 17:07:07 +02:00
xaedes	0c2c9c7545	fix gradient accumulation bug where the same batch was used for each microstep	2023-09-06 22:45:36 +02:00
xaedes	de6170d818	fix gradient accumulation bug where the same batch was used for each microstep	2023-09-06 21:35:21 +02:00
xaedes	0393116628	Merge branch 'master' into finetune-lora # Conflicts: # common/common.cpp	2023-09-06 20:15:24 +02:00
xaedes	c08fcf5947	specify default lora rank with '--lora-r N' '--lora-r N' will specify default rank for all tensors '--rank-wq N', etc. will override this default rank for specific tensor types.	2023-09-06 20:11:49 +02:00
xaedes	8c2d7e37f9	improve finetune time measurement fix printf warnings on system where int64_t is (long int). change time datatypes to double because values get big with long training times. exclude file saving from time measurement. converge faster to actual time per iteration by removing very small first duration before first iteration was performed. fix bug in output of total training time, the reported value was 1000 times to small.	2023-09-06 18:06:24 +02:00
Georgi Gerganov	178b1850eb	k-quants : fix zero-weight guard in Q6_K (ref #3040 )	2023-09-06 12:40:57 +03:00
Kerfuffle	ea2c85d5d2	convert-llama-ggml-to-gguf: Try to handle files older than GGJTv3 (#3023 ) * convert-llama-ggmlv3-to-gguf: Try to handle files older than GGJTv3 * Better error messages for files that cannot be converted * Add file type to GGUF output * Rename to convert-llama-ggml-to-gguf.py * Include original file type information in description * Improve some informational output	2023-09-06 02:49:11 -06:00
Cebtenzzre	9912b9efc8	build : add LLAMA_METAL_NDEBUG flag (#3033 )	2023-09-05 18:21:10 -04:00
Cebtenzzre	9e2023156e	make : use new flag variables for recent changes (#3019 )	2023-09-05 15:12:00 -04:00
Cebtenzzre	de2fe892af	examples : replace fprintf to stdout with printf (#3017 )	2023-09-05 15:10:27 -04:00
Erik Scholz	c9c3220c48	convert: fix convert.py not working with int filename_stem (#3028 ) * fix implicit int to string conversion * convert : remove an obsolete pyright comment --------- Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>	2023-09-05 19:41:00 +02:00
xaedes	867e7c2255	Merge branch 'master' into finetune-lora	2023-09-05 14:48:46 +02:00
Georgi Gerganov	d375b8f3aa	ggml : fix L-BFGS linesearch loop	2023-09-05 12:05:13 +03:00
Georgi Gerganov	786e786061	build : fix compile warnings	2023-09-05 12:02:19 +03:00
Kawrakow	d59bd97065	Guard against all weights in a super-block being zero (#3010 ) * Guard against all weights in a super-block being zero * Also guard against extremely small weights Closes #2982 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-09-05 09:55:33 +02:00
Georgi Gerganov	35938ee3b0	llama : update logic for number of threads when using BLAS	2023-09-05 10:46:39 +03:00
Georgi Gerganov	921772104b	speculative : add grammar support (#2991 ) * speculative : add grammar support * grammars : add json_arr.gbnf * grammar : add comments to new grammar file * grammar : remove one nested level * common : warm-up with 2 tokens - seems to work better * speculative : print draft token pieces * speculative : reuse grammar parser + better logs and comments * speculative : avoid grammar_mem * make : fix speculative build	2023-09-05 08:46:17 +03:00
xaedes	d07b6aac77	fix tracking of train_samples and train_tokens	2023-09-05 02:18:17 +02:00
xaedes	c1c3b0e0c2	add gradient accumulation specify number accumulation steps with '--grad-acc N'. this will simulate a bigger batch size of grad_acc*batch.	2023-09-05 01:09:06 +02:00
Georgi Gerganov	2ba85c8609	py : minor	2023-09-04 22:50:50 +03:00
xaedes	d3afd7131e	Merge branch 'master' into finetune-lora # Conflicts: # Makefile	2023-09-04 21:44:05 +02:00
Georgi Gerganov	e36ecdccc8	build : on Mac OS enable Metal by default (#2901 ) * build : on Mac OS enable Metal by default * make : try to fix build on Linux * make : move targets back to the top * make : fix target clean * llama : enable GPU inference by default with Metal * llama : fix vocab_only logic when GPU is enabled * common : better `n_gpu_layers` assignment * readme : update Metal instructions * make : fix merge conflict remnants * gitignore : metal	2023-09-04 22:26:24 +03:00
slaren	bd33e5ab92	ggml-opencl : store GPU buffer in ggml_tensor::extra (#2994 )	2023-09-04 14:59:52 +02:00
Cebtenzzre	3103568144	llama-bench : make cpp file non-executable (#2999 )	2023-09-04 13:40:18 +03:00
Leng Yue	5b8530d88c	make : add speculative example (#3003 )	2023-09-04 13:39:57 +03:00
Aarni Koskela	e4386f417f	server : add a subtle loading animation to the edit box (#2466 ) * editorconfig: add override for the server HTML (which already is 2-space indented) * server: add a subtle loading animation to the edit box	2023-09-04 16:28:55 +08:00
Jiahao Li	35195689cd	2x faster (rms) norm cuda kernels (3.7% e2e improvement) (#2985 ) * 2x faster (rms) norm cuda kernels * Fix code style	2023-09-04 08:53:30 +02:00
xaedes	9ea2f7ff58	Merge branch 'master' into finetune-lora # Conflicts: # ggml-alloc.c	2023-09-04 02:40:44 +02:00
slaren	cf9b08485c	ggml-alloc : use virtual memory for measurement (#2973 ) * ggml-alloc : use virtual memory for measurement * compatibility fixes for MAP_ANONYMOUS * fallback to fixed address for systems without virtual memory	2023-09-03 20:34:09 +02:00
xaedes	50589ed6be	load default rms_norm and rope parameters from base model	2023-09-03 20:05:54 +02:00
xaedes	bdb7092e82	add missing gguf_free in load_checkpoint_lora_file	2023-09-03 20:04:03 +02:00
xaedes	e07f5c57bb	fix printf format warnings	2023-09-03 20:03:39 +02:00
xaedes	406e0750cc	update README.md	2023-09-03 19:25:18 +02:00
Georgi Gerganov	47068e5170	speculative : PoC for speeding-up inference via speculative sampling (#2926 ) * speculative : initial example * speculative : print encoding speed * speculative : add --draft CLI arg	2023-09-03 15:12:08 +03:00
Georgi Gerganov	8f429fa511	perplexity : fix ETA by warming up the model with an empty run	2023-09-03 13:43:17 +03:00
Kerfuffle	6519e9c99c	gguf(python): Fix special vocab handling when id < 0 (#2984 )	2023-09-03 04:38:43 -06:00
Georgi Gerganov	b7f2aa9e51	metal : restore `363f0bf` and fix reduce in F16_F32 kernels (#2986 )	2023-09-03 13:23:33 +03:00
Alon	73a12a6344	cov : disable comment in PRs (#2989 )	2023-09-03 13:19:01 +03:00
opparco	3730134776	llama : fix bpe tokenize from byte (#2889 )	2023-09-03 13:18:09 +03:00
Georgi Gerganov	d9151e6f57	metal : revert `6af0bab` until we fix it This restores the generated text to be the same as before #2959	2023-09-03 12:40:56 +03:00
Alon	afc43d5f82	cov : add Code Coverage and codecov.io integration (#2928 ) * update .gitignore * makefile: add coverage support (lcov, gcovr) * add code-coverage workflow * update code coverage workflow * wun on ubuntu 20.04 * use gcc-8 * check why the job hang * add env vars * add LLAMA_CODE_COVERAGE=1 again * - add CODECOV_TOKEN - add missing make lcov-report * install lcov * update make file -pb flag * remove unused GGML_NITER from workflows * wrap coverage output files in COV_TARGETS	2023-09-03 11:48:49 +03:00

1 2 3 4 5 ...

1371 commits