llama.cpp

Author	SHA1	Message	Date
xaedes	52c92c0a8c	terminate recursive tensor cloning when reaching tensor without src tensors	2023-08-14 17:53:36 +02:00
xaedes	0dd496c5e2	fix variable name and add missing type cast	2023-08-14 17:52:48 +02:00
xaedes	cfddc36be2	correctly clone reshape and permute operations by also cloning tensor->nb values	2023-08-14 17:52:15 +02:00
xaedes	d43741540b	don't use allocate hash_map on context because the context has no_alloc=True when using memory allocator resulting in NULL data pointers	2023-08-14 17:51:20 +02:00
xaedes	fc826c8ea8	in train function replace add_inplace by regular add because using add_inplace seems to result in different gradients	2023-08-14 17:49:22 +02:00
xaedes	2bf422eafd	add train function using automatic gradient checkpointing backward pass and allocator	2023-08-06 23:07:57 +02:00
xaedes	d43af4b543	Merge branch 'master' into pr-train-mem-usage-improvements	2023-08-06 17:30:17 +02:00
DannyDaemonic	86c3219895	console : fix issue related to Windows 11 PowerShell console mode persistence (#2521 )	2023-08-06 09:49:34 +03:00
Keiichi Tabata	2e8265ae17	convert.py : add missing abstract methods for quantized data (#2491 )	2023-08-06 09:34:05 +03:00
Johannes Gäßler	f514d1b306	CUDA: faster k-quant mul_mat_q kernels (#2525 )	2023-08-05 18:20:44 +02:00
Jonas Wunderlich	332311234a	fix firefox autoscroll (#2519 )	2023-08-04 22:16:11 +02:00
Cebtenzzre	182af739c4	server: regenerate completion.js.hpp (#2515 )	2023-08-04 21:00:57 +02:00
Cebtenzzre	4329d1acb0	CUDA: use min compute capability of GPUs actually used (#2506 )	2023-08-04 17:35:22 +02:00
Cebtenzzre	02f9d96a86	CUDA: check if event is NULL before cudaStreamWaitEvent (#2505 ) Fixes #2503	2023-08-04 17:34:32 +02:00
DannyDaemonic	3498588e0f	Add --simple-io option for subprocesses and break out console.h and cpp (#1558 )	2023-08-04 08:20:12 -07:00
Stephen Nichols	5f631c2679	Fixing race condition in server and partial stream handling in frontend. (#2391 ) * Fixing race condition in server.cpp and partial stream handling in completion.js * Reverting assert edits. * Adding newline to eof	2023-08-04 13:37:24 +02:00
l3utterfly	415e99fec2	Stream save llama context data to file instead of allocating entire buffer upfront (#2488 ) * added stream saving context data to file to avoid allocating unnecessary amounts of memory * generalised copying state data to file or buffer * added comments explaining how copy_state_data works * fixed trailing whitespaces * fixed save load state example * updated save load state to use public function in llama.cpp * - restored breakage of the llama_copy_state_data API - moved new logic for copying llama state data to internal function * fixed function declaration order * restored save load state example * fixed whitepace * removed unused llama-util.h include * Apply suggestions from code review Co-authored-by: slaren <slarengh@gmail.com> * Apply code review suggestions Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2023-08-04 13:29:52 +02:00
Borislav Stanimirov	ff966e7ca6	build : fix several cast and printf warnings (#2499 )	2023-08-04 13:07:21 +03:00
Evan Jones	8183159cf3	examples : generate JSON according to schema (#1887 ) * examples : add JSON schema grammars * complete JSON grammar * ensure primitive types can be used as root of schema * support integer type and adjust usage text	2023-08-02 22:05:44 -04:00
Johannes Gäßler	468ea24fb4	CUDA: faster non k-quant mul_mat_q kernels (#2483 )	2023-08-02 18:04:04 +02:00
Johannes Gäßler	4f6b60c776	CUDA: Fix models with output size != 32000 (#2480 )	2023-08-02 16:48:10 +02:00
ldwang	220d931864	readme : add Aquila-7B model series to supported models (#2487 ) * support bpe tokenizer in convert Signed-off-by: ldwang <ftgreat@gmail.com> * support bpe tokenizer in convert Signed-off-by: ldwang <ftgreat@gmail.com> * support bpe tokenizer in convert, fix Signed-off-by: ldwang <ftgreat@gmail.com> * Add Aquila-7B models in README.md Signed-off-by: ldwang <ftgreat@gmail.com> * Up Aquila-7B models in README.md Signed-off-by: ldwang <ftgreat@gmail.com> --------- Signed-off-by: ldwang <ftgreat@gmail.com> Co-authored-by: ldwang <ftgreat@gmail.com>	2023-08-02 11:21:11 +03:00
Eve	81844fbcfd	tests : Fix compilation warnings (Linux/GCC) (#2451 ) * fix hellaswag print format, cast away warning in test-double-float * c++11 cannot use designated initializers * add static to test-grad0.c internal functions * use memcpy in test-double-float.c * port c tests to c++ * use initializer list for ggml_init_params	2023-08-02 11:06:19 +03:00
Yiming Cui	a312193e18	readme : Add Chinese LLaMA-2 / Alpaca-2 to supported models (#2475 ) * add support for chinese llama-2 / alpaca-2 * remove white spaces	2023-08-02 09:18:31 +03:00
Bono Lv	c574bddb36	fix a typo in examples/server/README.md (#2478 )	2023-08-01 14:54:28 +02:00
ebraminio	86aeb27734	server : Support dark mode (#2414 ) * server : Support dark mode So it respects user system light / dark settings. * Update index.html.hpp by running ./deps.sh	2023-08-01 10:56:23 +02:00
Matteo Boschini	1873ff586b	metal : add gqa8 kernel to allow llama-2-70B on metal (#2459 ) * Added gqa8 kernel to allow llama-2-70B on metal * Update ggml-metal.m Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com> * Extend kernel_mul_mat_f16_f32 to handle gqa broadcast * Added ne03==ne13 assertion --------- Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>	2023-08-01 10:43:12 +03:00
Johannes Gäßler	49e7cb5bb1	CUDA: fixed LLAMA_FAST compilation option (#2473 )	2023-07-31 21:02:19 +02:00
Johannes Gäßler	b772bba42e	CUDA: fixed cmake F16 option (#2471 )	2023-07-31 19:52:22 +02:00
Johannes Gäßler	0728c5a8b9	CUDA: mmq CLI option, fixed mmq build issues (#2453 )	2023-07-31 15:44:35 +02:00
Johannes Gäßler	1215ed7d5c	CUDA: Implemented row flattening for non-glm RoPE (#2468 )	2023-07-31 14:32:30 +02:00
Johannes Gäßler	2dbf518911	CUDA: fewer memory bank conflicts for mul_mat_q (#2458 )	2023-07-31 13:18:51 +02:00
slaren	9d2382b3e4	Fix Metal backend broken from the allocator changes (#2455 ) * fix Metal backend broken from the allocator changes	2023-07-31 11:02:53 +02:00
slaren	a113689571	ggml : add graph tensor allocator (#2411 ) * ggml : add graph tensor allocator * ggml : don't calculate data pointer of unallocated tensors when creating a view with an offset * ggml : refactor ggml_view_Nd into ggml_view_tensor_offset	2023-07-30 15:58:01 +02:00
Johannes Gäßler	11f3ca06b8	CUDA: Quantized matrix matrix multiplication (#2160 ) * mmq implementation for non k-quants * q6_K * q2_K * q3_k * q4_K * vdr * q5_K * faster q8_1 loading * loop unrolling * add __restrict__ * q2_K sc_high * GGML_CUDA_MMQ_Y * Updated Makefile * Update Makefile * DMMV_F16 -> F16 * Updated README, CMakeLists * Fix CMakeLists.txt * Fix CMakeLists.txt * Fix multi GPU out-of-bounds	2023-07-29 23:04:44 +02:00
Johannes Gäßler	9baf9ef304	CUDA: faster multi GPU synchronization (#2448 )	2023-07-29 23:04:10 +02:00
xaedes	22cb368dd9	remove trailing whitespace	2023-07-28 23:55:30 +02:00
xaedes	c1a5e116a4	llama training : fix ggml_rms_norm_back calls to pass configurable eps	2023-07-28 23:13:20 +02:00
xaedes	ecdc16163e	ggml : update ggml_rms_norm_back with configurable eps	2023-07-28 23:13:20 +02:00
xaedes	87035b96f7	remove out-commented vectorized code of opt_adam the vectorized code might be bit faster for low number of parameters, but it had a big memory usage overhead	2023-07-28 23:13:20 +02:00
xaedes	0f6a8ab519	tighten abs error bounds for sqrt in test-grad0	2023-07-28 23:13:20 +02:00
xaedes	47055c929f	tighten abs error bounds for flash_attn in test-grad0	2023-07-28 23:13:20 +02:00
xaedes	dbbc263313	add conditional compilation of using F16 exp in flash attention uncomment `// #define GGML_FLASH_ATTN_EXP_FP16` to enable usage of f16 exp in flash attention	2023-07-28 23:13:20 +02:00
xaedes	1065c3b7b9	tighten abs error bounds for cross_entropy_loss in test-grad0	2023-07-28 23:13:20 +02:00
xaedes	24a4b099f3	change sampling parameters for prediction after training to defaults of common.h and clarify what is context for prediction and what are generated tokens	2023-07-28 23:13:19 +02:00
xaedes	17a0898d50	fix increase of model.train_samples and model.train_tokens now that each optimizer iteration gets its own batch we need to multiply by number of opt iterations	2023-07-28 23:13:19 +02:00
xaedes	58024d3e5f	rename training parameter cos-decay-alpha to cos-decay-min and clarify that adam-min-alpha also applies to warmup	2023-07-28 23:13:19 +02:00
xaedes	e6ff0728e0	add minimum number of tensor dimensions to apply weight decay (default 2) this allows to not apply weight decay to bias parameters	2023-07-28 23:13:19 +02:00
xaedes	d7aa4d9576	use optimization callback in training allows dynamic learning schedule and different batch data for each iteration without relying on low n_iter and high n_examples parameters reduces runtime by avoiding restart of optimization function and improves training convergence by providing a different batch for each iteration	2023-07-28 23:13:19 +02:00
xaedes	bfc3119139	add optimization callback to ggml_opt_resume_g this callback is called before each iteration with custom data and pointer to learning schedule parameter (only used in Adam(W)). can be used for dynamic learning schedule and setting input data for batches before each iteration	2023-07-28 23:13:18 +02:00

1 2 3 4 5 ...

1048 commits