llama.cpp

Author	SHA1	Message	Date
Pavol Rusnak	bb98e77be7	nix: use convert.py instead of legacy wrapper convert-pth-to-ggml.py (#981 )	2023-04-25 23:19:57 +02:00
Georgi Gerganov	7a32fcb3b2	ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) (#1179 ) * ggml : add Q8_0 quantization format (rename the old one to Q8_1) * tests : fix test-quantize-fns * ggml : finalize Q8_0 implementation * ggml : use q4_0_q8_0 and q4_2_q8_0 * ggml : fix Q8_0 dot product bug (ARM) * ggml : Q8_0 unroll x2 * ggml : fix bug - using wrong block type * ggml : extend quantize_fns_t with "vec_dot_type" * ggml : fix Q8_0 to use 255 values out of 256 * ggml : fix assert using wrong QK4_2 instead of QK4_3	2023-04-25 23:40:51 +03:00
unbounded	dd0eabc049	ggml : use full range for Q4_0 and Q4_2 quantization (#729 ) * Use full range for q4_0 quantization By keeping the sign of the highest magnitude, we can make sure the highest value maps to -8, which is currently unused. This is a bit of a freebie since it is fully backwards compatible with the current format. * Update quantize_row_q4_0 for AVX/AVX2 * Update quantize_row_q4_0 for WASM Untested * Update quantize_row_q4_0 for Arm NEON * Update quantize_row_q4_0 for PowerPC Untested * Use full range for q4_2 quantization	2023-04-25 20:20:46 +03:00
Concedo	0aa3d839fb	free old ctx on retry	2023-04-25 23:42:57 +08:00
Concedo	a696b0a16c	missed another thing	2023-04-25 23:16:04 +08:00
Concedo	8c9c218609	missed a thing	2023-04-25 23:02:08 +08:00
Concedo	235daf4016	Merge branch 'master' into concedo # Conflicts: # .github/workflows/build.yml # README.md	2023-04-25 20:44:22 +08:00
Concedo	72b2331ad6	edge cases with mem crash? need verify	2023-04-25 20:42:30 +08:00
Concedo	5eec5d6ed9	Added backwards compatibility to an earlier version of NeoX.	2023-04-25 20:34:18 +08:00
Concedo	bff998f871	Slight refactor of the python code: credits to @LuxF3rre	2023-04-25 19:20:14 +08:00
xaedes	54bb60e268	ggml : fix bug in ggml_compute_forward_sum_f32 (#1162 ) The sum over all rows is now computed instead of just the last row	2023-04-24 23:02:02 +02:00
Georgi Gerganov	8a0f8673ba	ggml : export symbols (#1155 )	2023-04-24 22:18:25 +03:00
xaedes	0c5692345d	examples : add save_load_state example (#1150 ) * add save_load_state example * use <cstdio> instead of <iostream> and fprintf / printf instead of cout * renamed save-load-state example files replacing underscores by dashes	2023-04-24 19:23:31 +03:00
Georgi Gerganov	957c8ae21d	llama : increase scratch buffer size for 65B (ref #1152 ) Temporary solution	2023-04-24 18:47:30 +03:00
mgroeber9110	9b0a4d4214	examples/main README improvements and some light refactoring (#1131 )	2023-04-24 15:45:32 +00:00
Stephan Walter	2ec83428de	Fix build for gcc 8 and test in CI (#1154 )	2023-04-24 15:38:26 +00:00
slaren	e4cf982e0d	Fix cuda compilation (#1128 ) * Fix: Issue with CUBLAS compilation error due to missing -fPIC flag --------- Co-authored-by: B1gM8c <89020353+B1gM8c@users.noreply.github.com>	2023-04-24 17:29:58 +02:00
Concedo	59fb174678	fixed compile errors, made mmap automatic when lora is selected, added updated quantizers and quantization handling for gpt neox gpt 2 and gptj	2023-04-24 23:20:06 +08:00
Concedo	3962eb39c7	added token unbanning	2023-04-24 21:50:20 +08:00
Concedo	1b9b9068b1	merged q4_2 and q4_3 dequants and FIXED CLBLAST SLOWNESS!	2023-04-24 21:33:01 +08:00
Concedo	e58f1d1336	Merge branch 'master' into concedo_experimental	2023-04-24 19:43:17 +08:00
Georgi Gerganov	c4fe84fb0d	llama : refactor get / set state + remove redundant kv cache API (#1143 )	2023-04-24 07:40:02 +03:00
Concedo	8e615c8245	Merge branch 'master' into concedo_experimental # Conflicts: # README.md	2023-04-24 12:20:08 +08:00
slaren	1d78fecdab	Fix LoRA acronym (#1145 )	2023-04-23 23:03:44 +02:00
Georgi Gerganov	284685f169	scripts : add helper scripts to synch ggml repo	2023-04-23 19:57:09 +03:00
DannyDaemonic	edce63baa9	Added README.md for main with examples and explanations (#1139 )	2023-04-23 15:37:02 +00:00
Georgi Gerganov	ec9cdb6752	ggml : do not print perf ops that have not been used at all	2023-04-23 18:32:52 +03:00
Georgi Gerganov	e4422e299c	ggml : better PERF prints + support "LLAMA_PERF=1 make"	2023-04-23 18:15:39 +03:00
Stephan Walter	53c8434398	Improve AVX2 for vec_dot_q4_3_q8_0 (#1138 )	2023-04-23 11:01:03 +00:00
Pavol Rusnak	c6524f46eb	readme : update gpt4all instructions (#980 )	2023-04-23 10:21:26 +02:00
Concedo	9129e937f9	only llama can use batch sizes above 256 to prevent unacceptably high memory usage	2023-04-23 15:57:06 +08:00
Yishuo Wang	c9e2c26f41	A better `packNibbles` and `mul_sum_i8_pairs_float` implementation using AVX512 (#1119 )	2023-04-23 07:57:05 +00:00
Concedo	432cc91649	still needs to be a bit higher for very small contexts	2023-04-23 15:01:38 +08:00
Concedo	4e1ea2ac61	hopefully fixed the ooms for good	2023-04-23 13:49:50 +08:00
Gustavo Rocha Dias	3f21bd81f3	doc - Better explanation of how to build the libraries at Windows. (#107 )	2023-04-23 13:40:09 +08:00
Concedo	d41490c27b	just revert back to the working commit	2023-04-23 00:35:42 +08:00
Concedo	c60fb5ef4b	fixed rwkv build errors on ARM devices	2023-04-23 00:18:38 +08:00
Concedo	b5d6284190	increase initial buffer too	2023-04-23 00:07:33 +08:00
Concedo	d2f14b2b1f	add an extra buffer to mem allocations	2023-04-23 00:04:32 +08:00
Concedo	7c60441d71	Merge branch 'master' into concedo # Conflicts: # .github/workflows/build.yml # CMakeLists.txt	2023-04-22 23:46:14 +08:00
Concedo	eb73b4c261	remove writing to cl_buffer_c and change it to a writeonly buffer - should work since beta is always zero.	2023-04-22 23:19:17 +08:00
Concedo	cd6c121357	reinstated the reusable buffers -> approx 10% speedup for prompt processing	2023-04-22 22:49:27 +08:00
Georgi Gerganov	0e018fe008	ggml : fix Q4_3 cuBLAS	2023-04-22 16:32:07 +03:00
Stephan Walter	857308d1e8	ci : trigger CI for drafts, but not most PR actions (#1125 )	2023-04-22 16:12:29 +03:00
Stephan Walter	c50b628810	Fix CI: ARM NEON, quantization unit tests, editorconfig (#1122 )	2023-04-22 10:54:13 +00:00
unbounded	5f939498d5	ggml : unit test for quantization functions (#953 ) * Unit test for quantization functions Use the ggml_internal_get_quantize_fn function to loop through all quantization formats and run a sanity check on the result. Also add a microbenchmark that times these functions directly without running the rest of the GGML graph. * test-quantize-fns: CI fixes Fix issues uncovered in CI - need to use sizes divisible by 328 for loop unrolling - use intrinsic header that should work on Mac test-quantize: remove Per PR comment, subsumed by test-quantize-fns * test-quantize: fix for q8_0 intermediates	2023-04-22 12:10:39 +03:00
wbpxre150	36b4f7e064	llama : print timings on ctrl+c exit (#1021 ) * print timings on ctrl+c exit * remove redundant free memory call. * add global pointer to ctx.	2023-04-22 11:56:35 +03:00
Concedo	811989c2ad	fixed pyinstaller	2023-04-22 16:31:42 +08:00
eiery	10f19c1121	llama : have n_batch default to 512 (#1091 ) * set default n_batch to 512 when using BLAS * spacing * alternate implementation of setting different n_batch for BLAS * set n_batch to 512 for all cases	2023-04-22 11:27:05 +03:00
Concedo	1b7aa2b815	Merge branch 'master' into concedo # Conflicts: # .github/workflows/build.yml # CMakeLists.txt # Makefile	2023-04-22 16:22:08 +08:00

1 2 3 4 5 ...

702 commits