llama.cpp

Author	SHA1	Message	Date
Concedo	0485fa65a2	wstring convert for mpt	2023-06-24 11:43:42 +08:00
Concedo	6d718525c4	Merge branch 'optimize_quants_upstream' into concedo_experimental	2023-06-23 23:56:31 +08:00
Concedo	f7b096374d	fixed string too long CI issue	2023-06-23 23:56:22 +08:00
Concedo	490cf395f8	better alloc error	2023-06-23 22:51:51 +08:00
Concedo	ece453ed09	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # README.md	2023-06-23 22:46:54 +08:00
Concedo	f39a746089	bug fixes for openblas	2023-06-23 22:45:22 +08:00
Concedo	43c2891afa	option to not use scratch	2023-06-23 19:01:36 +08:00
Concedo	d5e4cf7ffe	handle ctx manip	2023-06-23 19:01:15 +08:00
Concedo	df9135e3a9	fixing memory bugs	2023-06-23 18:41:23 +08:00
eiery	d7b7484f74	Add OpenLLaMA instructions to the README (#1954 ) * add openllama to readme	2023-06-23 10:38:01 +02:00
Erik Scholz	7487137227	rework convert.py to read hyper-parameters from config.json (#1958 ) * Read hyper-parameters from HuggingFace-transformer config.json, if they exist, and fall back to guessing, like before otherwise. This allows converting open_llama 3B and other non-standard model designs.	2023-06-22 14:20:47 +02:00
Concedo	0eedccaf06	Merge branch 'master' into optimize_quants_upstream	2023-06-22 17:59:58 +08:00
Concedo	e6ddb15c3a	cleanup	2023-06-22 10:38:27 +08:00
Johannes Gäßler	bbca06e269	cmake: revert CUDA arch default to 52, 61 if f16 (#1959 )	2023-06-21 23:49:25 +02:00
Rahul Vivek Nair	fb98254f99	Fix typo in README.md (#1961 )	2023-06-21 23:48:43 +02:00
Concedo	1b71752a9f	Implemented basic GPU offloading for MPT, GPT-2, GPT-J and GPT-NeoX	2023-06-22 00:43:25 +08:00
Ycros	b1f00fa9cc	Fix hordeconfig max context setting, and add Makefile flags for cuda F16/KQuants per iter. (#252 ) * Fix hordeconfig maxcontext setting. * cuda: Bring DMMV_F16 and KQUANTS_ITER Makefile flags over from llama.	2023-06-21 23:01:46 +08:00
Concedo	dfdd20240c	gpt j use scratch buffers	2023-06-21 16:10:31 +08:00
Georgi Gerganov	049aa16b8c	readme : add link to p1	2023-06-20 19:05:54 +03:00
Concedo	266d47a4b9	Merge branch 'optimize_quants_upstream' into concedo_experimental	2023-06-20 22:46:35 +08:00
Concedo	da668e685f	fixing address spaces	2023-06-20 22:46:11 +08:00
Concedo	cce6e67f44	fixing address spaces	2023-06-20 22:45:16 +08:00
Concedo	1f1735f5ad	Merge branch 'optimize_quants_upstream' into concedo_experimental	2023-06-20 21:39:35 +08:00
Concedo	6b75fc48b9	fixed global const struct types	2023-06-20 21:38:48 +08:00
Xiake Sun	2322ec223a	Fix typo (#1949 )	2023-06-20 15:42:40 +03:00
Concedo	537ff22ec9	fixed a bug with token timings, updated lite	2023-06-20 20:41:42 +08:00
Concedo	c5ae3f50a7	Merge branch 'optimize_quants_upstream' into concedo_experimental	2023-06-20 18:41:13 +08:00
Concedo	a6e8b0216d	remove old dot kernels and template	2023-06-20 18:37:48 +08:00
Concedo	93247a11cd	ported q2k and q5k speedups	2023-06-20 18:37:41 +08:00
Concedo	029bed6446	ported q3k speedup successfully	2023-06-20 18:37:26 +08:00
Concedo	d754915269	Merge branch 'optimize_quants_upstream' into concedo_experimental	2023-06-20 17:26:39 +08:00
Concedo	b4c532e862	Merge branch 'master' into concedo_experimental	2023-06-20 17:26:27 +08:00
0cc4m	8d816d19d1	Add q6_k fast matmul kernel	2023-06-20 08:41:35 +02:00
0cc4m	34a4917984	Use preprocessor for QK_K	2023-06-20 08:04:16 +02:00
0cc4m	069cbe530d	Fix q2_k fast kernel	2023-06-20 08:01:40 +02:00
Ettore Di Giacinto	aacdbd4056	llama : fix params struct slignment (#1936 ) * Workaround struct misalignment during value-copy Signed-off-by: mudler <mudler@localai.io> * Move booleans at the bottom of the structure Signed-off-by: mudler <mudler@localai.io> * Add comment Signed-off-by: mudler <mudler@localai.io> --------- Signed-off-by: mudler <mudler@localai.io>	2023-06-20 04:24:39 +03:00
Henri Vasserman	20568fe60f	[Fix] Reenable server embedding endpoint (#1937 ) * Add back embedding feature * Update README	2023-06-20 01:12:39 +03:00
Georgi Gerganov	18b35625c3	ggml : fix bug in LBFGS optimizer (found by ggml tests)	2023-06-19 20:43:30 +03:00
Concedo	69fd31d18c	Merge branch 'master' into optimize_quants_upstream	2023-06-19 23:38:59 +08:00
Concedo	5e8e99f206	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt	2023-06-19 23:37:53 +08:00
l3utterfly	ba4e85a833	llama : use aligned memory during ggml_init call from loading saved sessions (#1934 ) * fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions * - removed commented out old code from fix - updated another instance of same issue below original	2023-06-19 18:20:06 +03:00
Georgi Gerganov	23fc5c219a	cmake : fix trailing whitespaces	2023-06-19 18:18:34 +03:00
Kawrakow	cb40dfca69	llama : only use Q6_K for output weights if tensor size is multiple of 256 (#1932 ) * Only use Q6_K for output weights if tensor size is multiple of 256 * Fixed copy/paste mistake --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-19 18:17:03 +03:00
Kawrakow	ca7c3f4da5	cuda : faster k-quants on older GPUs (#1930 ) * k_quants: hopefully much faster Q4_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 65.5 ms/tok to 41.5 ms/tok! * k_quants: hopefully much faster Q3_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 60.3 ms/tok to 41.0 ms/tok! * k_quants: faster Q2_K on older GPUs It looks like I didn't need to change anything compared to what we already had, so this is just adding clarifying comments. But I now measure 36.3 ms/tok on the GTX-1660, instead fo the 47.2 ms/tok that I have written in the faster k-quants PR. * k_quants: faster Q5_K on older GPUs 68.5 ms/tok -> 62.0 ms/tok on GTX-1660. For some reason the same access pattern that leads to such resounding success for Q2_K to Q4_K did not work at all for Q5_K. It is also more difficult to measure because for Q5_K_S we only have 32 layers on the GTX-1660, so output, tok embeddings and kv cache are done on the CPU. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-19 18:14:09 +03:00
Georgi Gerganov	b97ca431db	ggml : sync latest ggml repo (#1924 ) * ggml : sync latest ggml repo * ggml : remove unused comments * ggml : asserts	2023-06-19 18:12:33 +03:00
Howard Su	1e3abfcef0	cmake : fix build shared ggml when CUDA is enabled (#1929 ) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-19 18:10:37 +03:00
Concedo	c94a438328	xx + ib0	2023-06-19 23:01:49 +08:00
Concedo	266d436746	Added broken new q4k quant	2023-06-19 22:41:35 +08:00
Concedo	51e834c27b	keep duplicate targets for now	2023-06-19 22:38:23 +08:00
Concedo	cf94340dfc	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # README.md	2023-06-19 22:28:38 +08:00

1 2 3 4 5 ...

1318 commits