llama.cpp

Author	SHA1	Message	Date
niansa	d1f84db4b6	Implemented GGML_OP_NORM	2023-06-30 15:18:10 +02:00
niansa	8fa60134b1	Added missing break to mul_mat_f16 case	2023-06-30 12:47:17 +02:00
niansa	0dc5f2f2ba	Fixed mul mat dispatch size	2023-06-30 12:31:13 +02:00
niansa	f093bf2e5e	Minor MUL_MAT fix and implemented DIAG_MASK_INF	2023-06-30 12:19:29 +02:00
niansa	964fe8c546	Added mul_mat (needs fixes)	2023-06-30 11:47:10 +02:00
niansa	749d6179a8	Snake case all functions	2023-06-29 14:23:00 +02:00
niansa	5ac68ccacb	Cleanups	2023-06-29 11:14:21 +02:00
niansa	de7d1823ed	Implemented ggml_vk_soft_max	2023-06-28 12:48:41 +02:00
niansa	e2b721db65	Allow vk add row	2023-06-28 10:19:18 +02:00
niansa	ed14f0764a	Fixed ggml_vk_abmath row argument	2023-06-28 10:15:23 +02:00
niansa	072007b1e8	Add buffer qualifiers	2023-06-23 21:21:16 +02:00
niansa	acb7d90398	Reenabled unknown op message	2023-06-23 20:39:32 +02:00
niansa	5d5f66d1d9	More little fixes and stuff	2023-06-23 20:37:58 +02:00
niansa	e0814f86a2	Free vk context	2023-06-23 20:02:46 +02:00
niansa	55815b67f4	Improved memory safety	2023-06-23 19:58:41 +02:00
niansa	4b267e88b6	Temporarily care for all layers	2023-06-23 18:40:58 +02:00
niansa	40621ea0ec	Added more debugging	2023-06-23 18:26:21 +02:00
niansa	e6da9bd96b	Added ggml_vk_mem_used()	2023-06-23 17:57:09 +02:00
niansa	1a68195408	Add mutexes for gpu tensors	2023-06-23 17:46:09 +02:00
niansa	46f577bfc1	h2d tensors during loadup	2023-06-23 17:10:45 +02:00
niansa	98e588c6eb	Fix ggml_vk_h2d_tensor throwing on second call	2023-06-23 16:50:37 +02:00
niansa	09b0b3a49b	Wait for all threads to finish	2023-06-23 16:13:32 +02:00
niansa	2589cb0c70	Prevent compileSource race	2023-06-23 16:02:49 +02:00
niansa	5c0d8dd0f2	Specify program output size	2023-06-23 15:58:13 +02:00
niansa	e830264c92	Share sequence to functions and add scale()	2023-06-23 15:10:24 +02:00
niansa	5e9403342b	Minor fixes	2023-06-23 15:01:09 +02:00
niansa	b6264542b7	Added vk_mul to ggml_vk_graph_compute	2023-06-23 14:19:31 +02:00
niansa	18d6f7f8da	More progress...	2023-06-23 14:08:45 +02:00
niansa	d539247996	Began implementing ggml_graph_compute	2023-06-23 14:03:33 +02:00
niansa	b8a4594f89	More fixes...	2023-06-23 12:19:33 +02:00
niansa	9d643755a6	Fixed compile error	2023-06-23 11:51:25 +02:00
niansa	339bc36cdd	Added more functions from Metal	2023-06-23 11:50:30 +02:00
niansa	9cdaea9240	Implemented dequantize_row_q4_1	2023-06-22 16:30:36 +02:00
niansa	b0f11fa9c1	More code cleanups	2023-06-22 16:05:56 +02:00
niansa	3b3d30e4ad	Cleanups	2023-06-22 13:57:04 +02:00
niansa	2f3fe0c0a4	Updated gitignore	2023-06-22 13:57:04 +02:00
niansa	4f598dd973	Initial working stuff	2023-06-22 13:57:04 +02:00
Johannes Gäßler	bbca06e269	cmake: revert CUDA arch default to 52, 61 if f16 (#1959 )	2023-06-21 23:49:25 +02:00
Rahul Vivek Nair	fb98254f99	Fix typo in README.md (#1961 )	2023-06-21 23:48:43 +02:00
Georgi Gerganov	049aa16b8c	readme : add link to p1	2023-06-20 19:05:54 +03:00
Xiake Sun	2322ec223a	Fix typo (#1949 )	2023-06-20 15:42:40 +03:00
Ettore Di Giacinto	aacdbd4056	llama : fix params struct slignment (#1936 ) * Workaround struct misalignment during value-copy Signed-off-by: mudler <mudler@localai.io> * Move booleans at the bottom of the structure Signed-off-by: mudler <mudler@localai.io> * Add comment Signed-off-by: mudler <mudler@localai.io> --------- Signed-off-by: mudler <mudler@localai.io>	2023-06-20 04:24:39 +03:00
Henri Vasserman	20568fe60f	[Fix] Reenable server embedding endpoint (#1937 ) * Add back embedding feature * Update README	2023-06-20 01:12:39 +03:00
Georgi Gerganov	18b35625c3	ggml : fix bug in LBFGS optimizer (found by ggml tests)	2023-06-19 20:43:30 +03:00
l3utterfly	ba4e85a833	llama : use aligned memory during ggml_init call from loading saved sessions (#1934 ) * fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions * - removed commented out old code from fix - updated another instance of same issue below original	2023-06-19 18:20:06 +03:00
Georgi Gerganov	23fc5c219a	cmake : fix trailing whitespaces	2023-06-19 18:18:34 +03:00
Kawrakow	cb40dfca69	llama : only use Q6_K for output weights if tensor size is multiple of 256 (#1932 ) * Only use Q6_K for output weights if tensor size is multiple of 256 * Fixed copy/paste mistake --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-19 18:17:03 +03:00
Kawrakow	ca7c3f4da5	cuda : faster k-quants on older GPUs (#1930 ) * k_quants: hopefully much faster Q4_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 65.5 ms/tok to 41.5 ms/tok! * k_quants: hopefully much faster Q3_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 60.3 ms/tok to 41.0 ms/tok! * k_quants: faster Q2_K on older GPUs It looks like I didn't need to change anything compared to what we already had, so this is just adding clarifying comments. But I now measure 36.3 ms/tok on the GTX-1660, instead fo the 47.2 ms/tok that I have written in the faster k-quants PR. * k_quants: faster Q5_K on older GPUs 68.5 ms/tok -> 62.0 ms/tok on GTX-1660. For some reason the same access pattern that leads to such resounding success for Q2_K to Q4_K did not work at all for Q5_K. It is also more difficult to measure because for Q5_K_S we only have 32 layers on the GTX-1660, so output, tok embeddings and kv cache are done on the CPU. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-19 18:14:09 +03:00
Georgi Gerganov	b97ca431db	ggml : sync latest ggml repo (#1924 ) * ggml : sync latest ggml repo * ggml : remove unused comments * ggml : asserts	2023-06-19 18:12:33 +03:00
Howard Su	1e3abfcef0	cmake : fix build shared ggml when CUDA is enabled (#1929 ) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-19 18:10:37 +03:00

1 2 3 4 5 ...

761 commits