llama.cpp

Author	SHA1	Message	Date
niansa	e0814f86a2	Free vk context	2023-06-23 20:02:46 +02:00
niansa	55815b67f4	Improved memory safety	2023-06-23 19:58:41 +02:00
niansa	4b267e88b6	Temporarily care for all layers	2023-06-23 18:40:58 +02:00
niansa	40621ea0ec	Added more debugging	2023-06-23 18:26:21 +02:00
niansa	e6da9bd96b	Added ggml_vk_mem_used()	2023-06-23 17:57:09 +02:00
niansa	1a68195408	Add mutexes for gpu tensors	2023-06-23 17:46:09 +02:00
niansa	46f577bfc1	h2d tensors during loadup	2023-06-23 17:10:45 +02:00
niansa	98e588c6eb	Fix ggml_vk_h2d_tensor throwing on second call	2023-06-23 16:50:37 +02:00
niansa	09b0b3a49b	Wait for all threads to finish	2023-06-23 16:13:32 +02:00
niansa	2589cb0c70	Prevent compileSource race	2023-06-23 16:02:49 +02:00
niansa	5c0d8dd0f2	Specify program output size	2023-06-23 15:58:13 +02:00
niansa	e830264c92	Share sequence to functions and add scale()	2023-06-23 15:10:24 +02:00
niansa	5e9403342b	Minor fixes	2023-06-23 15:01:09 +02:00
niansa	b6264542b7	Added vk_mul to ggml_vk_graph_compute	2023-06-23 14:19:31 +02:00
niansa	18d6f7f8da	More progress...	2023-06-23 14:08:45 +02:00
niansa	d539247996	Began implementing ggml_graph_compute	2023-06-23 14:03:33 +02:00
niansa	b8a4594f89	More fixes...	2023-06-23 12:19:33 +02:00
niansa	9d643755a6	Fixed compile error	2023-06-23 11:51:25 +02:00
niansa	339bc36cdd	Added more functions from Metal	2023-06-23 11:50:30 +02:00
niansa	9cdaea9240	Implemented dequantize_row_q4_1	2023-06-22 16:30:36 +02:00
niansa	b0f11fa9c1	More code cleanups	2023-06-22 16:05:56 +02:00
niansa	3b3d30e4ad	Cleanups	2023-06-22 13:57:04 +02:00
niansa	2f3fe0c0a4	Updated gitignore	2023-06-22 13:57:04 +02:00
niansa	4f598dd973	Initial working stuff	2023-06-22 13:57:04 +02:00
Johannes Gäßler	bbca06e269	cmake: revert CUDA arch default to 52, 61 if f16 (#1959 )	2023-06-21 23:49:25 +02:00
Rahul Vivek Nair	fb98254f99	Fix typo in README.md (#1961 )	2023-06-21 23:48:43 +02:00
Georgi Gerganov	049aa16b8c	readme : add link to p1	2023-06-20 19:05:54 +03:00
Xiake Sun	2322ec223a	Fix typo (#1949 )	2023-06-20 15:42:40 +03:00
Ettore Di Giacinto	aacdbd4056	llama : fix params struct slignment (#1936 ) * Workaround struct misalignment during value-copy Signed-off-by: mudler <mudler@localai.io> * Move booleans at the bottom of the structure Signed-off-by: mudler <mudler@localai.io> * Add comment Signed-off-by: mudler <mudler@localai.io> --------- Signed-off-by: mudler <mudler@localai.io>	2023-06-20 04:24:39 +03:00
Henri Vasserman	20568fe60f	[Fix] Reenable server embedding endpoint (#1937 ) * Add back embedding feature * Update README	2023-06-20 01:12:39 +03:00
Georgi Gerganov	18b35625c3	ggml : fix bug in LBFGS optimizer (found by ggml tests)	2023-06-19 20:43:30 +03:00
l3utterfly	ba4e85a833	llama : use aligned memory during ggml_init call from loading saved sessions (#1934 ) * fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions * - removed commented out old code from fix - updated another instance of same issue below original	2023-06-19 18:20:06 +03:00
Georgi Gerganov	23fc5c219a	cmake : fix trailing whitespaces	2023-06-19 18:18:34 +03:00
Kawrakow	cb40dfca69	llama : only use Q6_K for output weights if tensor size is multiple of 256 (#1932 ) * Only use Q6_K for output weights if tensor size is multiple of 256 * Fixed copy/paste mistake --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-19 18:17:03 +03:00
Kawrakow	ca7c3f4da5	cuda : faster k-quants on older GPUs (#1930 ) * k_quants: hopefully much faster Q4_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 65.5 ms/tok to 41.5 ms/tok! * k_quants: hopefully much faster Q3_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 60.3 ms/tok to 41.0 ms/tok! * k_quants: faster Q2_K on older GPUs It looks like I didn't need to change anything compared to what we already had, so this is just adding clarifying comments. But I now measure 36.3 ms/tok on the GTX-1660, instead fo the 47.2 ms/tok that I have written in the faster k-quants PR. * k_quants: faster Q5_K on older GPUs 68.5 ms/tok -> 62.0 ms/tok on GTX-1660. For some reason the same access pattern that leads to such resounding success for Q2_K to Q4_K did not work at all for Q5_K. It is also more difficult to measure because for Q5_K_S we only have 32 layers on the GTX-1660, so output, tok embeddings and kv cache are done on the CPU. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-19 18:14:09 +03:00
Georgi Gerganov	b97ca431db	ggml : sync latest ggml repo (#1924 ) * ggml : sync latest ggml repo * ggml : remove unused comments * ggml : asserts	2023-06-19 18:12:33 +03:00
Howard Su	1e3abfcef0	cmake : fix build shared ggml when CUDA is enabled (#1929 ) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-19 18:10:37 +03:00
Johannes Gäßler	16b9cd1939	Convert vector to f16 for dequantize mul mat vec (#1913 ) * Convert vector to f16 for dmmv * compile option * Added compilation option description to README * Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"	2023-06-19 10:23:56 +02:00
Johannes Gäßler	b24c3049d9	Added tokens per second to info prints (#1928 )	2023-06-18 17:41:26 +02:00
Johannes Gäßler	0ede372a51	Fixed incorrectly applying RMS norm twice (#1925 )	2023-06-18 16:07:09 +02:00
l3utterfly	8596af4277	ggml : fix bug in ggml_compute_forward_add_q_f32 (#1918 )	2023-06-18 14:19:16 +03:00
Mike	e1886cf4fe	readme : update Android build instructions (#1922 ) Add steps for using termux on android devices to prevent common errors.	2023-06-18 11:28:26 +03:00
Kawrakow	8ab8ba62eb	llama : prevent usage of k-quants when tensor size is not a multiple of 256 (#1921 ) * Fix examples/metal * k-quants: prevent usage when tensor size is not divisible by 256 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-18 11:13:43 +03:00
Kawrakow	90cc59d6ab	examples : fix examples/metal (#1920 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-18 10:52:10 +03:00
Georgi Gerganov	ce2c7d72e2	metal : handle buffers larger than device's maxBufferLength (#1826 ) * metal : handle buffers larger than device's maxBufferLength * metal : print more verbose device info + handle errors * metal : fix prints for overlapping views * metal : minimize view overlap to try to utilize device memory better	2023-06-18 09:09:47 +03:00
Howard Su	57cd69460f	cmake : add CUDA_ARCHITECTURES to new target ggml_static (#1917 )	2023-06-18 07:29:47 +03:00
Georgi Gerganov	b2416493ab	make : do not print help for simple example	2023-06-17 20:55:03 +03:00
Georgi Gerganov	4f9c43e3bd	minor : warning fixes	2023-06-17 20:24:11 +03:00
Johannes Gäßler	2c9380dd2f	Only one CUDA stream per device for async compute (#1898 )	2023-06-17 19:15:02 +02:00
Georgi Gerganov	051e1b0e6a	llama : fix kv_cache `n` init (close #1903 )	2023-06-17 19:31:20 +03:00

1 2 3 4 5 ...

748 commits