niansa
d1f84db4b6
Implemented GGML_OP_NORM
2023-06-30 15:18:10 +02:00
niansa
8fa60134b1
Added missing break to mul_mat_f16 case
2023-06-30 12:47:17 +02:00
niansa
0dc5f2f2ba
Fixed mul mat dispatch size
2023-06-30 12:31:13 +02:00
niansa
f093bf2e5e
Minor MUL_MAT fix and implemented DIAG_MASK_INF
2023-06-30 12:19:29 +02:00
niansa
964fe8c546
Added mul_mat (needs fixes)
2023-06-30 11:47:10 +02:00
niansa
749d6179a8
Snake case all functions
2023-06-29 14:23:00 +02:00
niansa
5ac68ccacb
Cleanups
2023-06-29 11:14:21 +02:00
niansa
de7d1823ed
Implemented ggml_vk_soft_max
2023-06-28 12:48:41 +02:00
niansa
e2b721db65
Allow vk add row
2023-06-28 10:19:18 +02:00
niansa
ed14f0764a
Fixed ggml_vk_abmath row argument
2023-06-28 10:15:23 +02:00
niansa
072007b1e8
Add buffer qualifiers
2023-06-23 21:21:16 +02:00
niansa
acb7d90398
Reenabled unknown op message
2023-06-23 20:39:32 +02:00
niansa
5d5f66d1d9
More little fixes and stuff
2023-06-23 20:37:58 +02:00
niansa
e0814f86a2
Free vk context
2023-06-23 20:02:46 +02:00
niansa
55815b67f4
Improved memory safety
2023-06-23 19:58:41 +02:00
niansa
4b267e88b6
Temporarily care for all layers
2023-06-23 18:40:58 +02:00
niansa
40621ea0ec
Added more debugging
2023-06-23 18:26:21 +02:00
niansa
e6da9bd96b
Added ggml_vk_mem_used()
2023-06-23 17:57:09 +02:00
niansa
1a68195408
Add mutexes for gpu tensors
2023-06-23 17:46:09 +02:00
niansa
46f577bfc1
h2d tensors during loadup
2023-06-23 17:10:45 +02:00
niansa
98e588c6eb
Fix ggml_vk_h2d_tensor throwing on second call
2023-06-23 16:50:37 +02:00
niansa
09b0b3a49b
Wait for all threads to finish
2023-06-23 16:13:32 +02:00
niansa
2589cb0c70
Prevent compileSource race
2023-06-23 16:02:49 +02:00
niansa
5c0d8dd0f2
Specify program output size
2023-06-23 15:58:13 +02:00
niansa
e830264c92
Share sequence to functions and add scale()
2023-06-23 15:10:24 +02:00
niansa
5e9403342b
Minor fixes
2023-06-23 15:01:09 +02:00
niansa
b6264542b7
Added vk_mul to ggml_vk_graph_compute
2023-06-23 14:19:31 +02:00
niansa
18d6f7f8da
More progress...
2023-06-23 14:08:45 +02:00
niansa
d539247996
Began implementing ggml_graph_compute
2023-06-23 14:03:33 +02:00
niansa
b8a4594f89
More fixes...
2023-06-23 12:19:33 +02:00
niansa
9d643755a6
Fixed compile error
2023-06-23 11:51:25 +02:00
niansa
339bc36cdd
Added more functions from Metal
2023-06-23 11:50:30 +02:00
niansa
9cdaea9240
Implemented dequantize_row_q4_1
2023-06-22 16:30:36 +02:00
niansa
b0f11fa9c1
More code cleanups
2023-06-22 16:05:56 +02:00
niansa
3b3d30e4ad
Cleanups
2023-06-22 13:57:04 +02:00
niansa
2f3fe0c0a4
Updated gitignore
2023-06-22 13:57:04 +02:00
niansa
4f598dd973
Initial working stuff
2023-06-22 13:57:04 +02:00
Johannes Gäßler
bbca06e269
cmake: revert CUDA arch default to 52, 61 if f16 ( #1959 )
2023-06-21 23:49:25 +02:00
Rahul Vivek Nair
fb98254f99
Fix typo in README.md ( #1961 )
2023-06-21 23:48:43 +02:00
Georgi Gerganov
049aa16b8c
readme : add link to p1
2023-06-20 19:05:54 +03:00
Xiake Sun
2322ec223a
Fix typo ( #1949 )
2023-06-20 15:42:40 +03:00
Ettore Di Giacinto
aacdbd4056
llama : fix params struct slignment ( #1936 )
...
* Workaround struct misalignment during value-copy
Signed-off-by: mudler <mudler@localai.io>
* Move booleans at the bottom of the structure
Signed-off-by: mudler <mudler@localai.io>
* Add comment
Signed-off-by: mudler <mudler@localai.io>
---------
Signed-off-by: mudler <mudler@localai.io>
2023-06-20 04:24:39 +03:00
Henri Vasserman
20568fe60f
[Fix] Reenable server embedding endpoint ( #1937 )
...
* Add back embedding feature
* Update README
2023-06-20 01:12:39 +03:00
Georgi Gerganov
18b35625c3
ggml : fix bug in LBFGS optimizer (found by ggml tests)
2023-06-19 20:43:30 +03:00
l3utterfly
ba4e85a833
llama : use aligned memory during ggml_init call from loading saved sessions ( #1934 )
...
* fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions
* - removed commented out old code from fix
- updated another instance of same issue below original
2023-06-19 18:20:06 +03:00
Georgi Gerganov
23fc5c219a
cmake : fix trailing whitespaces
2023-06-19 18:18:34 +03:00
Kawrakow
cb40dfca69
llama : only use Q6_K for output weights if tensor size is multiple of 256 ( #1932 )
...
* Only use Q6_K for output weights if tensor size is multiple of 256
* Fixed copy/paste mistake
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-19 18:17:03 +03:00
Kawrakow
ca7c3f4da5
cuda : faster k-quants on older GPUs ( #1930 )
...
* k_quants: hopefully much faster Q4_K on older GPUs
On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 65.5 ms/tok
to 41.5 ms/tok!
* k_quants: hopefully much faster Q3_K on older GPUs
On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 60.3 ms/tok
to 41.0 ms/tok!
* k_quants: faster Q2_K on older GPUs
It looks like I didn't need to change anything
compared to what we already had, so this is just
adding clarifying comments. But I now measure
36.3 ms/tok on the GTX-1660, instead fo the
47.2 ms/tok that I have written in the faster
k-quants PR.
* k_quants: faster Q5_K on older GPUs
68.5 ms/tok -> 62.0 ms/tok on GTX-1660.
For some reason the same access pattern that leads
to such resounding success for Q2_K to Q4_K did not
work at all for Q5_K.
It is also more difficult to measure because for Q5_K_S
we only have 32 layers on the GTX-1660, so output, tok embeddings
and kv cache are done on the CPU.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-19 18:14:09 +03:00
Georgi Gerganov
b97ca431db
ggml : sync latest ggml repo ( #1924 )
...
* ggml : sync latest ggml repo
* ggml : remove unused comments
* ggml : asserts
2023-06-19 18:12:33 +03:00
Howard Su
1e3abfcef0
cmake : fix build shared ggml when CUDA is enabled ( #1929 )
...
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-19 18:10:37 +03:00