Commit graph

748 commits

Author SHA1 Message Date
niansa
e0814f86a2 Free vk context 2023-06-23 20:02:46 +02:00
niansa
55815b67f4 Improved memory safety 2023-06-23 19:58:41 +02:00
niansa
4b267e88b6 Temporarily care for all layers 2023-06-23 18:40:58 +02:00
niansa
40621ea0ec Added more debugging 2023-06-23 18:26:21 +02:00
niansa
e6da9bd96b Added ggml_vk_mem_used() 2023-06-23 17:57:09 +02:00
niansa
1a68195408 Add mutexes for gpu tensors 2023-06-23 17:46:09 +02:00
niansa
46f577bfc1 h2d tensors during loadup 2023-06-23 17:10:45 +02:00
niansa
98e588c6eb Fix ggml_vk_h2d_tensor throwing on second call 2023-06-23 16:50:37 +02:00
niansa
09b0b3a49b Wait for all threads to finish 2023-06-23 16:13:32 +02:00
niansa
2589cb0c70 Prevent compileSource race 2023-06-23 16:02:49 +02:00
niansa
5c0d8dd0f2 Specify program output size 2023-06-23 15:58:13 +02:00
niansa
e830264c92 Share sequence to functions and add scale() 2023-06-23 15:10:24 +02:00
niansa
5e9403342b Minor fixes 2023-06-23 15:01:09 +02:00
niansa
b6264542b7 Added vk_mul to ggml_vk_graph_compute 2023-06-23 14:19:31 +02:00
niansa
18d6f7f8da More progress... 2023-06-23 14:08:45 +02:00
niansa
d539247996 Began implementing ggml_graph_compute 2023-06-23 14:03:33 +02:00
niansa
b8a4594f89 More fixes... 2023-06-23 12:19:33 +02:00
niansa
9d643755a6 Fixed compile error 2023-06-23 11:51:25 +02:00
niansa
339bc36cdd Added more functions from Metal 2023-06-23 11:50:30 +02:00
niansa
9cdaea9240 Implemented dequantize_row_q4_1 2023-06-22 16:30:36 +02:00
niansa
b0f11fa9c1 More code cleanups 2023-06-22 16:05:56 +02:00
niansa
3b3d30e4ad Cleanups 2023-06-22 13:57:04 +02:00
niansa
2f3fe0c0a4 Updated gitignore 2023-06-22 13:57:04 +02:00
niansa
4f598dd973 Initial working stuff 2023-06-22 13:57:04 +02:00
Johannes Gäßler
bbca06e269
cmake: revert CUDA arch default to 52, 61 if f16 (#1959) 2023-06-21 23:49:25 +02:00
Rahul Vivek Nair
fb98254f99
Fix typo in README.md (#1961) 2023-06-21 23:48:43 +02:00
Georgi Gerganov
049aa16b8c
readme : add link to p1 2023-06-20 19:05:54 +03:00
Xiake Sun
2322ec223a
Fix typo (#1949) 2023-06-20 15:42:40 +03:00
Ettore Di Giacinto
aacdbd4056
llama : fix params struct slignment (#1936)
* Workaround struct misalignment during value-copy

Signed-off-by: mudler <mudler@localai.io>

* Move booleans at the bottom of the structure

Signed-off-by: mudler <mudler@localai.io>

* Add comment

Signed-off-by: mudler <mudler@localai.io>

---------

Signed-off-by: mudler <mudler@localai.io>
2023-06-20 04:24:39 +03:00
Henri Vasserman
20568fe60f
[Fix] Reenable server embedding endpoint (#1937)
* Add back embedding feature

* Update README
2023-06-20 01:12:39 +03:00
Georgi Gerganov
18b35625c3
ggml : fix bug in LBFGS optimizer (found by ggml tests) 2023-06-19 20:43:30 +03:00
l3utterfly
ba4e85a833
llama : use aligned memory during ggml_init call from loading saved sessions (#1934)
* fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions

* - removed commented out old code from fix
- updated another instance of same issue below original
2023-06-19 18:20:06 +03:00
Georgi Gerganov
23fc5c219a
cmake : fix trailing whitespaces 2023-06-19 18:18:34 +03:00
Kawrakow
cb40dfca69
llama : only use Q6_K for output weights if tensor size is multiple of 256 (#1932)
* Only use Q6_K for output weights if tensor size is multiple of 256

* Fixed copy/paste mistake

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-19 18:17:03 +03:00
Kawrakow
ca7c3f4da5
cuda : faster k-quants on older GPUs (#1930)
* k_quants: hopefully much faster Q4_K on older GPUs

On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 65.5 ms/tok
to 41.5 ms/tok!

* k_quants: hopefully much faster Q3_K on older GPUs

On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 60.3 ms/tok
to 41.0 ms/tok!

* k_quants: faster Q2_K on older GPUs

It looks like I didn't need to change anything
compared to what we already had, so this is just
adding clarifying comments. But I now measure
36.3 ms/tok on the GTX-1660, instead fo the
47.2 ms/tok that I have written in the faster
k-quants PR.

* k_quants: faster Q5_K on older GPUs

68.5 ms/tok -> 62.0 ms/tok on GTX-1660.
For some reason the same access pattern that leads
to such resounding success for Q2_K to Q4_K did not
work at all for Q5_K.

It is also more difficult to measure because for Q5_K_S
we only have 32 layers on the GTX-1660, so output, tok embeddings
and kv cache are done on the CPU.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-19 18:14:09 +03:00
Georgi Gerganov
b97ca431db
ggml : sync latest ggml repo (#1924)
* ggml : sync latest ggml repo

* ggml : remove unused comments

* ggml : asserts
2023-06-19 18:12:33 +03:00
Howard Su
1e3abfcef0
cmake : fix build shared ggml when CUDA is enabled (#1929)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-19 18:10:37 +03:00
Johannes Gäßler
16b9cd1939
Convert vector to f16 for dequantize mul mat vec (#1913)
* Convert vector to f16 for dmmv

* compile option

* Added compilation option description to README

* Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"
2023-06-19 10:23:56 +02:00
Johannes Gäßler
b24c3049d9
Added tokens per second to info prints (#1928) 2023-06-18 17:41:26 +02:00
Johannes Gäßler
0ede372a51
Fixed incorrectly applying RMS norm twice (#1925) 2023-06-18 16:07:09 +02:00
l3utterfly
8596af4277
ggml : fix bug in ggml_compute_forward_add_q_f32 (#1918) 2023-06-18 14:19:16 +03:00
Mike
e1886cf4fe
readme : update Android build instructions (#1922)
Add steps for using termux on android devices to prevent common errors.
2023-06-18 11:28:26 +03:00
Kawrakow
8ab8ba62eb
llama : prevent usage of k-quants when tensor size is not a multiple of 256 (#1921)
* Fix examples/metal

* k-quants: prevent usage when tensor size is not divisible by 256

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-18 11:13:43 +03:00
Kawrakow
90cc59d6ab
examples : fix examples/metal (#1920)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-18 10:52:10 +03:00
Georgi Gerganov
ce2c7d72e2
metal : handle buffers larger than device's maxBufferLength (#1826)
* metal : handle buffers larger than device's maxBufferLength

* metal : print more verbose device info + handle errors

* metal : fix prints for overlapping views

* metal : minimize view overlap to try to utilize device memory better
2023-06-18 09:09:47 +03:00
Howard Su
57cd69460f
cmake : add CUDA_ARCHITECTURES to new target ggml_static (#1917) 2023-06-18 07:29:47 +03:00
Georgi Gerganov
b2416493ab
make : do not print help for simple example 2023-06-17 20:55:03 +03:00
Georgi Gerganov
4f9c43e3bd
minor : warning fixes 2023-06-17 20:24:11 +03:00
Johannes Gäßler
2c9380dd2f
Only one CUDA stream per device for async compute (#1898) 2023-06-17 19:15:02 +02:00
Georgi Gerganov
051e1b0e6a
llama : fix kv_cache n init (close #1903) 2023-06-17 19:31:20 +03:00