Commit graph

1293 commits

Author SHA1 Message Date
0cc4m
0e3cc8e6f7 Improve code formatting 2023-06-13 16:10:25 +02:00
0cc4m
f1ac03ed37 Shorten switch statements 2023-06-13 15:21:44 +02:00
Concedo
f345347e5c updated lite 2023-06-13 20:44:22 +08:00
Concedo
561ce6a153 Merge remote-tracking branch 'occam/kquant-opencl' into concedo_experimental 2023-06-13 20:27:11 +08:00
Concedo
67559a15f3 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	Makefile
2023-06-13 20:26:51 +08:00
Concedo
871009dfab integrated world tokenizer for RWKV 2023-06-13 20:06:19 +08:00
Kerfuffle
74d4cfa343
Allow "quantizing" to f16 and f32 (#1787)
* Allow "quantizing" to f16 and f32

Fix an issue where quantizing didn't respect LLAMA_NO_K_QUANTS

Add brief help to the list of quantization types in the quantize tool

Ignore case for quantization type arguments in the quantize tool
2023-06-13 04:23:23 -06:00
0cc4m
2a972f3649 Fix q3_k 2023-06-13 12:05:35 +02:00
0cc4m
fc8c823f34 Fix q2_k, improve code 2023-06-13 12:05:35 +02:00
Concedo
6e20827f93 Added OpenCL DMMV kernels 2023-06-13 12:05:35 +02:00
Concedo
f558e4c297 Finish dequant kernels 2023-06-13 12:05:35 +02:00
Concedo
56151bb875 Replace uchar with uint8_t 2023-06-13 12:05:35 +02:00
0cc4m
a4ee2b89d2 Fix q4_k opencl struct order 2023-06-13 12:05:35 +02:00
Concedo
1506affd0a Added q6_k kernel 2023-06-13 12:05:35 +02:00
0cc4m
44422fd567 Set global and local sizes for kernel calls for dequantizing k-quants 2023-06-13 12:05:35 +02:00
Concedo
9b41865312 Porting q2_k kernel to OpenCL 2023-06-13 12:05:23 +02:00
Concedo
9830871d0f pulled all Occam's fixes and the kquants are all working now 2023-06-13 16:15:13 +08:00
Concedo
9b6c35b651 rwkv speed enhancements (batch processing), fixed a rwkv token processing bug 2023-06-13 16:02:12 +08:00
Kawrakow
74a6d922f1
Metal implementation for all k_quants (#1807)
* metal : improve q4_K

28.3 -> 26.0 ms/token by avoiding a branch in the
calculation of the scales.

* metal : small improvement for Q4_K

* metal : still optimizing Q4_K

This commit pushes it down to 25.3 ms / token.

The crazy idea of using 6 bits for the scales is really costly on
Metal: if I remove the bit fiddling necessary to make the block
scales, time goes almost to the Q4_0 23 ms/token.

Before pushing the k-quants upstream I had a Q4_K variant that
had used 8-bit scales. It wasn't more accurate, used 0.125 bits more per weight,
was running slightly slower on the CPU (due to the larger model size
and being memory bound there), and the difference was entirely
negligible under CUDA. So, I decided to publish the version with 6-bit
scales. Perhaps I should re-consider and change to 8-bit scales?

* metal : some more optimizations

Q2_K: 25.4 ms/token
Q6_K: 27.3 ms/token
Q4_0: 22.8 ms/token
Q4_1: 23.1 ms/token

* metal : Q3_K support

Something is not quite right yet.

* metal : Q5_K support

Initial version achieves 31.2 ms/token, 210 GB/s

* metal : still not able to figure out why q3_K does not work

* Minor

* metal : yet another failed attempt to make q3_K work

* metal : optimize Q5_K

31.2 ms -> 27.8 ms.
250 GB/s.

* metal : q3_K still not working

Adding a heavily commented q3_K metal kernel to explain
my obviously faulty logic. Perhaps someone could spot the issue?

* metal : q3_K finally working

Not optimized at all.

What was the issue? The scales are not 4-bytes aligned,
and I was accessing them with a uint32_t pointer.
When I tried that on CUDA, I got an error (illegal memory access)
and added a memcpy to a local array of 3 uint32_t's.
But on Metal it told me there is no memcpy, so I tried
accessing directly. There is no error, just garbage results.
At some point I did try accessing the scales with an uint16_t
pointer (the scales are for sure 2-byte aligned), but was
still getting garbage. I guess, there must have been another bug.

No access to scales is via a uint16_t pointer and, after starting
from scratch from the C dequantize function, it finally works.

* metal : Q3_K 1st optimization pass

* metal : Q3_K second optimization pass - 29.6 ms/token

* metal : Q3_K cleanup

* metal : fixed accidentally broken Q2_K

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-12 22:39:21 +03:00
slaren
e4caa8da59
ci : run when changing only the CUDA sources (#1800) 2023-06-12 20:12:47 +03:00
Concedo
860fb026df rwkv compile fix (+1 squashed commits)
Squashed commits:

[8b0ebb1] upgraded rwkv + added memory overheads + added state_out bufs
2023-06-12 23:04:40 +08:00
Concedo
120851df53 prevent gpu offload if kquant is selected with clblast for now 2023-06-12 21:57:31 +08:00
Concedo
215edf420b Merge branch 'master' into concedo_experimental 2023-06-12 21:53:13 +08:00
Concedo
9c08017051 this patch is a work in progress implementation for the k-quants. the dequant kernels are working, but the DMMV ones are not. 2023-06-12 21:47:57 +08:00
Howard Su
58970a4c39
Leverage mmap for offloading tensors to GPU (#1597)
* Rebase to latest

* Show progress

* Add assert to make sure we only allocate temp buffer for non-CPU backend tensor

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-06-12 14:44:16 +02:00
Kawrakow
8c0a10e64d
metal : fix failure to load model (#1817)
The number of buffers in the ggml context was left unitialized.
This leads to sporadic failures to load the model on
startup. It is actually strange that the failure occurred so
infrequantly.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-12 14:31:36 +03:00
Concedo
b9a4da3c6f Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	SHA256SUMS
2023-06-11 23:27:28 +08:00
Concedo
c44b9c3ecf added the llama_v2 cuda back (+2 squashed commit)
Squashed commit:

[1c97fd4] Revert "fix for cublas"

This reverts commit 994be9a4db.

[fce03c3] Revert "fix for cublas"

This reverts commit 33528f5b1d.
2023-06-11 23:23:24 +08:00
Kerfuffle
fa84c4b3e8
Fix issue where interactive mode crashes when input exceeds ctx size (#1789)
* Fix issue where interactive mode in the main example crashes when input exceeds ctx size

* Ensure the context size is at least 8 tokens in the main example.

Closes #1768
2023-06-11 08:19:17 -06:00
Kyle Liang
12b063f0ec
Fixed WSL cuda's OOM error (#1594)
* In the function , add the cuda error bypass.

* remove excessive codes and prints

---------

Co-authored-by: liang <liangmanlai@126.com>
2023-06-11 15:20:52 +02:00
Ryan Landay
31d2b5f4a4
Update SHA256SUMS with current hashes for models quantized using q4_0 (#1798) 2023-06-11 12:38:53 +03:00
Georgi Gerganov
4de0334f5c
cmake : fix Metal build (close #1791) 2023-06-10 22:56:53 +03:00
Artyom Lebedev
3f1223155a
k-quants : GCC12 compilation fix (#1792) 2023-06-10 22:51:36 +03:00
Concedo
fb67506c1b Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	README.md
#	flake.nix
#	ggml-metal.m
2023-06-10 23:04:48 +08:00
Andrei
303f5809f1
metal : fix issue with ggml-metal.metal path. Closes #1769 (#1782)
* Fix issue with ggml-metal.metal path

* Add ggml-metal.metal as a resource for llama target

* Update flake.nix metal kernel substitution
2023-06-10 17:47:34 +03:00
Concedo
0c9cd39259 lowered streaming tickrate for greater efficiency 2023-06-10 22:12:01 +08:00
Aisuko
059e99066d
doc : fix wrong address of BLIS.md (#1772)
Signed-off-by: Aisuko <urakiny@gmail.com>
2023-06-10 17:08:11 +03:00
Concedo
b9f74db89e Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
2023-06-10 21:07:20 +08:00
Concedo
fa64971881 encoding 2023-06-10 21:05:35 +08:00
Concedo
66a3f4e421 added support for lora base 2023-06-10 19:29:45 +08:00
Concedo
375540837e updated lite 2023-06-10 19:16:29 +08:00
Concedo
a68fcfe738 only start a new thread when using sse 2023-06-10 19:03:41 +08:00
Concedo
43f7e40470 added extra endpoints for abort gen and polled streaming 2023-06-10 18:13:26 +08:00
Georgi Gerganov
17c10acfb4
ggml : force no_alloc == false when creating opt tensors (close #1699)
This is needed to make operators like ggml_view() be able to store their
parameters in the ggml context's memory and not get discarded when
no_alloc is true
2023-06-10 12:08:15 +03:00
Kawrakow
e9b66ee982
metal : add Q4_1 implementation (#1785)
23.3 ms / token, so just ~1% slower than q4_0.
Achieves 290 GB/s memory throughput.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-10 11:28:11 +03:00
Kerfuffle
4f0154b0ba
llama : support requantizing models instead of only allowing quantization from 16/32bit (#1691)
* Add support for quantizing already quantized models

* Threaded dequantizing and f16 to f32 conversion

* Clean up thread blocks with spares calculation a bit

* Use std::runtime_error exceptions.
2023-06-10 10:59:17 +03:00
Xingchen Song(宋星辰)
ef3171d162
ggml : workaround for missing _mm256_setr_m128i in GCC < 8 (#1638) 2023-06-10 10:49:40 +03:00
rankaiyx
555275a693
make : add SSSE3 compilation use case (#1659) 2023-06-10 09:41:59 +03:00
Robert Sung-wook Shin
98ed165574
OpenCL: Add release memory (#1741)
* Add opencl release memory

* Rename function name
2023-06-09 18:24:40 +02:00
Concedo
5bd9cef9fa merging Proper SSE Token Streaming #220 with end connection fix test 2023-06-09 23:22:16 +08:00