This is needed to make operators like ggml_view() be able to store their
parameters in the ggml context's memory and not get discarded when
no_alloc is true
* Add support for quantizing already quantized models
* Threaded dequantizing and f16 to f32 conversion
* Clean up thread blocks with spares calculation a bit
* Use std::runtime_error exceptions.
commit b617f2847b
Merge: 73cc5b892f44ff
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Fri Jun 9 16:10:35 2023 +0800
Merge branch 'master' into concedo_experimental
commit 73cc5b88fb
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Fri Jun 9 16:09:23 2023 +0800
added warning message for unsupported K quants
commit 92f44ff7f7
Author: AT <manyoso@users.noreply.github.com>
Date: Fri Jun 9 04:00:51 2023 -0400
metal : add GELU implementation (#1770)
Co-authored-by: Adam Treat <adam@nomic.ai>
commit 245fc3c37d
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date: Fri Jun 9 10:39:59 2023 +0300
metal : faster q4_0 (#1775)
* metal : 8% faster q4_0
Avoid copying into local uchar4 anf float4.
* metal : 17% faster Q4_0
Use 64 threads in a thread group.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
commit 01dc509038
Merge: 083384572ff528
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Fri Jun 9 14:53:35 2023 +0800
Merge branch 'master' into concedo_experimental
commit 0833845268
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Fri Jun 9 14:38:31 2023 +0800
merged metal patch directly into the file
commit 72ff5282bf
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date: Thu Jun 8 22:28:21 2023 +0300
metal : add Q2_K implementation (#1762)
* metal : add Q2_K implementation
27.1 ms / token on M2 Max 30-core GPU, so about the
same speed as Q4_0. Memory throughput is ~156 GB/s.
The access pattern used in the Q2_K
CUDA implementation resulted in significantly lower
performance (~31 ms/token).
* Fixing merge conflicts
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
commit 0bf7cf1b29
Author: Georgi Gerganov <ggerganov@gmail.com>
Date: Thu Jun 8 20:48:14 2023 +0300
Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)"
This reverts commit 8432d4d9f7.
commit 8432d4d9f7
Author: le.chang <cljs118@126.com>
Date: Fri Jun 9 00:47:56 2023 +0800
ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)
commit 6fa1613f15
Author: Hyun-joo KIM <bebopkim@gmail.com>
Date: Fri Jun 9 01:47:36 2023 +0900
Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment
commit 0f291e1f65
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date: Thu Jun 8 19:46:22 2023 +0300
metal : Q6_K implementation (#1752)
* Metal implementation for Q4_K
Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.
* Optimizing Q4_K on metal
The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.
At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.
* Optimizing q4_K metal dot some more
For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.
* Fix after merge with master
* Metal implementation for Q6_K
Similar to the CUDA implementation.
No idea if this is the optimum for Metal, but the few
alternative variants I tried all had a lower performance.
We get 36.5 ms / token on M2 Max with 30 GPU cores.
This corresponds to ~200 GB/second throughput.
* clang-tidy : add config back
* Much better Q6_K implementation for metal
28.3 ms / token for 7B. Subtracting ~9 ms that is spent in
other compute graph operations, we are left with ~19 ms
for the matrix multiplications. The model is ~5.5 GB,
so we are getting 1000 / 19 * 5.5 = 290 GB/s!
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
commit 7f181600c7
Author: Hyun-joo KIM <bebopkim@gmail.com>
Date: Fri Jun 9 01:24:22 2023 +0900
Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment
commit 8fc8179919
Author: qingfengfenga <41416092+qingfengfenga@users.noreply.github.com>
Date: Thu Jun 8 15:58:53 2023 +0800
Add llama.cpp docker support for non-latin languages (#1673)
* Modify Dockerfile default character set to improve compatibility (#1673)
commit b50b570ed9
Author: Steven Roussey <sroussey@gmail.com>
Date: Thu Jun 8 00:12:28 2023 -0700
ggml : fix fprintf warnings (#1720)
commit 53aba3f393
Author: Georgi Gerganov <ggerganov@gmail.com>
Date: Thu Jun 8 10:09:08 2023 +0300
clang-tidy : restore dot file from accidental deletion
commit 4161bdc04d
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date: Thu Jun 8 10:08:23 2023 +0300
metal : add Q4_K implementation (#1733)
* Metal implementation for Q4_K
Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.
* Optimizing Q4_K on metal
The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.
At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.
* Optimizing q4_K metal dot some more
For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.
* Fix after merge with master
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
commit 0035858273
Author: johnson442 <56517414+johnson442@users.noreply.github.com>
Date: Thu Jun 8 08:02:48 2023 +0100
k-quants : add missing compile definition to CMakeLists (#1748)
* metal : 8% faster q4_0
Avoid copying into local uchar4 anf float4.
* metal : 17% faster Q4_0
Use 64 threads in a thread group.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* metal : add Q2_K implementation
27.1 ms / token on M2 Max 30-core GPU, so about the
same speed as Q4_0. Memory throughput is ~156 GB/s.
The access pattern used in the Q2_K
CUDA implementation resulted in significantly lower
performance (~31 ms/token).
* Fixing merge conflicts
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Metal implementation for Q4_K
Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.
* Optimizing Q4_K on metal
The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.
At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.
* Optimizing q4_K metal dot some more
For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.
* Fix after merge with master
* Metal implementation for Q6_K
Similar to the CUDA implementation.
No idea if this is the optimum for Metal, but the few
alternative variants I tried all had a lower performance.
We get 36.5 ms / token on M2 Max with 30 GPU cores.
This corresponds to ~200 GB/second throughput.
* clang-tidy : add config back
* Much better Q6_K implementation for metal
28.3 ms / token for 7B. Subtracting ~9 ms that is spent in
other compute graph operations, we are left with ~19 ms
for the matrix multiplications. The model is ~5.5 GB,
so we are getting 1000 / 19 * 5.5 = 290 GB/s!
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Metal implementation for Q4_K
Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.
* Optimizing Q4_K on metal
The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.
At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.
* Optimizing q4_K metal dot some more
For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.
* Fix after merge with master
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>