commit b617f2847b
Merge: 73cc5b892f44ff
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Fri Jun 9 16:10:35 2023 +0800
Merge branch 'master' into concedo_experimental
commit 73cc5b88fb
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Fri Jun 9 16:09:23 2023 +0800
added warning message for unsupported K quants
commit 92f44ff7f7
Author: AT <manyoso@users.noreply.github.com>
Date: Fri Jun 9 04:00:51 2023 -0400
metal : add GELU implementation (#1770)
Co-authored-by: Adam Treat <adam@nomic.ai>
commit 245fc3c37d
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date: Fri Jun 9 10:39:59 2023 +0300
metal : faster q4_0 (#1775)
* metal : 8% faster q4_0
Avoid copying into local uchar4 anf float4.
* metal : 17% faster Q4_0
Use 64 threads in a thread group.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
commit 01dc509038
Merge: 083384572ff528
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Fri Jun 9 14:53:35 2023 +0800
Merge branch 'master' into concedo_experimental
commit 0833845268
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Fri Jun 9 14:38:31 2023 +0800
merged metal patch directly into the file
commit 72ff5282bf
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date: Thu Jun 8 22:28:21 2023 +0300
metal : add Q2_K implementation (#1762)
* metal : add Q2_K implementation
27.1 ms / token on M2 Max 30-core GPU, so about the
same speed as Q4_0. Memory throughput is ~156 GB/s.
The access pattern used in the Q2_K
CUDA implementation resulted in significantly lower
performance (~31 ms/token).
* Fixing merge conflicts
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
commit 0bf7cf1b29
Author: Georgi Gerganov <ggerganov@gmail.com>
Date: Thu Jun 8 20:48:14 2023 +0300
Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)"
This reverts commit 8432d4d9f7.
commit 8432d4d9f7
Author: le.chang <cljs118@126.com>
Date: Fri Jun 9 00:47:56 2023 +0800
ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)
commit 6fa1613f15
Author: Hyun-joo KIM <bebopkim@gmail.com>
Date: Fri Jun 9 01:47:36 2023 +0900
Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment
commit 0f291e1f65
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date: Thu Jun 8 19:46:22 2023 +0300
metal : Q6_K implementation (#1752)
* Metal implementation for Q4_K
Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.
* Optimizing Q4_K on metal
The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.
At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.
* Optimizing q4_K metal dot some more
For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.
* Fix after merge with master
* Metal implementation for Q6_K
Similar to the CUDA implementation.
No idea if this is the optimum for Metal, but the few
alternative variants I tried all had a lower performance.
We get 36.5 ms / token on M2 Max with 30 GPU cores.
This corresponds to ~200 GB/second throughput.
* clang-tidy : add config back
* Much better Q6_K implementation for metal
28.3 ms / token for 7B. Subtracting ~9 ms that is spent in
other compute graph operations, we are left with ~19 ms
for the matrix multiplications. The model is ~5.5 GB,
so we are getting 1000 / 19 * 5.5 = 290 GB/s!
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
commit 7f181600c7
Author: Hyun-joo KIM <bebopkim@gmail.com>
Date: Fri Jun 9 01:24:22 2023 +0900
Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment
commit 8fc8179919
Author: qingfengfenga <41416092+qingfengfenga@users.noreply.github.com>
Date: Thu Jun 8 15:58:53 2023 +0800
Add llama.cpp docker support for non-latin languages (#1673)
* Modify Dockerfile default character set to improve compatibility (#1673)
commit b50b570ed9
Author: Steven Roussey <sroussey@gmail.com>
Date: Thu Jun 8 00:12:28 2023 -0700
ggml : fix fprintf warnings (#1720)
commit 53aba3f393
Author: Georgi Gerganov <ggerganov@gmail.com>
Date: Thu Jun 8 10:09:08 2023 +0300
clang-tidy : restore dot file from accidental deletion
commit 4161bdc04d
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date: Thu Jun 8 10:08:23 2023 +0300
metal : add Q4_K implementation (#1733)
* Metal implementation for Q4_K
Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.
* Optimizing Q4_K on metal
The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.
At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.
* Optimizing q4_K metal dot some more
For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.
* Fix after merge with master
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
commit 0035858273
Author: johnson442 <56517414+johnson442@users.noreply.github.com>
Date: Thu Jun 8 08:02:48 2023 +0100
k-quants : add missing compile definition to CMakeLists (#1748)
* metal : 8% faster q4_0
Avoid copying into local uchar4 anf float4.
* metal : 17% faster Q4_0
Use 64 threads in a thread group.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* metal : add Q2_K implementation
27.1 ms / token on M2 Max 30-core GPU, so about the
same speed as Q4_0. Memory throughput is ~156 GB/s.
The access pattern used in the Q2_K
CUDA implementation resulted in significantly lower
performance (~31 ms/token).
* Fixing merge conflicts
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Metal implementation for Q4_K
Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.
* Optimizing Q4_K on metal
The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.
At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.
* Optimizing q4_K metal dot some more
For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.
* Fix after merge with master
* Metal implementation for Q6_K
Similar to the CUDA implementation.
No idea if this is the optimum for Metal, but the few
alternative variants I tried all had a lower performance.
We get 36.5 ms / token on M2 Max with 30 GPU cores.
This corresponds to ~200 GB/second throughput.
* clang-tidy : add config back
* Much better Q6_K implementation for metal
28.3 ms / token for 7B. Subtracting ~9 ms that is spent in
other compute graph operations, we are left with ~19 ms
for the matrix multiplications. The model is ~5.5 GB,
so we are getting 1000 / 19 * 5.5 = 290 GB/s!
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Metal implementation for Q4_K
Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.
* Optimizing Q4_K on metal
The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.
At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.
* Optimizing q4_K metal dot some more
For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.
* Fix after merge with master
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
The prompt cache constitutes a nice speed up when using the same prompt
prefix across multiple evaluations, but when using it, it will also be
updated, which is not always desirable. One use case is to have a large
prompt containing some context and usage rules, and a second part
containing variable data of the problem being studied. In this case it's
desirable to be able to save the first part once, and to always reuse it
as-is without updating it with the second part.
The new argument --prompt-cache-ro enables this read-only mode on the
prompt cache. The prompt's contents that match the cache are loaded
from the cache but the rest is not modified. This allowed to reduce a
total analysis time from 112s to 49.7s here, without having to backup
and restore a copy of the prompt, which takes significant time at 500
MB.
Signed-off-by: Willy Tarreau <w@1wt.eu>