* Rebase to latest
* Show progress
* Add assert to make sure we only allocate temp buffer for non-CPU backend tensor
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
The number of buffers in the ggml context was left unitialized.
This leads to sporadic failures to load the model on
startup. It is actually strange that the failure occurred so
infrequantly.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fix issue where interactive mode in the main example crashes when input exceeds ctx size
* Ensure the context size is at least 8 tokens in the main example.
Closes#1768
This is needed to make operators like ggml_view() be able to store their
parameters in the ggml context's memory and not get discarded when
no_alloc is true
* Add support for quantizing already quantized models
* Threaded dequantizing and f16 to f32 conversion
* Clean up thread blocks with spares calculation a bit
* Use std::runtime_error exceptions.
commit b617f2847b
Merge: 73cc5b892f44ff
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Fri Jun 9 16:10:35 2023 +0800
Merge branch 'master' into concedo_experimental
commit 73cc5b88fb
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Fri Jun 9 16:09:23 2023 +0800
added warning message for unsupported K quants
commit 92f44ff7f7
Author: AT <manyoso@users.noreply.github.com>
Date: Fri Jun 9 04:00:51 2023 -0400
metal : add GELU implementation (#1770)
Co-authored-by: Adam Treat <adam@nomic.ai>
commit 245fc3c37d
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date: Fri Jun 9 10:39:59 2023 +0300
metal : faster q4_0 (#1775)
* metal : 8% faster q4_0
Avoid copying into local uchar4 anf float4.
* metal : 17% faster Q4_0
Use 64 threads in a thread group.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
commit 01dc509038
Merge: 083384572ff528
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Fri Jun 9 14:53:35 2023 +0800
Merge branch 'master' into concedo_experimental
commit 0833845268
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Fri Jun 9 14:38:31 2023 +0800
merged metal patch directly into the file
commit 72ff5282bf
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date: Thu Jun 8 22:28:21 2023 +0300
metal : add Q2_K implementation (#1762)
* metal : add Q2_K implementation
27.1 ms / token on M2 Max 30-core GPU, so about the
same speed as Q4_0. Memory throughput is ~156 GB/s.
The access pattern used in the Q2_K
CUDA implementation resulted in significantly lower
performance (~31 ms/token).
* Fixing merge conflicts
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
commit 0bf7cf1b29
Author: Georgi Gerganov <ggerganov@gmail.com>
Date: Thu Jun 8 20:48:14 2023 +0300
Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)"
This reverts commit 8432d4d9f7.
commit 8432d4d9f7
Author: le.chang <cljs118@126.com>
Date: Fri Jun 9 00:47:56 2023 +0800
ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)
commit 6fa1613f15
Author: Hyun-joo KIM <bebopkim@gmail.com>
Date: Fri Jun 9 01:47:36 2023 +0900
Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment
commit 0f291e1f65
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date: Thu Jun 8 19:46:22 2023 +0300
metal : Q6_K implementation (#1752)
* Metal implementation for Q4_K
Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.
* Optimizing Q4_K on metal
The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.
At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.
* Optimizing q4_K metal dot some more
For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.
* Fix after merge with master
* Metal implementation for Q6_K
Similar to the CUDA implementation.
No idea if this is the optimum for Metal, but the few
alternative variants I tried all had a lower performance.
We get 36.5 ms / token on M2 Max with 30 GPU cores.
This corresponds to ~200 GB/second throughput.
* clang-tidy : add config back
* Much better Q6_K implementation for metal
28.3 ms / token for 7B. Subtracting ~9 ms that is spent in
other compute graph operations, we are left with ~19 ms
for the matrix multiplications. The model is ~5.5 GB,
so we are getting 1000 / 19 * 5.5 = 290 GB/s!
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
commit 7f181600c7
Author: Hyun-joo KIM <bebopkim@gmail.com>
Date: Fri Jun 9 01:24:22 2023 +0900
Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment
commit 8fc8179919
Author: qingfengfenga <41416092+qingfengfenga@users.noreply.github.com>
Date: Thu Jun 8 15:58:53 2023 +0800
Add llama.cpp docker support for non-latin languages (#1673)
* Modify Dockerfile default character set to improve compatibility (#1673)
commit b50b570ed9
Author: Steven Roussey <sroussey@gmail.com>
Date: Thu Jun 8 00:12:28 2023 -0700
ggml : fix fprintf warnings (#1720)
commit 53aba3f393
Author: Georgi Gerganov <ggerganov@gmail.com>
Date: Thu Jun 8 10:09:08 2023 +0300
clang-tidy : restore dot file from accidental deletion
commit 4161bdc04d
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date: Thu Jun 8 10:08:23 2023 +0300
metal : add Q4_K implementation (#1733)
* Metal implementation for Q4_K
Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.
* Optimizing Q4_K on metal
The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.
At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.
* Optimizing q4_K metal dot some more
For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.
* Fix after merge with master
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
commit 0035858273
Author: johnson442 <56517414+johnson442@users.noreply.github.com>
Date: Thu Jun 8 08:02:48 2023 +0100
k-quants : add missing compile definition to CMakeLists (#1748)
* metal : 8% faster q4_0
Avoid copying into local uchar4 anf float4.
* metal : 17% faster Q4_0
Use 64 threads in a thread group.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>