Commit graph

1203 commits

Author SHA1 Message Date
Concedo
375540837e updated lite 2023-06-10 19:16:29 +08:00
Concedo
a68fcfe738 only start a new thread when using sse 2023-06-10 19:03:41 +08:00
Concedo
43f7e40470 added extra endpoints for abort gen and polled streaming 2023-06-10 18:13:26 +08:00
Georgi Gerganov
17c10acfb4
ggml : force no_alloc == false when creating opt tensors (close #1699)
This is needed to make operators like ggml_view() be able to store their
parameters in the ggml context's memory and not get discarded when
no_alloc is true
2023-06-10 12:08:15 +03:00
Kawrakow
e9b66ee982
metal : add Q4_1 implementation (#1785)
23.3 ms / token, so just ~1% slower than q4_0.
Achieves 290 GB/s memory throughput.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-10 11:28:11 +03:00
Kerfuffle
4f0154b0ba
llama : support requantizing models instead of only allowing quantization from 16/32bit (#1691)
* Add support for quantizing already quantized models

* Threaded dequantizing and f16 to f32 conversion

* Clean up thread blocks with spares calculation a bit

* Use std::runtime_error exceptions.
2023-06-10 10:59:17 +03:00
Xingchen Song(宋星辰)
ef3171d162
ggml : workaround for missing _mm256_setr_m128i in GCC < 8 (#1638) 2023-06-10 10:49:40 +03:00
rankaiyx
555275a693
make : add SSSE3 compilation use case (#1659) 2023-06-10 09:41:59 +03:00
Robert Sung-wook Shin
98ed165574
OpenCL: Add release memory (#1741)
* Add opencl release memory

* Rename function name
2023-06-09 18:24:40 +02:00
Concedo
5bd9cef9fa merging Proper SSE Token Streaming #220 with end connection fix test 2023-06-09 23:22:16 +08:00
Concedo
b92f9fe3a2 Merge remote-tracking branch 'sammcheese/sammcheese/tokenstreaming' into concedo_experimental 2023-06-09 20:41:02 +08:00
Concedo
507939c135 Merge branch 'master' into concedo_experimental 2023-06-09 20:20:04 +08:00
Concedo
788784179a Merge branch 'concedo' into concedo_experimental 2023-06-09 20:19:56 +08:00
12Boti
e1ab14c4ab
fix format string vulnerability (#223) 2023-06-09 20:16:03 +08:00
Johannes Gäßler
ae9663f188
Windows nvcc workaround (#1753)
Fix gibberish output on Windows when using CUDA
2023-06-09 13:58:15 +02:00
SammCheese
57b0b53b54
fix kobold lite generation 2023-06-09 12:39:35 +02:00
SammCheese
c99ab9df33
Revert "Squashed commit of the following:"
This reverts commit 4f665cd63d.
2023-06-09 12:19:08 +02:00
SammCheese
e6231c3055
back to http.server, improved implementation 2023-06-09 12:17:55 +02:00
Concedo
d28ed99e59 remove unused declarations 2023-06-09 18:01:55 +08:00
SammCheese
4f665cd63d
Squashed commit of the following:
commit b617f2847b
Merge: 73cc5b8 92f44ff
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Fri Jun 9 16:10:35 2023 +0800

    Merge branch 'master' into concedo_experimental

commit 73cc5b88fb
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Fri Jun 9 16:09:23 2023 +0800

    added warning message for unsupported K quants

commit 92f44ff7f7
Author: AT <manyoso@users.noreply.github.com>
Date:   Fri Jun 9 04:00:51 2023 -0400

    metal : add GELU implementation (#1770)

    Co-authored-by: Adam Treat <adam@nomic.ai>

commit 245fc3c37d
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date:   Fri Jun 9 10:39:59 2023 +0300

    metal : faster q4_0 (#1775)

    * metal : 8% faster q4_0

    Avoid copying into local uchar4 anf float4.

    * metal : 17% faster Q4_0

    Use 64 threads in a thread group.

    ---------

    Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

commit 01dc509038
Merge: 0833845 72ff528
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Fri Jun 9 14:53:35 2023 +0800

    Merge branch 'master' into concedo_experimental

commit 0833845268
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Fri Jun 9 14:38:31 2023 +0800

    merged metal patch directly into the file

commit 72ff5282bf
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date:   Thu Jun 8 22:28:21 2023 +0300

    metal : add Q2_K implementation (#1762)

    * metal : add Q2_K implementation

    27.1 ms / token on M2 Max 30-core GPU, so about the
    same speed as Q4_0. Memory throughput is ~156 GB/s.

    The access pattern used in the Q2_K
    CUDA implementation resulted in significantly lower
    performance (~31 ms/token).

    * Fixing merge conflicts

    ---------

    Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

commit 0bf7cf1b29
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Thu Jun 8 20:48:14 2023 +0300

    Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)"

    This reverts commit 8432d4d9f7.

commit 8432d4d9f7
Author: le.chang <cljs118@126.com>
Date:   Fri Jun 9 00:47:56 2023 +0800

    ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)

commit 6fa1613f15
Author: Hyun-joo KIM <bebopkim@gmail.com>
Date:   Fri Jun 9 01:47:36 2023 +0900

    Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment

commit 0f291e1f65
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date:   Thu Jun 8 19:46:22 2023 +0300

    metal : Q6_K implementation (#1752)

    * Metal implementation for Q4_K

    Very slow for now:
    42 ms / token, Q4_0 runs in 28 ms/token on my
    30-core M2 Max GPU.

    * Optimizing Q4_K on metal

    The first token always takes longer, I guess because
    the metal kernel is being jit-compiled.
    So, using n = 128 to measure time.

    At this point Q4_K takes 29.5 ms / token
    compared to 27.2 ms / token for Q4_0.
    Quite a bit better than the initial attempt,
    but still not good enough.

    * Optimizing q4_K metal dot some more

    For n = 256 it is now 28.1 ms/token compared to
    27 ms/token for q4_0.

    * Fix after merge with master

    * Metal implementation for Q6_K

    Similar to the CUDA implementation.
    No idea if this is the optimum for Metal, but the few
    alternative variants I tried all had a lower performance.

    We get 36.5 ms / token on M2 Max with 30 GPU cores.
    This corresponds to ~200 GB/second throughput.

    * clang-tidy : add config back

    * Much better Q6_K implementation for metal

    28.3 ms / token for 7B. Subtracting ~9 ms that is spent in
    other compute graph operations, we are left with ~19 ms
    for the matrix multiplications. The model is ~5.5 GB,
    so we are getting 1000 / 19 * 5.5 = 290 GB/s!

    ---------

    Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

commit 7f181600c7
Author: Hyun-joo KIM <bebopkim@gmail.com>
Date:   Fri Jun 9 01:24:22 2023 +0900

    Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment

commit 8fc8179919
Author: qingfengfenga <41416092+qingfengfenga@users.noreply.github.com>
Date:   Thu Jun 8 15:58:53 2023 +0800

    Add llama.cpp docker support for non-latin languages (#1673)

    * Modify Dockerfile default character set to improve compatibility (#1673)

commit b50b570ed9
Author: Steven Roussey <sroussey@gmail.com>
Date:   Thu Jun 8 00:12:28 2023 -0700

    ggml : fix fprintf warnings (#1720)

commit 53aba3f393
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Thu Jun 8 10:09:08 2023 +0300

    clang-tidy : restore dot file from accidental deletion

commit 4161bdc04d
Author: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date:   Thu Jun 8 10:08:23 2023 +0300

    metal : add Q4_K implementation (#1733)

    * Metal implementation for Q4_K

    Very slow for now:
    42 ms / token, Q4_0 runs in 28 ms/token on my
    30-core M2 Max GPU.

    * Optimizing Q4_K on metal

    The first token always takes longer, I guess because
    the metal kernel is being jit-compiled.
    So, using n = 128 to measure time.

    At this point Q4_K takes 29.5 ms / token
    compared to 27.2 ms / token for Q4_0.
    Quite a bit better than the initial attempt,
    but still not good enough.

    * Optimizing q4_K metal dot some more

    For n = 256 it is now 28.1 ms/token compared to
    27 ms/token for q4_0.

    * Fix after merge with master

    ---------

    Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

commit 0035858273
Author: johnson442 <56517414+johnson442@users.noreply.github.com>
Date:   Thu Jun 8 08:02:48 2023 +0100

    k-quants : add missing compile definition to CMakeLists (#1748)
2023-06-09 10:55:07 +02:00
Georgi Gerganov
b33dee282f
metal : fix build "tanhf" -> "tanh" 2023-06-09 11:11:04 +03:00
Concedo
b617f2847b Merge branch 'master' into concedo_experimental 2023-06-09 16:10:35 +08:00
Concedo
73cc5b88fb added warning message for unsupported K quants 2023-06-09 16:09:23 +08:00
AT
92f44ff7f7
metal : add GELU implementation (#1770)
Co-authored-by: Adam Treat <adam@nomic.ai>
2023-06-09 11:00:51 +03:00
Kawrakow
245fc3c37d
metal : faster q4_0 (#1775)
* metal : 8% faster q4_0

Avoid copying into local uchar4 anf float4.

* metal : 17% faster Q4_0

Use 64 threads in a thread group.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-09 10:39:59 +03:00
Concedo
01dc509038 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.devops/full.Dockerfile
#	.devops/main.Dockerfile
#	CMakeLists.txt
2023-06-09 14:53:35 +08:00
Concedo
0833845268 merged metal patch directly into the file 2023-06-09 14:38:31 +08:00
Kawrakow
72ff5282bf
metal : add Q2_K implementation (#1762)
* metal : add Q2_K implementation

27.1 ms / token on M2 Max 30-core GPU, so about the
same speed as Q4_0. Memory throughput is ~156 GB/s.

The access pattern used in the Q2_K
CUDA implementation resulted in significantly lower
performance (~31 ms/token).

* Fixing merge conflicts

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-08 22:28:21 +03:00
Georgi Gerganov
0bf7cf1b29
Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)"
This reverts commit 8432d4d9f7.
2023-06-08 20:48:14 +03:00
le.chang
8432d4d9f7
ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738) 2023-06-08 19:47:56 +03:00
Hyun-joo KIM
6fa1613f15
Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment 2023-06-09 01:47:36 +09:00
Kawrakow
0f291e1f65
metal : Q6_K implementation (#1752)
* Metal implementation for Q4_K

Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.

* Optimizing Q4_K on metal

The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.

At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.

* Optimizing q4_K metal dot some more

For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.

* Fix after merge with master

* Metal implementation for Q6_K

Similar to the CUDA implementation.
No idea if this is the optimum for Metal, but the few
alternative variants I tried all had a lower performance.

We get 36.5 ms / token on M2 Max with 30 GPU cores.
This corresponds to ~200 GB/second throughput.

* clang-tidy : add config back

* Much better Q6_K implementation for metal

28.3 ms / token for 7B. Subtracting ~9 ms that is spent in
other compute graph operations, we are left with ~19 ms
for the matrix multiplications. The model is ~5.5 GB,
so we are getting 1000 / 19 * 5.5 = 290 GB/s!

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-08 19:46:22 +03:00
SammCheese
dee692a63e
compability with basic_api, change api path to /extra 2023-06-08 18:34:24 +02:00
SammCheese
b4e9e185d3
fix legacy streaming 2023-06-08 18:34:24 +02:00
SammCheese
9a8da35ec4
working streaming. TODO: fix lite 2023-06-08 18:34:23 +02:00
SammCheese
97971291e9
draft: token streaming 2023-06-08 18:34:08 +02:00
Hyun-joo KIM
7f181600c7
Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment 2023-06-09 01:24:22 +09:00
Concedo
a6a0fa338a cleanup indentation, fixing cublas build 2023-06-08 22:40:53 +08:00
Concedo
a979e71ddc add obj flags to all output make targets 2023-06-08 16:28:26 +08:00
qingfengfenga
8fc8179919
Add llama.cpp docker support for non-latin languages (#1673)
* Modify Dockerfile default character set to improve compatibility (#1673)
2023-06-08 00:58:53 -07:00
Steven Roussey
b50b570ed9
ggml : fix fprintf warnings (#1720) 2023-06-08 10:12:28 +03:00
Georgi Gerganov
53aba3f393
clang-tidy : restore dot file from accidental deletion 2023-06-08 10:09:08 +03:00
Kawrakow
4161bdc04d
metal : add Q4_K implementation (#1733)
* Metal implementation for Q4_K

Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.

* Optimizing Q4_K on metal

The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.

At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.

* Optimizing q4_K metal dot some more

For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.

* Fix after merge with master

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-08 10:08:23 +03:00
johnson442
0035858273
k-quants : add missing compile definition to CMakeLists (#1748) 2023-06-08 10:02:48 +03:00
Concedo
6635f7efce updated lite 2023-06-08 00:20:32 +08:00
Concedo
49a6be3d87 add llama metal compile flags as an option 2023-06-07 22:29:38 +08:00
Concedo
7b0707ff26 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
2023-06-07 17:06:56 +08:00
Georgi Gerganov
5c64a0952e
k-quants : allow to optionally disable at compile time (#1734)
* k-quants : put behind optional compile flag LLAMA_K_QUANTS

* build : enable k-quants by default
2023-06-07 10:59:52 +03:00
Concedo
e78c675a6e Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
#	flake.lock
#	flake.nix
#	ggml-opencl.cpp
2023-06-07 15:23:29 +08:00
jacobi petrucciani
5b57a5b726
flake : update to support metal on m1/m2 (#1724) 2023-06-07 07:15:31 +03:00