Commit graph

794 commits

Author SHA1 Message Date
Henri Vasserman
f344d090f7
streaming shell script 2023-06-12 22:49:08 +03:00
Henri Vasserman
429ed950af
move CPPHTTPLIB settings inside server
Since they aren't configurable and were missing from the Makefile.
2023-06-12 20:46:53 +03:00
Henri Vasserman
28694f7ac9
add a simple bash script too 2023-06-12 19:53:13 +03:00
Henri Vasserman
fc4264d14a
api url 2023-06-12 18:43:40 +03:00
Henri Vasserman
1510337901
fix make flags propagation 2023-06-12 18:34:12 +03:00
Henri Vasserman
b91200a2e5
javascript chat update. 2023-06-12 18:34:01 +03:00
Henri Vasserman
13cf6929b7
more json changes and stop info 2023-06-12 17:46:16 +03:00
Henri Vasserman
dff11a14d2
json parsing improvements 2023-06-12 16:52:21 +03:00
Henri Vasserman
4148b9bd03
remove void 2023-06-12 10:28:17 +03:00
Randall Fitzgerald
eee8b28d36
Merge pull request #20 from SlyEcho/server_refactor
Logging changes
2023-06-11 15:17:46 -04:00
Henri Vasserman
6518f9c482
build settings 2023-06-11 16:32:53 +03:00
Henri Vasserman
9612d12fbf
big logging update 2023-06-11 16:18:39 +03:00
Henri Vasserman
2c00bf855d
more formatting changes 2023-06-11 14:01:42 +03:00
Randall Fitzgerald
bac0ddb58f
Merge branch 'ggerganov:master' into master 2023-06-10 06:11:31 -04:00
Georgi Gerganov
17c10acfb4
ggml : force no_alloc == false when creating opt tensors (close #1699)
This is needed to make operators like ggml_view() be able to store their
parameters in the ggml context's memory and not get discarded when
no_alloc is true
2023-06-10 12:08:15 +03:00
Kawrakow
e9b66ee982
metal : add Q4_1 implementation (#1785)
23.3 ms / token, so just ~1% slower than q4_0.
Achieves 290 GB/s memory throughput.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-10 11:28:11 +03:00
Kerfuffle
4f0154b0ba
llama : support requantizing models instead of only allowing quantization from 16/32bit (#1691)
* Add support for quantizing already quantized models

* Threaded dequantizing and f16 to f32 conversion

* Clean up thread blocks with spares calculation a bit

* Use std::runtime_error exceptions.
2023-06-10 10:59:17 +03:00
Xingchen Song(宋星辰)
ef3171d162
ggml : workaround for missing _mm256_setr_m128i in GCC < 8 (#1638) 2023-06-10 10:49:40 +03:00
rankaiyx
555275a693
make : add SSSE3 compilation use case (#1659) 2023-06-10 09:41:59 +03:00
Randall Fitzgerald
d6d263fc4f
Merge pull request #19 from lesaun/master
Clarify build instructions in README.
2023-06-09 23:11:02 -04:00
Lesaun Harvey
917540ce43
Clarify build instructions in README. 2023-06-09 19:06:09 -07:00
Randall Fitzgerald
1a9141b6c3 Remove model assign in main(). Clarified stop in README.
The model will now load the default from gptparams ("models/7B/ggml-model.bin")
2023-06-09 16:29:10 -04:00
Robert Sung-wook Shin
98ed165574
OpenCL: Add release memory (#1741)
* Add opencl release memory

* Rename function name
2023-06-09 18:24:40 +02:00
Johannes Gäßler
ae9663f188
Windows nvcc workaround (#1753)
Fix gibberish output on Windows when using CUDA
2023-06-09 13:58:15 +02:00
Randall Fitzgerald
7cdeb08483 More formatting cleanup 2023-06-09 05:12:16 -04:00
Randall Fitzgerald
889d9044bf Merge branch 'master' of https://github.com/digiwombat/llama.cpp 2023-06-09 04:57:21 -04:00
Randall Fitzgerald
7580427837 Resolving some review comments 2023-06-09 04:56:31 -04:00
Randall Fitzgerald
23a1b1841e
Merge branch 'ggerganov:master' into master 2023-06-09 04:51:20 -04:00
Randall Fitzgerald
cc2b33649d Missed a pair of catch statements for formatting. 2023-06-09 04:50:31 -04:00
Randall Fitzgerald
a9c34779f6 Spaces to 4 and other code style cleanup. Notes in README. 2023-06-09 04:47:18 -04:00
Georgi Gerganov
b33dee282f
metal : fix build "tanhf" -> "tanh" 2023-06-09 11:11:04 +03:00
AT
92f44ff7f7
metal : add GELU implementation (#1770)
Co-authored-by: Adam Treat <adam@nomic.ai>
2023-06-09 11:00:51 +03:00
Kawrakow
245fc3c37d
metal : faster q4_0 (#1775)
* metal : 8% faster q4_0

Avoid copying into local uchar4 anf float4.

* metal : 17% faster Q4_0

Use 64 threads in a thread group.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-09 10:39:59 +03:00
Kawrakow
72ff5282bf
metal : add Q2_K implementation (#1762)
* metal : add Q2_K implementation

27.1 ms / token on M2 Max 30-core GPU, so about the
same speed as Q4_0. Memory throughput is ~156 GB/s.

The access pattern used in the Q2_K
CUDA implementation resulted in significantly lower
performance (~31 ms/token).

* Fixing merge conflicts

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-08 22:28:21 +03:00
Henri Vasserman
ccd85e0a6b
Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-06-08 22:17:46 +03:00
Henri Vasserman
61befcba7b
Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-06-08 22:14:43 +03:00
Georgi Gerganov
0bf7cf1b29
Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738)"
This reverts commit 8432d4d9f7.
2023-06-08 20:48:14 +03:00
le.chang
8432d4d9f7
ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (#1738) 2023-06-08 19:47:56 +03:00
Kawrakow
0f291e1f65
metal : Q6_K implementation (#1752)
* Metal implementation for Q4_K

Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.

* Optimizing Q4_K on metal

The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.

At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.

* Optimizing q4_K metal dot some more

For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.

* Fix after merge with master

* Metal implementation for Q6_K

Similar to the CUDA implementation.
No idea if this is the optimum for Metal, but the few
alternative variants I tried all had a lower performance.

We get 36.5 ms / token on M2 Max with 30 GPU cores.
This corresponds to ~200 GB/second throughput.

* clang-tidy : add config back

* Much better Q6_K implementation for metal

28.3 ms / token for 7B. Subtracting ~9 ms that is spent in
other compute graph operations, we are left with ~19 ms
for the matrix multiplications. The model is ~5.5 GB,
so we are getting 1000 / 19 * 5.5 = 290 GB/s!

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-08 19:46:22 +03:00
qingfengfenga
8fc8179919
Add llama.cpp docker support for non-latin languages (#1673)
* Modify Dockerfile default character set to improve compatibility (#1673)
2023-06-08 00:58:53 -07:00
Steven Roussey
b50b570ed9
ggml : fix fprintf warnings (#1720) 2023-06-08 10:12:28 +03:00
Georgi Gerganov
53aba3f393
clang-tidy : restore dot file from accidental deletion 2023-06-08 10:09:08 +03:00
Kawrakow
4161bdc04d
metal : add Q4_K implementation (#1733)
* Metal implementation for Q4_K

Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.

* Optimizing Q4_K on metal

The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.

At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.

* Optimizing q4_K metal dot some more

For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.

* Fix after merge with master

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-08 10:08:23 +03:00
johnson442
0035858273
k-quants : add missing compile definition to CMakeLists (#1748) 2023-06-08 10:02:48 +03:00
Randall Fitzgerald
64a06536cb Merge remote-tracking branch 'upstream/master'
# Resolved Conflicts:
#	examples/server/README.md
#	examples/server/server.cpp
2023-06-07 12:23:49 -04:00
Georgi Gerganov
5c64a0952e
k-quants : allow to optionally disable at compile time (#1734)
* k-quants : put behind optional compile flag LLAMA_K_QUANTS

* build : enable k-quants by default
2023-06-07 10:59:52 +03:00
jacobi petrucciani
5b57a5b726
flake : update to support metal on m1/m2 (#1724) 2023-06-07 07:15:31 +03:00
Georgi Gerganov
4dc62c545d
readme : add June roadmap 2023-06-07 07:15:08 +03:00
Willy Tarreau
35a84916fb
main: add the possibility to open the prompt cache read-only (#1640)
The prompt cache constitutes a nice speed up when using the same prompt
prefix across multiple evaluations, but when using it, it will also be
updated, which is not always desirable. One use case is to have a large
prompt containing some context and usage rules, and a second part
containing variable data of the problem being studied. In this case it's
desirable to be able to save the first part once, and to always reuse it
as-is without updating it with the second part.

The new argument --prompt-cache-ro enables this read-only mode on the
prompt cache. The prompt's contents that match the cache are loaded
from the cache but the rest is not modified. This allowed to reduce a
total analysis time from 112s to 49.7s here, without having to backup
and restore a copy of the prompt, which takes significant time at 500
MB.

Signed-off-by: Willy Tarreau <w@1wt.eu>
2023-06-06 22:10:17 -04:00
Georgi Gerganov
2d7bf110ed
llama : fix vram_scratch var 2023-06-06 22:54:39 +03:00