Commit graph

775 commits

Author SHA1 Message Date
WangHaoranRobin
58828c209a
Merge pull request #8 from WangHaoranRobin/robin_fork_master
server: fix llama_sample_top_k order
2023-06-26 18:11:48 -07:00
Wang Haoran(Robin)
bc88fece87 server: fix llama_sample_top_k order 2023-06-26 18:11:27 -07:00
WangHaoranRobin
c7f7f13650
Merge branch 'ggerganov:master' into master 2023-06-26 18:08:40 -07:00
Georgi Gerganov
181e8d9755
llama : fix rope usage after ChatGLM change 2023-06-27 00:37:33 +03:00
Georgi Gerganov
d9779021bd
ggml : add support for ChatGLM RoPE 2023-06-27 00:06:51 +03:00
Roman Parykin
d38e451578
readme : add Scala 3 bindings repo (#2010) 2023-06-26 22:47:59 +03:00
David Yang
eaa6ca5a61
ggml : increase max tensor name + clean up compiler warnings in train-text (#1988)
* Clean up compiler warnings in train-text

Some brackets to disambiguate order of operations

* Increase GGML_MAX_NAME

Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues
2023-06-26 22:45:32 +03:00
Gustavo Rocha Dias
aa777abbb7
readme : LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux (#2007)
* docs - Alternative way to build at Android, with CLBlast.

* doc - LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux.

* doc- fix typo
2023-06-26 22:34:45 +03:00
WangHaoranRobin
b5c5c8e2b9
Merge branch 'ggerganov:master' into master 2023-06-26 11:56:01 -07:00
Georgi Gerganov
c824d2e368
ggml : avoid conv 2d kernel round up 2023-06-26 21:03:59 +03:00
zrm
b853d45601
ggml : add NUMA support (#1556)
* detect NUMA systems and pin work threads to nodes (linux)

* disable mmap prefetch/readahead for NUMA systems

* avoid sending finalize op to thread pool if it does nothing

* silence robot

* fix args

* make --numa a param

* recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement

* lower synchronization overhead

* statically allocate

* move numa state to g_state

* add description for --numa

* ggml : minor style changes

* ggml : minor style + try fix sanitizer build

* llama : allow to initialize backend with NUMA support

* llama : avoid ggml include in llama-util.h

* ggml : style / formatting

* ggml : fix handling of ops with n_threads > n_tasks > 1

* server : utilize numa parameter

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-26 20:57:59 +03:00
Georgi Gerganov
9225baef71
k-quants : fix indentation 2023-06-26 20:10:52 +03:00
katsu560
a84ab1da8d
tests : fix quantize perf (#1990)
* fix test quantize perf

* avoid the global state
2023-06-26 19:47:02 +03:00
katsu560
5743ca8092
k-quants : add AVX support to dot functions (#1916)
* k_quants : add AVX support

* k_quants : apply review comments
2023-06-26 19:46:07 +03:00
Georgi Gerganov
412c60e473
readme : add link to new k-quants for visibility 2023-06-26 19:45:09 +03:00
Kawrakow
6769e944c7
k-quants : support for super-block size of 64 (#2001)
* k_quants: WIP super-blocks with 64 weights

* k_quants: WIP super-blocks with 64 weights

Q6_K scalar and AVX2 works

* k_quants: WIP super-blocks with 64 weights

Q4_K scalar and AVX2 works

* k_quants: WIP super-blocks with 64 weights

Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower
than the scalar implementation)

* k_quants: WIP super-blocks with 64 weights

Q3_K scalar and AVX2 works.

* k_quants: WIP super-blocks with 64 weights

Q5_K scalar and AVX2 works, and with that all
k_quants are done on AVX2 and scalar

* k_quants: WIP super-blocks with 64 weights

Q6_K working on CUDA. Cannot make it run quite as gast as
with super-blocks with 256 weigths: 8% slower on 4080,
20% slower on the 1660 (but there we fit 1 less layer on the
GPU because pf the larger model size), so some fraction of
these 20% is due to that,

* k_quants: WIP super-blocks with 64 weights

Q4_K working on CUDA. ~10% slower on GTX-1660,
16% slower on 4080.

* k_quants: WIP super-blocks with 64 weights

Q2_K working on CUDA. ~3% slower on GTX-1660,
10% slower on 4080.

* k_quants: WIP super-blocks with 64 weights

Q3_K working on CUDA.

* k_quants: WIP super-blocks with 64 weights

Q5_K working on CUDA, and with this CUDA is done.

* k_quants: WIP super-blocks with 64 weights

Q6_K working on ARM_NEON

* k_quants: WIP super-blocks with 64 weights

Q4_K working on ARM_NEON, but quite a bit slower than 256 weights

* k_quants: WIP super-blocks with 64 weights

Q2_K working on ARM_NEON, but quite a bit slower than 256 weights

* k_quants: WIP super-blocks with 64 weights

Q3_K working on ARM_NEON, but quite a bit slower than 256 weights.

* k_quants: WIP super-blocks with 64 weights

Q5_K working on ARM_NEON, but quite a bit slower than 256 weights.

With that, we have full support for ARM_NEON, although
performance is not quite there.

* k_quants: WIP super-blocks with 64 weights

Slightly more efficient Q3_K and Q5_K

* k_quants: WIP super-blocks with 64 weights

Another small improvement for Q3_K and Q5_K on ARM_NEON

* k_quants: WIP super-blocks with 64 weights

Yet another speedup for Q5_K on ARM_NEON.
We are now within 10% of the QK_K = 256 version.

* k_quants: WIP super-blocks with 64 weights

* We are able to pass preprocessor macros to the Metal
  compiler
* Q6_K works and is actually slightly more efficient than
  the QK_K = 256 version (25.2 ms vs 25.8 ms)

* k_quants: WIP super-blocks with 64 weights

Q4_K works on Metal and is actually slightly faster
than QK_K = 256 (21.95 ms vs 24.0 ms).

* k_quants: WIP super-blocks with 64 weights

Q2_K works on Metal and is very slightly faster
than QK_K = 256 (23.8 ms vs 24.2 ms).

* k_quants: WIP super-blocks with 64 weights

Q3_K works on Metal and is slightly faster
than QK_K = 256 (26.6 ms vs 28.3 ms).

* k_quants: WIP super-blocks with 64 weights

Q5_K works on Metal and is slightly faster
than QK_K = 256 (23.7 ms vs 26.3 ms).

* k_quants: call them _K, not _k, also on Metal

* k_quants: correctly define QK_K in llama.cpp

* Fixed bug in q4_K quantization added with the 64-block addition

* Simplify via lambda

* k_quants: swicth Q3_K to 4-bit scales when QK_K = 64

Otherwise there isn't much benefit from this
quantization type. There is some very slight loss
in accuracy, but we reduce size by ~7%.
E.g., for OpenLLaMA-3B, Q3_K_S perplexity is
8.6131 with 8-bit scales and 8.6352 with 4-bit,
while file size decreases from 1.53G to 1.44G.

* k_quants: switch Q4_K to 4-bit scales when QK_K = 64

 Here the loss in accuracy is greater than for Q3_K,
 but the Q4_K points still move further to the left on
 the perplexity vs size curve.

* k_quants: forgot to add the Metal changes in last commit

* k_quants: change Q5_K to be type 0 when QK_K = 64

Still needs AVX2 implementation

* k_quants: AVX2 implementation for new 64-weight Q5_K

* k_quants: 10% faster ARM_NEON Q5_K dot product

* k_quants: fixed issue caused by merging with master

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-26 19:43:07 +03:00
Howard Su
cbebf61ca7
Fix assert when free invalid cuda pointer (#2005)
Fix assert via initializing extra structure always.
CUDA error 1 at C:\GPT\llama.cpp\ggml-cuda.cu:2536: invalid argument
2023-06-26 23:15:47 +08:00
WangHaoranRobin
77edee7d9b
Merge pull request #7 from WangHaoranRobin/robin_fork_master
server: handle probs output when temp=0; handle final response probs output
2023-06-25 16:33:38 -07:00
WangHaoranRobin
13f5d697ce
Merge branch 'master' into robin_fork_master 2023-06-25 16:33:31 -07:00
Wang Haoran(Robin)
c9e6642cf7 server: handle probs output when temp=0; handle final response probs output 2023-06-25 16:29:34 -07:00
WangHaoranRobin
bd6550bd8b
Merge pull request #6 from WangHaoranRobin/robin_fork_master
server: remove n_probs upper limit of 5
2023-06-25 14:16:35 -07:00
Wang Haoran(Robin)
e815b69579 server: remove n_probs upper limit of 5 2023-06-25 14:15:14 -07:00
WangHaoranRobin
af058cf820
Merge branch 'ggerganov:master' into master 2023-06-25 08:51:59 -07:00
Georgi Gerganov
447ccbe8c3
readme : add new roadmap + manifesto 2023-06-25 16:08:12 +03:00
Georgi Gerganov
bd34cdde38
ggml : sync latest ggml (custom operators) 2023-06-25 14:25:08 +03:00
anon998
c2a08f87b8
fix server sampling: top k sampler first (#1977)
Co-authored-by: anon <anon@example.org>
2023-06-25 10:48:36 +02:00
Georgi Gerganov
66a2555ba6
readme : add Azure CI discussion link 2023-06-25 09:07:03 +03:00
sjinzh
e65ca7e14a
zig : upgrade build system support (#1981)
* upgrade zig build system support

* zig : add new line at the end of the file

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-25 08:45:44 +03:00
Robyn
5ec8dd5a3c
#1869 Fix null reference errors when training from scratch with CUDA (#1907)
* #1869 Fix null reference errors when training from scratch with CUDA build

Calling ggml_compute_forward when node->src0 was null was causing train-text-from-scratch.exe to terminate unexpectedly.

* ggml : do not dereference src0 if NULL

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-24 20:10:29 +02:00
Georgi Gerganov
65bdd52a86
tests : sync test-grad0 from ggml 2023-06-24 19:40:18 +03:00
WangHaoranRobin
23b516b053
Merge branch 'ggerganov:master' into master 2023-06-24 07:55:23 -07:00
WangHaoranRobin
7f7046ea01
Merge pull request #5 from WangHaoranRobin/robin_fork_master
server: remove trailling white space
2023-06-24 07:55:15 -07:00
Wang Haoran(Robin)
02c96a4cbb server: remove trailling white space 2023-06-24 07:54:26 -07:00
Rowan Hart
fdd1860911
flake : fix ggml-metal.metal path and run nixfmt (#1974) 2023-06-24 14:07:08 +03:00
AN Long
c943d823c1
convert : fix invalid params in write_vocab_only (#1975) 2023-06-24 14:02:06 +03:00
slaren
f2c754e1c3
ggml : improve ggml_graph_dump_dot, add ggml_format_name (#1978)
* Improve ggml_graph_dump_dot, add ggml_format_name

* add more automatic names to view ops

* fix name of copies
2023-06-24 13:57:18 +03:00
Georgi Gerganov
11da1a85cd
readme : fix whitespaces 2023-06-24 13:38:18 +03:00
Alberto
235b610d65
readme : fixed termux instructions (#1973) 2023-06-24 13:32:13 +03:00
Alex Renda
b061ba9e2a
llama : fix top-p sampling to match the canonical definition (#1953)
* Fix top-p sampling to match the standard definition (smallest set that has probability mass at least p, not largest set with probability mass less than p)

* top-p: correct gt to gte

* add test for correct top-p behavior
2023-06-24 13:15:01 +03:00
Didzis Gosko
527b6fba1d
llama : make model stateless and context stateful (llama_state) (#1797)
* llama : make model stateless and context stateful

* llama : minor cleanup

* llama : update internal API declaration

* Apply suggestions from code review

fix style

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Missing model memory release

* Fix style

* Add deprecated warning for public API function llama_init_from_file

* Update public API use cases: move away from deprecated llama_init_from_file

* Deprecate public API function llama_apply_lora_from_file

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-24 11:47:58 +03:00
eiery
d7b7484f74
Add OpenLLaMA instructions to the README (#1954)
* add openllama to readme
2023-06-23 10:38:01 +02:00
WangHaoranRobin
6c76c31184
Merge branch 'ggerganov:master' into master 2023-06-22 22:14:41 -07:00
WangHaoranRobin
7cd8fc20d0
Merge pull request #4 from WangHaoranRobin/robin_fork_master
server: fix some beginner mistakes
2023-06-22 22:00:21 -07:00
Wang Haoran(Robin)
7b93b248ef server: fix some beginner mistakes 2023-06-22 21:59:12 -07:00
WangHaoranRobin
bdb710efa2
Merge pull request #3 from WangHaoranRobin/robin_fork_master
server: fix issue for multibyte character generation
2023-06-22 21:36:50 -07:00
Wang Haoran(Robin)
cf76195223 server: fix issue when handling probability output for incomplete tokens for multibyte character generation 2023-06-22 21:35:37 -07:00
WangHaoranRobin
926664c229
Merge pull request #2 from WangHaoranRobin/robin_fork_master
server: fix comment about max n_probs
2023-06-22 09:01:42 -07:00
Wang Haoran(Robin)
ccf254bd44 server: fix comment about max n_probs 2023-06-22 08:57:35 -07:00
Erik Scholz
7487137227
rework convert.py to read hyper-parameters from config.json (#1958)
* Read hyper-parameters from HuggingFace-transformer config.json, if they exist, and fall back to guessing, like before otherwise.
  This allows converting open_llama 3B and other non-standard model designs.
2023-06-22 14:20:47 +02:00
Johannes Gäßler
bbca06e269
cmake: revert CUDA arch default to 52, 61 if f16 (#1959) 2023-06-21 23:49:25 +02:00