Commit graph

1607 commits

Author SHA1 Message Date
Concedo
993ba3b026 Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-07-24 11:59:00 +08:00
Concedo
280abaf029 added stop reason in the perf endpoint 2023-07-24 11:55:35 +08:00
Kawrakow
2f9cf974a0
Some more Q4_K and Q5_K speedup on CUDA (#2346)
* Faster Q5_K on CUDA

* Small Q5_K improvement on older GPUs

* Spped up Q4_K on CUDA

GTX1660: 29.5 ms/t -> 25.6 ms/t
RTX4080: 8.40 ms/t -> 8.25 ms/t

* Spped up Q4_K on CUDA

GTX1660: 36.7 ms/t -> 35.6 ms/t
RTX4080:  9.8 ms/t ->  9.5 ms/t

* Address PR comments

* Add some comments to satisfy PR reviewer

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-24 00:19:47 +03:00
IgnacioFDM
4f06592cc6
Add gqa parameter support to the server (#2351)
* Add gqa parameter support to the server
* Change help from stderr to stdout
2023-07-23 23:31:17 +03:00
Johannes Gäßler
70d26ac388
Fix __dp4a documentation (#2348) 2023-07-23 17:49:06 +02:00
Concedo
910744e2c0 Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
#	flake.nix
#	llama.cpp
2023-07-23 22:37:38 +08:00
Concedo
c28ab4e1b7 update lite, try support k80 2023-07-23 21:50:35 +08:00
wzy
57921ca6db
common : n_threads == -1 uses std:🧵:hardware_concurrency() (#2347)
* Fix #2345, fix incorrect n_threads

* Update examples/common.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-23 16:33:02 +03:00
slaren
3602ac4255
fix n_tasks (#2342)
ggml-ci
2023-07-23 15:19:39 +02:00
slaren
95a6c595e7
ggml: move op parameters from tensors to ggml_tensor::op_params (#2333)
* ggml: move op parameters from tensors to ggml_tensor::op_params

* alibi: use memcpy for float params

* remove `src[1] = NULL` in ops
2023-07-23 14:36:02 +02:00
Georgi Gerganov
e76d630df1
llama : grouped-query attention + LLaMAv2 70B support (#2276)
* CUDA: GQA implementation

* llama : support for GQA and LLaMAv2 70B

ggml-ci

* py : fix hparams parsing (if-else blocks)

ggml-ci

* py : oh boy ..

ggml-ci

* help : fix gqa value for 70B

ggml-ci

---------

Co-authored-by: JohannesGaessler <johannesg@5d6.de>
2023-07-23 15:09:47 +03:00
maddes8cht
1d0824b247
llama : print help to stdout (#2338) 2023-07-23 14:59:48 +03:00
wzy
bc3ec2cdc9
flake : support nix build '.#opencl' (#2337) 2023-07-23 14:57:02 +03:00
Christian Demsar
a940458e48
llama : print max tensor size to stderr (#2336) 2023-07-23 14:56:34 +03:00
Jose Maldonado
91171b8072
make : fix CLBLAST compile support in FreeBSD (#2331)
* Fix Makefile for CLBLAST compile support and instructions for compile llama.cpp FreeBSD

* More general use-case for CLBLAST support (Linux and FreeBSD)
2023-07-23 14:52:08 +03:00
AustinMroz
355c80f49e
examples : simplify vim plugin (#2327)
Uses builtin json_encode and json_decode functions to simplify escaping
Removes the need for temp files
2023-07-23 14:16:48 +03:00
Jiahao Li
83a00ce69b
metal : support bcast add & dup & cont op (#2323) 2023-07-23 14:00:37 +03:00
Concedo
2e84eac7f6 Merge branch 'master' into concedo_experimental 2023-07-23 16:23:00 +08:00
Concedo
aa05eadb6f Merge branch 'master' into concedo_experimental
# Conflicts:
#	llama.cpp
2023-07-23 16:22:44 +08:00
Kawrakow
d2a43664f9
Speed up Q4_K (#2322)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-23 08:49:20 +03:00
Concedo
1108232e30 Merge branch 'concedo' into concedo_experimental 2023-07-23 09:59:58 +08:00
Concedo
0cca0726fe reduce number of retries, fixed maxlength > maxctx bug 2023-07-23 09:59:34 +08:00
Ycros
56995caa48
Fix mirostatv2. (#338) 2023-07-23 09:52:03 +08:00
Johannes Gäßler
b9b7d94fc1
CUDA: Fixed 7b q3_K_S with mul_mat_vec_q (#2313) 2023-07-22 21:27:34 +02:00
Georgi Gerganov
b47b8a9cfe
llama : optimize memory buffers (#2325) 2023-07-22 21:17:57 +03:00
Concedo
fa0270df7c added some checks to skip generation if busy 2023-07-22 23:10:04 +08:00
Concedo
2807d98fd4 touchup (+2 squashed commit)
Squashed commit:

[8b06458] fixed broken param order

[7eabdc0] very broken, do not use
2023-07-22 22:57:56 +08:00
klosax
b5fe67f8c6
Perplexity: Compute scores correlated to HellaSwag (#2312)
* Add parameter --perplexity-lines to perplexity.cpp
2023-07-22 14:21:24 +02:00
whoreson
24baa54ac1
examples : basic VIM plugin
VIM plugin for server exe
2023-07-22 13:34:51 +03:00
Concedo
3aec3038d4 bump scratch buffers 2023-07-22 18:12:18 +08:00
Georgi Gerganov
dd6c67d3cb
ci : fix args 2023-07-22 12:00:56 +03:00
Georgi Gerganov
5d500e8ccf
ci : add 7B CUDA tests (#2319)
* ci : add 7B CUDA tests

ggml-ci

* ci : add Q2_K to the tests

* ci : bump CUDA ppl chunks

ggml-ci

* ci : increase CUDA TG len + add --ignore-eos

* ci : reduce CUDA ppl cunks down to 4 to save time
2023-07-22 11:48:22 +03:00
Concedo
52c5856a08 auto populate horde model name 2023-07-22 16:03:12 +08:00
Concedo
dd3f8dabed updated cluster to horde.koboldai.net 2023-07-22 12:42:40 +08:00
Concedo
236d0e8955 add tip about using other workers 2023-07-22 12:29:22 +08:00
Concedo
701bf0a6cd reduce sleep time between jobs 2023-07-22 11:56:43 +08:00
Concedo
343ae756fa Merge branch 'master' into concedo_experimental
# Conflicts:
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README.md
#	flake.nix
#	ggml-cuda.cu
2023-07-22 11:51:30 +08:00
Concedo
52c98228aa bugfixes for missing params 2023-07-22 11:37:44 +08:00
Concedo
d7ab6adbc1 embedded horde worker is ready 2023-07-22 11:21:32 +08:00
Richard Roberson
7d5f18468c
examples : add easy python script to create quantized (k-bit support) GGML models from local HF Transformer models (#2311)
* Resync my fork with new llama.cpp commits

* examples : rename to use dash instead of underscore

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-21 22:01:10 +03:00
Concedo
75064b4ada wip on embedded horde worker 2023-07-22 01:30:25 +08:00
Kawrakow
d924522a46
Custom RoPE + bettter memory management for CUDA (#2295)
* Custom RoPE + bettter memory management for CUDA

* Adjusted look ahead in ggml_cuda_pool_malloc to 5%

This is sufficient it seems.
We end up using about 200 MB less VRAM that way when running
the 13B model with context 8192.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-21 17:27:51 +03:00
Kawrakow
4d76a5f49b
Faster Q3_K implementation on Metal (#2307)
* Faster Q3_K on Metal

* Additional Q3_K speedup on Metal

* Q3_K for QK_K = 64

* Better Q3_K for QK_K = 64

21.6 ms/t -> 21.1 ms/t

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-21 17:05:30 +03:00
Georgi Gerganov
0db14fef06
ggml : fix the rope fix (513f861953) 2023-07-21 15:16:55 +03:00
Ikko Eltociear Ashimine
03e566977b
examples : fix typo in minigpt4.py (#2298)
promt -> prompt
2023-07-21 14:53:07 +03:00
Georgi Gerganov
513f861953
ggml : fix rope args order + assert (#2054) 2023-07-21 14:51:34 +03:00
Georgi Gerganov
3973b25a64
gitignore : fix final newline 2023-07-21 14:42:41 +03:00
Guillaume "Vermeille" Sanchez
ab0e26bdfb
llama : remove cfg smooth factor as it is only a reparameterization of the guidance scale (#2280) 2023-07-21 13:58:36 +03:00
Jose Maldonado
73643f5fb1
gitignore : changes for Poetry users + chat examples (#2284)
A fix in Makefile for FreeBSD users. In the platfrom x86_64 is amd64. This fix resolve compilation using CFLAGS and CXXFLAGS with -march=native and -mtune=native
Add two examples for interactive mode using Llama2 models (thx TheBloke for models)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-21 13:53:27 +03:00
Georgi Gerganov
a814d04f81
make : fix indentation 2023-07-21 13:50:55 +03:00