Concedo
6d71e100fe
buff buffers
2023-07-24 20:33:17 +08:00
Concedo
825e34baa3
default horde name and better handling for horde (+3 squashed commit)
...
Squashed commit:
[fadfa60] better idle handling for horde worker
[a3971e6] updated lite
[2ca2b79] seems to not generate rubbish
2023-07-24 18:41:41 +08:00
Concedo
c7136f03d9
added support for tensor_split parameter as an advanced parameter.
2023-07-24 17:16:19 +08:00
Concedo
66328fcd80
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# Makefile
2023-07-24 15:44:26 +08:00
Concedo
94499dba25
added support for 70b llama 2
2023-07-24 15:20:18 +08:00
Concedo
993ba3b026
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# README.md
2023-07-24 11:59:00 +08:00
Evan Jones
84e09a7d8b
llama : add grammar-based sampling ( #1773 )
...
* llama, main : constrain sampling to grammar
* allow loading grammar from file
* fix whitespace errors
* handle & print parser errors
* add comments to grammar syntax and allow newlines where unambiguous
* add missing include
* support alternates in root rule
* fix bugs with empty token and EOS
* adjust JSON grammar
* remove swp file
* rewrite ternary expressions
Co-authored-by: Henri Vasserman <henv@hot.ee>
* use struct for grammar elements and add Unicode support
* add unicode escapes
* add inverse char ranges
* only sample full tokens (no peeking or truncation)
* llama : minor style changes
blindly applied in online editor - hopefully I didn't break something
* update help text
* add warning message if EOS is disabled
---------
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-23 23:58:10 -04:00
Concedo
280abaf029
added stop reason in the perf endpoint
2023-07-24 11:55:35 +08:00
Kawrakow
2f9cf974a0
Some more Q4_K and Q5_K speedup on CUDA ( #2346 )
...
* Faster Q5_K on CUDA
* Small Q5_K improvement on older GPUs
* Spped up Q4_K on CUDA
GTX1660: 29.5 ms/t -> 25.6 ms/t
RTX4080: 8.40 ms/t -> 8.25 ms/t
* Spped up Q4_K on CUDA
GTX1660: 36.7 ms/t -> 35.6 ms/t
RTX4080: 9.8 ms/t -> 9.5 ms/t
* Address PR comments
* Add some comments to satisfy PR reviewer
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-24 00:19:47 +03:00
IgnacioFDM
4f06592cc6
Add gqa parameter support to the server ( #2351 )
...
* Add gqa parameter support to the server
* Change help from stderr to stdout
2023-07-23 23:31:17 +03:00
Johannes Gäßler
70d26ac388
Fix __dp4a documentation ( #2348 )
2023-07-23 17:49:06 +02:00
Concedo
910744e2c0
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# Makefile
# README.md
# flake.nix
# llama.cpp
2023-07-23 22:37:38 +08:00
Concedo
c28ab4e1b7
update lite, try support k80
2023-07-23 21:50:35 +08:00
wzy
57921ca6db
common : n_threads == -1 uses std: 🧵 :hardware_concurrency() ( #2347 )
...
* Fix #2345 , fix incorrect n_threads
* Update examples/common.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-23 16:33:02 +03:00
slaren
3602ac4255
fix n_tasks ( #2342 )
...
ggml-ci
2023-07-23 15:19:39 +02:00
slaren
95a6c595e7
ggml: move op parameters from tensors to ggml_tensor::op_params ( #2333 )
...
* ggml: move op parameters from tensors to ggml_tensor::op_params
* alibi: use memcpy for float params
* remove `src[1] = NULL` in ops
2023-07-23 14:36:02 +02:00
Georgi Gerganov
e76d630df1
llama : grouped-query attention + LLaMAv2 70B support ( #2276 )
...
* CUDA: GQA implementation
* llama : support for GQA and LLaMAv2 70B
ggml-ci
* py : fix hparams parsing (if-else blocks)
ggml-ci
* py : oh boy ..
ggml-ci
* help : fix gqa value for 70B
ggml-ci
---------
Co-authored-by: JohannesGaessler <johannesg@5d6.de>
2023-07-23 15:09:47 +03:00
maddes8cht
1d0824b247
llama : print help to stdout ( #2338 )
2023-07-23 14:59:48 +03:00
wzy
bc3ec2cdc9
flake : support nix build '.#opencl'
( #2337 )
2023-07-23 14:57:02 +03:00
Christian Demsar
a940458e48
llama : print max tensor size to stderr ( #2336 )
2023-07-23 14:56:34 +03:00
Jose Maldonado
91171b8072
make : fix CLBLAST compile support in FreeBSD ( #2331 )
...
* Fix Makefile for CLBLAST compile support and instructions for compile llama.cpp FreeBSD
* More general use-case for CLBLAST support (Linux and FreeBSD)
2023-07-23 14:52:08 +03:00
AustinMroz
355c80f49e
examples : simplify vim plugin ( #2327 )
...
Uses builtin json_encode and json_decode functions to simplify escaping
Removes the need for temp files
2023-07-23 14:16:48 +03:00
Jiahao Li
83a00ce69b
metal : support bcast add & dup & cont op ( #2323 )
2023-07-23 14:00:37 +03:00
Concedo
2e84eac7f6
Merge branch 'master' into concedo_experimental
2023-07-23 16:23:00 +08:00
Concedo
aa05eadb6f
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# llama.cpp
2023-07-23 16:22:44 +08:00
Kawrakow
d2a43664f9
Speed up Q4_K ( #2322 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-23 08:49:20 +03:00
Concedo
1108232e30
Merge branch 'concedo' into concedo_experimental
2023-07-23 09:59:58 +08:00
Concedo
0cca0726fe
reduce number of retries, fixed maxlength > maxctx bug
2023-07-23 09:59:34 +08:00
Ycros
56995caa48
Fix mirostatv2. ( #338 )
2023-07-23 09:52:03 +08:00
Johannes Gäßler
b9b7d94fc1
CUDA: Fixed 7b q3_K_S with mul_mat_vec_q ( #2313 )
2023-07-22 21:27:34 +02:00
Georgi Gerganov
b47b8a9cfe
llama : optimize memory buffers ( #2325 )
2023-07-22 21:17:57 +03:00
Concedo
fa0270df7c
added some checks to skip generation if busy
2023-07-22 23:10:04 +08:00
Concedo
2807d98fd4
touchup (+2 squashed commit)
...
Squashed commit:
[8b06458] fixed broken param order
[7eabdc0] very broken, do not use
2023-07-22 22:57:56 +08:00
klosax
b5fe67f8c6
Perplexity: Compute scores correlated to HellaSwag ( #2312 )
...
* Add parameter --perplexity-lines to perplexity.cpp
2023-07-22 14:21:24 +02:00
whoreson
24baa54ac1
examples : basic VIM plugin
...
VIM plugin for server exe
2023-07-22 13:34:51 +03:00
Concedo
3aec3038d4
bump scratch buffers
2023-07-22 18:12:18 +08:00
Georgi Gerganov
dd6c67d3cb
ci : fix args
2023-07-22 12:00:56 +03:00
Georgi Gerganov
5d500e8ccf
ci : add 7B CUDA tests ( #2319 )
...
* ci : add 7B CUDA tests
ggml-ci
* ci : add Q2_K to the tests
* ci : bump CUDA ppl chunks
ggml-ci
* ci : increase CUDA TG len + add --ignore-eos
* ci : reduce CUDA ppl cunks down to 4 to save time
2023-07-22 11:48:22 +03:00
Concedo
52c5856a08
auto populate horde model name
2023-07-22 16:03:12 +08:00
Concedo
dd3f8dabed
updated cluster to horde.koboldai.net
2023-07-22 12:42:40 +08:00
Concedo
236d0e8955
add tip about using other workers
2023-07-22 12:29:22 +08:00
Concedo
701bf0a6cd
reduce sleep time between jobs
2023-07-22 11:56:43 +08:00
Concedo
343ae756fa
Merge branch 'master' into concedo_experimental
...
# Conflicts:
# .gitignore
# CMakeLists.txt
# Makefile
# README.md
# flake.nix
# ggml-cuda.cu
2023-07-22 11:51:30 +08:00
Concedo
52c98228aa
bugfixes for missing params
2023-07-22 11:37:44 +08:00
Concedo
d7ab6adbc1
embedded horde worker is ready
2023-07-22 11:21:32 +08:00
Richard Roberson
7d5f18468c
examples : add easy python script to create quantized (k-bit support) GGML models from local HF Transformer models ( #2311 )
...
* Resync my fork with new llama.cpp commits
* examples : rename to use dash instead of underscore
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-21 22:01:10 +03:00
Concedo
75064b4ada
wip on embedded horde worker
2023-07-22 01:30:25 +08:00
Kawrakow
d924522a46
Custom RoPE + bettter memory management for CUDA ( #2295 )
...
* Custom RoPE + bettter memory management for CUDA
* Adjusted look ahead in ggml_cuda_pool_malloc to 5%
This is sufficient it seems.
We end up using about 200 MB less VRAM that way when running
the 13B model with context 8192.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-21 17:27:51 +03:00
Kawrakow
4d76a5f49b
Faster Q3_K implementation on Metal ( #2307 )
...
* Faster Q3_K on Metal
* Additional Q3_K speedup on Metal
* Q3_K for QK_K = 64
* Better Q3_K for QK_K = 64
21.6 ms/t -> 21.1 ms/t
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-21 17:05:30 +03:00
Georgi Gerganov
0db14fef06
ggml : fix the rope fix ( 513f861953
)
2023-07-21 15:16:55 +03:00