Commit graph

3964 commits

Author SHA1 Message Date
MaggotHATE
28d2cff729 Merge branch 'master' of https://github.com/MaggotHATE/llama.cpp-xtc 2024-10-15 09:46:14 +05:00
MaggotHATE
2be814aa69 Fixed tests and outdated README 2024-10-15 09:46:04 +05:00
MaggotHATE
17ad143ead
Merge branch 'ggerganov:master' into master 2024-10-14 18:36:52 +05:00
MaggotHATE
3613a6d27b Renamed random distribution 2024-10-14 18:36:03 +05:00
MaggotHATE
436a9919e3 Simplified algorithm since threshold_max is removed 2024-10-14 16:10:13 +05:00
VoidIsVoid
a89f75e1b7
server : handle "logprobs" field with false value (#9871)
Co-authored-by: Gimling <huangjl@ruyi.ai>
2024-10-14 10:04:36 +03:00
MaggotHATE
dfef2c4c37
Merge branch 'ggerganov:master' into master 2024-10-14 11:44:50 +05:00
MaggotHATE
a3e652296a Merge branch 'master' of https://github.com/MaggotHATE/llama.cpp-xtc 2024-10-14 11:44:00 +05:00
MaggotHATE
44bbd6337a Quick fixes by comments 2024-10-14 11:43:45 +05:00
agray3
13dca2a54a
Vectorize load instructions in dmmv f16 CUDA kernel (#9816)
* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-10-14 02:49:08 +02:00
Georgi Gerganov
d4c19c0f5c
server : accept extra_context for the infill endpoint (#9874)
* server : accept extra_context for the infill endpoint

ggml-ci

* server : update readme [no ci]

* server : use repo-level FIM pattern if possible

ggml-ci
2024-10-13 21:31:35 +03:00
Georgi Gerganov
c7181bd294
server : reuse cached context chunks (#9866)
ggml-ci
2024-10-13 18:52:48 +03:00
MaggotHATE
ea62e65fe9
Merge branch 'ggerganov:master' into master 2024-10-13 13:45:40 +05:00
Georgi Gerganov
92be9f1216
flake.lock: Update (#9870)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/bc947f541ae55e999ffdb4013441347d83b00feb?narHash=sha256-NOiTvBbRLIOe5F6RbHaAh6%2B%2BBNjsb149fGZd1T4%2BKBg%3D' (2024-10-04)
  → 'github:NixOS/nixpkgs/5633bcff0c6162b9e4b5f1264264611e950c8ec7?narHash=sha256-9UTxR8eukdg%2BXZeHgxW5hQA9fIKHsKCdOIUycTryeVw%3D' (2024-10-09)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-10-12 20:11:26 -07:00
MaggotHATE
cca842fbd3 Fixed arg after update 2024-10-12 18:46:13 +05:00
MaggotHATE
ea85a51af1
Merge branch 'ggerganov:master' into master 2024-10-12 18:38:06 +05:00
MaggotHATE
68557eb7a0 Merge branch 'master' of https://github.com/MaggotHATE/llama.cpp-xtc 2024-10-12 18:36:14 +05:00
MaggotHATE
9c43a01c5d Removed xtc_threshold_max 2024-10-12 18:35:56 +05:00
Georgi Gerganov
edc265661c
server : add option to time limit the generation phase (#9865)
ggml-ci
2024-10-12 16:14:27 +03:00
Georgi Gerganov
1bde94dd02
server : remove self-extend features (#9860)
* server : remove self-extend

ggml-ci

* server : fix context limit check to use slot.n_past

ggml-ci
2024-10-12 16:06:31 +03:00
Georgi Gerganov
95c76e8e92
server : remove legacy system_prompt feature (#9857)
* server : remove legacy system_prompt feature

ggml-ci

* readme : update [no ci]

* server : fix non-transformer logic + remove response from /props
2024-10-12 14:51:54 +03:00
Georgi Gerganov
11ac9800af
llama : improve infill support and special token detection (#9798)
* llama : improve infill support

ggml-ci

* llama : add more FIM token strings

ggml-ci

* server : update prompt on slot restore (#9800)

* gguf : deprecate old FIM token KVs
2024-10-12 08:21:51 +03:00
R0CKSTAR
943d20b411
musa : update doc (#9856)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-10-12 08:09:53 +03:00
MaggotHATE
dfe587a5f3
Merge branch 'ggerganov:master' into master 2024-10-12 00:41:34 +05:00
Diego Devesa
96776405a1
ggml : move more prints to the ggml log system (#9839)
* ggml : move more prints to the ggml log system

* show BLAS OpenMP warnings in all builds using debug print
2024-10-11 15:34:45 +02:00
MaggotHATE
acada1a5e7 Made algorithm safer and more readable 2024-10-11 15:36:25 +05:00
MaggotHATE
3968369071 Fixed labels in old server UI 2024-10-11 11:53:19 +05:00
MaggotHATE
882a603bda
Merge branch 'master' into master 2024-10-11 11:26:05 +05:00
Diego Devesa
7eee341bee
common : use common_ prefix for common library functions (#9805)
* common : use common_ prefix for common library functions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-10-10 22:57:42 +02:00
Diego Devesa
0e9f760eb1
rpc : add backend registry / device interfaces (#9812)
* rpc : add backend registry / device interfaces

* llama : add llama_supports_rpc API

* ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server
2024-10-10 20:14:55 +02:00
R0CKSTAR
cf8e0a3bb9
musa: add docker image support (#9685)
* mtgpu: add docker image support

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: enable docker workflow

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-10-10 20:10:37 +02:00
MaggotHATE
72db625bd4 Added XTC to server UIs 2024-10-10 22:59:23 +05:00
Diego Devesa
c7499c557c
examples : do not use common library in simple example (#9803)
* examples : do not use common library in simple example

* add command line parser, simplify code
2024-10-10 19:50:49 +02:00
MaggotHATE
f7a383ffb3 Initial server support 2024-10-10 21:48:49 +05:00
MaggotHATE
2107882cf5 Renamed parameters, fixed info and defaults
* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off
2024-10-10 19:35:28 +05:00
MaggotHATE
ba29d31fb7
Merge branch 'ggerganov:master' into master 2024-10-10 11:42:50 +05:00
Diego Devesa
c81f3bbb05
cmake : do not build common library by default when standalone (#9804) 2024-10-09 18:49:52 +02:00
Georgi Gerganov
e7022064ab
perplexity : fix integer overflow (#9783)
* perplexity : fix integer overflow

ggml-ci

* perplexity : keep n_vocab as int and make appropriate casts

ggml-ci
2024-10-09 17:00:18 +03:00
MaggotHATE
37e02e34a1
Added XTC to README 2024-10-09 14:08:02 +05:00
MaggotHATE
ed535bb2ae
Merge branch 'ggerganov:master' into master 2024-10-09 14:00:55 +05:00
Georgi Gerganov
3dc48fe75a
examples : remove llama.vim
An updated version will be added in #9787
2024-10-09 10:55:42 +03:00
MaggotHATE
d0b1053897
Fixed incorrect min_keep check 2024-10-09 00:59:46 +05:00
MaggotHATE
6feb6b399c
Update dump info in common 2024-10-08 21:15:37 +05:00
MaggotHATE
c19fb26042
Merged back lost commits in common and arg 2024-10-08 21:11:35 +05:00
MaggotHATE
09bc6d507c
Updated info in common and args 2024-10-08 20:57:36 +05:00
MaggotHATE
81a0c2603c
Simplified algorithm and more tests 2024-10-08 18:38:43 +05:00
MaggotHATE
8110f783d1
Merge branch 'ggerganov:master' into master 2024-10-08 18:36:38 +05:00
Diego Devesa
dca1d4b58a
ggml : fix BLAS with unsupported types (#9775)
* ggml : do not use BLAS with types without to_float

* ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies

* ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits

it's not really internal if everybody uses it
2024-10-08 14:21:43 +02:00
Xuan Son Nguyen
458367a906
server : better security control for public deployments (#9776)
* server : more explicit endpoint access settings

* protect /props endpoint

* fix tests

* update server docs

* fix typo

* fix tests
2024-10-08 13:27:04 +02:00
standby24x7
fa42aa6d89
scripts : fix spelling typo in messages and comments (#9782)
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
2024-10-08 09:19:53 +03:00