Commit graph

2965 commits

Author SHA1 Message Date
Justine Tunney
ebbc728e37
Make atomic operations explicit 2024-05-22 00:35:13 -07:00
Justine Tunney
8435ab0ae8
Avoid INIT synchronization barrier when possible
This change makes inference go ~5% faster for me.
2024-05-22 00:31:28 -07:00
Justine Tunney
7deec14bd9
Make MUL_MAT initialization go fast
Using an atomic to delegate slices of the matrix to separate threads is
slow, because all the threads have to contend for the same memory spot.
The right thing to do here is use a `chore` variable, where all threads
perform the same computation independently.

This change introduces the ggml_once() and ggml_syncthreads() functions
which works the exact same way as CUDA. This is nice, since it means if
BLAS or LLAMAFILE doesn't need `B` requantized, then it can skip paying
for the synchronization barrier between the INIT and COMPUTE phases. We
can further refactor along this path, to remove all the INIT/COMP/FINAL
code too. All ops can should be in charge of their own synchronization.
2024-05-22 00:20:24 -07:00
Justine Tunney
ae6ee0b777
Revert "ggml : use dynamic thread scheduling for matrix multiplication (#6915)"
This reverts commit e1b40ac3b9.
2024-05-21 23:51:05 -07:00
liuwei-git
201cc11afa
llama : add phi3 128K model support (#7225)
* add phi3 128k support in convert-hf-to-gguf

* add phi3 128k support in cuda

* address build warnings on llama.cpp

* adjust index value in cuda long rope freq factors

* add long rope support in ggml cpu backend

* make freq factors only depend on ctx size

* remove unused rope scaling type 'su' frin gguf converter

* fix flint warnings on convert-hf-to-gguf.py

* set to the short freq factor when context size is small than trained context size

* add one line of comments

* metal : support rope freq_factors

* ggml : update ggml_rope_ext API to support freq. factors

* backends : add dev messages to support rope freq. factors

* minor : style

* tests : update to use new rope API

* backends : fix pragma semicolons

* minor : cleanup

* llama : move rope factors from KV header to tensors

* llama : remove tmp assert

* cuda : fix compile warning

* convert : read/write n_head_kv

* llama : fix uninitialized tensors

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-21 23:28:32 +03:00
Georgi Gerganov
6369bf0433
metal : handle F16 inf values, fix FA partial offload (#7434)
ggml-ci
2024-05-21 23:03:42 +03:00
Olivier Chafik
e402de364b
grammars: fix resampling logic regression (#7424) 2024-05-21 20:40:00 +01:00
Johannes Gäßler
fcf6538ba6
CUDA: fix unused warning in mmq.cu (#7442) 2024-05-21 20:27:12 +03:00
Georgi Gerganov
c3f8d58356
tests : test-tokenizer-0.sh print more info (#7402) 2024-05-21 19:53:48 +03:00
Amir
11474e756d
examples: cache hf model when --model not provided (#7353)
* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided
2024-05-21 17:13:12 +03:00
Johannes Gäßler
d8ee902227
CUDA: deduplicate mmq code (#7397) 2024-05-21 16:02:12 +02:00
jaime-m-p
d7e852c1bc
Tokenizer SPM fixes for phi-3 and llama-spm (bugfix) (#7425)
* Update brute force test: add_special
* Update brute force test: default values for add_bos_token and add_eos_token
* Enable rtrim when pre-inserting BOS

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Revert "server : fix test regexes"
2024-05-21 14:39:48 +02:00
jaime-m-p
917dc8cfa6
Tokenizer SPM fixes for phi-3 and llama-spm (#7375)
* Update brute force test: special tokens
* Fix added tokens
  - Try to read 'added_tokens.json'.
  - Try to read 'tokenizer_config.json'.
  - Try to read 'tokenizer.json'.
* Fix special tokens rtrim

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : fix test regexes
2024-05-20 20:15:57 +02:00
Georgi Gerganov
fabf30b4c4
llama : remove Persimmon (#7408)
* llama : remove Persimmon

* requirements : remove
2024-05-21 02:35:28 +10:00
Johannes Gäßler
20385cebcc
perplexity: update README FP16 results [no ci] (#7413) 2024-05-20 18:15:38 +02:00
Radoslav Gerganov
db10f01310
rpc : track allocated buffers (#7411)
* rpc : track allocated buffers

ref: #7407

* rpc : pack rpc_tensor tightly
2024-05-20 16:36:55 +03:00
Georgi Gerganov
3bc10cb485
server : fix temperature + disable some tests (#7409)
* server : fix temperature

* server : disable tests relying on parallel determinism

* ci : change server Debug -> RelWithDebInfo
2024-05-20 22:10:03 +10:00
AidanBeltonS
6bf9b66fa3
[SYCL] Update SYCL upscale operation (#7321)
* Update SYCL upscale operation

* Formatting

* Remove messages
2024-05-20 16:38:23 +05:30
Bingan
26cd4237bc
Update README.md (#7410) 2024-05-20 11:55:34 +02:00
Herman Semenov
213e90ed73
ggml-opencl, llama: using reserve() if count already known (#7272) 2024-05-20 10:33:21 +03:00
junchao-loongson
65c58207ec
ggml : add loongarch lsx and lasx support (#6454)
* add loongarch lsx and lasx optimize code

* Add loongarch compilation support to makefile

* revert stb_image.h

* opt bytes_from_nibbles_32 and sum_i16_pairs_float

* fix undeclared

* format code

* update

* update 2

---------

Co-authored-by: Jinyang He <hejinyang@loongson.cn>
2024-05-20 10:19:21 +03:00
Georgi Gerganov
1cc0155d04
server : tuning tests (#7388)
* server : don't pass temperature as string

* server : increase timeout

* tests : fix the fix 0.8f -> 0.8

ggml-ci

* tests : set explicit temperature
2024-05-20 10:16:41 +03:00
Georgi Gerganov
e932094d58
server : return error on too large embedding input (#7389) 2024-05-20 08:56:05 +03:00
Georgi Gerganov
2789baf480
tests : fix --keep_split -> --keep-split (#7374) 2024-05-20 08:55:09 +03:00
Srihari-mcw
33c8d50acc
Add provisions for windows support for BF16 code including CMake provision for enabling AVX512_BF16 (#7258) 2024-05-20 12:18:39 +10:00
slaren
d359f30921
llama : remove MPI backend (#7395) 2024-05-20 01:17:03 +02:00
Fred Douglas
1ea2a0036e
quantize : fix --keep-split check (#7374) 2024-05-19 19:37:04 +03:00
0cc4m
f030ec1f7a
Vulkan Embedding Fix (#7360)
* Fix empty Vulkan host buffers

Add fp32 fp16 matmul shader

Fix matmul shader alignment

* Remove deprecated tensor->backend uses

* Fix Vulkan validation errors on embedding models with no offloaded layers

* Fix Vulkan llava segfault when not offloading layers
2024-05-19 17:19:53 +02:00
slaren
e4e6f67be6
ggml : fix another case of quants nans (#7387) 2024-05-19 17:08:46 +02:00
Johannes Gäßler
5ca49cbecd
ggml: implement quantized KV cache for FA (#7372) 2024-05-19 16:46:13 +02:00
Johannes Gäßler
1b01f06db0
server: add test for token probs (#7347) 2024-05-19 16:26:02 +02:00
Johannes Gäßler
41858392e1
server: fix seed being reported back (#7382) 2024-05-19 17:06:33 +03:00
Anas Ahouzi
6aade19ee7
Add StableLM2 pre-tokenizer (#7349)
* Add StableLM pre-tokenizer

* Fix space

* Fix trailing whitespace
2024-05-19 22:46:46 +10:00
slaren
ab33f7a338
cuda : clear error after buffer allocation failure (#7376) 2024-05-19 14:19:37 +02:00
Brian
e23b974f4c
labeler.yml: Use settings from ggerganov/llama.cpp [no ci] (#7363)
https://github.com/actions/labeler#using-configuration-path-input-together-with-the-actionscheckout-action
Recommends the use of checkout action to use the correct repo context
when applying settings for PR labels

e.g.

    steps:
    - uses: actions/checkout@v4 # Uploads repository content to the runner
      with:
        repository: "owner/repositoryName" # The one of the available inputs, visit https://github.com/actions/checkout#readme to find more
    - uses: actions/labeler@v5
      with:
        configuration-path: 'path/to/the/uploaded/configuration/file'
2024-05-19 20:51:03 +10:00
Georgi Gerganov
854d365aba
cmake : update android comments (#7341) 2024-05-19 11:01:01 +03:00
fraxy-v
f5bf761747
Capture CUDA logging output (#7298)
* logging: output capture in cuda module

* fix compile error

* fix: vsnprintf terminates with 0, string use not correct

* post review

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-05-19 00:44:42 +02:00
Georgi Gerganov
059031b8c4
ci : re-enable sanitizer runs (#7358)
* Revert "ci : temporary disable sanitizer builds (#6128)"

This reverts commit 4f6d1337ca.

* ci : trigger
2024-05-18 18:55:54 +03:00
Georgi Gerganov
511182eabb
android : use "ci-android" branch for CI (#7341)
* android : use "ci-android" branch for CI

* ggml : disable SIMD exp and silu for 32-bit ARM

ggml-ci

* android : do not fetch, use add_subdirectory instead

* cmake : provide binary dir
2024-05-18 20:40:39 +10:00
Johannes Gäßler
133d99c599
CUDA: deduplicate FlashAttention code (#7352) 2024-05-18 12:36:25 +02:00
Johannes Gäßler
cb42c29427
server: correct --threads documentation [no ci] (#7362) 2024-05-18 11:10:47 +02:00
Engininja2
d233b507cd
cuda : add half2 __shfl_xor() for ROCm 5.5 (#7263) 2024-05-18 10:05:17 +02:00
Steffen Röcker
0f98acfac6
llama : add support for larger Granite Code Models (20B, 34B) (#7324)
Tie the weights for ARCH_STARCODER to support the larger Granite code models.
Partially addresses ggerganov/issues/7116

There still remains to be a few things to fix.
Currently requires `--override-kv tokenizer.ggml.add_bos_token=bool:false`
2024-05-18 11:04:55 +03:00
strawberrymelonpanda
ca57e0f35e
perplexity : ndot progress and show stats with < 100 tasks (#7348)
Fix floating point error with ndot printing, allow end stats on lower task numbers if multiple-choice tasks.
2024-05-18 10:57:08 +03:00
0cc4m
c1b295eea5
Update and fix Vulkan soft_max and argsort implementations (#7237)
* Update and fix Vulkan softmax implementation

* Update and fix Vulkan argsort implementation
2024-05-18 08:10:58 +02:00
Brian
de73196344
github-actions-labeler: initial commit (#7330)
* github-actions-labeler: initial commit [no ci]

* github actions: remove priority auto labeling [no ci]
2024-05-18 16:04:23 +10:00
Georgi Gerganov
b49a13dd2f
convert : fix set_vocab_sentencepiece (#6866)
* convert : fix set_vocab_sentencepiece

* Update convert-hf-to-gguf.py
2024-05-18 08:46:20 +03:00
slaren
05834841dc
ggml : fix quants nans when all the group weights are very close to zero (#7313) 2024-05-18 02:39:54 +02:00
Engininja2
ef277de2ad
cmake : fix typo in AMDGPU_TARGETS (#7356) 2024-05-18 02:39:25 +02:00
jaime-m-p
b43272afa2
Unicode codepoint flags for custom regexs (#7245)
* Replace CODEPOINT_TYPE_* with codepoint_flags
* Update and bugfix brute force random test
* Deterministic brute force random test
* Unicode normalization NFD
* Get rid of BOM
2024-05-18 01:09:13 +02:00