Commit graph

2973 commits

Author SHA1 Message Date
Oleksandr Kuvshynov
1d6d9497a8 readme 2024-05-27 12:36:57 -04:00
Oleksandr Kuvshynov
de26d49fbe duo: v5 2024-05-25 22:19:23 -04:00
Oleksandr Kuvshynov
7c8699add6 pass user data 2024-05-25 22:10:19 -04:00
Oleksandr Kuvshynov
534093878b duo: v3 2024-05-25 14:41:30 -04:00
Oleksandr Kuvshynov
96811fdf63 duo: v2 2024-05-25 14:23:57 -04:00
Oleksandr Kuvshynov
78938bc0c9 duo: v0 2024-05-25 13:59:28 -04:00
Oleksandr Kuvshynov
83aabb3fb7 readme 2024-05-24 23:56:48 -04:00
Oleksandr Kuvshynov
10d5aefed5 logging 2024-05-24 22:21:41 -04:00
Oleksandr Kuvshynov
66982abcb1 fixes 2024-05-24 12:22:59 -04:00
Oleksandr Kuvshynov
02e2c91d01 correct split id 2024-05-24 09:52:28 -04:00
Oleksandr Kuvshynov
60fe62e6eb some renaming 2024-05-22 23:52:36 -04:00
Oleksandr Kuvshynov
479c80a0db duo: cleanup v2 2024-05-22 23:31:23 -04:00
Oleksandr Kuvshynov
eecdd3b0ce duo: first ~working option 2024-05-22 23:02:31 -04:00
Oleksandr Kuvshynov
2849247c4f duo: more cleanup 2024-05-21 22:45:59 -04:00
Oleksandr Kuvshynov
f3965704fd duo: simplify a little 2024-05-21 22:31:52 -04:00
Oleksandr Kuvshynov
d52d193e58 duo v0
setting up RPC + callback on each split completion

1. start rpc server on local instance on two different ports with 5GB
   allocated each.
2. set up another callback on completion of a split. This seems cleaner
   than trying to second-guess which tensor is the boundary of a split.
3. run it with 8B model @ 4bit, observe split_done captured at a reasonable place.

Next step - bring back linear speculation and start speculating on another remote
   instances.
2024-05-21 16:11:30 -04:00
Georgi Gerganov
c3f8d58356
tests : test-tokenizer-0.sh print more info (#7402) 2024-05-21 19:53:48 +03:00
Amir
11474e756d
examples: cache hf model when --model not provided (#7353)
* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided
2024-05-21 17:13:12 +03:00
Johannes Gäßler
d8ee902227
CUDA: deduplicate mmq code (#7397) 2024-05-21 16:02:12 +02:00
jaime-m-p
d7e852c1bc
Tokenizer SPM fixes for phi-3 and llama-spm (bugfix) (#7425)
* Update brute force test: add_special
* Update brute force test: default values for add_bos_token and add_eos_token
* Enable rtrim when pre-inserting BOS

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Revert "server : fix test regexes"
2024-05-21 14:39:48 +02:00
jaime-m-p
917dc8cfa6
Tokenizer SPM fixes for phi-3 and llama-spm (#7375)
* Update brute force test: special tokens
* Fix added tokens
  - Try to read 'added_tokens.json'.
  - Try to read 'tokenizer_config.json'.
  - Try to read 'tokenizer.json'.
* Fix special tokens rtrim

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : fix test regexes
2024-05-20 20:15:57 +02:00
Georgi Gerganov
fabf30b4c4
llama : remove Persimmon (#7408)
* llama : remove Persimmon

* requirements : remove
2024-05-21 02:35:28 +10:00
Johannes Gäßler
20385cebcc
perplexity: update README FP16 results [no ci] (#7413) 2024-05-20 18:15:38 +02:00
Radoslav Gerganov
db10f01310
rpc : track allocated buffers (#7411)
* rpc : track allocated buffers

ref: #7407

* rpc : pack rpc_tensor tightly
2024-05-20 16:36:55 +03:00
Georgi Gerganov
3bc10cb485
server : fix temperature + disable some tests (#7409)
* server : fix temperature

* server : disable tests relying on parallel determinism

* ci : change server Debug -> RelWithDebInfo
2024-05-20 22:10:03 +10:00
AidanBeltonS
6bf9b66fa3
[SYCL] Update SYCL upscale operation (#7321)
* Update SYCL upscale operation

* Formatting

* Remove messages
2024-05-20 16:38:23 +05:30
Bingan
26cd4237bc
Update README.md (#7410) 2024-05-20 11:55:34 +02:00
Herman Semenov
213e90ed73
ggml-opencl, llama: using reserve() if count already known (#7272) 2024-05-20 10:33:21 +03:00
junchao-loongson
65c58207ec
ggml : add loongarch lsx and lasx support (#6454)
* add loongarch lsx and lasx optimize code

* Add loongarch compilation support to makefile

* revert stb_image.h

* opt bytes_from_nibbles_32 and sum_i16_pairs_float

* fix undeclared

* format code

* update

* update 2

---------

Co-authored-by: Jinyang He <hejinyang@loongson.cn>
2024-05-20 10:19:21 +03:00
Georgi Gerganov
1cc0155d04
server : tuning tests (#7388)
* server : don't pass temperature as string

* server : increase timeout

* tests : fix the fix 0.8f -> 0.8

ggml-ci

* tests : set explicit temperature
2024-05-20 10:16:41 +03:00
Georgi Gerganov
e932094d58
server : return error on too large embedding input (#7389) 2024-05-20 08:56:05 +03:00
Georgi Gerganov
2789baf480
tests : fix --keep_split -> --keep-split (#7374) 2024-05-20 08:55:09 +03:00
Srihari-mcw
33c8d50acc
Add provisions for windows support for BF16 code including CMake provision for enabling AVX512_BF16 (#7258) 2024-05-20 12:18:39 +10:00
slaren
d359f30921
llama : remove MPI backend (#7395) 2024-05-20 01:17:03 +02:00
Fred Douglas
1ea2a0036e
quantize : fix --keep-split check (#7374) 2024-05-19 19:37:04 +03:00
0cc4m
f030ec1f7a
Vulkan Embedding Fix (#7360)
* Fix empty Vulkan host buffers

Add fp32 fp16 matmul shader

Fix matmul shader alignment

* Remove deprecated tensor->backend uses

* Fix Vulkan validation errors on embedding models with no offloaded layers

* Fix Vulkan llava segfault when not offloading layers
2024-05-19 17:19:53 +02:00
slaren
e4e6f67be6
ggml : fix another case of quants nans (#7387) 2024-05-19 17:08:46 +02:00
Johannes Gäßler
5ca49cbecd
ggml: implement quantized KV cache for FA (#7372) 2024-05-19 16:46:13 +02:00
Johannes Gäßler
1b01f06db0
server: add test for token probs (#7347) 2024-05-19 16:26:02 +02:00
Johannes Gäßler
41858392e1
server: fix seed being reported back (#7382) 2024-05-19 17:06:33 +03:00
Anas Ahouzi
6aade19ee7
Add StableLM2 pre-tokenizer (#7349)
* Add StableLM pre-tokenizer

* Fix space

* Fix trailing whitespace
2024-05-19 22:46:46 +10:00
slaren
ab33f7a338
cuda : clear error after buffer allocation failure (#7376) 2024-05-19 14:19:37 +02:00
Brian
e23b974f4c
labeler.yml: Use settings from ggerganov/llama.cpp [no ci] (#7363)
https://github.com/actions/labeler#using-configuration-path-input-together-with-the-actionscheckout-action
Recommends the use of checkout action to use the correct repo context
when applying settings for PR labels

e.g.

    steps:
    - uses: actions/checkout@v4 # Uploads repository content to the runner
      with:
        repository: "owner/repositoryName" # The one of the available inputs, visit https://github.com/actions/checkout#readme to find more
    - uses: actions/labeler@v5
      with:
        configuration-path: 'path/to/the/uploaded/configuration/file'
2024-05-19 20:51:03 +10:00
Georgi Gerganov
854d365aba
cmake : update android comments (#7341) 2024-05-19 11:01:01 +03:00
fraxy-v
f5bf761747
Capture CUDA logging output (#7298)
* logging: output capture in cuda module

* fix compile error

* fix: vsnprintf terminates with 0, string use not correct

* post review

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-05-19 00:44:42 +02:00
Georgi Gerganov
059031b8c4
ci : re-enable sanitizer runs (#7358)
* Revert "ci : temporary disable sanitizer builds (#6128)"

This reverts commit 4f6d1337ca.

* ci : trigger
2024-05-18 18:55:54 +03:00
Georgi Gerganov
511182eabb
android : use "ci-android" branch for CI (#7341)
* android : use "ci-android" branch for CI

* ggml : disable SIMD exp and silu for 32-bit ARM

ggml-ci

* android : do not fetch, use add_subdirectory instead

* cmake : provide binary dir
2024-05-18 20:40:39 +10:00
Johannes Gäßler
133d99c599
CUDA: deduplicate FlashAttention code (#7352) 2024-05-18 12:36:25 +02:00
Johannes Gäßler
cb42c29427
server: correct --threads documentation [no ci] (#7362) 2024-05-18 11:10:47 +02:00
Engininja2
d233b507cd
cuda : add half2 __shfl_xor() for ROCm 5.5 (#7263) 2024-05-18 10:05:17 +02:00