Commit graph

2553 commits

Author SHA1 Message Date
Kawrakow
76aa30a263
Add ability to use Q5_0, Q5_1, and IQ4_NL for quantized K cache (#6183)
* k_cache: be able to use Q5_0

* k_cache: be able to use Q5_1 on CODA

* k_cache: be able to use Q5_0 on Metal

* k_cache: be able to use Q5_1 on Metal

* k_cache: be able to use IQ4_NL - just CUDA for now

* k_cache: be able to use IQ4_NL on Metal

* k_cache: add newly added supported types to llama-bench and CUDA supports_op

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-21 08:27:57 +01:00
AidanBeltonS
c5b8595e3f
Add nvidia and amd backends (#6157) 2024-03-21 11:40:52 +05:30
Francis Couture-Harpin
5f33a675ca perplexity : make hellaswag and multiple-choice outputs identical to master
Due to how the KV cache is updated, the logprobs for tokens in a batch
are very slightly affected by the other tokens present in the batch,
so to make hellaswag and multiple-choice return exactly the same results
as on master, the last token of each sequence needs to be evaluated
even though its output is not used at all.

This will probably be changed back in the future to make these benchmarks
a tiny bit faster.

* perplexity : fix division by zero when using less than 100 multiple-choice tasks
2024-03-20 23:05:18 -04:00
Francis Couture-Harpin
7d8d6b589f llama : handle errors from llama_output_reserve at call sites 2024-03-20 23:05:12 -04:00
slaren
42e21c6882
cuda : fix conflict with std::swap (#6186) 2024-03-21 01:47:46 +01:00
slaren
1c51f98adc
cuda : print the returned error when CUDA initialization fails (#6185) 2024-03-20 21:03:26 +01:00
Ziang Wu
f9c7ba3447
llava : update MobileVLM-README.md (#6180) 2024-03-20 17:29:51 +02:00
Ziang Wu
272935b281
llava : add MobileVLM_V2 backup (#6175)
* Add MobileVLM_V2 backup

* Update MobileVLM-README.md

* Update examples/llava/MobileVLM-README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/llava/convert-image-encoder-to-gguf.py

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* clip :  fix whitespace

* fix deifinition mistake in clip.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-20 17:02:32 +02:00
slaren
ccf58aa3ec
cuda : refactor to remove global resources (#6170)
* cuda : refactor to remove global resources
2024-03-20 14:42:59 +01:00
Xuan Son Nguyen
91f8ad167d
Server: version bump for httplib and json (#6169)
* server: version bump for httplib and json

* fix build

* bring back content_length
2024-03-20 13:30:36 +01:00
Georgi Gerganov
6b7e76d28c
gitignore : ignore curl-related files 2024-03-20 14:17:34 +02:00
Georgi Gerganov
bc0baab2ea
server : allow to override -ngl in tests (#6170) 2024-03-20 14:14:32 +02:00
Georgi Gerganov
d795988d9e
Revert "llava : add a MobileVLM_V2-1.7B backup (#6152)"
This reverts commit f8c4e745e1.
2024-03-20 13:29:49 +02:00
Ziang Wu
f8c4e745e1
llava : add a MobileVLM_V2-1.7B backup (#6152)
* Add MobileVLM_V2 backup

* Update MobileVLM-README.md

* Update examples/llava/MobileVLM-README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/llava/convert-image-encoder-to-gguf.py

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* clip :  fix whitespace

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-20 13:20:37 +02:00
Karthick
47cc7a7bf9
Server: Handle n_keep parameter in the request (#6174) 2024-03-20 12:02:34 +01:00
Jared Van Bortel
bd60d82d0c
server tests : more pythonic process management; fix bare except: (#6146)
* server tests : remove seemingly redundant newlines in print()

* server tests : use built-in subprocess features, not os.kill and psutil

* server tests : do not catch e.g. SystemExit; use print_exc

* server tests: handle TimeoutExpired exception

* server tests: fix connect on dual-stack systems

* server: tests: add new tokens regex on windows generated following new repeat penalties default changed in (#6127)

* server: tests: remove the hack on windows since now we get the good socket family

* server: tests: add new tokens regex following new repeat penalties default changed in (#6127)

* server: tests: add new tokens regex following new repeat penalties default changed in (#6127)

---------

Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
2024-03-20 06:33:49 +01:00
Neo Zhang Jianyu
6c0b287748
update readme sycl for new update (#6151)
* update readme sycl for new update

* Update README-sycl.md

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* Update README-sycl.md

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* Update README-sycl.md

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* Update README-sycl.md

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* Update README-sycl.md

Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>

* Update README-sycl.md

Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>

* update by review comments

* update w64devkit link

* update for verify device id part

* Update README-sycl.md

Co-authored-by: Meng, Hengyu <airdldl@163.com>

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>
Co-authored-by: Meng, Hengyu <airdldl@163.com>
2024-03-20 11:21:41 +08:00
Abhilash Majumder
d26e8b669d
increase igpu cluster limit (#6159) 2024-03-20 08:28:49 +05:30
Francis Couture-Harpin
615a3a4a50 llama : clearer error messages for invalid logits or embeddings ids
* llama : assert all models that can have inp_out_ids

Since the graph topology is now constant, this presence check
can be done even when there are no outputs.

* llama : assert logits and embd buffers exist before writing to them
2024-03-19 15:32:18 -04:00
Francis Couture-Harpin
8f70dcb0f3 perplexity : make Winogrande work as it does on master
The problems with the Winogrande implementation will
need to be fixed in a separate PR to ease review.
2024-03-19 14:07:48 -04:00
DAN™
d8b009a945
Remove undeed header file. (#6158) 2024-03-19 17:16:09 +01:00
Pierrick Hymbert
d0d5de42e5
gguf-split: split and merge gguf per batch of tensors (#6135)
* gguf-split: split and merge gguf files per tensor

* gguf-split: build with make toolchain

* gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split

* split : minor style + fix compile warnings

* gguf-split: remove --upload not implemented

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-19 12:05:44 +01:00
Georgi Gerganov
b80cf3b2d1
common : disable repeat penalties by default (#6127) 2024-03-19 10:21:54 +02:00
slaren
970a48060a
ci : exempt some labels from being tagged as stale (#6140) 2024-03-19 10:06:54 +02:00
DAN™
4c28b82529
common : print usage on '-h' and '--help' (#6145) 2024-03-19 07:59:36 +02:00
Francis Couture-Harpin
d04cfaf2f5 llama : fix llama_output_reserve nullptr deref when new_size is 0 2024-03-18 21:26:08 -04:00
Francis Couture-Harpin
8b826c5b08 ggml : skip empty tensors in all backends 2024-03-18 21:15:00 -04:00
Francis Couture-Harpin
4551e7eba8 llama : use a vector for ctx->output_ids
* llama : rework reallocation logic for llama_output_reserve

Now comparing the actual size with the new total size of the output buffer
to allow more efficient enabling and disabling of the embeddings
and/or logits output in the future.
2024-03-18 20:51:32 -04:00
Francis Couture-Harpin
09bb15a66a ggml : make ggml_is_empty public and work with views 2024-03-18 20:21:02 -04:00
github-actions[bot]
2d15886bb0 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/9df3e30ce24fd28c7b3e2de0d986769db5d6225d' (2024-03-06)
  → 'github:NixOS/nixpkgs/d691274a972b3165335d261cc4671335f5c67de9' (2024-03-14)
2024-03-18 18:51:30 +00:00
Jared Van Bortel
d199ca79f2
mpt : implement backwards compatiblity with duped output tensor (#6139) 2024-03-18 12:49:02 -04:00
Felix
104f5e0fc1
clip : fix memory leak (#6138) 2024-03-18 17:40:22 +02:00
slaren
5e1b7f94a0
backend : set max split inputs to GGML_MAX_SRC (#6137) 2024-03-18 16:33:44 +01:00
Georgi Gerganov
ac9ee6a4ad
ci : disable stale issue messages (#6126) 2024-03-18 13:45:38 +02:00
Georgi Gerganov
4f6d1337ca
ci : temporary disable sanitizer builds (#6128) 2024-03-18 13:45:27 +02:00
slaren
2bf8d0f7c4
backend : offload large batches to GPU (#6083)
* backend : offload large batches to GPU

* fix hip

* code cleanup

* fix CUDA split buffers

* Update ggml-backend-impl.h

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cuda : fix memset without set_device

* imatrix : remove sched affix from weight names

* sched : add a new split if the current one has too many inputs
reduce max inputs per split
more cleanup

* update backends

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-03-18 11:03:04 +01:00
DAN™
496bc79bc2
common : tidy-up argument parsing (#6105)
* Tidy-up argument parsing.

* Missing ref.

* common : minor

* common : add static classifier

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-18 10:27:44 +02:00
Thérence
9b03719ad7
convert : add support for CamembertModel architecture (#6119)
Adding support for CamembertModel architecture used by :
https://huggingface.co/dangvantuan/sentence-camembert-large
2024-03-18 10:17:00 +02:00
Romain D
3a6efdd03c
convert : use f32 outtype for bf16 tensors (#6106)
The old behaviour is to use f16, but bf16 to f16 is not a lossless conversion.
Change the outtype to f32 to default to a lossless conversion.
2024-03-18 10:04:41 +02:00
Francis Couture-Harpin
6bf7f3f41c ggml : do not multi-thread ops returning empty tensors 2024-03-18 00:35:03 -04:00
Francis Couture-Harpin
99c37ccb6b ggml : saner ggml_can_repeat with empty tensors
*  ggml : future-proof ggml_is_empty by using GGML_MAX_DIMS - 1
2024-03-18 00:14:36 -04:00
Francis Couture-Harpin
d100502251 llama : keep same graph topology even when n_outputs == 0 2024-03-17 22:04:42 -04:00
Francis Couture-Harpin
711b0bcb11 llama : fix running a batch with n_outputs == 0
It previously worked because lctx.inp_out_ids was not initialized,
so it pointed to some garbage address which was somehow still valid when I
ran my tests.
2024-03-17 20:41:21 -04:00
Francis Couture-Harpin
a57fa7faa4 llama : fix not-skipping outputs of non-causal models 2024-03-17 20:19:25 -04:00
Francis Couture-Harpin
e19cb3aeb7 llama : fix wrong n_outputs in llama_set_inputs
A mismatch happened when using a smaller n_ubatch than n_batch and then using
llama_batch_get_one(). The decision of what n_outputs should be now almost
fully depends on how lctx.n_outputs is set in llama_decode_internal.
The conditions are simpler this way.

* llama : when saving the state, recalculate n_outputs

This ensures the correct number of outputs for the entire previous batch
is stored in the session file, even when n_ubatch is smaller than n_batch.
2024-03-17 17:04:10 -04:00
Francis Couture-Harpin
408fcb0f91 llama : fix llama_get_embeddings_ith when the resulting id is 0 2024-03-17 15:36:41 -04:00
Francis Couture-Harpin
487f89ec2e llama : fix embedding conditions 2024-03-17 15:36:41 -04:00
Francis Couture-Harpin
d0129e8e29 perplexity : normalize spaces and punctuation in Winogrande sentences 2024-03-17 15:36:41 -04:00
Francis Couture-Harpin
17b45c96ed perplexity : fix Winogrande, use correct logits for second choice start
The first logits used to evaluate the second choice were not from
the end of the common prefix; instead, they were the logits from the end
of the first choice. This has been corrected.

The previous implementation sometimes had outliers in the scores of
choices for some tasks, and the logic to skip choices words
in the log-likelihood evaluation probably was an attempt to reduce those,
but it was complex and didn't quite seem to be the right thing.

This is simpler now, and the outlier scores aren't there anymore.
2024-03-17 15:36:41 -04:00
Francis Couture-Harpin
25981fca37 perplexity : adapt to the logits API changes 2024-03-17 15:36:41 -04:00