Commit graph

2309 commits

Author SHA1 Message Date
Jared Van Bortel
89febfed93
examples : do not assume BOS when shifting context (#5622) 2024-02-21 10:33:54 -05:00
Georgi Gerganov
5022cf242d
sync : ggml 2024-02-21 16:52:52 +02:00
Pierrick Hymbert
1ecea255eb
server: health: fix race condition on slots data using tasks queue (#5634)
* server: health: fix race condition on slots data using tasks queue

* server: health:
    * include_slots only if slots_endpoint
    * fix compile warning task.target_id not initialized.
2024-02-21 15:47:48 +01:00
Ettore Di Giacinto
a00a35cef9
readme : add LocalAI to the availables UI (#5629) 2024-02-21 16:39:10 +02:00
Georgi Gerganov
eccd7a26dd
sync : ggml (#5633)
* ggml : fix conv_2d batch mode (ggml/737)

Co-authored-by: bssrdf <bssrdf@gmail.com>

* ggml : compute forward no longer pass src tensors (ggml/729)

* sync : ggml

ggml-ci

---------

Co-authored-by: bssrdf <merlintiger@hotmail.com>
Co-authored-by: bssrdf <bssrdf@gmail.com>
2024-02-21 16:17:10 +02:00
Georgi Gerganov
c14f72db9c
readme : update hot topics 2024-02-21 15:39:54 +02:00
Daniel Bevenius
cc6cac08e3
llava : add --skip-unknown to 1.6 convert.py (#5632)
This commit adds the `--skip-unknown` option to the convert.py script
and removes the saving of the updated checkpoints to avoid updating
possibly checked out files.

The motivation for this change is that this was done for 1.5
in Commit fc0c8d286a ("llava :
update surgery script to not remove tensors") and makes the examples
more consistent.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-21 15:36:57 +02:00
postmasters
580111d42b
llama : add gemma model (#5631)
There are couple things in this architecture:

1. Shared input and output embedding parameters.
2. Key length and value length are not derived from `n_embd`.

More information about the models can be found at
https://ai.google.dev/gemma. GGUFs can be downloaded from
https://huggingface.co/google.
2024-02-21 15:08:22 +02:00
Georgi Gerganov
f1d4138c13
server : fix initialization thread issues 2024-02-21 13:08:57 +02:00
Meng, Hengyu
88c46cbdac
[SYCL] conext add name (#5624)
* [SYCL] conext add name

* name should start with SYCL*
2024-02-21 17:52:06 +08:00
Kawrakow
a14679cc30
IQ4_NL: 4-bit non-linear quants with blocks of 32 (#5590)
* iq4_nl: squash commits for easier rebase

* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels

* iq4_nl: Fix after merging with master

* iq4_nl: another fix after merging with master

* Use IQ4_NL instead of Q4_K when using k-quants is not possible

* Fix typo that makes several tests fail

* It was the ggml_vdotq thing missed inside the brackets

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-21 11:39:52 +02:00
Pierrick HYMBERT
2a37bd6b86 server: tests: fix the multi users infinite loop test 2024-02-21 02:29:50 +01:00
Pierrick HYMBERT
469af4b4ec server: tests: change CI workflow trigger 2024-02-21 02:20:44 +01:00
Pierrick HYMBERT
3322bfa980 server: tests: add a small check to be sure all started threads have generated response 2024-02-21 02:04:59 +01:00
Pierrick HYMBERT
672d98f6f0 server: tests: CORS and api key checks scenario 2024-02-21 01:51:33 +01:00
Pierrick HYMBERT
6dcbcfe6ba server: tests: simplify completion scenario 2024-02-21 00:43:50 +01:00
Pierrick HYMBERT
19664b9f01 server: tests: detokenize endpoint issue reference added 2024-02-21 00:17:38 +01:00
Pierrick HYMBERT
1065f6d41b server: tests: add tokenize/detokenize scenario 2024-02-21 00:13:53 +01:00
Pierrick HYMBERT
e6d482088d server: tests: add embeddings scenario 2024-02-21 00:02:30 +01:00
Pierrick HYMBERT
1ecda0d13e server: tests: disable issue 3969 scenario 2024-02-20 23:35:44 +01:00
Pierrick HYMBERT
b0b6d83c76 server: tests: add infinite loop scenario 2024-02-20 23:17:00 +01:00
Pierrick HYMBERT
68574c6f98 server: tests: add infinite loop scenario 2024-02-20 23:11:59 +01:00
Pierrick HYMBERT
6b9dc4f291 server: tests: add infinite loop 2024-02-20 23:05:27 +01:00
Pierrick HYMBERT
0772884b06 server: tests: add a constant seed in completion request 2024-02-20 22:55:29 +01:00
Pierrick HYMBERT
b9f8390d28 server: tests: check for infinite loops 2024-02-20 22:49:36 +01:00
Pierrick HYMBERT
367b59a15c server: tests: check for infinite loops 2024-02-20 22:45:30 +01:00
Pierrick HYMBERT
c355f76427 server: tests: slots endpoint checks 2024-02-20 22:32:11 +01:00
Pierrick HYMBERT
11adf1d864 server: tests: add OAI multi user scenario 2024-02-20 22:00:09 +01:00
Pierrick HYMBERT
9b7ea97979 server: tests: add OAI stream test, fix file end of line, fast fail behave 2024-02-20 21:34:35 +01:00
Pierrick HYMBERT
56583bee41 server: tests: refactor steps and vocabulary 2024-02-20 20:52:24 +01:00
Pierrick HYMBERT
6c95ec6587 server: tests: change model to: @karpathy's tinyllamas 2024-02-20 20:50:14 +01:00
CJ Pais
6560bed3f0
server : support llava 1.6 (#5553)
* server: init working 1.6

* move clip_image to header

* remove commented code

* remove c++ style from header

* remove todo

* expose llava_image_embed_make_with_clip_img

* fix zig build
2024-02-20 21:07:22 +02:00
slaren
06bf2cf8c4
make : fix debug build with CUDA (#5616) 2024-02-20 20:06:17 +01:00
Pierrick HYMBERT
8bb586bf06 server: tests: add health check and concurrent request example 2024-02-20 19:05:21 +01:00
Pierrick HYMBERT
1680599b01 server: tests: build only the server 2024-02-20 19:05:21 +01:00
Pierrick HYMBERT
fe9866a52d server: tests: use ngxson llama_xs_q4.bin 2024-02-20 19:05:21 +01:00
Pierrick HYMBERT
30aa323fb9 server: tests: fix ci workflow 2024-02-20 19:05:21 +01:00
Pierrick HYMBERT
4e5245e6b8 server: tests: fix ci workflow 2024-02-20 19:05:21 +01:00
Pierrick HYMBERT
6497755de5 server: tests: fix ci workflow 2024-02-20 19:05:21 +01:00
Pierrick HYMBERT
9b63d7057a server: tests: reduce number of files, all in one tests shell script 2024-02-20 19:05:21 +01:00
Pierrick HYMBERT
157bcf2286 server: init functional test 2024-02-20 19:05:21 +01:00
Daniel Bevenius
4ed8e4fbef
llava : add explicit instructions for llava-1.6 (#5611)
This commit contains a suggestion for the README.md in the llava
example. The suggestion adds explicit instructions for how to convert
a llava-1.6 model and run it using llava-cli.

The motivation for this is that having explicit instructions similar to
the 1.5 instructions will make it easier for users to try this out.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-20 19:30:27 +02:00
Xuan Son Nguyen
9c405c9f9a
Server: use llama_chat_apply_template (#5593)
* server: use llama_chat_apply_template

* server: remove trailing space

* server: fix format_chat

* server: fix help message

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: fix formatted_chat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-20 15:58:27 +01:00
Dane Madsen
5207b3fbc5
readme : update UI list (#5605)
* Add maid to ui list

* Specify licence
2024-02-20 12:00:23 +02:00
Haoxiang Fei
8dbbd75754
metal : add build system support for embedded metal library (#5604)
* add build support for embedded metal library

* Update Makefile

---------

Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-20 11:58:36 +02:00
Pierrick Hymbert
c0a8c6db37
server : health endpoint configurable failure on no slot (#5594) 2024-02-20 09:48:19 +02:00
AidanBeltonS
b9111bd209
Update ggml_sycl_op_mul_mat_vec_q (#5502)
* Update ggml_sycl_op_mul_mat_vec_q

* Apply suggestions from code review

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* revert suggestion on macro

* fix bug

* Add quant type GGML_TYPE_IQ1_S to unsupported

* fix format

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-20 12:31:25 +05:30
Mathijs de Bruin
633782b8d9 nix: now that we can do so, allow MacOS to build Vulkan binaries
Author:    Philip Taron <philip.taron@gmail.com>
Date:      Tue Feb 13 20:28:02 2024 +0000
2024-02-19 14:49:49 -08:00
0cc4m
22f83f0c38 Enable Vulkan MacOS CI 2024-02-19 14:49:49 -08:00
0cc4m
bb9dcd560a Refactor validation and enumeration platform checks into functions to clean up ggml_vk_instance_init() 2024-02-19 14:49:49 -08:00