Commit graph

2255 commits

Author SHA1 Message Date
pudepiedj
6f0bfdbe55 Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch 2024-02-25 09:29:35 +00:00
pudepiedj
c80d429c42 Various server updates 2024-02-25 09:29:31 +00:00
pudepiedj
62ef858d00
Update README.md 2024-02-23 10:01:24 +00:00
pudepiedj
9aa1457066
Update README.md 2024-02-23 10:00:02 +00:00
pudepiedj
a9dd5f3769 Revised server hostname 2024-02-23 09:56:14 +00:00
pudepiedj
011ea9852a Minor updates 2024-02-23 07:48:00 +00:00
pudepiedj
9c99ef43d7 small alterations 2024-02-22 16:45:05 +00:00
pudepiedj
298207185d small changes and threads 64 2024-02-21 21:10:54 +00:00
pudepiedj
3800bc6c7f Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch 2024-02-21 17:33:06 +00:00
pudepiedj
1b04d5907b improved python client lower threadpool 2024-02-21 17:33:03 +00:00
pudepiedj
fd0ccdb601
Merge branch 'ggerganov:master' into server_branch 2024-02-21 13:28:07 +00:00
postmasters
580111d42b
llama : add gemma model (#5631)
There are couple things in this architecture:

1. Shared input and output embedding parameters.
2. Key length and value length are not derived from `n_embd`.

More information about the models can be found at
https://ai.google.dev/gemma. GGUFs can be downloaded from
https://huggingface.co/google.
2024-02-21 15:08:22 +02:00
pudepiedj
f7e29e5248 More requests and threads 2024-02-21 11:39:48 +00:00
Meng, Hengyu
88c46cbdac
[SYCL] conext add name (#5624)
* [SYCL] conext add name

* name should start with SYCL*
2024-02-21 17:52:06 +08:00
Kawrakow
a14679cc30
IQ4_NL: 4-bit non-linear quants with blocks of 32 (#5590)
* iq4_nl: squash commits for easier rebase

* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels

* iq4_nl: Fix after merging with master

* iq4_nl: another fix after merging with master

* Use IQ4_NL instead of Q4_K when using k-quants is not possible

* Fix typo that makes several tests fail

* It was the ggml_vdotq thing missed inside the brackets

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-21 11:39:52 +02:00
pudepiedj
760b6d639b minor change 2024-02-21 09:32:29 +00:00
pudepiedj
e500a14ab0 Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch 2024-02-20 19:31:33 +00:00
pudepiedj
4904b0a06e kvgraphics with interactio 2024-02-20 19:31:29 +00:00
pudepiedj
ecbb531b1f
Merge branch 'ggerganov:master' into server_branch 2024-02-20 19:23:44 +00:00
CJ Pais
6560bed3f0
server : support llava 1.6 (#5553)
* server: init working 1.6

* move clip_image to header

* remove commented code

* remove c++ style from header

* remove todo

* expose llava_image_embed_make_with_clip_img

* fix zig build
2024-02-20 21:07:22 +02:00
slaren
06bf2cf8c4
make : fix debug build with CUDA (#5616) 2024-02-20 20:06:17 +01:00
Daniel Bevenius
4ed8e4fbef
llava : add explicit instructions for llava-1.6 (#5611)
This commit contains a suggestion for the README.md in the llava
example. The suggestion adds explicit instructions for how to convert
a llava-1.6 model and run it using llava-cli.

The motivation for this is that having explicit instructions similar to
the 1.5 instructions will make it easier for users to try this out.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-20 19:30:27 +02:00
Xuan Son Nguyen
9c405c9f9a
Server: use llama_chat_apply_template (#5593)
* server: use llama_chat_apply_template

* server: remove trailing space

* server: fix format_chat

* server: fix help message

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: fix formatted_chat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-20 15:58:27 +01:00
Dane Madsen
5207b3fbc5
readme : update UI list (#5605)
* Add maid to ui list

* Specify licence
2024-02-20 12:00:23 +02:00
Haoxiang Fei
8dbbd75754
metal : add build system support for embedded metal library (#5604)
* add build support for embedded metal library

* Update Makefile

---------

Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-20 11:58:36 +02:00
Pierrick Hymbert
c0a8c6db37
server : health endpoint configurable failure on no slot (#5594) 2024-02-20 09:48:19 +02:00
AidanBeltonS
b9111bd209
Update ggml_sycl_op_mul_mat_vec_q (#5502)
* Update ggml_sycl_op_mul_mat_vec_q

* Apply suggestions from code review

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* revert suggestion on macro

* fix bug

* Add quant type GGML_TYPE_IQ1_S to unsupported

* fix format

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-02-20 12:31:25 +05:30
Mathijs de Bruin
633782b8d9 nix: now that we can do so, allow MacOS to build Vulkan binaries
Author:    Philip Taron <philip.taron@gmail.com>
Date:      Tue Feb 13 20:28:02 2024 +0000
2024-02-19 14:49:49 -08:00
0cc4m
22f83f0c38 Enable Vulkan MacOS CI 2024-02-19 14:49:49 -08:00
0cc4m
bb9dcd560a Refactor validation and enumeration platform checks into functions to clean up ggml_vk_instance_init() 2024-02-19 14:49:49 -08:00
0cc4m
f50db6ae0b Add check for VK_KHR_portability_enumeration for MoltenVK support 2024-02-19 14:49:49 -08:00
Mathijs de Bruin
d8c054517d Add preprocessor checks for Apple devices.
Based on work by @rbourgeat in https://github.com/ggerganov/llama.cpp/pull/5322/files
2024-02-19 14:49:49 -08:00
Mathijs de Bruin
42f664a382 Resolve ErrorIncompatibleDriver with Vulkan on MacOS.
Refs:
- https://chat.openai.com/share/7020ce72-65fc-45ec-b7be-9d9d798a5f3f
- https://github.com/SaschaWillems/Vulkan/issues/954
- https://github.com/haasn/libplacebo/issues/128
- https://github.com/KhronosGroup/Vulkan-Samples/issues/476
2024-02-19 14:49:49 -08:00
Mathijs de Bruin
5dde540897 Allow for Vulkan build with Accelerate.
Closes #5304
2024-02-19 14:49:49 -08:00
slaren
40c3a6c1e1
cuda : ignore peer access already enabled errors (#5597)
* cuda : ignore peer access already enabled errors

* fix hip
2024-02-19 23:40:26 +01:00
pudepiedj
d261e7f8f8
Merge branch 'ggerganov:master' into server_branch 2024-02-19 22:14:25 +00:00
Jared Van Bortel
f24ed14ee0
make : pass CPPFLAGS directly to nvcc, not via -Xcompiler (#5598) 2024-02-19 15:54:12 -05:00
pudepiedj
69cb1ef0b1
Merge branch 'ggerganov:master' into server_branch 2024-02-19 16:38:20 +00:00
pudepiedj
ea0e8ac758 Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch 2024-02-19 16:37:02 +00:00
pudepiedj
efe38c636f server changes 2024-02-19 16:36:59 +00:00
nopperl
9d679f0fcc
examples : support minItems/maxItems in JSON grammar converter (#5039)
* support minLength and maxLength in JSON schema grammar converter

* Update examples/json-schema-to-grammar.py

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-19 16:14:07 +02:00
Georgi Gerganov
1387cf60f7
llava : remove extra cont (#5587) 2024-02-19 15:23:17 +02:00
slaren
6fd413791a llava : replace ggml_cpy with ggml_cont 2024-02-19 15:09:43 +02:00
Georgi Gerganov
337c9cbd52 sync : ggml
ggml-ci
2024-02-19 15:09:43 +02:00
Georgi Gerganov
a3145bdc30 ggml-alloc : apply ggml/731 2024-02-19 15:09:43 +02:00
Didzis Gosko
890559ab28 metal : option to embed MSL source into compiled binary (whisper/1842)
* ggml : embed Metal library source (ggml-metal.metal) into binary

enable by setting WHISPER_EMBED_METAL_LIBRARY

* rename the build option

* rename the preprocessor directive

* generate Metal library embedding assembly on-fly during build process
2024-02-19 15:09:43 +02:00
pudepiedj
491e11b283
Merge branch 'ggerganov:master' into server_branch 2024-02-19 12:48:37 +00:00
Georgi Gerganov
d0e3ce51f4
ci : enable -Werror for CUDA builds (#5579)
* cmake : pass -Werror through -Xcompiler

ggml-ci

* make, cmake : enable CUDA errors on warnings

ggml-ci
2024-02-19 14:45:41 +02:00
pudepiedj
aaffb2387f
Merge branch 'ggerganov:master' into server_branch 2024-02-19 12:44:51 +00:00
pudepiedj
8a4d202957 minor changes 2024-02-19 12:41:27 +00:00