Commit graph

1960 commits

Author SHA1 Message Date
Jared Van Bortel
df687b10ab kompute : support mask parameter of softmax 2024-01-24 16:51:27 -05:00
Jared Van Bortel
8bd38fe32d test-backend-ops : test mask parameter of ggml_soft_max_ext 2024-01-24 16:28:41 -05:00
Jared Van Bortel
308f279622 kompute : support scale parameter of softmax 2024-01-24 16:17:37 -05:00
Jared Van Bortel
1450966071 test-backend-ops : test scale parameter of ggml_soft_max_ext 2024-01-24 16:17:37 -05:00
Jared Van Bortel
2852902eda test-backend-ops : add llama test 2024-01-24 16:17:29 -05:00
Jared Van Bortel
2b0f642fec fix f16 mmv, 49 -> 41 failures 2024-01-24 13:43:49 -05:00
Jared Van Bortel
1a14099c43 fix q4_0/q4_1 mmv, 65 -> 49 failures 2024-01-24 13:43:48 -05:00
Jared Van Bortel
0787b80db8 kompute : remove broken mulrow kernel -> 1 less test failure 2024-01-24 13:43:48 -05:00
Jared Van Bortel
2755ae3d10 kompute : fix more dispatch ambiguity -> 12 less failures 2024-01-24 13:43:47 -05:00
Jared Van Bortel
08e23fd78c kompute : fix op_mul kernel -> 13 less test failures 2024-01-24 13:43:47 -05:00
Jared Van Bortel
0899adf86e kompute : fix get_rows dispatch -> 4 less failures 2024-01-24 13:43:47 -05:00
Jared Van Bortel
cb9ceff966 minor cleanup 2024-01-24 13:43:46 -05:00
Georgi Gerganov
33e8d6abe1 kompute : fix ggml_add kernel (#5027) 2024-01-24 13:43:46 -05:00
Jared Van Bortel
2f6a279e29 fix supported ops for kompute backend 2024-01-24 13:43:45 -05:00
Jared Van Bortel
07530731ba never try to evaluate an empty command buffer
This fixes the immediate crashes with test-backend-ops - when
evaluatating individual no-ops like OP_VIEW, it tries to submit an empty
command buffer, which crashes RADV and hangs AMDVLK.
2024-01-24 13:43:45 -05:00
Jared Van Bortel
729e1a4cc1 sync op_rope_f16 with recent op_rope_f32 changes 2024-01-24 13:43:45 -05:00
Jared Van Bortel
e9d5223da3 actually fix this assertion 2024-01-24 13:43:44 -05:00
Jared Van Bortel
9431026a84 clean up old backend code 2024-01-24 13:43:44 -05:00
Georgi Gerganov
d6bd471693 kompute : fix rope_f32 and scale ops (#5008) 2024-01-24 13:43:44 -05:00
Jared Van Bortel
76474a7c0d kompute : ignore exceptions in ggml_vk_available_devices (#12)
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
2024-01-24 13:43:43 -05:00
Jared Van Bortel
cad72e1252 add sanity check and fix kompute teardown order 2024-01-24 13:43:43 -05:00
Jared Van Bortel
070919dbf7 attempt to get test-backend-ops working 2024-01-24 13:43:43 -05:00
Jared Van Bortel
5f660dada8 fix assertion failure 2024-01-24 13:43:42 -05:00
Jared Van Bortel
298d6eec09 kompute : initial attempt at ggml-backend v2 support 2024-01-24 13:43:40 -05:00
Jared Van Bortel
7c527eb568 Merge commit 'e7e4df031b' into HEAD 2024-01-24 13:39:17 -05:00
slaren
e7e4df031b
llama : ggml-backend integration (#4766)
* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (#4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (#4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12 20:07:38 +01:00
Georgi Gerganov
584d674be6
llama : remove redundant assert for StableLM (#4901) 2024-01-12 20:54:12 +02:00
Daniel Bevenius
930f907d3e
export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894)
This commit replaces the magic number used in export-lora.cpp with
the one defined in llama.h, which is indirectly included via common.h.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-12 19:54:53 +02:00
Zay
e790eef21c
llama.swiftui : update models layout (#4826)
* Updated Models Layout

- Added a models drawer
- Added downloading directly from Hugging Face
- Load custom models from local folder
- Delete models by swiping left

* trimmed trailing white space

* Updated Models Layout
2024-01-12 14:48:00 +02:00
Georgi Gerganov
5537d9d36b
gitignore : imatrix 2024-01-12 14:33:21 +02:00
Johannes Gäßler
1b280c9fff
CUDA: fix softmax compile for old CUDA versions (#4862) 2024-01-12 12:30:41 +01:00
Georgi Gerganov
3cabe80630
llama : fix typo "imp_embd" -> "inp_embd" 2024-01-12 13:11:15 +02:00
howlger
4315a94366
common : streamline the formatting of help (#4890)
* common : streamline the formatting of help

- Separate alternative parameters by a comma

- Do not indent `--version` differently

* Update common/common.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-12 13:05:32 +02:00
Georgi Gerganov
2d00741e12
py : fix lint (#4889) 2024-01-12 13:03:38 +02:00
Georgi Gerganov
f445c0e68c
llama : fix llm_build_k_shift to use correct n_rot (#4889)
* llama : fix llm_build_k_shift to use correct n_rot

ggml-ci

* llama : always use hparams.n_rot for ggml_rope_custom

ggml-ci

* convert : fix persimmon conversion to write correct n_rot
2024-01-12 13:01:56 +02:00
Kawrakow
326b418b59
Importance Matrix calculation (#4861)
* imatrix: 1st version

* imatrix: WIP

* Cleanup

* Update examples/imatrix/imatrix.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-12 06:59:57 +01:00
Georgi Gerganov
1d118386fe
server : fix infill when prompt is empty (#4833) 2024-01-11 23:23:49 +02:00
Georgi Gerganov
7edefbd79c
main : better name for variable n_print (#4874) 2024-01-11 22:46:26 +02:00
Georgi Gerganov
3ca63b4538
main : disable token count by default (#4874) 2024-01-11 22:43:05 +02:00
Georgi Gerganov
b037787548
swift : track ggml release branch (#4867) 2024-01-11 21:58:28 +02:00
Kawrakow
469e75d0a3
llama : restore intended k-quants mixes for MoE models (#4872)
* Restore intended k-quants quantization mixes for MoE models

* Update Q2_K_S values in the quantize tool

Still using LLaMA-v1 PPL values in the quant description
today does not make much sense. But let's leave this update
for another PR.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11 21:43:15 +02:00
Kawrakow
49662cbed3
ggml : SOTA 2-bit quants (add IQ2_XS) (#4856)
* iq2_xs: basics

* iq2_xs: this should have been in the basics

* iq2_xs: CUDA and scalar CPU works

* iq2_xs: WIP Metal

* iq2_xs: Metal now works

* iq2_xs: working, but dog slow, ARM_NEON dot product

* iq2_xs: better ARM_NEON dot product

We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.

* iq2_xs: AVX2 dot product - 19.5 t/s

* iq2_xs: faster AVX2 dit product

21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.

* iq2_xs: had forgotten to delete iq2-data.h

* Add llama enum for IQ2_XS

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11 21:39:39 +02:00
Georgi Gerganov
3ba5b8ca8e
swift : pin ggml commit + remove ggml.h from spm-headers (#4878)
ggml-ci
2024-01-11 21:31:31 +02:00
Laura
4330bd83fe
server : implement credentialed CORS (#4514)
* Implement credentialed CORS according to MDN

* Fix syntax error

* Move validate_api_key up so it is defined before its first usage
2024-01-11 20:02:48 +02:00
Michael Coppola
27379455c3
server : support for multiple api keys (#4864)
* server: added support for multiple api keys, added loading api keys from file

* minor: fix whitespace

* added file error handling to --api-key-file, changed code to better
reflect current style

* server: update README.md for --api-key-file

---------

Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-01-11 19:51:17 +02:00
Behnam M
eab6795006
server : add LOG_INFO when model is successfully loaded (#4881)
* added /health endpoint to the server

* added comments on the additional /health endpoint

* Better handling of server state

When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.

* initialized server_state

* fixed a typo

* starting http server before initializing the model

* Update server.cpp

* Update server.cpp

* fixes

* fixes

* fixes

* made ServerState atomic and turned two-line spaces into one-line

* updated `server` readme to document the `/health` endpoint too

* used LOG_INFO after successful model loading
2024-01-11 19:41:39 +02:00
Someone
d8d90aa343
ci: nix-flake-update: new token with pr permissions (#4879)
* ci: nix-flake-update: new token with pr permissions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11 17:22:34 +00:00
pudepiedj
43f76bf1c3
main : print total token count and tokens consumed so far (#4874)
* Token count changes

* Add show token count

* Updating before PR

* Two requested changes

* Move param def posn
2024-01-11 18:14:52 +02:00
Isaac McFadyen
2f043328e3
server : fix typo in model name (#4876) 2024-01-11 16:33:26 +02:00
Paul Tsochantaris
2a7c94db5f
metal : put encoder debug group behind a define (#4873) 2024-01-11 16:31:52 +02:00