Commit graph

2606 commits

Author SHA1 Message Date
slaren
86f3666ab4 cuda : fix warning 2024-04-03 00:46:56 +02:00
slaren
31adc93486 llama : more loader cleanup, better error checking 2024-04-03 00:46:15 +02:00
slaren
fe62909618 metal : add support for non-pow-2 argsort 2024-04-02 20:36:42 +02:00
Georgi Gerganov
c704c778f6
convert : fix grok tensor names 2024-04-02 21:35:13 +03:00
slaren
f421b32d5a cuda/argsort : use shared memory instead of pool memory 2024-04-02 20:09:25 +02:00
slaren
9530398013 make linter happy 2024-04-02 18:21:45 +02:00
slaren
d08a1f4860 convert-hf-to-gguf.py : update grok (untested) 2024-04-02 18:19:37 +02:00
slaren
f27cbf3610 fix quantizing of merged experts 2024-04-02 17:07:14 +02:00
slaren
68d21debe4 gguf : bump version 2024-04-02 16:38:05 +02:00
slaren
6f33852f3d minor 2024-04-02 16:08:55 +02:00
slaren
6875369909 llama : add merged experts tensors to the grok tensor map 2024-04-02 16:08:45 +02:00
slaren
5de4a5da07 update grok model loading 2024-04-02 03:08:04 +02:00
slaren
8f84ca3cd9 test-backend-ops : test qwen argsort 2024-04-02 02:07:22 +02:00
slaren
b4a62062db update imatrix 2024-04-02 02:05:38 +02:00
slaren
deea2007b4 cleanup + disable mmap automatically with split tensors models 2024-04-02 01:55:22 +02:00
slaren
6886fdb887 allow quantize to work for split and merged experts models in the same way 2024-04-02 01:35:19 +02:00
slaren
4531b029ee cuda : support non-pow-2 number of experts 2024-04-02 01:11:59 +02:00
slaren
8c2f7b8169
Update convert-hf-to-gguf.py
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-31 19:52:46 +02:00
slaren
3b3298af17 update convert.py for mixtral hf models 2024-03-31 01:35:10 +01:00
slaren
4a5d50eb61 update convert-hf-to-gguf.py 2024-03-31 01:24:05 +01:00
slaren
6203d72651 update convert.py 2024-03-30 23:51:21 +01:00
slaren
2abb6c7225
Update ggml-metal.m
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-30 11:42:28 +01:00
slaren
26c09adce6 fix cuda 2024-03-30 00:44:16 +01:00
slaren
325e5efa0d update test-backend-ops 2024-03-29 23:48:10 +01:00
slaren
93db37e274 update metal 2024-03-29 22:22:52 +01:00
slaren
2479900a1c minor 2024-03-29 20:41:27 +01:00
slaren
9c9fe60f53 update cuda 2024-03-29 20:06:00 +01:00
slaren
0c7e21d7b2 ggml : update mul_mat_id to use the same tensor for all the experts 2024-03-29 19:10:20 +01:00
0cc4m
ba0c7c70ab
Vulkan k-quant mmq and ggml-backend offload functionality (#6155)
* Fix Vulkan no kv offload incoherence

* Add k-quant mul mat mat shaders

* Rework working buffer allocation, reduces vram use noticeably

Clean up cpu assist code, replaced with ggml-backend offload function

* Default to all dedicated GPUs

* Add fallback for integrated GPUs if no dedicated GPUs are found

* Add debug info which device is allocating memory

* Fix Intel dequant issue

Fix validation issue

* Fix Vulkan GGML_OP_GET_ROWS implementation

* Clean up merge artifacts

* Remove Vulkan warning
2024-03-29 17:29:21 +01:00
Georgi Gerganov
d48ccf3ad4
sync : ggml (#6351)
* sync : ggml

ggml-ci

* cuda : move GGML_CUDA_DMMV constants to dmmv.cuh

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-03-29 17:45:46 +02:00
hxer7963
069574775c
[Model] Add support for xverse (#6301)
* Support xverse model convert to gguf format.

* 1. Convert xverse models to gguf;
2. Add LLM_ARCH_XVERSE inference in llama.cpp;
3. Add xverse item in Supported models in README.md;

* * gguf-py: remove redundant logs
* llama: remove the init_mapping_prefetch custom parameter

* llama.cpp: Include the changes from #6122 to exclude the unused outputs of the last layers.

* - Fix format issues
- Remove duplicate set kqv_out to llm_build_kv

* Update llama.cpp

---------

Co-authored-by: willhe <willhe@xverse.cn>
Co-authored-by: willhe <hexin@xverse.cn>
2024-03-29 14:37:03 +01:00
Georgi Gerganov
cfde806eb9
ci : fix BGE wget (#6383)
ggml-ci
2024-03-29 14:34:28 +02:00
zhouwg
b910287954
readme : add project (#6356)
* readme: add Android UI binding

* Update README.md
2024-03-29 09:33:46 +02:00
Matt Clayton
8093987090
cmake : add explicit metal version options (#6370)
* cmake: add explicit metal version options

* Update CMakeLists.txt

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-29 09:27:42 +02:00
Daniel Bevenius
057400a3fd
llama : remove redundant reshape in build_kv_store (#6369)
* llama: remove redundant reshape in build_kv_store

This commit removes the reshape of the V matrix in the build_kv_store.

The motivation for this is that V matrix has the shape:
```console
(gdb) p *v_cur
$46 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU,
       buffer = 0x0, ne = {4096, 512, 1, 1}, nb = {4, 16384, 8388608,
       8388608}, op = GGML_OP_MUL_MAT, op_params = {
       0 <repeats 16 times>}, flags = 0, grad = 0x0,
       src = {0xb496b0, 0x7ffef1c40950, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
       0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
       view_src = 0x0, view_offs = 0, data = 0x0,
       name = "Vcur-0", '\000' <repeats 57 times>, extra = 0x0,
       padding = "\000\000\000\000\000\000\000"}
```
And after reshaping this tensor we get:
```console
gdb) p *ggml_reshape_2d(ctx, v_cur, n_embd_v_gqa, n_tokens)
$44 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU,
       buffer = 0x0, ne = {4096, 512, 1, 1}, nb = {4, 16384, 8388608,
       8388608}, op = GGML_OP_RESHAPE, op_params = {
       0 <repeats 16 times>}, flags = 0, grad = 0x0,
       src = {0x7ffef1c40e00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
       0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
       view_src = 0x7ffef1c40e00, view_offs = 0, data = 0x0,
       name = "Vcur-0 (reshaped)", '\000' <repeats 46 times>, extra = 0x0,
       padding = "\000\000\000\000\000\000\000"}
```
I noticed that the `src` and `view_src` fields are different but that the
dimensions are the same. From the code comment it seems like the reshape
call is not needed and perhaps the above can motivate the removal of the
reshape call.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* llama : add assert

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-29 09:23:22 +02:00
Pedro Cuenca
b75c38166c
convert : allow conversion of Mistral HF models (#6144)
* Allow conversion of Mistral HF models

* Homogenize Llama, Mistral, Mixtral under the same entry.

* Fix tokenizer, permute tensors

* Use sentencepiece tokenizer, or fall back to hfft.

* convert-hf : small fix for mypy

* convert-hf : fix duplicated block_count

* convert-hf : add vocab size to metadata

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-03-29 09:15:00 +02:00
Georgi Gerganov
bfe7dafc9c
readme : add notice for UI list 2024-03-28 22:56:03 +02:00
Ouadie EL FAROUKI
5106ef482c
[SYCL] Revisited & updated SYCL build documentation (#6141)
* Revisited & updated SYCL build documentation

* removed outdated comment

* Addressed PR comments

* Trimed white spaces

* added new end line
2024-03-28 16:01:47 +00:00
Jared Van Bortel
be55134a53
convert : refactor vocab selection logic (#6355) 2024-03-28 11:44:36 -04:00
Ziang Wu
66ba560256
llava : fix MobileVLM (#6364)
* fix empty bug

* Update MobileVLM-README.md

added more results on devices

* Update MobileVLM-README.md

* Update MobileVLM-README.md

* Update MobileVLM-README.md

* Update MobileVLM-README.md

* Update MobileVLM-README.md

* Update MobileVLM-README.md

* Update examples/llava/MobileVLM-README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update MobileVLM-README.md

remove gguf links

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-28 16:33:10 +02:00
compilade
0308f5e3d7
llama : fix command-r inference when omitting outputs (#6367) 2024-03-28 14:05:54 +02:00
Pierrick Hymbert
28cb9a09c4
ci: bench: fix master not schedule, fix commit status failed on external repo (#6365) 2024-03-28 11:27:56 +01:00
Ting Sun
cfc4d75df6
doc: fix outdated default value of batch size (#6336)
* doc: fix outdated default value of batch size

* doc: add doc for ubatch-size
2024-03-28 09:51:06 +01:00
Eric Zhang
6902cb7f2e
server : stop gracefully on SIGTERM (#6348) 2024-03-28 09:50:48 +01:00
hutli
d2d8f38996 nix: removed unnessesary indentation 2024-03-28 07:48:27 +00:00
hutli
d39b308eaf nix: moved blas availability check to package inputs so it is still overridable 2024-03-28 07:48:27 +00:00
hutli
c873976649 using blas.meta.available to check host platform 2024-03-28 07:48:27 +00:00
hutli
dbb03e2b9c only using explicit blas if hostPlatform is allowed 2024-03-28 07:48:27 +00:00
Someone Serge
e9f17dc3bf nix: .#windows: proper cross-compilation set-up
Take all dependencies from the cross stage, rather tha only stdenv
2024-03-28 07:48:27 +00:00
Someone Serge
22a462cc1f nix: package: don't introduce the dependency on python
- The generic /usr/bin/env shebangs are good enough
- Python deps are provisioned in the devShells
- We need to be able to leave python out at least on windows (currently breaks eval)
2024-03-28 07:48:27 +00:00