Commit graph

1969 commits

Author SHA1 Message Date
Jared Van Bortel
de9fba0d39 kompute : fix basic f16 get_rows, 28 -> 26 failures 2024-01-25 15:22:26 -05:00
Jared Van Bortel
11b305082b test-backend-ops : restore softmax tests 2024-01-25 15:05:55 -05:00
Jared Van Bortel
38d1f0c7a0 kompute : fix op_gelu -> Falcon is working on AMDVLK 2024-01-25 15:01:46 -05:00
Jared Van Bortel
6fc99a6e66 test-backend-ops : test larger GELU range 2024-01-25 15:01:46 -05:00
Jared Van Bortel
1849b85473 test-backend-ops : add Falcon test 2024-01-25 13:55:49 -05:00
Jared Van Bortel
f5ac635473 kompute : fix q8_0 mmv, 41 -> 28 failures 2024-01-25 11:27:11 -05:00
Jared Van Bortel
987335ea0a kompute : fix algorithm names 2024-01-25 11:09:18 -05:00
Jared Van Bortel
ec68a9657f test-backend-ops : increase max_nmse_err so Llama passes 2024-01-24 17:31:34 -05:00
Jared Van Bortel
ebb5f7e968 test-backend-ops : test llama with different batch sizes 2024-01-24 16:55:44 -05:00
Jared Van Bortel
df687b10ab kompute : support mask parameter of softmax 2024-01-24 16:51:27 -05:00
Jared Van Bortel
8bd38fe32d test-backend-ops : test mask parameter of ggml_soft_max_ext 2024-01-24 16:28:41 -05:00
Jared Van Bortel
308f279622 kompute : support scale parameter of softmax 2024-01-24 16:17:37 -05:00
Jared Van Bortel
1450966071 test-backend-ops : test scale parameter of ggml_soft_max_ext 2024-01-24 16:17:37 -05:00
Jared Van Bortel
2852902eda test-backend-ops : add llama test 2024-01-24 16:17:29 -05:00
Jared Van Bortel
2b0f642fec fix f16 mmv, 49 -> 41 failures 2024-01-24 13:43:49 -05:00
Jared Van Bortel
1a14099c43 fix q4_0/q4_1 mmv, 65 -> 49 failures 2024-01-24 13:43:48 -05:00
Jared Van Bortel
0787b80db8 kompute : remove broken mulrow kernel -> 1 less test failure 2024-01-24 13:43:48 -05:00
Jared Van Bortel
2755ae3d10 kompute : fix more dispatch ambiguity -> 12 less failures 2024-01-24 13:43:47 -05:00
Jared Van Bortel
08e23fd78c kompute : fix op_mul kernel -> 13 less test failures 2024-01-24 13:43:47 -05:00
Jared Van Bortel
0899adf86e kompute : fix get_rows dispatch -> 4 less failures 2024-01-24 13:43:47 -05:00
Jared Van Bortel
cb9ceff966 minor cleanup 2024-01-24 13:43:46 -05:00
Georgi Gerganov
33e8d6abe1 kompute : fix ggml_add kernel (#5027) 2024-01-24 13:43:46 -05:00
Jared Van Bortel
2f6a279e29 fix supported ops for kompute backend 2024-01-24 13:43:45 -05:00
Jared Van Bortel
07530731ba never try to evaluate an empty command buffer
This fixes the immediate crashes with test-backend-ops - when
evaluatating individual no-ops like OP_VIEW, it tries to submit an empty
command buffer, which crashes RADV and hangs AMDVLK.
2024-01-24 13:43:45 -05:00
Jared Van Bortel
729e1a4cc1 sync op_rope_f16 with recent op_rope_f32 changes 2024-01-24 13:43:45 -05:00
Jared Van Bortel
e9d5223da3 actually fix this assertion 2024-01-24 13:43:44 -05:00
Jared Van Bortel
9431026a84 clean up old backend code 2024-01-24 13:43:44 -05:00
Georgi Gerganov
d6bd471693 kompute : fix rope_f32 and scale ops (#5008) 2024-01-24 13:43:44 -05:00
Jared Van Bortel
76474a7c0d kompute : ignore exceptions in ggml_vk_available_devices (#12)
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
2024-01-24 13:43:43 -05:00
Jared Van Bortel
cad72e1252 add sanity check and fix kompute teardown order 2024-01-24 13:43:43 -05:00
Jared Van Bortel
070919dbf7 attempt to get test-backend-ops working 2024-01-24 13:43:43 -05:00
Jared Van Bortel
5f660dada8 fix assertion failure 2024-01-24 13:43:42 -05:00
Jared Van Bortel
298d6eec09 kompute : initial attempt at ggml-backend v2 support 2024-01-24 13:43:40 -05:00
Jared Van Bortel
7c527eb568 Merge commit 'e7e4df031b' into HEAD 2024-01-24 13:39:17 -05:00
slaren
e7e4df031b
llama : ggml-backend integration (#4766)
* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (#4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (#4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12 20:07:38 +01:00
Georgi Gerganov
584d674be6
llama : remove redundant assert for StableLM (#4901) 2024-01-12 20:54:12 +02:00
Daniel Bevenius
930f907d3e
export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894)
This commit replaces the magic number used in export-lora.cpp with
the one defined in llama.h, which is indirectly included via common.h.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-12 19:54:53 +02:00
Zay
e790eef21c
llama.swiftui : update models layout (#4826)
* Updated Models Layout

- Added a models drawer
- Added downloading directly from Hugging Face
- Load custom models from local folder
- Delete models by swiping left

* trimmed trailing white space

* Updated Models Layout
2024-01-12 14:48:00 +02:00
Georgi Gerganov
5537d9d36b
gitignore : imatrix 2024-01-12 14:33:21 +02:00
Johannes Gäßler
1b280c9fff
CUDA: fix softmax compile for old CUDA versions (#4862) 2024-01-12 12:30:41 +01:00
Georgi Gerganov
3cabe80630
llama : fix typo "imp_embd" -> "inp_embd" 2024-01-12 13:11:15 +02:00
howlger
4315a94366
common : streamline the formatting of help (#4890)
* common : streamline the formatting of help

- Separate alternative parameters by a comma

- Do not indent `--version` differently

* Update common/common.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-12 13:05:32 +02:00
Georgi Gerganov
2d00741e12
py : fix lint (#4889) 2024-01-12 13:03:38 +02:00
Georgi Gerganov
f445c0e68c
llama : fix llm_build_k_shift to use correct n_rot (#4889)
* llama : fix llm_build_k_shift to use correct n_rot

ggml-ci

* llama : always use hparams.n_rot for ggml_rope_custom

ggml-ci

* convert : fix persimmon conversion to write correct n_rot
2024-01-12 13:01:56 +02:00
Kawrakow
326b418b59
Importance Matrix calculation (#4861)
* imatrix: 1st version

* imatrix: WIP

* Cleanup

* Update examples/imatrix/imatrix.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-12 06:59:57 +01:00
Georgi Gerganov
1d118386fe
server : fix infill when prompt is empty (#4833) 2024-01-11 23:23:49 +02:00
Georgi Gerganov
7edefbd79c
main : better name for variable n_print (#4874) 2024-01-11 22:46:26 +02:00
Georgi Gerganov
3ca63b4538
main : disable token count by default (#4874) 2024-01-11 22:43:05 +02:00
Georgi Gerganov
b037787548
swift : track ggml release branch (#4867) 2024-01-11 21:58:28 +02:00
Kawrakow
469e75d0a3
llama : restore intended k-quants mixes for MoE models (#4872)
* Restore intended k-quants quantization mixes for MoE models

* Update Q2_K_S values in the quantize tool

Still using LLaMA-v1 PPL values in the quant description
today does not make much sense. But let's leave this update
for another PR.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11 21:43:15 +02:00