Commit graph

1677 commits

Author SHA1 Message Date
slaren
0e05e1fec3
Merge e1241d9b46 into 799a1cb13b 2023-12-13 12:04:28 +00:00
slaren
799a1cb13b
llama : add Mixtral support (#4406)
* convert : support Mixtral as LLAMA arch

* convert : fix n_ff typo

* llama : model loading

* ggml : sync latest ggml_mul_mat_id

* llama : update graph to support MoE

* llama : fix cur -> cur_expert

* llama : first working version

* llama : fix expert weighting in the FFN

* ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)

* ggml : add n_as argument to ggml_mul_mat_id

* ggml : fix ggml_get_rows to take into account ne02 / ne11

* metal : add more general support for ggml_get_rows + tests

* llama : add basic support for offloading moe with CUDA

* metal : add/mul/div use general kernel when src1 not cont

* metal : reduce the kernel launches for ggml_mul_mat_id

* ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D

* ggml : update get_rows f16 and q

* cuda : support non-contiguous src1 in get_rows

* llama : offload missing ffn_moe_silu

* metal : fix ggml_get_rows to work with non-cont src1

* metal : add indirect mat-vec kernels for all quantization types

* llama : do not quantize expert gating tensors

* llama : add n_expert and n_expert_used to hparams + change quants

* test-backend-ops : add moe test

* cuda : fix get_rows when ncols is odd

* convert : determine n_ctx correctly

* metal : fix ggml_mul_mat_id for F32

* test-backend-ops : make experts more evenly probable (test_moe)

* test-backend-ops : cleanup, add moe test for batches

* test-backend-ops : add cpy from f32 -> all types test

* test-backend-ops : fix dequantize block offset

* llama : fix hard-coded number of experts

* test-backend-ops : simplify and disable slow tests to avoid CI timeout

* test-backend-ops : disable MOE test with thread sanitizer

* cuda : fix mul_mat_id with multi gpu

* convert : use 1e6 rope_freq_base for mixtral

* convert : fix style

* convert : support safetensors format

* gguf-py : bump version

* metal : add cpy f16 -> f32 kernel

* metal : fix binary ops for ne10 % 4 != 0

* test-backend-ops : add one more sum_rows test

* ggml : do not use BLAS with ggml_mul_mat_id

* convert-hf : support for mixtral-instruct (#4428)

* convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct

* convert : use sentencepiece tokenizer for Mixtral-instruct

* convert : make flake8 happy

* metal : fix soft_max kernels

ref: https://github.com/ggerganov/ggml/pull/621/commits/1914017863d2f9ab8ecc0281cc2a56d683668b92

* metal : limit kernels to not use more than the allowed threads

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Radek Pilar <github@mrkva.eu>
2023-12-13 14:04:25 +02:00
Georgi Gerganov
e1241d9b46
metal : switch to execution barriers + fix one of the barriers 2023-12-13 13:56:45 +02:00
Georgi Gerganov
109e7aa8ac
metal : limit kernels to not use more than the allowed threads 2023-12-13 10:57:25 +02:00
Georgi Gerganov
ab558ac2b3
metal : fix soft_max kernels
ref: https://github.com/ggerganov/ggml/pull/621/commits/1914017863d2f9ab8ecc0281cc2a56d683668b92
2023-12-13 10:57:25 +02:00
Radek Pilar
82e4f64578
convert-hf : support for mixtral-instruct (#4428)
* convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct

* convert : use sentencepiece tokenizer for Mixtral-instruct

* convert : make flake8 happy
2023-12-12 21:04:10 +02:00
Georgi Gerganov
90c12e6b3c
ggml : do not use BLAS with ggml_mul_mat_id 2023-12-12 20:05:58 +02:00
Georgi Gerganov
ea4402bb0e
test-backend-ops : add one more sum_rows test 2023-12-12 17:03:38 +02:00
Georgi Gerganov
a51bc0c1c0
metal : fix binary ops for ne10 % 4 != 0 2023-12-12 15:55:42 +02:00
Georgi Gerganov
08eb99179a
metal : add cpy f16 -> f32 kernel 2023-12-12 14:15:08 +02:00
slaren
a742d9f9b7 gguf-py : bump version 2023-12-12 12:46:33 +01:00
Georgi Gerganov
6a419f4d19
convert : support safetensors format 2023-12-12 13:05:14 +02:00
kalomaze
fecac45658
server : tweak default sampling parameters (#4367)
* Set a more typical Top P setting as the default

* Update temp max
2023-12-12 12:12:35 +02:00
Richard Kiss
9494d7c477
english : use typos to fix comments and logs (#4354) 2023-12-12 11:53:36 +02:00
Jared Van Bortel
6138963fb2
build : target Windows 8 for standard mingw-w64 (#4405)
* build : target Windows 8 for standard mingw-w64

* make : fix missing console.o deps

This was causing a link error with `make all` on Windows.
2023-12-12 11:27:26 +02:00
crasm
6391817cd1
llama : document logits_all deprecation (#4418)
llama_context_params.logits_all is a parameter for controlling
llama_eval. This documents that logits_all should not be used with
llama_decode and llama_batch.
2023-12-12 11:25:57 +02:00
Vladimir Zorin
d9d4cfef64
server : fix local model name in server (#4420) 2023-12-12 11:25:29 +02:00
Taikono-Himazin
41a11aaf99
ggml : increased GGML_MAX_PARAMS to allow finetuning of 70b models (#4424) 2023-12-12 11:24:32 +02:00
slaren
f1cbfabd64 convert : fix style 2023-12-11 20:02:55 +01:00
slaren
7dc75e3923 convert : use 1e6 rope_freq_base for mixtral 2023-12-11 20:00:28 +01:00
slaren
296c945de5 cuda : fix mul_mat_id with multi gpu 2023-12-11 16:53:25 +01:00
slaren
33e50f1b53 test-backend-ops : disable MOE test with thread sanitizer 2023-12-11 12:27:48 +01:00
slaren
ffda94c87f test-backend-ops : simplify and disable slow tests to avoid CI timeout 2023-12-11 12:15:31 +01:00
Georgi Gerganov
8cbaed1d9a
llama : fix hard-coded number of experts 2023-12-11 08:55:27 +02:00
slaren
b0029815e4 test-backend-ops : fix dequantize block offset 2023-12-11 02:43:52 +01:00
Yueh-Po Peng
8a7b2fa528
Update README.md (#4388)
Fix small typo.
2023-12-10 23:27:38 +01:00
slaren
f1380d7897 test-backend-ops : add cpy from f32 -> all types test 2023-12-10 22:58:31 +01:00
slaren
54d254bbed test-backend-ops : cleanup, add moe test for batches 2023-12-10 21:52:11 +01:00
Georgi Gerganov
54ba263410
test-backend-ops : make experts more evenly probable (test_moe) 2023-12-10 15:28:07 +02:00
Georgi Gerganov
b0b83dd9e2
metal : fix ggml_mul_mat_id for F32 2023-12-10 14:30:38 +02:00
Georgi Gerganov
65923a8ede
convert : determine n_ctx correctly 2023-12-10 14:18:14 +02:00
slaren
8614aa736d cuda : fix get_rows when ncols is odd 2023-12-10 13:12:18 +01:00
slaren
cefebb3660 test-backend-ops : add moe test 2023-12-10 13:12:18 +01:00
Georgi Gerganov
e640cbe055
llama : add n_expert and n_expert_used to hparams + change quants 2023-12-10 13:57:54 +02:00
Georgi Gerganov
d1259b7b35
llama : do not quantize expert gating tensors 2023-12-10 13:00:13 +02:00
Georgi Gerganov
6cfb31f9ea
metal : add indirect mat-vec kernels for all quantization types 2023-12-10 11:48:14 +02:00
Georgi Gerganov
016f9bb55a
metal : fix ggml_get_rows to work with non-cont src1 2023-12-10 09:38:21 +02:00
slaren
0710b0f726 llama : offload missing ffn_moe_silu 2023-12-09 23:29:47 +01:00
slaren
62b95f93d0 cuda : support non-contiguous src1 in get_rows 2023-12-09 22:39:34 +01:00
slaren
2e4db48291 ggml : update get_rows f16 and q 2023-12-09 22:38:22 +01:00
Xiang (Kevin) Li
e18f7345a3
grammar : revert the replacement of llama_token_to_piece with id_to_token (#4396) 2023-12-09 23:29:27 +02:00
slaren
ac3f7d8e23 ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D 2023-12-09 19:20:21 +01:00
Georgi Gerganov
8c5b66eeaa
metal : reduce the kernel launches for ggml_mul_mat_id 2023-12-09 15:30:34 +02:00
Georgi Gerganov
7e2006b0c0
metal : add/mul/div use general kernel when src1 not cont 2023-12-09 14:25:49 +02:00
slaren
06dfde3e94 llama : add basic support for offloading moe with CUDA 2023-12-09 13:21:30 +01:00
Georgi Gerganov
2cbcba829f
metal : add more general support for ggml_get_rows + tests 2023-12-09 14:18:42 +02:00
Georgi Gerganov
9064b1ca05
ggml : fix ggml_get_rows to take into account ne02 / ne11 2023-12-09 14:04:54 +02:00
slaren
ee8fb399aa ggml : add n_as argument to ggml_mul_mat_id 2023-12-09 12:42:25 +01:00
Georgi Gerganov
7372b62271
ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only) 2023-12-09 13:19:47 +02:00
Georgi Gerganov
8b185b7030
llama : fix expert weighting in the FFN 2023-12-09 13:01:42 +02:00