llama.cpp

Author	SHA1	Message	Date
slaren	0e05e1fec3	Merge `e1241d9b46` into `799a1cb13b`	2023-12-13 12:04:28 +00:00
slaren	799a1cb13b	llama : add Mixtral support (#4406 ) * convert : support Mixtral as LLAMA arch * convert : fix n_ff typo * llama : model loading * ggml : sync latest ggml_mul_mat_id * llama : update graph to support MoE * llama : fix cur -> cur_expert * llama : first working version * llama : fix expert weighting in the FFN * ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only) * ggml : add n_as argument to ggml_mul_mat_id * ggml : fix ggml_get_rows to take into account ne02 / ne11 * metal : add more general support for ggml_get_rows + tests * llama : add basic support for offloading moe with CUDA * metal : add/mul/div use general kernel when src1 not cont * metal : reduce the kernel launches for ggml_mul_mat_id * ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D * ggml : update get_rows f16 and q * cuda : support non-contiguous src1 in get_rows * llama : offload missing ffn_moe_silu * metal : fix ggml_get_rows to work with non-cont src1 * metal : add indirect mat-vec kernels for all quantization types * llama : do not quantize expert gating tensors * llama : add n_expert and n_expert_used to hparams + change quants * test-backend-ops : add moe test * cuda : fix get_rows when ncols is odd * convert : determine n_ctx correctly * metal : fix ggml_mul_mat_id for F32 * test-backend-ops : make experts more evenly probable (test_moe) * test-backend-ops : cleanup, add moe test for batches * test-backend-ops : add cpy from f32 -> all types test * test-backend-ops : fix dequantize block offset * llama : fix hard-coded number of experts * test-backend-ops : simplify and disable slow tests to avoid CI timeout * test-backend-ops : disable MOE test with thread sanitizer * cuda : fix mul_mat_id with multi gpu * convert : use 1e6 rope_freq_base for mixtral * convert : fix style * convert : support safetensors format * gguf-py : bump version * metal : add cpy f16 -> f32 kernel * metal : fix binary ops for ne10 % 4 != 0 * test-backend-ops : add one more sum_rows test * ggml : do not use BLAS with ggml_mul_mat_id * convert-hf : support for mixtral-instruct (#4428) * convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct * convert : use sentencepiece tokenizer for Mixtral-instruct * convert : make flake8 happy * metal : fix soft_max kernels ref: https://github.com/ggerganov/ggml/pull/621/commits/1914017863d2f9ab8ecc0281cc2a56d683668b92 * metal : limit kernels to not use more than the allowed threads --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Radek Pilar <github@mrkva.eu>	2023-12-13 14:04:25 +02:00
Georgi Gerganov	e1241d9b46	metal : switch to execution barriers + fix one of the barriers	2023-12-13 13:56:45 +02:00
Georgi Gerganov	109e7aa8ac	metal : limit kernels to not use more than the allowed threads	2023-12-13 10:57:25 +02:00
Georgi Gerganov	ab558ac2b3	metal : fix soft_max kernels ref: https://github.com/ggerganov/ggml/pull/621/commits/1914017863d2f9ab8ecc0281cc2a56d683668b92	2023-12-13 10:57:25 +02:00
Radek Pilar	82e4f64578	convert-hf : support for mixtral-instruct (#4428 ) * convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct * convert : use sentencepiece tokenizer for Mixtral-instruct * convert : make flake8 happy	2023-12-12 21:04:10 +02:00
Georgi Gerganov	90c12e6b3c	ggml : do not use BLAS with ggml_mul_mat_id	2023-12-12 20:05:58 +02:00
Georgi Gerganov	ea4402bb0e	test-backend-ops : add one more sum_rows test	2023-12-12 17:03:38 +02:00
Georgi Gerganov	a51bc0c1c0	metal : fix binary ops for ne10 % 4 != 0	2023-12-12 15:55:42 +02:00
Georgi Gerganov	08eb99179a	metal : add cpy f16 -> f32 kernel	2023-12-12 14:15:08 +02:00
slaren	a742d9f9b7	gguf-py : bump version	2023-12-12 12:46:33 +01:00
Georgi Gerganov	6a419f4d19	convert : support safetensors format	2023-12-12 13:05:14 +02:00
kalomaze	fecac45658	server : tweak default sampling parameters (#4367 ) * Set a more typical Top P setting as the default * Update temp max	2023-12-12 12:12:35 +02:00
Richard Kiss	9494d7c477	english : use `typos` to fix comments and logs (#4354 )	2023-12-12 11:53:36 +02:00
Jared Van Bortel	6138963fb2	build : target Windows 8 for standard mingw-w64 (#4405 ) * build : target Windows 8 for standard mingw-w64 * make : fix missing console.o deps This was causing a link error with `make all` on Windows.	2023-12-12 11:27:26 +02:00
crasm	6391817cd1	llama : document logits_all deprecation (#4418 ) llama_context_params.logits_all is a parameter for controlling llama_eval. This documents that logits_all should not be used with llama_decode and llama_batch.	2023-12-12 11:25:57 +02:00
Vladimir Zorin	d9d4cfef64	server : fix local model name in server (#4420 )	2023-12-12 11:25:29 +02:00
Taikono-Himazin	41a11aaf99	ggml : increased GGML_MAX_PARAMS to allow finetuning of 70b models (#4424 )	2023-12-12 11:24:32 +02:00
slaren	f1cbfabd64	convert : fix style	2023-12-11 20:02:55 +01:00
slaren	7dc75e3923	convert : use 1e6 rope_freq_base for mixtral	2023-12-11 20:00:28 +01:00
slaren	296c945de5	cuda : fix mul_mat_id with multi gpu	2023-12-11 16:53:25 +01:00
slaren	33e50f1b53	test-backend-ops : disable MOE test with thread sanitizer	2023-12-11 12:27:48 +01:00
slaren	ffda94c87f	test-backend-ops : simplify and disable slow tests to avoid CI timeout	2023-12-11 12:15:31 +01:00
Georgi Gerganov	8cbaed1d9a	llama : fix hard-coded number of experts	2023-12-11 08:55:27 +02:00
slaren	b0029815e4	test-backend-ops : fix dequantize block offset	2023-12-11 02:43:52 +01:00
Yueh-Po Peng	8a7b2fa528	Update README.md (#4388 ) Fix small typo.	2023-12-10 23:27:38 +01:00
slaren	f1380d7897	test-backend-ops : add cpy from f32 -> all types test	2023-12-10 22:58:31 +01:00
slaren	54d254bbed	test-backend-ops : cleanup, add moe test for batches	2023-12-10 21:52:11 +01:00
Georgi Gerganov	54ba263410	test-backend-ops : make experts more evenly probable (test_moe)	2023-12-10 15:28:07 +02:00
Georgi Gerganov	b0b83dd9e2	metal : fix ggml_mul_mat_id for F32	2023-12-10 14:30:38 +02:00
Georgi Gerganov	65923a8ede	convert : determine n_ctx correctly	2023-12-10 14:18:14 +02:00
slaren	8614aa736d	cuda : fix get_rows when ncols is odd	2023-12-10 13:12:18 +01:00
slaren	cefebb3660	test-backend-ops : add moe test	2023-12-10 13:12:18 +01:00
Georgi Gerganov	e640cbe055	llama : add n_expert and n_expert_used to hparams + change quants	2023-12-10 13:57:54 +02:00
Georgi Gerganov	d1259b7b35	llama : do not quantize expert gating tensors	2023-12-10 13:00:13 +02:00
Georgi Gerganov	6cfb31f9ea	metal : add indirect mat-vec kernels for all quantization types	2023-12-10 11:48:14 +02:00
Georgi Gerganov	016f9bb55a	metal : fix ggml_get_rows to work with non-cont src1	2023-12-10 09:38:21 +02:00
slaren	0710b0f726	llama : offload missing ffn_moe_silu	2023-12-09 23:29:47 +01:00
slaren	62b95f93d0	cuda : support non-contiguous src1 in get_rows	2023-12-09 22:39:34 +01:00
slaren	2e4db48291	ggml : update get_rows f16 and q	2023-12-09 22:38:22 +01:00
Xiang (Kevin) Li	e18f7345a3	grammar : revert the replacement of llama_token_to_piece with id_to_token (#4396 )	2023-12-09 23:29:27 +02:00
slaren	ac3f7d8e23	ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D	2023-12-09 19:20:21 +01:00
Georgi Gerganov	8c5b66eeaa	metal : reduce the kernel launches for ggml_mul_mat_id	2023-12-09 15:30:34 +02:00
Georgi Gerganov	7e2006b0c0	metal : add/mul/div use general kernel when src1 not cont	2023-12-09 14:25:49 +02:00
slaren	06dfde3e94	llama : add basic support for offloading moe with CUDA	2023-12-09 13:21:30 +01:00
Georgi Gerganov	2cbcba829f	metal : add more general support for ggml_get_rows + tests	2023-12-09 14:18:42 +02:00
Georgi Gerganov	9064b1ca05	ggml : fix ggml_get_rows to take into account ne02 / ne11	2023-12-09 14:04:54 +02:00
slaren	ee8fb399aa	ggml : add n_as argument to ggml_mul_mat_id	2023-12-09 12:42:25 +01:00
Georgi Gerganov	7372b62271	ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)	2023-12-09 13:19:47 +02:00
Georgi Gerganov	8b185b7030	llama : fix expert weighting in the FFN	2023-12-09 13:01:42 +02:00

1 2 3 4 5 ...

1677 commits