llama.cpp

Author	SHA1	Message	Date
Iwan Kawrakow	e6e61e3158	iq3_s: partial fix for QK_K = 64	2024-02-23 16:04:28 +02:00
Iwan Kawrakow	1d47de3258	ROCm again	2024-02-23 14:03:52 +02:00
Iwan Kawrakow	0d6d185e0f	Attempt to fix ROCm	2024-02-23 13:52:33 +02:00
Iwan Kawrakow	303f3f3258	Another attempt to fix the Windows builds	2024-02-23 10:38:02 +02:00
Iwan Kawrakow	436a146f98	Attempt to fix failing tests	2024-02-23 10:16:15 +02:00
Iwan Kawrakow	cd6a0f08be	Move Q3_K_XS mix to 3.25 bpw	2024-02-23 08:02:43 +02:00
Iwan Kawrakow	47cf30b0ee	iq3_s: make tests pass	2024-02-23 08:02:43 +02:00
Iwan Kawrakow	2730225c5f	iq3_xs: rename to iq3_s	2024-02-23 08:02:43 +02:00
Iwan Kawrakow	272c7f7739	Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS	2024-02-23 08:02:43 +02:00
Iwan Kawrakow	b25f99607d	Fix stupid warning	2024-02-23 08:02:43 +02:00
Iwan Kawrakow	4d5feebeb6	iq3_xs: tiny Metal speed improvement	2024-02-23 08:02:43 +02:00
Iwan Kawrakow	87038fe198	iq3_xs: tiny Metal speed improvement	2024-02-23 08:02:43 +02:00
Iwan Kawrakow	1777825550	iq3_xs: make new version work on metal Performance is very similar to Q3_K_S	2024-02-23 08:02:43 +02:00
Iwan Kawrakow	1328331db7	iq3_s: make ARM_NEON work with new version	2024-02-23 08:02:43 +02:00
Iwan Kawrakow	1fef4b8b68	iq3_xs: make scalar and AVX2 work for new version	2024-02-23 08:02:43 +02:00
Iwan Kawrakow	eacff4aa81	iq3_xs: make CUDA work for new version	2024-02-23 08:02:43 +02:00
Iwan Kawrakow	d83fddaa3b	iiq3_xs: a 3.4375 bpw variant	2024-02-23 08:02:42 +02:00
Iwan Kawrakow	2ec600b7a4	Adding IQ3_M - IQ3_XS mix with mostly Q4_K	2024-02-23 08:02:42 +02:00
Iwan Kawrakow	38aa7b176f	iq3_xs: working Metal implementation	2024-02-23 08:02:42 +02:00
Iwan Kawrakow	76214ab655	iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)	2024-02-23 08:02:42 +02:00
Iwan Kawrakow	f1255c50c0	iq3_xs: working scalar and AVX2 dot products	2024-02-23 08:02:42 +02:00
Iwan Kawrakow	5be4e7ac4a	Minor improvement via 3 neighbours	2024-02-23 08:02:42 +02:00
Iwan Kawrakow	76aff093b4	Minor PPL improvement via a block scale fudge factor	2024-02-23 08:02:42 +02:00
Iwan Kawrakow	5691fecd06	Resurrecting iq3_xs After all the experimentation, nothing was better than this.	2024-02-23 08:02:42 +02:00
Iwan Kawrakow	10a47fa678	iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels	2024-02-23 08:00:52 +02:00
Jared Van Bortel	15499eb942	mpt : do not duplicate token_embd.weight on disk (#5670 )	2024-02-22 17:05:23 -05:00
Georgi Gerganov	96633eeca1	gemma : use more bits for the token_embd.weight tensor (#5650 ) * gemma : use Q8_0 for the token_embd.weight tensor * llama : quantize token_embd.weight using output type	2024-02-22 23:23:46 +02:00
Georgi Gerganov	847eedbdb2	py : add Gemma conversion from HF models (#5647 ) * py : add gemma conversion from HF models * Update convert-hf-to-gguf.py Co-authored-by: Aarni Koskela <akx@iki.fi> * Update convert-hf-to-gguf.py Co-authored-by: Aarni Koskela <akx@iki.fi> * Update convert-hf-to-gguf.py Co-authored-by: Jared Van Bortel <jared@nomic.ai> --------- Co-authored-by: Aarni Koskela <akx@iki.fi> Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2024-02-22 23:22:48 +02:00
Georgi Gerganov	7e4f339c40	ggml : always define ggml_fp16_t as uint16_t (#5666 ) * ggml : always define ggml_fp16_t as uint16_t ggml-ci * ggml : cont ggml-ci * ggml : cont * ggml : cont ggml-ci * ggml : cont ggml-ci * cuda : no longer ggml headers last ggml-ci * ggml : fix q6_K FP16 -> FP32 conversion ggml-ci * ggml : more FP16 -> FP32 conversion fixes ggml-ci	2024-02-22 23:21:39 +02:00
Georgi Gerganov	334f76fa38	sync : ggml	2024-02-22 23:21:05 +02:00
Georgi Gerganov	efd56b1c21	ggml : 32-bit arm compat (whisper/1891) * ggml : 32-bit arm compat * ggml : add ggml_vqtbl1q_s8 impl * ggml : cont	2024-02-22 23:20:50 +02:00
Someone	201294ae17	nix: init singularity and docker images (#5056 ) Exposes a few attributes demonstrating how to build [singularity](https://docs.sylabs.io/guides/latest/user-guide/)/[apptainer](https://apptainer.org/) and Docker images re-using llama.cpp's Nix expression. Built locally on `x86_64-linux` with `nix build github:someoneserge/llama.cpp/feat/nix/images#llamaPackages.{docker,docker-min,sif,llama-cpp}` and it's fast and effective.	2024-02-22 11:44:10 -08:00
Georgi Gerganov	5a9e2f60ba	py : minor fixes (#5668 )	2024-02-22 20:13:25 +02:00
Xuan Son Nguyen	373ee3fbba	Add Gemma chat template (#5665 ) * add gemma chat template * gemma: only apply system_prompt on non-model message	2024-02-22 19:10:21 +01:00
Someone	4cb4d8b22d	workflows: nix: hardcode cachix ids, build unconditionally (#5663 ) GitHub does not expose environment and repository variables to PRs coming from forks implies that we've been disabling the Nix CI actions for most PRs. The `if:` also didn't make much sense, because we can always pull from cachix, and there's no point (albeit no risk either) in pushing cache for the untrusted code.	2024-02-22 08:32:09 -08:00
Georgi Gerganov	3a03541ced	minor : fix trailing whitespace (#5638 )	2024-02-22 13:54:03 +02:00
Georgi Gerganov	56d03d92be	readme : update hot topics	2024-02-22 10:35:54 +02:00
Xuan Son Nguyen	a46f50747b	server : fallback to chatml, add AlphaMonarch chat template (#5628 ) * server: fallback to chatml * add new chat template * server: add AlphaMonarch to test chat template * server: only check model template if there is no custom tmpl * remove TODO	2024-02-22 10:33:24 +02:00
Alexey Parfenov	c5688c6250	server : clarify some params in the docs (#5640 )	2024-02-22 10:27:32 +02:00
Dat Quoc Nguyen	4ef245a92a	mpt : add optional bias tensors (#5638 ) Update for MPT with optional bias parameters: to work with PhoGPT and SEA-LION models that were pre-trained with 'bias'.	2024-02-22 10:15:13 +02:00
slaren	973053d8b0	llama : fix loading models with shared tok_embd and output (#5651 ) ggml-ci	2024-02-22 00:42:09 +01:00
Xuan Son Nguyen	7c8bcc11dc	Add docs for llama_chat_apply_template (#5645 ) * add docs for llama_chat_apply_template * fix typo	2024-02-22 00:31:00 +01:00
slaren	7fe4678b02	llama : fix session save/load with quantized KV (#5649 )	2024-02-21 22:52:39 +01:00
slaren	ba2135ccae	gemma : allow offloading the output tensor (#5646 )	2024-02-21 22:18:23 +01:00
Jared Van Bortel	89febfed93	examples : do not assume BOS when shifting context (#5622 )	2024-02-21 10:33:54 -05:00
Georgi Gerganov	5022cf242d	sync : ggml	2024-02-21 16:52:52 +02:00
Pierrick Hymbert	1ecea255eb	server: health: fix race condition on slots data using tasks queue (#5634 ) * server: health: fix race condition on slots data using tasks queue * server: health: * include_slots only if slots_endpoint * fix compile warning task.target_id not initialized.	2024-02-21 15:47:48 +01:00
Ettore Di Giacinto	a00a35cef9	readme : add LocalAI to the availables UI (#5629 )	2024-02-21 16:39:10 +02:00
Georgi Gerganov	eccd7a26dd	sync : ggml (#5633 ) * ggml : fix conv_2d batch mode (ggml/737) Co-authored-by: bssrdf <bssrdf@gmail.com> * ggml : compute forward no longer pass src tensors (ggml/729) * sync : ggml ggml-ci --------- Co-authored-by: bssrdf <merlintiger@hotmail.com> Co-authored-by: bssrdf <bssrdf@gmail.com>	2024-02-21 16:17:10 +02:00
Georgi Gerganov	c14f72db9c	readme : update hot topics	2024-02-21 15:39:54 +02:00

1 2 3 4 5 ...

2274 commits