SOTA 3-bit quants (#5196)

* iq3_xxs: quantize/dequantize

RMSE seems a bit high-ish at about half-way between q2_K and
q3_K, so need to check more.

* iq3_xxs: CUDA dequantize works

* iq2_xxs: tuning quantization

* iq3_xxs: starting to look better

PPL on wiki.test.raw
LLaMA-v1-7B: 6.4218
LLaMA-v2-7B: 6.3560
Mistral-7B : 6.0717

This is better than Q3_K_XS, with a 5% reduction in quantized model
size.

* iq3_xxs: CUDA dot product

We have
PP-512: 5891 t/s
TG-128: 143.9 t/s

* iq3_xxs: scalar and AVX2 dot products

* iq3_xxs: ARM_NEON and Metal

Metal performance is decent, ARM_NEON is pathetic

* iq3_xxs: slightly better grid points

* Faster iq3_xxs and iq2_xs dot products on CUDA

* iq3_xxs: add some quant mix

* iq3_xxs: fix failing quantization test

Dot product still fails. Is this real?

* iq3_xxs: hopefully fix ROCm

* iq3_xxs: failing tests

This time the dot product accuracy did find an actual bug
in the AVX2 implementation.

* Add IQ3_XXS to test-backend-ops

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

This commit is contained in:

Kawrakow

2024-01-30 15:14:12 +02:00

• committed by

GitHub

parent 2256f36b79

commit f4d7e54974

No known key found for this signature in database

GPG key ID: B5690EEEBB952194

14 changed files with 1215 additions and 18 deletions

									
										1

tests/test-backend-ops.cpp
									
										View file
										
				@ -1890,6 +1890,7 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op

				        GGML_TYPE_Q4_K, GGML_TYPE_Q5_K,

				        GGML_TYPE_Q6_K,

				        GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS,

				        GGML_TYPE_IQ3_XXS,

				    };

				    // unary ops

Rows
Columns

SOTA 3-bit quants (#5196)

1 tests/test-backend-ops.cpp Unescape Escape View file

1

tests/test-backend-ops.cpp

View file