SOTA 2-bit quants (#4773)

* iq2_xxs: basics

* iq2_xxs: scalar and AVX2 dot products

Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.

* iq2_xxs: ARM_NEON dot product

Somehow strangely slow (112 ms/token).

* iq2_xxs: WIP Metal

Dequantize works, something is still wrong with the
dot product.

* iq2_xxs: Metal dot product now works

We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s

Not the greatest performance, but not complete garbage either.

* iq2_xxs: slighty faster dot product

TG-128 is now 48.4 t/s

* iq2_xxs: slighty faster dot product

TG-128 is now 50.9 t/s

* iq2_xxs: even faster Metal dot product

TG-128 is now 54.1 t/s.

Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.

* iq2_xxs: dequantize CUDA kernel - fix conflict with master

* iq2_xxs: quantized CUDA dot product (MMVQ)

We get TG-128 = 153.1 t/s

* iq2_xxs: slightly faster CUDA dot product

TG-128 is now at 155.1 t/s.

* iq2_xxs: add to llama ftype enum

* iq2_xxs: fix MoE on Metal

* Fix missing MMQ ops when on hipBLAS

I had put the ggml_supports_mmq call at the wrong place.

* Fix bug in qequantize_row_iq2_xxs

The 0.25f factor was missing.
Great detective work by @ggerganov!

* Fixing tests

* PR suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

This commit is contained in:

Kawrakow

2024-01-08 16:02:32 +01:00

• committed by

GitHub

parent 668b31fc7d

commit dd5ae06405

No known key found for this signature in database

GPG key ID: 4AEE18F83AFDEB23

10 changed files with 902 additions and 1 deletions

									
										5

tests/test-quantize-fns.cpp
									
										View file
										
				@ -134,6 +134,11 @@ int main(int argc, char * argv[]) {

				            continue;

				        }

				        if ((ggml_type)i == GGML_TYPE_IQ2_XXS) {

				            printf("Skip %s due to missing quantization functionality\n", ggml_type_name((ggml_type) i));

				            continue;

				        }

				        printf("Testing %s\n", ggml_type_name((ggml_type) i));

				        if (qfns.from_float && qfns.to_float) {

Rows
Columns

SOTA 2-bit quants (#4773)

5 tests/test-quantize-fns.cpp Unescape Escape View file

5

tests/test-quantize-fns.cpp

View file