ggml : SOTA 2-bit quants (add IQ2_XS) (#4856)

* iq2_xs: basics

* iq2_xs: this should have been in the basics

* iq2_xs: CUDA and scalar CPU works

* iq2_xs: WIP Metal

* iq2_xs: Metal now works

* iq2_xs: working, but dog slow, ARM_NEON dot product

* iq2_xs: better ARM_NEON dot product

We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.

* iq2_xs: AVX2 dot product - 19.5 t/s

* iq2_xs: faster AVX2 dit product

21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.

* iq2_xs: had forgotten to delete iq2-data.h

* Add llama enum for IQ2_XS

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow 2024-01-11 20:39:39 +01:00 committed by GitHub
parent 3ba5b8ca8e
commit 49662cbed3
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
10 changed files with 1038 additions and 28 deletions

View file

@ -134,8 +134,9 @@ int main(int argc, char * argv[]) {
continue;
}
if ((ggml_type)i == GGML_TYPE_IQ2_XXS) {
printf("Skip %s due to missing quantization functionality\n", ggml_type_name((ggml_type) i));
const ggml_type ei = (ggml_type)i;
if (ei == GGML_TYPE_IQ2_XXS || ei == GGML_TYPE_IQ2_XS) {
printf("Skip %s due to missing quantization functionality\n", ggml_type_name(ei));
continue;
}