ggml : add mmla kernels for quantized GEMM (#4966)

* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: update unit tests for the new vec_dot interface

* llama.cpp: add MATMUL_INT8 capability to system_info

This commit is contained in:

snadampal

2024-02-11 07:22:33 -06:00

• committed by

GitHub

parent e4640d8fdf

commit a07d0fee1f

No known key found for this signature in database

GPG key ID: B5690EEEBB952194

10 changed files with 441 additions and 88 deletions

									
										1

llama.cpp
									
										View file
										
				@ -11869,6 +11869,7 @@ const char * llama_print_system_info(void) {

				    s += "SSE3 = "        + std::to_string(ggml_cpu_has_sse3())        + " | ";

				    s += "SSSE3 = "       + std::to_string(ggml_cpu_has_ssse3())       + " | ";

				    s += "VSX = "         + std::to_string(ggml_cpu_has_vsx())         + " | ";

				    s += "MATMUL_INT8 = " + std::to_string(ggml_cpu_has_matmul_int8()) + " | ";

				    return s.c_str();

				}

Rows
Columns

ggml : add mmla kernels for quantized GEMM (#4966)

1 llama.cpp Unescape Escape View file

1

llama.cpp

View file