ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508)

* ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0

* llama : bump LLAMA_FILE_VERSION to 3

* cuda : update Q4 and Q8 dequantize kernels

* ggml : fix AVX dot products

* readme : update performance table + hot topics
This commit is contained in:
Georgi Gerganov 2023-05-19 22:17:18 +03:00 committed by GitHub
parent 6986c7835a
commit 2d5db48371
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
6 changed files with 109 additions and 102 deletions

2
ggml.h
View file

@ -190,7 +190,7 @@
#define GGML_FILE_MAGIC 0x67676d6c // "ggml"
#define GGML_FILE_VERSION 1
#define GGML_QNT_VERSION 1 // bump this on quantization format changes
#define GGML_QNT_VERSION 2 // bump this on quantization format changes
#define GGML_QNT_VERSION_FACTOR 1000 // do not change this
#define GGML_MAX_DIMS 4