convert-hf : get bit-exact same output as ./quantize

The quantization version was missing. * convert-hf : don't round bf16 NANs * convert-hf : save some memory with np.int16 intermediate bf16 weights * convert-hf : more closely match llama.cpp with which weights to keep in f32
2024-05-09 11:27:34 -04:00 · 2024-05-09 11:27:34 -04:00 · 95930da30e
commit 95930da30e
parent 3801db12d8
3 changed files with 37 additions and 12 deletions
--- a/gguf-py/gguf/constants.py
+++ b/gguf-py/gguf/constants.py
@ -10,6 +10,7 @@ from typing import Any
 GGUF_MAGIC             = 0x46554747  # "GGUF"
 GGUF_VERSION           = 3
 GGUF_DEFAULT_ALIGNMENT = 32
+GGML_QUANT_VERSION     = 2  # GGML_QNT_VERSION from ggml.h

 #
 # metadata keys