Clamp out of range values in K quantizer

This assertion fails when quantizing Mixtral 8x7b as Q5_K_M, because I
used `convert.py --outtype f32` and the Mixtral weights use bf16 which
has a much larger exponent range than the K quantizer is expecting. If
--outtype f16 is used then the assert doesn't fail.

See ggerganov/llama.cpp#2982
This commit is contained in:
Justine Tunney 2024-04-24 16:59:30 -07:00
parent 784e11dea1
commit 2ef86e7213
No known key found for this signature in database
GPG key ID: 52965314629936D4

View file

@ -1023,7 +1023,7 @@ void dequantize_row_q8_0(const block_q8_0 * restrict x, float * restrict y, int6
// ===================== Helper functions // ===================== Helper functions
// //
static inline int nearest_int(float fval) { static inline int nearest_int(float fval) {
assert(fval <= 4194303.f); fval = fminf(fval, 4194303.f);
float val = fval + 12582912.f; float val = fval + 12582912.f;
int i; memcpy(&i, &val, sizeof(int)); int i; memcpy(&i, &val, sizeof(int));
return (i & 0x007fffff) - 0x00400000; return (i & 0x007fffff) - 0x00400000;