Clamp out of range values in K quantizer
This assertion fails when quantizing Mixtral 8x7b as Q5_K_M, because I used `convert.py --outtype f32` and the Mixtral weights use bf16 which has a much larger exponent range than the K quantizer is expecting. If --outtype f16 is used then the assert doesn't fail. See ggerganov/llama.cpp#2982
This commit is contained in:
parent
784e11dea1
commit
2ef86e7213
1 changed files with 1 additions and 1 deletions
|
@ -1023,7 +1023,7 @@ void dequantize_row_q8_0(const block_q8_0 * restrict x, float * restrict y, int6
|
|||
// ===================== Helper functions
|
||||
//
|
||||
static inline int nearest_int(float fval) {
|
||||
assert(fval <= 4194303.f);
|
||||
fval = fminf(fval, 4194303.f);
|
||||
float val = fval + 12582912.f;
|
||||
int i; memcpy(&i, &val, sizeof(int));
|
||||
return (i & 0x007fffff) - 0x00400000;
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue