mamba : very basic quantization support
Mostly works, but there is currently no difference between the variants of a k-quant (e.g. Q4_K_S and Q4_K_M are the same). Most of the SSM-specific weights can be kept in f32 without affecting the size that much, since they are relatively small. (the linear projection weights are responsible for most of Mamba's size) Too much quantization seems to make the state degrade quite fast, and the model begins to output gibberish. It seems to affect bigger models to a lesser extent than small models, but I'm not sure by how much. Experimentation will be needed to figure out which weights are more important for the _M (and _L?) variants of k-quants for Mamba. * convert : fix wrong name for layer norm weight of offical Mamba models I was using Q-bert/Mamba-* models before, which have a slighlty different naming scheme for the weights. (they start with "model.layers" instead of "backbone.layers")
This commit is contained in:
parent
78a853b788
commit
5816ae687e
2 changed files with 7 additions and 1 deletions
|
@ -91,7 +91,7 @@ class TensorNameMap:
|
||||||
"model.layers.layers.{bid}.norm", # plamo
|
"model.layers.layers.{bid}.norm", # plamo
|
||||||
"model.layers.{bid}.attention_norm", # internlm2
|
"model.layers.{bid}.attention_norm", # internlm2
|
||||||
"model.layers.{bid}.norm", # mamba
|
"model.layers.{bid}.norm", # mamba
|
||||||
"backbone.layers.{bid}.mixer.norm", # mamba
|
"backbone.layers.{bid}.norm", # mamba
|
||||||
),
|
),
|
||||||
|
|
||||||
# Attention norm 2
|
# Attention norm 2
|
||||||
|
|
|
@ -11718,6 +11718,12 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
|
||||||
quantize &= name != LLM_TN(model.arch)(LLM_TENSOR_POS_EMBD, "weight");
|
quantize &= name != LLM_TN(model.arch)(LLM_TENSOR_POS_EMBD, "weight");
|
||||||
quantize &= name != LLM_TN(model.arch)(LLM_TENSOR_TOKEN_TYPES, "weight");
|
quantize &= name != LLM_TN(model.arch)(LLM_TENSOR_TOKEN_TYPES, "weight");
|
||||||
|
|
||||||
|
// do not quantize Mamba's small yet 2D weights
|
||||||
|
// NOTE: can't use LLM_TN here because the layer number is not known
|
||||||
|
quantize &= name.find("ssm_conv1d.weight") == std::string::npos;
|
||||||
|
quantize &= name.find("ssm_x.weight") == std::string::npos;
|
||||||
|
quantize &= name.find("ssm_dt.weight") == std::string::npos;
|
||||||
|
|
||||||
enum ggml_type new_type;
|
enum ggml_type new_type;
|
||||||
void * new_data;
|
void * new_data;
|
||||||
size_t new_size;
|
size_t new_size;
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue