Is silu activation function applied to MODEL_TENSOR.FFN_GATE_EXP here? If so, we must change this to w1 for DBRX. Each expert in DBRX has 3 linear layers: w1, v1 and w2. For an input tensor x, output from the expert layer would be (silu(x.w1_t) * x.v1_t) . w2_t). Same math is also used in mixtral, only difference being DBRX uses v1 instead of w3 in mixtral.
Co-authored-by: Megha Agarwal <16129366+megha95@users.noreply.github.com>
This commit is contained in:
parent
bdc4efe17f
commit
542585fbea
1 changed files with 1 additions and 1 deletions
|
@ -238,7 +238,7 @@ class TensorNameMap:
|
|||
MODEL_TENSOR.FFN_UP_EXP: (
|
||||
"layers.{bid}.feed_forward.experts.w3", # mixtral (merged)
|
||||
"transformer.decoder_layer.{bid}.moe.linear_v", # Grok (merged)
|
||||
"transformer.blocks.{bid}.ffn.experts.mlp.w1", # dbrx
|
||||
"transformer.blocks.{bid}.ffn.experts.mlp.v1", # dbrx
|
||||
),
|
||||
|
||||
# AWQ-activation gate
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue