Is silu activation function applied to MODEL_TENSOR.FFN_GATE_EXP here? If so, we must change this to w1 for DBRX. Each expert in DBRX has 3 linear layers: w1, v1 and w2. For an input tensor x, output from the expert layer would be (silu(x.w1_t) * x.v1_t) . w2_t). Same math is also used in mixtral, only difference being DBRX uses v1 instead of w3 in mixtral.

Co-authored-by: Megha Agarwal <16129366+megha95@users.noreply.github.com>
2024-04-12 21:40:57 +02:00 · 2024-04-12 21:40:57 +02:00 · 542585fbea
commit 542585fbea
parent bdc4efe17f
1 changed files with 1 additions and 1 deletions
--- a/gguf-py/gguf/tensor_mapping.py
+++ b/gguf-py/gguf/tensor_mapping.py
@ -238,7 +238,7 @@ class TensorNameMap:
        MODEL_TENSOR.FFN_UP_EXP: (
            "layers.{bid}.feed_forward.experts.w3",                 # mixtral (merged)
            "transformer.decoder_layer.{bid}.moe.linear_v",         # Grok (merged)
-            "transformer.blocks.{bid}.ffn.experts.mlp.w1",          # dbrx
+            "transformer.blocks.{bid}.ffn.experts.mlp.v1",          # dbrx
        ),

        # AWQ-activation gate