Still not fully working, but worth committing these:
* per-layer n_embd_[kv]_s (probably a no-op since first layer is ssm)
* fix setting n_kv_hybrid when not worst_case
* Use the right n_kv for build_inp_s_copy when hybrid
* Use the right n_kv for recurrent section of llama_set_inputs
* Use the right logic to determine batch splitting for hybrid
Branch: BambaArchitecture
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
There are still problems at inference around matrix dimensions not lining
up, so there are likely still places where the per-layer sizes are not
being used correctly.
Branch: BambaArchitecture
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
head count and time step rank are used for the same purpose in the model,
so we stick with the existing key. Chunk size is not used in this impl
because of the way the mixer is implemented without chunking.
Branch: BambaArchitecture
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
There are likely still some missing hparams, but the tensor mapping should
be correct
Branch: BambaArchitecture
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Add download chat feature to server chat
Add a download feature next to the delete chat feature in the server vue chat interface.
* code style
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
This matches the key in common bert-based embedding models and may have a
value other than 1 in it.
Branch: XLMRobertaTypeVocabSize
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* GitHub: ask for more info in issues [no ci]
* refactor issue templates to be component-specific
* more understandable issue description
* add dropdown for llama.cpp module
* CANN Support Ascend310P to accelerate F32 and F16 Model
* Add compile option soc type macro ASCEND_310P to ggml-cann lib
* Remove unused code
* Remove the ascend soc_type hard code compile option in CMakelist.txt
* vulkan: Use pipeline_robustness to disable robustness in mul_mat_vec.
Add some early returns for nonexistent rows in mul_mat_vec shaders. These
can only be hit when dispatching a 2D grid of workgroups. Fix the logic
for the 2D grid of workgroups to round up.
Enable the pipeline robustness extension if it's available, and use it to
disable robustness for these pipelines. The instructions to do the bounds
checking contend for the same ALU resources as the bit twiddling dequant
instructions.
* vulkan: Add GLSL structure aliases for quant types to allow larger loads
In Vulkan it's not possible to cast pointer types, so instead you have to
declare an aliased binding for the memory with a different type. This
commit adds aliases for the quant formats using 16b ints, and in a few
places where the struct size is a multiple of 4 also using 32b ints.
Currently only q4_k's aliases are used, but others will be used in
subsequent commits.
* vulkan: use larger loads in q5_k and q6_k shaders.
Similar to the optimization I did in q4_k recently, this vectorizes some loads
and reduces the number of bit twiddling instructions.
* vulkan: use larger K step per iteration in mul_mat_vec.
Add vec4 dequantization functions, and use them to do K=8 per iteration in
mul_mat_vec. This uses 16b loads for the quant values and 128b loads for B
which helps reduce the load on the memory system.
The K_PER_ITER==2 logic is still there, just for F16/F32, and really only
because they support unaligned sizes.
Tweak the num_iters/unrolling logic to be simpler and catch a couple missed
unrolling opportunities.