Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better. Share some code for reducing the result values to memory in mul_mat_vec_base. |
||
---|---|---|
.. | ||
include | ||
src | ||
.gitignore | ||
CMakeLists.txt |