* Samplers sequence: simplified and input field.
* Removed unused function
* Modify and use `settings-modal-short-input`
* rename "name" --> "label"
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses
the B loads across the rows and also reuses some addressing calculations.
This required manually partially unrolling the loop, since the compiler
is less willing to unroll outer loops.
Add bounds-checking on the last iteration of the loop. I think this was at
least partly broken before.
Optimize the Q4_K shader to vectorize most loads and reduce the number of
bit twiddling instructions.
* use 128 bit loads (i've tried 256->128 to death and its slower)
* double accumulator
* avx bf16 vec dot
* +3% q4_0 inference
* +7% tg +5% pp compared to master
* slower f16c version, kep for reference
* 256b version, also slow. i tried :)
* revert f16
* faster with madd
* split to functions
* Q8_0 and IQ4_NL, 5-7% faster
* fix potential overflow (performance reduced)
* 16 bit add for q4_0 only
* merge
* server : (web ui) add copy btn for code blocks
* fix problem with api key
* use settings-modal-short-input component
* always show copy btn for code snippet
* sycl: Use syclcompat::dp4a
* Using the syclcompat version allow the compiler to optimize the
operation with native function
* Update news section
* Update CI Windows oneAPI version to 2025.0
* Reword doc
* Call syclcompat::dp4a inside dpct::dp4a
This reverts commit 90cb61d692.
Reuse the index calculations across all of src0/src1/dst. Add a shader
variant for when src0/src1 are the same dimensions and additional modulus
for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that
have a fast path when the calculation isn't needed or can be done more
cheaply.
* llama: propagating the results of `graph_compute` to the user interface
* llama: reverting kv_cache in case of failed compute
* llama: `llama_kv_cache_state` was removed, only the result of `llama_graph_compute` is returned
* llama: restore a kv_cache in case of failed computation
* llama: correct reverting of the entire batch.
also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models
* llama: updated comments
* llama : add comments about KV cache state after error
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Converter script can now read these two fields as a detailed base model and dataset source.
This was done so that it will be easier for Hugging Face to integrate detailed metadata as needed.
- base_model_sources (List[dict], optional)
- dataset_sources (List[dict], optional)
Dataset now represented as:
- general.dataset.count
- general.dataset.{id}.name
- general.dataset.{id}.author
- general.dataset.{id}.version
- general.dataset.{id}.organization
- general.dataset.{id}.description
- general.dataset.{id}.url
- general.dataset.{id}.doi
- general.dataset.{id}.uuid
- general.dataset.{id}.repo_url
This also adds to base model these metadata:
- general.base_model.{id}.description
* Fixes broken build for the SYCL CUDA backend caused by non-explicit gemm call in outprod (merged in with RWKV6 in
Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration #10133)
* Marks permuted MUL_MAT as unsupported to be able to run test-backend-ops
* Fixes asserts in norm to fix debug builds.