* readme: introduce gpustack
GPUStack is an open-source GPU cluster manager for running large
language models, which uses llama.cpp as the backend.
Signed-off-by: thxCode <thxcode0824@gmail.com>
* readme: introduce gguf-parser
GGUF Parser is a tool to review/check the GGUF file and estimate the
memory usage without downloading the whole model.
Signed-off-by: thxCode <thxcode0824@gmail.com>
---------
Signed-off-by: thxCode <thxcode0824@gmail.com>
* Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead.
- Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
- ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.
* Fix small typo
---------
Co-authored-by: 0cc4m <picard12@live.de>
- Progressivity for token embeddings and attn_qkv
- FFN down for IQ1 and IQ2 quants
- FFN gate and up for IQ2_S and IQ2_M, for progressivity in the IQ2 range.
Attn_q in Q3_K for experts >= 8
Attn_k in Q5_K for experts >= 8
Attn_v in Q6_K for experts >= 8, in IQ3_XXS for IQ2_XXS and IQ2_XS
Attn_output in Q4_K for experts >= 8
* gguf-py : add T5ENCODER model architecture
* common : call llama_decode() during warmup only if the model has decoder
* convert-hf : add T5EncoderModel
* llama : add llama_model_has_decoder() API function
* llama : split build_t5() into build_t5_encoder() and build_t5_decoder()
* llama : add support for LLM_ARCH_T5ENCODER
* llama-embedding : add support for LLAMA_POOLING_TYPE_NONE
* llama-embedding : add support for encoder-only models
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* ggml: use vulkan as gpu backend when available
Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>
* whisper: enable using vk as default buffer type
Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>
---------
Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>
This commit adds the `--pooling` option to the README.md file in the
`examples/embedding` directory.
The motivation for adding this options is that currently if the model
used does not specify a pooling type the embedding example will fail
with the following error message:
```console
main: error: pooling type NONE not supported
```
This commit also updates the name of the executable in the examples
section.
* gguf-py : use classes for quants
* convert_hf : simplify internal quantization type selection
* gguf-py : fix flake8 lint
* gguf-py : fix BF16 numpy view type
* gguf-py : remove LlamaFileTypeMap
Too specific to 'llama.cpp', and would be a maintenance burden
to keep up to date.
* gguf-py : add generic quantize and dequantize functions
The quant classes no longer need to be known,
only the target or the source type,
for 'quantize' and 'dequantize', respectively.