* Fix Vulkan no kv offload incoherence
* Add k-quant mul mat mat shaders
* Rework working buffer allocation, reduces vram use noticeably
Clean up cpu assist code, replaced with ggml-backend offload function
* Default to all dedicated GPUs
* Add fallback for integrated GPUs if no dedicated GPUs are found
* Add debug info which device is allocating memory
* Fix Intel dequant issue
Fix validation issue
* Fix Vulkan GGML_OP_GET_ROWS implementation
* Clean up merge artifacts
* Remove Vulkan warning
* Support xverse model convert to gguf format.
* 1. Convert xverse models to gguf;
2. Add LLM_ARCH_XVERSE inference in llama.cpp;
3. Add xverse item in Supported models in README.md;
* * gguf-py: remove redundant logs
* llama: remove the init_mapping_prefetch custom parameter
* llama.cpp: Include the changes from #6122 to exclude the unused outputs of the last layers.
* - Fix format issues
- Remove duplicate set kqv_out to llm_build_kv
* Update llama.cpp
---------
Co-authored-by: willhe <willhe@xverse.cn>
Co-authored-by: willhe <hexin@xverse.cn>
* llama: remove redundant reshape in build_kv_store
This commit removes the reshape of the V matrix in the build_kv_store.
The motivation for this is that V matrix has the shape:
```console
(gdb) p *v_cur
$46 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU,
buffer = 0x0, ne = {4096, 512, 1, 1}, nb = {4, 16384, 8388608,
8388608}, op = GGML_OP_MUL_MAT, op_params = {
0 <repeats 16 times>}, flags = 0, grad = 0x0,
src = {0xb496b0, 0x7ffef1c40950, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x0, view_offs = 0, data = 0x0,
name = "Vcur-0", '\000' <repeats 57 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
```
And after reshaping this tensor we get:
```console
gdb) p *ggml_reshape_2d(ctx, v_cur, n_embd_v_gqa, n_tokens)
$44 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU,
buffer = 0x0, ne = {4096, 512, 1, 1}, nb = {4, 16384, 8388608,
8388608}, op = GGML_OP_RESHAPE, op_params = {
0 <repeats 16 times>}, flags = 0, grad = 0x0,
src = {0x7ffef1c40e00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x7ffef1c40e00, view_offs = 0, data = 0x0,
name = "Vcur-0 (reshaped)", '\000' <repeats 46 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
```
I noticed that the `src` and `view_src` fields are different but that the
dimensions are the same. From the code comment it seems like the reshape
call is not needed and perhaps the above can motivate the removal of the
reshape call.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* llama : add assert
---------
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Allow conversion of Mistral HF models
* Homogenize Llama, Mistral, Mixtral under the same entry.
* Fix tokenizer, permute tensors
* Use sentencepiece tokenizer, or fall back to hfft.
* convert-hf : small fix for mypy
* convert-hf : fix duplicated block_count
* convert-hf : add vocab size to metadata
---------
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
- The generic /usr/bin/env shebangs are good enough
- Python deps are provisioned in the devShells
- We need to be able to leave python out at least on windows (currently breaks eval)
initial nix build for windows using zig
mingwW64 build
removes nix zig windows build
removes nix zig windows build
removed unnessesary glibc.static
removed unnessesary import of pkgs in nix
fixed missing trailing newline on non-windows nix builds
overriding stdenv when building for crosscompiling to windows in nix
better variables when crosscompiling windows in nix
cross compile windows on macos
removed trailing whitespace
remove unnessesary overwrite of "CMAKE_SYSTEM_NAME" in nix windows build
nix: keep file extension when copying result files during cross compile for windows
nix: better checking for file extensions when using MinGW
nix: using hostPlatform instead of targetPlatform when cross compiling for Windows
using hostPlatform.extensions.executable to extract executable format
* embedding : show full embedding for single prompt
To support the use case of creating an embedding for a given prompt, the entire embedding and not just the first part needed to be printed.
Also, show cosine similarity matrix only if there is more than one prompt, as the cosine similarity matrix for a single prompt is always `1.00`.
* Update examples/embedding/embedding.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* iq1_m: make it work for QK_K = 64 (WIP)
* iq1_m: make it work for QK_K = 64 (scalar and AVX2)
* iq1_m: QK_K = 64 seems to work on Metal and ARM_NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>