I found that parallel (in examples/parallel) was unusable when -np > 1.
I bisected the issue down to d7b800b8bc
I don't really understand anything about kv-cache, just that the change
caused parallel to emit nonsense on my M2 Mac Studio (Apple M2 Max,
macOS 14.1.2 (23B92)). The comments around it say kv_self.n is a
heuristic (and seems to have comments suggesting other possible values
for assignment), so I presume that it shouldn't be a problem to remove
the GGML_PAD(). Empirically it seems to work fine. That said, it does
sound like the bug could run deeper, but it is beyond my ability to
understand what the root cause might be.
It apparently reproduces on various models not only tinyllama, but this
one is small so it should be more convenient to reproduce. While
tinyllama isn't known for the quality of its output, there is still an
obvious difference between the nonsense output and the normal output.
Reproduction:
`./parallel -c 99999 -n 30 -ns 10 -np 2 -m ~/Downloads/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf`
Before fix (example bad output):
```
Input: If you could have any superpower, what would it be?
Response: . In the 812
```
After fix (example expected output):
```
Input: If you could have any superpower, what would it be?
Response: I would choose the power of being able to control time. The power
```
----
After typing the above I realized when running larger models, the
problem is less apparent, but still exists. For example, I tried
mixtral:
`./parallel -c 99999 -n 50 -ns 50 -np 2 -m ~/Downloads/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf`
And one of the outputs was:
```
Input: Recommend some interesting books to read.
Response: I recommend the book "Surelyourecommend would suggest starting
with the book "The Foundation for Self-Help by Dr. Micahelle myself to
anywhere in the world"
```
The problem with the above response is obvious, but here's another that
isn't so obvious if you just glance at it:
```
Input: I want to learn how to play the piano.
Response: That's great! I could recommend a personalize piano lessons
with a piano teacher. This will allow you to learn at your own pace. You
can practice scales and chords,
```
Note that "a personalize piano lessons" is not grammatical English, a
mistake that mixtral should not make. I didn't notice any such errors
when testing with this patch applied.
* server: health: fix race condition on slots data using tasks queue
* server: health:
* include_slots only if slots_endpoint
* fix compile warning task.target_id not initialized.
This commit adds the `--skip-unknown` option to the convert.py script
and removes the saving of the updated checkpoints to avoid updating
possibly checked out files.
The motivation for this change is that this was done for 1.5
in Commit fc0c8d286a ("llava :
update surgery script to not remove tensors") and makes the examples
more consistent.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
There are couple things in this architecture:
1. Shared input and output embedding parameters.
2. Key length and value length are not derived from `n_embd`.
More information about the models can be found at
https://ai.google.dev/gemma. GGUFs can be downloaded from
https://huggingface.co/google.
* iq4_nl: squash commits for easier rebase
* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels
* iq4_nl: Fix after merging with master
* iq4_nl: another fix after merging with master
* Use IQ4_NL instead of Q4_K when using k-quants is not possible
* Fix typo that makes several tests fail
* It was the ggml_vdotq thing missed inside the brackets
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit contains a suggestion for the README.md in the llava
example. The suggestion adds explicit instructions for how to convert
a llava-1.6 model and run it using llava-cli.
The motivation for this is that having explicit instructions similar to
the 1.5 instructions will make it easier for users to try this out.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* add build support for embedded metal library
* Update Makefile
---------
Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* support minLength and maxLength in JSON schema grammar converter
* Update examples/json-schema-to-grammar.py
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml : embed Metal library source (ggml-metal.metal) into binary
enable by setting WHISPER_EMBED_METAL_LIBRARY
* rename the build option
* rename the preprocessor directive
* generate Metal library embedding assembly on-fly during build process
This is a follup of Commit fc0c8d286a
("llava : update surgery script to not remove tensors") but this time
the change is to the BakLLaVA specific part of the surgery script.
I've been able to test this using SkunkworksAI/BakLLaVA-1 and it works
as expected using the instructions in README.md.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* llama: add llama_chat_apply_template
* test-chat-template: remove dedundant vector
* chat_template: do not use std::string for buffer
* add clarification for llama_chat_apply_template
* llama_chat_apply_template: add zephyr template
* llama_chat_apply_template: correct docs
* llama_chat_apply_template: use term "chat" everywhere
* llama_chat_apply_template: change variable name to "tmpl"
* #ifdef out some code NUMA blocks for Android due to lack of support
* added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper
* Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc
* harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways
---------
Co-authored-by: root <root@nenya.lothlorien.ca>
* build : pass all warning flags to nvcc via -Xcompiler
* make : fix apparent mis-merge from #3952
* make : fix incorrect GF_CC_VER for CUDA host compiler