readme
This commit is contained in:
parent
de26d49fbe
commit
1d6d9497a8
1 changed files with 15 additions and 7 deletions
|
@ -12,13 +12,12 @@ In case of two identical devices and equal model split we would leave half of co
|
||||||
|
|
||||||
We can utilize this compute to speculate and then evaluate larger sequence of tokens.
|
We can utilize this compute to speculate and then evaluate larger sequence of tokens.
|
||||||
|
|
||||||
This demo is fairly limited:
|
This demo is fairly limited, more like a proof of concept:
|
||||||
1. Expects two instances running main model
|
1. Expects exactly two instances running main model
|
||||||
2. One of these instances speculating
|
2. Only one of these instances speculating when main model is idle, so we still waste 25% of compute
|
||||||
3. Speculation is linear
|
3. Speculation is linear
|
||||||
4. Sampling is greedy
|
4. Sampling is greedy
|
||||||
|
|
||||||
So, in the case of two identical devices and equal model split we still are not utilizing 25% of compute.
|
|
||||||
Improvement of the above points is probably easier to do as separate changes, to make reviewing easier.
|
Improvement of the above points is probably easier to do as separate changes, to make reviewing easier.
|
||||||
|
|
||||||
### Setup
|
### Setup
|
||||||
|
@ -50,15 +49,24 @@ Also on M2:
|
||||||
./bin/duo -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf -md ../../llms/gguf/Meta-Llama-3-8B-Instruct-v2.Q2_K.gguf --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99 -t 1 --rpcd "localhost:20002"
|
./bin/duo -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf -md ../../llms/gguf/Meta-Llama-3-8B-Instruct-v2.Q2_K.gguf --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99 -t 1 --rpcd "localhost:20002"
|
||||||
|
|
||||||
...
|
...
|
||||||
decoded 256 tokens in 32.03 s, speed: 7.99 t/s
|
llama_print_timings: load time = 42068.04 ms
|
||||||
|
...
|
||||||
|
llama_print_timings: total time = 42792.74 ms / 302 tokens
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Seems like eval time is messed up a little
|
||||||
|
|
||||||
Compare that with running main with same 2 rpc servers:
|
Compare that with running main with same 2 rpc servers:
|
||||||
```
|
```
|
||||||
./bin/main -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99 -t 1
|
./bin/main -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99
|
||||||
...
|
...
|
||||||
|
llama_print_timings: load time = 42305.61 ms
|
||||||
|
...
|
||||||
|
llama_print_timings: total time = 58555.49 ms / 268 tokens
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Extra:
|
||||||
|
|
||||||
|
GPU util for both devices
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue