This commit is contained in:
Oleksandr Kuvshynov 2024-05-27 12:36:57 -04:00
parent de26d49fbe
commit 1d6d9497a8

View file

@ -12,13 +12,12 @@ In case of two identical devices and equal model split we would leave half of co
We can utilize this compute to speculate and then evaluate larger sequence of tokens. We can utilize this compute to speculate and then evaluate larger sequence of tokens.
This demo is fairly limited: This demo is fairly limited, more like a proof of concept:
1. Expects two instances running main model 1. Expects exactly two instances running main model
2. One of these instances speculating 2. Only one of these instances speculating when main model is idle, so we still waste 25% of compute
3. Speculation is linear 3. Speculation is linear
4. Sampling is greedy 4. Sampling is greedy
So, in the case of two identical devices and equal model split we still are not utilizing 25% of compute.
Improvement of the above points is probably easier to do as separate changes, to make reviewing easier. Improvement of the above points is probably easier to do as separate changes, to make reviewing easier.
### Setup ### Setup
@ -50,15 +49,24 @@ Also on M2:
./bin/duo -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf -md ../../llms/gguf/Meta-Llama-3-8B-Instruct-v2.Q2_K.gguf --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99 -t 1 --rpcd "localhost:20002" ./bin/duo -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf -md ../../llms/gguf/Meta-Llama-3-8B-Instruct-v2.Q2_K.gguf --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99 -t 1 --rpcd "localhost:20002"
... ...
decoded 256 tokens in 32.03 s, speed: 7.99 t/s llama_print_timings: load time = 42068.04 ms
...
llama_print_timings: total time = 42792.74 ms / 302 tokens
``` ```
Seems like eval time is messed up a little
Compare that with running main with same 2 rpc servers: Compare that with running main with same 2 rpc servers:
``` ```
./bin/main -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99 -t 1 ./bin/main -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99
... ...
llama_print_timings: load time = 42305.61 ms
...
llama_print_timings: total time = 58555.49 ms / 268 tokens
``` ```
Extra:
GPU util for both devices