diff --git a/examples/duo/README.md b/examples/duo/README.md index 7cb8e6e69..8644aab87 100644 --- a/examples/duo/README.md +++ b/examples/duo/README.md @@ -12,13 +12,12 @@ In case of two identical devices and equal model split we would leave half of co We can utilize this compute to speculate and then evaluate larger sequence of tokens. -This demo is fairly limited: -1. Expects two instances running main model -2. One of these instances speculating +This demo is fairly limited, more like a proof of concept: +1. Expects exactly two instances running main model +2. Only one of these instances speculating when main model is idle, so we still waste 25% of compute 3. Speculation is linear 4. Sampling is greedy -So, in the case of two identical devices and equal model split we still are not utilizing 25% of compute. Improvement of the above points is probably easier to do as separate changes, to make reviewing easier. ### Setup @@ -50,15 +49,24 @@ Also on M2: ./bin/duo -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf -md ../../llms/gguf/Meta-Llama-3-8B-Instruct-v2.Q2_K.gguf --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99 -t 1 --rpcd "localhost:20002" ... -decoded 256 tokens in 32.03 s, speed: 7.99 t/s +llama_print_timings: load time = 42068.04 ms +... +llama_print_timings: total time = 42792.74 ms / 302 tokens ``` +Seems like eval time is messed up a little + Compare that with running main with same 2 rpc servers: ``` -./bin/main -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99 -t 1 +./bin/main -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99 ... - +llama_print_timings: load time = 42305.61 ms +... +llama_print_timings: total time = 58555.49 ms / 268 tokens ``` +Extra: + +GPU util for both devices