readme

2024-05-27 12:36:57 -04:00 · 2024-05-27 12:36:57 -04:00 · 1d6d9497a8
commit 1d6d9497a8
parent de26d49fbe
1 changed files with 15 additions and 7 deletions
--- a/examples/duo/README.md
+++ b/examples/duo/README.md
@ -12,13 +12,12 @@ In case of two identical devices and equal model split we would leave half of co
 We can utilize this compute to speculate and then evaluate larger sequence of tokens.
-This demo is fairly limited:
+This demo is fairly limited, more like a proof of concept:
-1. Expects two instances running main model
+1. Expects exactly two instances running main model
-2. One of these instances speculating
+2. Only one of these instances speculating when main model is idle, so we still waste 25% of compute
 3. Speculation is linear
 4. Sampling is greedy
 So, in the case of two identical devices and equal model split we still are not utilizing 25% of compute.
 Improvement of the above points is probably easier to do as separate changes, to make reviewing easier.
 ### Setup
@ -50,15 +49,24 @@ Also on M2:
 ./bin/duo -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf -md ../../llms/gguf/Meta-Llama-3-8B-Instruct-v2.Q2_K.gguf --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99 -t 1  --rpcd "localhost:20002"
 ...
-decoded 256 tokens in 32.03 s, speed: 7.99 t/s
+llama_print_timings:        load time =   42068.04 ms
 ...
 llama_print_timings:       total time =   42792.74 ms /   302 tokens
 ```
 Seems like eval time is messed up a little 
 Compare that with running main with same 2 rpc servers:
 ```
-./bin/main -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf  --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99 -t 1
+./bin/main -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf  --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99
 ...
-
+llama_print_timings:        load time =   42305.61 ms
 ...
 llama_print_timings:       total time =   58555.49 ms /   268 tokens
 ```
 Extra: 
 GPU util for both devices