* server : fix temperature * server : disable tests relying on parallel determinism * ci : change server Debug -> RelWithDebInfo
		
			
				
	
	
		
			118 lines
		
	
	
	
		
			4.2 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
			
		
		
	
	
			118 lines
		
	
	
	
		
			4.2 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
| @llama.cpp
 | |
| @results
 | |
| Feature: Results
 | |
| 
 | |
|   Background: Server startup
 | |
|     Given a server listening on localhost:8080
 | |
|     And   a model file tinyllamas/split/stories15M-00001-of-00003.gguf from HF repo ggml-org/models
 | |
|     And   a model file test-model-00001-of-00003.gguf
 | |
|     And   128 as batch size
 | |
|     And   1024 KV cache size
 | |
|     And   128 max tokens to predict
 | |
|     And   continuous batching
 | |
| 
 | |
|   Scenario Outline: consistent results with same seed
 | |
|     Given <n_slots> slots
 | |
|     And   1.0 temperature
 | |
|     Then  the server is starting
 | |
|     Then  the server is healthy
 | |
| 
 | |
|     Given 4 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
 | |
| 
 | |
|     Given concurrent completion requests
 | |
|     Then the server is busy
 | |
|     Then the server is idle
 | |
|     And  all slots are idle
 | |
|     Then all predictions are equal
 | |
|     Examples:
 | |
|       | n_slots |
 | |
|       | 1       |
 | |
|       # FIXME: unified KV cache nondeterminism
 | |
|       # | 2       |
 | |
| 
 | |
|   Scenario Outline: different results with different seed
 | |
|     Given <n_slots> slots
 | |
|     And   1.0 temperature
 | |
|     Then  the server is starting
 | |
|     Then  the server is healthy
 | |
| 
 | |
|     Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 42
 | |
|     Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 43
 | |
|     Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 44
 | |
|     Given 1 prompts "Title: Little Red Riding Hood But In Space\n\nSummary:" with seed 45
 | |
| 
 | |
|     Given concurrent completion requests
 | |
|     Then the server is busy
 | |
|     Then the server is idle
 | |
|     And  all slots are idle
 | |
|     Then all predictions are different
 | |
|     Examples:
 | |
|       | n_slots |
 | |
|       | 1       |
 | |
|       | 2       |
 | |
| 
 | |
|   Scenario Outline: consistent results with same seed and varying batch size
 | |
|     Given 4 slots
 | |
|     And   <temp> temperature
 | |
|     # And   0 as draft
 | |
|     Then  the server is starting
 | |
|     Then  the server is healthy
 | |
| 
 | |
|     Given 1 prompts "Write a very long story about AI." with seed 42
 | |
|     And   concurrent completion requests
 | |
|     # Then the server is busy # Not all slots will be utilized.
 | |
|     Then  the server is idle
 | |
|     And   all slots are idle
 | |
| 
 | |
|     Given <n_parallel> prompts "Write a very long story about AI." with seed 42
 | |
|     And   concurrent completion requests
 | |
|     # Then the server is busy # Not all slots will be utilized.
 | |
|     Then the server is idle
 | |
|     And  all slots are idle
 | |
| 
 | |
|     Then all predictions are equal
 | |
|     Examples:
 | |
|       | n_parallel | temp |
 | |
|       | 1          | 0.0  |
 | |
|       | 1          | 1.0  |
 | |
|       # FIXME: unified KV cache nondeterminism
 | |
|       # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
 | |
|       # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
 | |
|       # and https://github.com/ggerganov/llama.cpp/pull/7347 .
 | |
|       # | 2          | 0.0  |
 | |
|       # | 4          | 0.0  |
 | |
|       # | 2          | 1.0  |
 | |
|       # | 4          | 1.0  |
 | |
| 
 | |
|   Scenario Outline: consistent token probs with same seed and prompt
 | |
|     Given <n_slots> slots
 | |
|     And   <n_kv> KV cache size
 | |
|     And   1.0 temperature
 | |
|     And   <n_predict> max tokens to predict
 | |
|     Then  the server is starting
 | |
|     Then  the server is healthy
 | |
| 
 | |
|     Given 1 prompts "The meaning of life is" with seed 42
 | |
|     And   concurrent completion requests
 | |
|     # Then the server is busy # Not all slots will be utilized.
 | |
|     Then  the server is idle
 | |
|     And   all slots are idle
 | |
| 
 | |
|     Given <n_parallel> prompts "The meaning of life is" with seed 42
 | |
|     And   concurrent completion requests
 | |
|     # Then the server is busy # Not all slots will be utilized.
 | |
|     Then the server is idle
 | |
|     And  all slots are idle
 | |
| 
 | |
|     Then all token probabilities are equal
 | |
|     Examples:
 | |
|       | n_slots | n_kv | n_predict | n_parallel |
 | |
|       | 4       | 1024 | 1         | 1          |
 | |
|       # FIXME: unified KV cache nondeterminism
 | |
|       # See https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227
 | |
|       # and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574
 | |
|       # and https://github.com/ggerganov/llama.cpp/pull/7347 .
 | |
|       # | 4       | 1024 | 1         | 4          |
 | |
|       # | 4       | 1024 | 100       | 1          |
 | |
|       # This test still fails even the above patches; the first token probabilities are already different.
 | |
|       # | 4       | 1024 | 100       | 4          |
 |