* server : simplify state machine for slot * add SLOT_STATE_DONE_PROMPT * pop_deferred_task * add missing notify_one * fix passkey test * metrics : add n_busy_slots_per_decode * fix test step * add test * maybe fix AddressSanitizer? * fix deque ? * missing lock * pop_deferred_task: also notify * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
		
			
				
	
	
		
			56 lines
		
	
	
	
		
			2.8 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
			
		
		
	
	
			56 lines
		
	
	
	
		
			2.8 KiB
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
| # run with: ./tests.sh --no-skipped --tags passkey
 | |
| @passkey
 | |
| @slow
 | |
| Feature: Passkey / Self-extend with context shift
 | |
| 
 | |
|   Background: Server startup
 | |
|     Given a server listening on localhost:8080
 | |
| 
 | |
|   # Generates a long text of junk and inserts a secret passkey number inside it.
 | |
|   # Then we query the LLM for the secret passkey.
 | |
|   # see #3856 and #4810
 | |
|   Scenario Outline: Passkey
 | |
|     Given a model file <hf_file> from HF repo <hf_repo>
 | |
|     And   <n_batch> as batch size
 | |
|     And   <n_junk> as number of junk
 | |
|     And   <n_predicted> server max tokens to predict
 | |
|     And   42 as seed
 | |
|     And   0.0 temperature
 | |
|     And   <n_ctx> KV cache size
 | |
|     And   1 slots
 | |
|     And   <n_ga> group attention factor to extend context size through self-extend
 | |
|     And   <n_ga_w> group attention width to extend context size through self-extend
 | |
|     # Can be override with N_GPU_LAYERS
 | |
|     And   <ngl> GPU offloaded layers
 | |
|     Then  the server is starting
 | |
|     # Higher timeout because the model may need to be downloaded from the internet
 | |
|     Then  the server is healthy with timeout 120 seconds
 | |
|     Given available models
 | |
|     Then  model 0 is trained on <n_ctx_train> tokens context
 | |
|     Given a prefix prompt:
 | |
|     """
 | |
|     here is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.
 | |
|     """
 | |
|     And a passkey prompt template:
 | |
|     """
 | |
|     The pass key is <passkey> Remember it. <passkey> is the pass key.
 | |
|     """
 | |
|     And a junk suffix prompt:
 | |
|     """
 | |
|     The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again.
 | |
|     """
 | |
|     And a suffix prompt:
 | |
|     """
 | |
|     What is the pass key? The pass key is
 | |
|     """
 | |
|     Given a "<passkey>" passkey challenge prompt with the passkey inserted every <i_pos> junk
 | |
|     And  a completion request with no api error
 | |
|     Then <n_predicted> tokens are predicted matching <re_content>
 | |
| 
 | |
|     Examples:
 | |
|       | hf_repo                         | hf_file                     | n_ctx_train | ngl | n_ctx | n_batch | n_ga | n_ga_w | n_junk | i_pos | passkey | n_predicted | re_content     |
 | |
|       | TheBloke/phi-2-GGUF             | phi-2.Q4_K_M.gguf           | 2048        | 5   | 8192  | 512     | 4    | 512    | 250    | 50    | 42      | 1           | 42             |
 | |
|       | TheBloke/phi-2-GGUF             | phi-2.Q4_K_M.gguf           | 2048        | 5   | 8192  | 512     | 2    | 512    | 250    | 50    | 42      | 1           | \b((?!42)\w)+\b  |
 | |
|       #| TheBloke/Llama-2-7B-GGUF        | llama-2-7b.Q2_K.gguf        | 4096        | 3   | 16384 | 512     | 4    | 512    | 500    | 300   | 1234    | 5           | 1234           |
 | |
|       #| TheBloke/Mixtral-8x7B-v0.1-GGUF | mixtral-8x7b-v0.1.Q2_K.gguf | 32768       | 2   | 16384 | 512     | 4    | 512    | 500    | 100   | 0987    | 5           | 0
 | |
|       # 987           |
 |