* server: tests: add models endpoint scenario * server: /v1/models add some metadata * server: tests: add debug field in context before scenario * server: tests: download model from HF, add batch size * server: tests: add passkey test * server: tests: add group attention params * server: do not truncate prompt tokens if self-extend through group attention is enabled * server: logs: do not truncate log values * server: tests - passkey - first good working value of nga * server: tests: fix server timeout * server: tests: fix passkey, add doc, fix regex content matching, fix timeout * server: tests: fix regex content matching * server: tests: schedule slow tests on master * server: metrics: fix when no prompt processed * server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1 * server: tests: increase timeout for completion * server: tests: keep only the PHI-2 test * server: tests: passkey add a negative test
		
			
				
	
	
		
			22 lines
		
	
	
	
		
			794 B
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
			
		
		
	
	
			22 lines
		
	
	
	
		
			794 B
		
	
	
	
		
			Gherkin
		
	
	
	
	
	
| # run with: ./tests.sh --no-skipped --tags wrong_usage
 | |
| @wrong_usage
 | |
| Feature: Wrong usage of llama.cpp server
 | |
| 
 | |
|   #3969 The user must always set --n-predict option
 | |
|   # to cap the number of tokens any completion request can generate
 | |
|   # or pass n_predict/max_tokens in the request.
 | |
|   Scenario: Infinite loop
 | |
|     Given a server listening on localhost:8080
 | |
|     And   a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
 | |
|     # Uncomment below to fix the issue
 | |
|     #And   64 server max tokens to predict
 | |
|     Then  the server is starting
 | |
|     Given a prompt:
 | |
|       """
 | |
|       Go to: infinite loop
 | |
|       """
 | |
|     # Uncomment below to fix the issue
 | |
|     #And   128 max tokens to predict
 | |
|     Given concurrent completion requests
 | |
|     Then the server is idle
 | |
|     Then all prompts are predicted
 |