Tokenizer SPM fixes for phi-3 and llama-spm (#7375)

* Update brute force test: special tokens * Fix added tokens - Try to read 'added_tokens.json'. - Try to read 'tokenizer_config.json'. - Try to read 'tokenizer.json'. * Fix special tokens rtrim Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server : fix test regexes
2024-05-20 20:15:57 +02:00 · 2024-05-20 20:15:57 +02:00 · 917dc8cfa6
commit 917dc8cfa6
parent fabf30b4c4
5 changed files with 98 additions and 14 deletions
--- a/examples/server/tests/features/server.feature
+++ b/examples/server/tests/features/server.feature
@ -37,8 +37,8 @@ Feature: llama.cpp server

    Examples: Prompts
      | prompt                                                                    | n_predict | re_content                                  | n_prompt | n_predicted | truncated |
-      | I believe the meaning of life is                                          | 8         | (read\|going)+                              | 18       | 8           | not       |
-      | Write a joke about AI from a very long prompt which will not be truncated | 256       | (princesses\|everyone\|kids\|Anna\|forest)+ | 46       | 64          | not       |
+      | I believe the meaning of life is                                          | 8         | (read\|going\|pretty)+                      | 18       | 8           | not       |
+      | Write a joke about AI from a very long prompt which will not be truncated | 256       | (princesses\|everyone\|kids\|Anna\|forest)+ | 45       | 64          | not       |

  Scenario: Completion prompt truncated
    Given a prompt:
@ -67,8 +67,8 @@ Feature: llama.cpp server

    Examples: Prompts
      | model        | system_prompt               | user_prompt                          | max_tokens | re_content                        | n_prompt | n_predicted | enable_streaming | truncated |
-      | llama-2      | Book                        | What is the best book                | 8          | (Here\|what)+                     | 77       | 8           | disabled         | not       |
-      | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 128        | (thanks\|happy\|bird\|Annabyear)+ | -1       | 64          | enabled          |           |
+      | llama-2      | Book                        | What is the best book                | 8          | (Here\|what)+                     | 76       | 8           | disabled         | not       |
+      | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 128        | (thanks\|happy\|bird\|fireplace)+ | -1       | 64          | enabled          |           |


  Scenario Outline: OAI Compatibility w/ response format
@ -84,7 +84,7 @@ Feature: llama.cpp server
      | response_format                                                     | n_predicted | re_content             |
      | {"type": "json_object", "schema": {"const": "42"}}                  | 5           | "42"                   |
      | {"type": "json_object", "schema": {"items": [{"type": "integer"}]}} | 10          | \[ -300 \]             |
-      | {"type": "json_object"}                                             | 10          | \{ " Jacky.            |
+      | {"type": "json_object"}                                             | 10          | \{ " Saragine.         |


  Scenario: Tokenize / Detokenize
--- a/examples/server/tests/features/slotsave.feature
+++ b/examples/server/tests/features/slotsave.feature
@ -26,7 +26,7 @@ Feature: llama.cpp server slot management
    # Since we have cache, this should only process the last tokens
    Given a user prompt "What is the capital of Germany?"
    And   a completion request with no api error
-    Then  24 tokens are predicted matching (Thank|special)
+    Then  24 tokens are predicted matching (Thank|special|Lily)
    And   7 prompt tokens are processed
    # Loading the original cache into slot 0,
    # we should only be processing 1 prompt token and get the same output
@ -41,7 +41,7 @@ Feature: llama.cpp server slot management
    Given a user prompt "What is the capital of Germany?"
    And   using slot id 1
    And   a completion request with no api error
-    Then  24 tokens are predicted matching (Thank|special)
+    Then  24 tokens are predicted matching (Thank|special|Lily)
    And   1 prompt tokens are processed

  Scenario: Erase Slot