server : Add option to return token pieces in /tokenize endpoint (#9108)
* server : added with_pieces functionality to /tokenize endpoint * server : Add tokenize with pieces tests to server.feature * Handle case if tokenizer splits along utf8 continuation bytes * Add example of token splitting * Remove trailing ws * Fix trailing ws * Maybe fix ci * maybe this fix windows ci? --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
This commit is contained in:
parent
e6b7801bd1
commit
78203641fe
6 changed files with 139 additions and 6 deletions
|
@ -407,9 +407,44 @@ Notice that each `probs` is an array of length `n_probs`.
|
|||
|
||||
*Options:*
|
||||
|
||||
`content`: Set the text to tokenize.
|
||||
`content`: (Required) The text to tokenize.
|
||||
|
||||
`add_special`: Boolean indicating if special tokens, i.e. `BOS`, should be inserted. Default: `false`
|
||||
`add_special`: (Optional) Boolean indicating if special tokens, i.e. `BOS`, should be inserted. Default: `false`
|
||||
|
||||
`with_pieces`: (Optional) Boolean indicating whether to return token pieces along with IDs. Default: `false`
|
||||
|
||||
**Response:**
|
||||
|
||||
Returns a JSON object with a `tokens` field containing the tokenization result. The `tokens` array contains either just token IDs or objects with `id` and `piece` fields, depending on the `with_pieces` parameter. The piece field is a string if the piece is valid unicode or a list of bytes otherwise.
|
||||
|
||||
|
||||
If `with_pieces` is `false`:
|
||||
```json
|
||||
{
|
||||
"tokens": [123, 456, 789]
|
||||
}
|
||||
```
|
||||
|
||||
If `with_pieces` is `true`:
|
||||
```json
|
||||
{
|
||||
"tokens": [
|
||||
{"id": 123, "piece": "Hello"},
|
||||
{"id": 456, "piece": " world"},
|
||||
{"id": 789, "piece": "!"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
With input 'á' (utf8 hex: C3 A1) on tinyllama/stories260k
|
||||
```json
|
||||
{
|
||||
"tokens": [
|
||||
{"id": 198, "piece": [195]}, // hex C3
|
||||
{"id": 164, "piece": [161]} // hex A1
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### POST `/detokenize`: Convert tokens to text
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue