tokenizer : special token handling (#3538)
* Rewrite special token handling from #1931 * shorten param name, add st verification by type * use offsets instead of copy by substr * formatting, remove copying iterator on delete * llama : normalize code-style * swift fix * print pfx/sfx if verb, main: split pfx input sfx * dont add space when using special tokens * minor : comment + spacing --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit is contained in:
parent
281ef73c25
commit
1a159553f9
7 changed files with 332 additions and 39 deletions
13
llama.h
13
llama.h
|
@ -511,17 +511,20 @@ extern "C" {
|
|||
// Tokenization
|
||||
//
|
||||
|
||||
// Convert the provided text into tokens.
|
||||
// The tokens pointer must be large enough to hold the resulting tokens.
|
||||
// Returns the number of tokens on success, no more than n_max_tokens
|
||||
// Returns a negative number on failure - the number of tokens that would have been returned
|
||||
/// @details Convert the provided text into tokens.
|
||||
/// @param tokens The tokens pointer must be large enough to hold the resulting tokens.
|
||||
/// @return Returns the number of tokens on success, no more than n_max_tokens
|
||||
/// @return Returns a negative number on failure - the number of tokens that would have been returned
|
||||
/// @param special Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext.
|
||||
/// Does not insert a leading space.
|
||||
LLAMA_API int llama_tokenize(
|
||||
const struct llama_model * model,
|
||||
const char * text,
|
||||
int text_len,
|
||||
llama_token * tokens,
|
||||
int n_max_tokens,
|
||||
bool add_bos);
|
||||
bool add_bos,
|
||||
bool special);
|
||||
|
||||
// Token Id -> Piece.
|
||||
// Uses the vocabulary in the provided context.
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue