diff --git a/common/chaton.hpp b/common/chaton.hpp index b50ea5708..4fa17cc98 100644 --- a/common/chaton.hpp +++ b/common/chaton.hpp @@ -1,60 +1,166 @@ #pragma once -/*** - * +/** + * * ## Overview + * + * Helps chat with models, by tagging chat messages based on the specified + * chat-handshake-template-standard. This uses a generic tagging code driven + * by a json meta data file, which specifies the handshake template details. + * + * This can be used by + * + * * main, to build on existing interactive flow and its in-prefix, in-suffix + * and antiprompt/reverse-prompt + * + * * server, by replacing its existing llama_chat_apply_template with the + * equivalent helper here. + * + * + * ## The common pattern + * + * As a convention, the tagging used by LLMs to differentiate between the + * different parts when chatting with them normally follows a general pattern of + * + * * + * + * * The Roles could include System, User and Assistant (ie the Model) + * + * * A chat normally consists of + * + * * a System message/prompt followed by + * + * * multiple user message/query - model message/response pairs + * + * The different models will normally have all or some subset of the tagging mentioned above. + * + * You may also notice some common patterns like + * + * * Because a user message is normally followed by model/assistant response, in most models + * + * * user messages wont have EndOfSentenceTag and + * + * * the following model response wont have BeginOfSentenceTag + * + * * Because a system message will normally be immidiately followed by a user query, + * + * * in many models, there wont be a EndOfSentenceTag following the system message and + * BeginOfSentenceTag wrt the 1st user message following the system message. + * + * * in some models there wont even be a RoleSuffixTag following system message + * and RolePrefixTag wrt the 1st user message following the system message. + * + * * however in many of these models, the subsequent user messages will have the + * BeginOfSentenceTag and or RolePrefixTag. + * + * + * ## The Strategy + * + * The template meta data json file allows the user to specify the above mentioned tags wrt + * each of the Role. Depending on whether a given model uses a given tag or not you either + * specify the required tag or else you specify a empty string. + * + * A tag could be a single word or multiple words, and may include newline char specified + * using \n and so on. The tag is always demarcated using double quotes and thus also allows + * spaces at the begining or end of the tag, if needed. + * + * In order to account for the conditionality of tags between the system message and the 1st + * user message, flags are provided to explicitly control whether each of these possible tags + * is used by a specific model or not, as part of its template info. + * + * The Roles are identified in the json file using "system", "user" and "assistant". However + * the model may use different words to identify these roles, in which case setup RolePrefix + * and or RoleSuffix appropriately. + * + * To identify that model is finished with generating response to user query, depending on + * the model's handshake template standard, one will need to set the reverse-prompt to either + * the assistant's suffix or end tag or to the user's begin or prefix tag, depending on what + * is generated by the model at the end of its response. + * + * + * ## The JSON File + * + * Can contain the template info wrt multiple models/handshake-standards. And inturn each + * unique template is identified by a unique template id string. + * + * The fields that make up a given chat-handshake-template-standard include + * + * * global-> begin & end + * + * * system -> begin, prefix, suffix & end + * + * * user -> begin, prefix, suffix & end + * + * * assistant -> begin, prefix, suffix & end + * + * * reverse-prompt + * + * * systemuser-system-has-suffix, systemuser-system-has-end, + * systemuser-1st-user-has-begin and systemuser-1st-user-has-prefix + * * - * Helps chat with a model, by allowing role based special token tagging, based on the specified chat-handshake-template-standard. - * This is used by main, to build on existing interactive flow and its in-prefix, in-suffix and antiprompt/reverse-promot - * - * 1. [ToDo] Use a json file to configure the needed tags for each of the supported chat-handshake-template-standard - * * global-> begin & end - * * system -> begin, prefix, suffix & end - * * user -> begin, prefix, suffix & end; assistant -> begin, prefix, suffix & end - * * [main] these override the in-prefix (begin+prefix) and in-suffix - * c. reverse-prompt - * * [main] this adds to any reverese-prompt specified using cmdline - * e. systemuser-sys-has-suffix, systemuser-sys-has-end, systemuser-1st-user-has-begin and systemuser-1st-user-has-prefix - * * [chaton-tmpl-apply] if a combination of system and user messages/prompts is passed, - * then for system messages suffix and end, as well as - * for the 1st user message following the 1st system message, - * include system suffix and end and user begin and prefix only if corresponding flags is set. - * * begin should normally relate to BoS while prefix should relate to Role Identifier tag. - * If there is no need for seperate handling of BoS and RoleIdTag, then one could even - * set both BoS and RoleIdTag to one of these entries itself. - * - * 2. [main] currently the user specified system prompt (-p + -f) is tagged using system role tags, - * and inturn this tagged message is tokenized with parse_special flag. - * So any special token related tags in the user specified system prompt will get parsed as special. - * - * 3. chaton-tmpl-apply uses the json file, which was loaded, to decide on how to generate the tagged messages for tokenisation. - * a. input: [ { role, message }, { role, message}, ....] - * b. output: currently a single string is returned which contains the tagged message(s). - * [later] if it is needed to differentiate between the special tags added by this from user specified prompts/messages, - * then return [ {flag, data}, { flag, data}, {flag, data}, ....], - * where the flag specifies whether parse_special should be used or not for the corresponding data, during tokenization. - * - * ## Adding support for new model / chat-handshake-template-standard - * - * 1. Add suitable entries in json for that model/standard - * 2. Update the flow in chaton-tmpl-apply, as needed. - * Try to update and or reuse the generic flow in chaton-tmpl-apply, as much as possible, - * before trying to add a custom logic. - * If you update the generic flow, cross check if existing json files will need to be updated or not. - * - * ## Notes - * + * ## Usage + * + * One needs to load the json file containing the template meta data and inturn call the + * other helper functions as needed. + * + * Inturn one can use the helper functions to either extract a given tag or to apply all + * tags specified wrt a given role to the passed message or to apply tags as needed for + * a bunch of messages in one go. + * + * The individual message tagging helper, will apply all tags specified wrt that role. + * + * The multiple messages tagging helper chaton-tmpl-apply, will look at the boolean flags + * when tagging the passed messages. In this the system suffix, system end, user begin and + * user prefix get included only if corresponding flag is set. + * + * Both the single and multi messages tagging helpers provide two versions. + * * one which returns a single string which contains the tagged message(s) + * * one which returns + * * [tagged msg] the string containing the tagged message(s) + * * [parts lengths] an array of integers, which specifies the part lengths, + * which divides the returned string into parts. + * * [parts types] a string where each character indicates whether the corresponding + * part is a normal part which needs to be tokenized without parse_special + * or is a special part which needs to be tokenized with parse-special. + * + * + * ## example/main + * + * The interactive commandline program under example/main, uses + * + * * the system role related tags to tag the system prompt + * * the system prompt includes contents of -p if any + * * followed by contents of file specified using -f if any + * * the user begin+prefix to map to in-prefix + * * the user suffix+end to map to in-suffix + * * the reverse-prompt to map to antiprompt + * * wrt tokenization + * * the user specified system prompt is tokenized with parse_special flag. + * * however the user messages are tokenized without parse_special flag. + * * Currently Main doesnt use chaton-tmpl-apply, but only * * chaton-tmpl-apply-single (for system prompt) and - * * chaton-tmpl-role-part which maps the user prefix, suffix and reverse-prompt to - * in-prefix, in-suffix and antiprompt of main. - * These always adds any role specific prefix and suffix around the passed message. - * - * Sample chaton_meta.json includes template info for - * * llama2, llama3, gemma, chatml, zephyr, deepseek, monarch - * * llama2 doesnt apply begin+prefix to 1st user msg following system msg - * * monarch doesnt apply begin to 1st user msg following system msg - * + * * chaton-tmpl-role-kv which maps the user prefix, suffix and reverse-prompt + * to in-prefix, in-suffix and antiprompt of main. + * These always adds any role specific begin+prefix and suffix+end around + * the passed message. + * + * + * ## Adding support for new model / chat-handshake-template-standard + * + * 1. Add suitable entries in json for that model/standard + * 2. Try to reuse the generic flow in chaton-tmpl-apply, as much as possible, + * before trying to add a custom logic. + * If you update the generic flow, cross check if existing json files will + * need to be updated or not. + * + * + * ## Notes + * + * Look at the sample chaton_meta.json in examples folder for how the above may apply + * * llama2, llama3, gemma, chatml, zephyr, deepseek(normal and coder), monarch, mistral + * */ #include