Have looked at tokenizer_config.json, jinja file and default
hardcoded template in llama.cpp.
This is also one of the models where a Global BoS is needed.
NOTE: Have taken the liberty to also add a SYSTEM: prefix wrt
system message, even thou default vicuna doesnt seem to need, but
vicuna-orca seems to need, so that both models can be driven from
same chat template config. I am assuming the system prefix should
not create any problem even in default vicuna, however if it does
create a problem one can duplicate the existing vicuna block in
chaton_meta.json and make the system prefix empty in it.
The first model seen, based on templates added till now into meta
json file, that needs a Global Begin.
From tokenizer_config json file, it appears like even system role
should have a appropriate prefix, unlike what is seen in hardcoded
default chat apply template of llama.cpp and chat jinja template.
With this and past few commits, now there is simple yet sufficient
support to help move multi-level-hierarchy config files into the
SimpCfg's simple physically 1-level, but if reqd logically multi
level hierarchy flow.
B4 this series of commits also one could have still achieved this,
but there would have been bit more effort needed.
Use the commonality between Indian languages to show mixup issue
with the simple minded trim_dump logic and how trim_oversmart
could potentially avoid that.
Given that I am using valid strings to show the pitfalls of fixed
native char size driven logic, so no need to keep the dump and
oversmart flows seperate, so merge into a common loop.
Update the notes to match the templated flow now and some of the
nitty gritties involved.
Update DumpHexString to be templated.
Split check nonenglish flow wrt trim dumb and oversmart testing,
so that things with work with one, but not the other can be
differentiated in the flow.
The constructor method doesnt convert wstring to string, when it
involves non-english chars which will encode to multibyte chars
in utf8. even thou it does work for the already utf8 u8string.
wcstombs doesnt seem to work for non english chars, when the
locale is set to the default c, need to change to something like
en_US.UTF-8, to allow it to do the conversion properly.
Seperate out the checks wrt different string types.
Add a wstring_basic, which verifies that wstring iterator handles
non english chars propery or atleast better.
Without using imbue, I couldnt get non-english wstrings to print
on mac. Need to check on linux also.
Also avoid the uint8_t typecasting, given that wchar isnt 8bit
Also the warning wrt is it string, now also logs the line number,
group and key, to help user identify the line better.
Misc: pass time last week Another life, Anchakkallakokkan, Deadloch
TODO: string check wrt true/false, doesnt seem to be working after
str_tolower was introduced. I seem to be doing some silly mistake
not able to make out, moving in and out of sleep, need to check
tomorrow.
string == string-literal failed
string == string-view failed
string.compare(string-literal) failed
Bit strange
test-chat-template-chaton now tries to check if meta-ok is ok wrt
the template-id being looked into.
Log template-id info also, where it was previously missed out.