* Introduce bfloat16 support
Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as
their canonical floating point format.
┌sign
│
│ ┌exponent
│ │
│ │ ┌mantissa
│ │ │
│┌──┴───┐┌─┴───┐
0b0000000000000000 brain16
This encoding has the same number of exponent bits as float32. That
makes conversion relatively straightforward, even in the absence of
hardware support. For example, converting brain16 to binary32 means
simply shifting 16 bits to the left.
┌sign
│
│ ┌exponent
│ │
│ │ ┌mantissa
│ │ │
│┌──┴───┐┌─┴───────────────────┐
0b00000000000000000000000000000000 IEEE binary32
The issue is that converting bf16 to fp16 can result in information
loss. Only 13% of bf16 numbers can be precisely represented in fp16
which in practice ends up being 99.71% of Mistral 7b v0.2's weights
however there is currently no way other than fp32 to get the others
┌sign
│
│ ┌exponent
│ │
│ │ ┌mantissa
│ │ │
│┌─┴─┐┌─┴──────┐
0b0000000000000000 IEEE binary16
This change fixes that, by adding a bf16 data type to GGML. Support
for CPU inference has been implemented along with optimizations for
the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2
improves somewhere around -0.0024 to -0.0046 compared to using fp16
* Remove GGML code that's not needed
* Minimize the GGML API surface area for BF16
* Remove bf16 luts
* Make the GGML header look nicer
* Fix documentation
* Apply ggerganov's fixes for test-backend-ops
* Add BF16 code for new ggml_validate_row_data() function
* Further tidy on Android instructions README.md
Fixed some logic when following readme direction
* Clean up redundent information
A new user arriving will see simple directions on llama.cpp homepage
* corrected puncuation
Period after cmake, colon after termux
* re-word for clarity
method seems to be more correct, instead of alternative in this context
* Organized required packages per build type
building llama.cpp with NDK on a pc doesn't require installing clang, cmake, git, or wget in termux.
* README.md
corrected title
* fix trailing whitespace
* Fixed save_imatrix to match old behaviour for MoE
This fix is simple and clear, but unnecessarily doubles the memory overhead..
* Fixed missing idx variable
* Unconditionally increment ncall
Co-authored-by: slaren <slarengh@gmail.com>
* Fixed 2 bugs in save_imatrix()
- Fixed segfault bug because the counts vector needed to be created.
- Fixed pre-existing bug didn't actually add to the counts for "--combine" option.
* ncall needs summing too
* Trailing whitespace
---------
Co-authored-by: slaren <slarengh@gmail.com>
* Update log text (EOS to EOG)
The log text "found EOS" is no longer always correct, here, because there is now an is-EOG check that also returns true for EOT.
* Improve log msg. further by using "an" instead of "some".
As suggested, to avoid misunderstanding (no multiple EOG tokens found, just one).
so that the size of the elements is explicit and fixed, so that
it is inturn in sync with the fixed int size specified wrt the
c-api, even with any c compilers with different idea about int.
avoid some ununsed vars, need to update compile flags later to
enable corresponding warnings.
Have looked at tokenizer_config.json, jinja file and default
hardcoded template in llama.cpp.
This is also one of the models where a Global BoS is needed.
NOTE: Have taken the liberty to also add a SYSTEM: prefix wrt
system message, even thou default vicuna doesnt seem to need, but
vicuna-orca seems to need, so that both models can be driven from
same chat template config. I am assuming the system prefix should
not create any problem even in default vicuna, however if it does
create a problem one can duplicate the existing vicuna block in
chaton_meta.json and make the system prefix empty in it.
The first model seen, based on templates added till now into meta
json file, that needs a Global Begin.
From tokenizer_config json file, it appears like even system role
should have a appropriate prefix, unlike what is seen in hardcoded
default chat apply template of llama.cpp and chat jinja template.
With this and past few commits, now there is simple yet sufficient
support to help move multi-level-hierarchy config files into the
SimpCfg's simple physically 1-level, but if reqd logically multi
level hierarchy flow.
B4 this series of commits also one could have still achieved this,
but there would have been bit more effort needed.
Use the commonality between Indian languages to show mixup issue
with the simple minded trim_dump logic and how trim_oversmart
could potentially avoid that.
Given that I am using valid strings to show the pitfalls of fixed
native char size driven logic, so no need to keep the dump and
oversmart flows seperate, so merge into a common loop.
Update the notes to match the templated flow now and some of the
nitty gritties involved.
Update DumpHexString to be templated.
Split check nonenglish flow wrt trim dumb and oversmart testing,
so that things with work with one, but not the other can be
differentiated in the flow.
The constructor method doesnt convert wstring to string, when it
involves non-english chars which will encode to multibyte chars
in utf8. even thou it does work for the already utf8 u8string.
wcstombs doesnt seem to work for non english chars, when the
locale is set to the default c, need to change to something like
en_US.UTF-8, to allow it to do the conversion properly.
Seperate out the checks wrt different string types.
Add a wstring_basic, which verifies that wstring iterator handles
non english chars propery or atleast better.
Without using imbue, I couldnt get non-english wstrings to print
on mac. Need to check on linux also.
Also avoid the uint8_t typecasting, given that wchar isnt 8bit