Commit graph

3003 commits

Author SHA1 Message Date
Justine Tunney
3855416027
ggml : introduce bfloat16 support (#6412)
* Introduce bfloat16 support

Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as
their canonical floating point format.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───┐
    0b0000000000000000 brain16

This encoding has the same number of exponent bits as float32. That
makes conversion relatively straightforward, even in the absence of
hardware support. For example, converting brain16 to binary32 means
simply shifting 16 bits to the left.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───────────────────┐
    0b00000000000000000000000000000000 IEEE binary32

The issue is that converting bf16 to fp16 can result in information
loss. Only 13% of bf16 numbers can be precisely represented in fp16
which in practice ends up being 99.71% of Mistral 7b v0.2's weights
however there is currently no way other than fp32 to get the others

      ┌sign
      │
      │  ┌exponent
      │  │
      │  │    ┌mantissa
      │  │    │
      │┌─┴─┐┌─┴──────┐
    0b0000000000000000 IEEE binary16

This change fixes that, by adding a bf16 data type to GGML. Support
for CPU inference has been implemented along with optimizations for
the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2
improves somewhere around -0.0024 to -0.0046 compared to using fp16

* Remove GGML code that's not needed

* Minimize the GGML API surface area for BF16

* Remove bf16 luts

* Make the GGML header look nicer

* Fix documentation

* Apply ggerganov's fixes for test-backend-ops

* Add BF16 code for new ggml_validate_row_data() function
2024-05-08 09:30:09 +03:00
Georgi Gerganov
c0e6fbf8c3
metal : fix unused warning 2024-05-08 09:14:50 +03:00
Jeximo
c780e75305
Further tidy on Android instructions README.md (#7077)
* Further tidy on Android instructions README.md

Fixed some logic when following readme direction

* Clean up redundent information

A new user arriving will see simple directions on llama.cpp homepage

* corrected puncuation

Period after cmake, colon after termux

* re-word for clarity

method seems to be more correct, instead of alternative in this context

* Organized required packages per build type

building llama.cpp with NDK on a pc doesn't require installing clang, cmake, git, or wget in termux.

* README.md

corrected title

* fix trailing whitespace
2024-05-08 02:26:43 +02:00
jukofyork
48b2f9c1fc
Fixed save_imatrix to match old behaviour for MoE (#7099)
* Fixed save_imatrix to match old behaviour for MoE

This fix is simple and clear, but unnecessarily doubles the memory overhead..

* Fixed missing idx variable

* Unconditionally increment ncall

Co-authored-by: slaren <slarengh@gmail.com>

* Fixed 2 bugs in save_imatrix()

- Fixed segfault bug because the counts vector needed to be created.
- Fixed pre-existing bug didn't actually add to the counts for "--combine" option.

* ncall needs summing too

* Trailing whitespace

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-05-08 02:24:16 +02:00
Johannes Gäßler
af0a5b6163
server: fix incorrectly reported token probabilities (#7125)
* server: normalize token probabilities

* fix temperature == 0.0f
2024-05-07 23:07:58 +02:00
nopperl
b6aa670203
Fix OLMo HF to GGUF conversion (#6910) 2024-05-07 21:39:43 +02:00
Kyle Mistele
260b7c6529
server : update readme with undocumented options (#7013) 2024-05-07 21:44:29 +03:00
Georgi Gerganov
53d6c52e22
readme : update hot topics 2024-05-07 21:43:13 +03:00
RhinoDevel
3af34c1d1b
main : update log text (EOS to EOG) (#7104)
* Update log text (EOS to EOG)

The log text "found EOS" is no longer always correct, here, because there is now an is-EOG check that also returns true for EOT.

* Improve log msg. further by using "an" instead of "some".

As suggested, to avoid misunderstanding (no multiple EOG tokens found, just one).
2024-05-07 20:51:31 +03:00
omahs
04976db7a8
docs: fix typos (#7124)
* fix typo

* fix typos

* fix typo

* fix typos

* fix typo

* fix typos
2024-05-07 18:20:33 +03:00
Georgi Gerganov
947d3ad27d
ci : add GG_BUILD_EXTRA_TESTS_0 env (#7098)
* ci : add GG_BUILD_EXTRA_TESTS_0 env

ggml-ci

* Update run.sh

ggml-ci
2024-05-07 11:08:49 +03:00
HanishKVC
76791bad63 ChatON:Fix partsLengths to int32_t type, instead of int
so that the size of the elements is explicit and fixed, so that
it is inturn in sync with the fixed int size specified wrt the
c-api, even with any c compilers with different idea about int.

avoid some ununsed vars, need to update compile flags later to
enable corresponding warnings.
2024-05-07 12:40:49 +05:30
HanishKVC
b3a56545d6 ChatON:Reposition alertAssistantAtEnd flag for consistency 2024-05-07 11:49:43 +05:30
HanishKVC
0852f3b7ec ChatON:ExCApi: Rename for consistency 2024-05-07 11:46:40 +05:30
HanishKVC
43a3a91b03 ChatON: Cleanup/Refine initial go at tmpl_apply_ex_capi 2024-05-07 11:44:25 +05:30
HanishKVC
7c288d3dfc ChatON: Rename to partstypes for consistency 2024-05-07 11:32:20 +05:30
HanishKVC
04b4a15177 ChatON: Initial go at chat-template-apply c-api with parts info 2024-05-07 11:08:47 +05:30
HanishKVC
f6a86cd209 ChatON: Update the Note a bit 2024-05-07 10:29:16 +05:30
William Tambellini
858f6b73f6
Add an option to build without CUDA VMM (#7067)
Add an option to build ggml cuda without CUDA VMM
resolves
https://github.com/ggerganov/llama.cpp/issues/6889
https://forums.developer.nvidia.com/t/potential-nvshmem-allocated-memory-performance-issue/275416/4
2024-05-06 20:12:14 +02:00
Georgi Gerganov
b3a995b416
flake.lock: Update (#7079)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/9126214d0a59633752a136528f5f3b9aa8565b7d?narHash=sha256-sB4SWl2lX95bExY2gMFG5HIzvva5AVMJd4Igm%2BGpZNw%3D' (2024-04-01)
  → 'github:hercules-ci/flake-parts/e5d10a24b66c3ea8f150e47dfdb0416ab7c3390e?narHash=sha256-yzcRNDoyVP7%2BSCNX0wmuDju1NUCt8Dz9%2BlyUXEI0dbI%3D' (2024-05-02)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089?dir=lib&narHash=sha256-iMUFArF0WCatKK6RzfUJknjem0H9m4KgorO/p3Dopkk%3D' (2024-03-29)
  → 'https://github.com/NixOS/nixpkgs/archive/50eb7ecf4cd0a5756d7275c8ba36790e5bd53e33.tar.gz?narHash=sha256-QBx10%2Bk6JWz6u7VsohfSw8g8hjdBZEf8CFzXH1/1Z94%3D' (2024-05-02)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/7bb2ccd8cdc44c91edba16c48d2c8f331fb3d856?narHash=sha256-Drmja/f5MRHZCskS6mvzFqxEaZMeciScCTFxWVLqWEY%3D' (2024-04-25)
  → 'github:NixOS/nixpkgs/63c3a29ca82437c87573e4c6919b09a24ea61b0f?narHash=sha256-4cPymbty65RvF1DWQfc%2BBc8B233A1BWxJnNULJKQ1EY%3D' (2024-05-02)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-05-06 08:36:06 -07:00
Georgi Gerganov
bcdee0daa7
minor : fix trailing whitespace 2024-05-06 09:31:30 +03:00
HanishKVC
b875b02979 ChatON:Initial go at vicuna chat template in meta.json
Have looked at tokenizer_config.json, jinja file and default
hardcoded template in llama.cpp.

This is also one of the models where a Global BoS is needed.

NOTE: Have taken the liberty to also add a SYSTEM: prefix wrt
system message, even thou default vicuna doesnt seem to need, but
vicuna-orca seems to need, so that both models can be driven from
same chat template config. I am assuming the system prefix should
not create any problem even in default vicuna, however if it does
create a problem one can duplicate the existing vicuna block in
chaton_meta.json and make the system prefix empty in it.
2024-05-06 11:27:56 +05:30
HanishKVC
0f8f2a18c2 ChatON:chat template for OpenChat in meta.json initial go
The first model seen, based on templates added till now into meta
json file, that needs a Global Begin.

From tokenizer_config json file, it appears like even system role
should have a appropriate prefix, unlike what is seen in hardcoded
default chat apply template of llama.cpp and chat jinja template.
2024-05-06 11:27:56 +05:30
HanishKVC
93115a9733 ChatON: initial go at OrionStar Ai chat model template
Got from its tokenizer config json. Also same found in existing
hardcoded template in default chat apply template logic of llamacpp
2024-05-06 11:27:56 +05:30
HanishKVC
989c6c4125 SimpCfg: Cleanup the Note a bit to avoid some ambiguities 2024-05-06 11:27:56 +05:30
HanishKVC
344c068d7b SimpCfg:MultiPart keys wrt get_vector
With this and past few commits, now there is simple yet sufficient
support to help move multi-level-hierarchy config files into the
SimpCfg's simple physically 1-level, but if reqd logically multi
level hierarchy flow.

B4 this series of commits also one could have still achieved this,
but there would have been bit more effort needed.
2024-05-06 11:27:56 +05:30
HanishKVC
19d3c88e8a SimpCfg:MultiPart keys wrt get_value etal 2024-05-06 11:27:56 +05:30
HanishKVC
623d0b60da SimpCfg: General MultiPart support, KeyParts not Key wrt SetValue 2024-05-06 11:27:56 +05:30
HanishKVC
c6ecd9316e SimpCfg: Use to_str instead of using stringstream directly 2024-05-06 11:27:56 +05:30
HanishKVC
5380b1e86e ChatON:Update meta.json wrt command-r models
template info picked from tokenizer config's default entry,

Verified that same is used in the existing hardcoded chat apply
template flow.
2024-05-06 11:27:56 +05:30
HanishKVC
2b14bcaddb SimpCfg:ChatON: add by Humans for All note 2024-05-06 11:27:56 +05:30
HanishKVC
20e5b383c5 SimpCfg:Trim DumpHexString only if SC_DEBUG_VERBOSE 2024-05-06 11:27:56 +05:30
HanishKVC
f53c19baac SimpCfg: Update the notes wrt tolower and add test code 2024-05-06 11:27:56 +05:30
HanishKVC
3287fdba28 SimpCfg:Fix/cleanup trim related test samples and flow
Use the commonality between Indian languages to show mixup issue
with the simple minded trim_dump logic and how trim_oversmart
could potentially avoid that.

Given that I am using valid strings to show the pitfalls of fixed
native char size driven logic, so no need to keep the dump and
oversmart flows seperate, so merge into a common loop.
2024-05-06 11:27:56 +05:30
HanishKVC
33619a3b92 SimpCfg: Templatize str_lower 2024-05-06 11:27:56 +05:30
HanishKVC
32ba195a83 SimpCfg: Templatize str_trim_single
Also use NativeCharSize and MultiNativeCharSize wording to make
the note more generic
2024-05-06 11:27:56 +05:30
HanishKVC
5b8bf849c0 SimpCfg: Fixed & ~Variable Length to Native & MultiNativeCharSize
So as to make the notes, more generic.
2024-05-06 11:27:56 +05:30
HanishKVC
d030a26f3c SimpCfg:Update TrimOverSmart use templated TrimDumb after wstrconv 2024-05-06 11:27:56 +05:30
HanishKVC
97ac443bba SimpCfg:Cleanup, updated notes, templated code
Update the notes to match the templated flow now and some of the
nitty gritties involved.

Update DumpHexString to be templated.

Split check nonenglish flow wrt trim dumb and oversmart testing,
so that things with work with one, but not the other can be
differentiated in the flow.
2024-05-06 11:27:56 +05:30
HanishKVC
bf111a83f1 SimpCfg:TemplatedDumbTrim; Test dumb and oversmart trim logics 2024-05-06 11:27:56 +05:30
HanishKVC
554b00f027 SimpCfg: Add some missing const refs 2024-05-06 11:27:56 +05:30
HanishKVC
cae0fff715 SimpCfg: Update notes; Try add a better trimming logic 2024-05-06 11:27:56 +05:30
HanishKVC
d1156cc055 SimpCfg: As locale manipulation reqd for better processing 2024-05-06 11:27:56 +05:30
HanishKVC
2325764180 SimpCfg:CheckStrings: Switch Mbs2Wcs to multithread safe calls 2024-05-06 11:27:56 +05:30
HanishKVC
23acf07bb2 SimpCfg:CheckStrings: Cleanup wstring flow to needed parts 2024-05-06 11:27:56 +05:30
HanishKVC
2cda78f1ad SimpCfg:CheckStrings: WString2String finally
The constructor method doesnt convert wstring to string, when it
involves non-english chars which will encode to multibyte chars
in utf8. even thou it does work for the already utf8 u8string.

wcstombs doesnt seem to work for non english chars, when the
locale is set to the default c, need to change to something like
en_US.UTF-8, to allow it to do the conversion properly.
2024-05-06 11:27:56 +05:30
HanishKVC
7607dbc8c7 SimpCfg:CheckStrings: Try fixup wstring handling 2024-05-06 11:27:56 +05:30
HanishKVC
1a618a42f8 SimpCfg: Update the func notes with alert 2024-05-06 11:27:56 +05:30
HanishKVC
66d6fa62b7 SimpCfg: C++ and strings is a mess even after decades
Seperate out the checks wrt different string types.

Add a wstring_basic, which verifies that wstring iterator handles
non english chars propery or atleast better.
2024-05-06 11:27:56 +05:30
HanishKVC
3ad5cec47e SimpCfg:CheckStrings:MacOS, wstring and wcout
Without using imbue, I couldnt get non-english wstrings to print
on mac. Need to check on linux also.

Also avoid the uint8_t typecasting, given that wchar isnt 8bit
2024-05-06 11:27:56 +05:30