Commit graph

2843 commits

Author SHA1 Message Date
Francis Couture-Harpin
bffdaf4010 Merge branch 'master' into compilade/lazy-convert-hf 2024-05-08 10:56:03 -04:00
Dawid Potocki
83330d8cd6
main : add --conversation / -cnv flag (#7108) 2024-05-08 17:32:32 +03:00
Eve
465263d0cf
sgemm : AVX Q4_0 and Q8_0 (#6891)
* basic avx implementation

* style

* combine denibble with load

* reduce 256 to 128 (and back!) conversions

* sse load

* Update sgemm.cpp

* oops

oops
2024-05-08 17:29:23 +03:00
Johan
911b3900dd
server : add_special option for tokenize endpoint (#7059) 2024-05-08 15:27:58 +03:00
20kdc
ad211edef5
convert.py : --vocab-only generates false but valid params (#7027)
An example of how this might be used in the style of baby-llama will be attached with this PR.
2024-05-08 15:22:32 +03:00
Ren Xuancheng
229ffff872
llama : add BPE pre-tokenization for Qwen2 (#7114)
* Add BPE pre-tokenization for Qwen2.

* minor : fixes

---------

Co-authored-by: Ren Xuancheng <17811943+jklj077@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-08 15:06:43 +03:00
Xuan Son Nguyen
1fd9c1741d
clean up json_value & server_log (#7142) 2024-05-08 13:24:14 +02:00
DAN™
4cd621c26d
convert : add BPE pre-tokenization for DBRX (#7132)
* Add BPE pre-tokenization for DBRX.

* Add vocab GGUFs.

* Remove test.

* Remove GGUFs.
2024-05-08 13:43:23 +03:00
Georgi Gerganov
7e0b6a7b3b
py : also print the normalizers 2024-05-08 12:47:07 +03:00
Brian
acdce3cdef
compare-llama-bench.py: add missing basicConfig (#7138)
* compare-llama-bench.py: add missing basicConfig

* compare-llama-bench.py: Add line break between error message and print_help()

* Add regular print() markdown table
2024-05-08 10:54:39 +02:00
Justine Tunney
3855416027
ggml : introduce bfloat16 support (#6412)
* Introduce bfloat16 support

Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as
their canonical floating point format.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───┐
    0b0000000000000000 brain16

This encoding has the same number of exponent bits as float32. That
makes conversion relatively straightforward, even in the absence of
hardware support. For example, converting brain16 to binary32 means
simply shifting 16 bits to the left.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───────────────────┐
    0b00000000000000000000000000000000 IEEE binary32

The issue is that converting bf16 to fp16 can result in information
loss. Only 13% of bf16 numbers can be precisely represented in fp16
which in practice ends up being 99.71% of Mistral 7b v0.2's weights
however there is currently no way other than fp32 to get the others

      ┌sign
      │
      │  ┌exponent
      │  │
      │  │    ┌mantissa
      │  │    │
      │┌─┴─┐┌─┴──────┐
    0b0000000000000000 IEEE binary16

This change fixes that, by adding a bf16 data type to GGML. Support
for CPU inference has been implemented along with optimizations for
the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2
improves somewhere around -0.0024 to -0.0046 compared to using fp16

* Remove GGML code that's not needed

* Minimize the GGML API surface area for BF16

* Remove bf16 luts

* Make the GGML header look nicer

* Fix documentation

* Apply ggerganov's fixes for test-backend-ops

* Add BF16 code for new ggml_validate_row_data() function
2024-05-08 09:30:09 +03:00
Georgi Gerganov
c0e6fbf8c3
metal : fix unused warning 2024-05-08 09:14:50 +03:00
Jeximo
c780e75305
Further tidy on Android instructions README.md (#7077)
* Further tidy on Android instructions README.md

Fixed some logic when following readme direction

* Clean up redundent information

A new user arriving will see simple directions on llama.cpp homepage

* corrected puncuation

Period after cmake, colon after termux

* re-word for clarity

method seems to be more correct, instead of alternative in this context

* Organized required packages per build type

building llama.cpp with NDK on a pc doesn't require installing clang, cmake, git, or wget in termux.

* README.md

corrected title

* fix trailing whitespace
2024-05-08 02:26:43 +02:00
jukofyork
48b2f9c1fc
Fixed save_imatrix to match old behaviour for MoE (#7099)
* Fixed save_imatrix to match old behaviour for MoE

This fix is simple and clear, but unnecessarily doubles the memory overhead..

* Fixed missing idx variable

* Unconditionally increment ncall

Co-authored-by: slaren <slarengh@gmail.com>

* Fixed 2 bugs in save_imatrix()

- Fixed segfault bug because the counts vector needed to be created.
- Fixed pre-existing bug didn't actually add to the counts for "--combine" option.

* ncall needs summing too

* Trailing whitespace

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-05-08 02:24:16 +02:00
Johannes Gäßler
af0a5b6163
server: fix incorrectly reported token probabilities (#7125)
* server: normalize token probabilities

* fix temperature == 0.0f
2024-05-07 23:07:58 +02:00
nopperl
b6aa670203
Fix OLMo HF to GGUF conversion (#6910) 2024-05-07 21:39:43 +02:00
Kyle Mistele
260b7c6529
server : update readme with undocumented options (#7013) 2024-05-07 21:44:29 +03:00
Georgi Gerganov
53d6c52e22
readme : update hot topics 2024-05-07 21:43:13 +03:00
RhinoDevel
3af34c1d1b
main : update log text (EOS to EOG) (#7104)
* Update log text (EOS to EOG)

The log text "found EOS" is no longer always correct, here, because there is now an is-EOG check that also returns true for EOT.

* Improve log msg. further by using "an" instead of "some".

As suggested, to avoid misunderstanding (no multiple EOG tokens found, just one).
2024-05-07 20:51:31 +03:00
omahs
04976db7a8
docs: fix typos (#7124)
* fix typo

* fix typos

* fix typo

* fix typos

* fix typo

* fix typos
2024-05-07 18:20:33 +03:00
Georgi Gerganov
947d3ad27d
ci : add GG_BUILD_EXTRA_TESTS_0 env (#7098)
* ci : add GG_BUILD_EXTRA_TESTS_0 env

ggml-ci

* Update run.sh

ggml-ci
2024-05-07 11:08:49 +03:00
William Tambellini
858f6b73f6
Add an option to build without CUDA VMM (#7067)
Add an option to build ggml cuda without CUDA VMM
resolves
https://github.com/ggerganov/llama.cpp/issues/6889
https://forums.developer.nvidia.com/t/potential-nvshmem-allocated-memory-performance-issue/275416/4
2024-05-06 20:12:14 +02:00
Georgi Gerganov
b3a995b416
flake.lock: Update (#7079)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/9126214d0a59633752a136528f5f3b9aa8565b7d?narHash=sha256-sB4SWl2lX95bExY2gMFG5HIzvva5AVMJd4Igm%2BGpZNw%3D' (2024-04-01)
  → 'github:hercules-ci/flake-parts/e5d10a24b66c3ea8f150e47dfdb0416ab7c3390e?narHash=sha256-yzcRNDoyVP7%2BSCNX0wmuDju1NUCt8Dz9%2BlyUXEI0dbI%3D' (2024-05-02)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089?dir=lib&narHash=sha256-iMUFArF0WCatKK6RzfUJknjem0H9m4KgorO/p3Dopkk%3D' (2024-03-29)
  → 'https://github.com/NixOS/nixpkgs/archive/50eb7ecf4cd0a5756d7275c8ba36790e5bd53e33.tar.gz?narHash=sha256-QBx10%2Bk6JWz6u7VsohfSw8g8hjdBZEf8CFzXH1/1Z94%3D' (2024-05-02)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/7bb2ccd8cdc44c91edba16c48d2c8f331fb3d856?narHash=sha256-Drmja/f5MRHZCskS6mvzFqxEaZMeciScCTFxWVLqWEY%3D' (2024-04-25)
  → 'github:NixOS/nixpkgs/63c3a29ca82437c87573e4c6919b09a24ea61b0f?narHash=sha256-4cPymbty65RvF1DWQfc%2BBc8B233A1BWxJnNULJKQ1EY%3D' (2024-05-02)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-05-06 08:36:06 -07:00
Francis Couture-Harpin
94e667a9d8 gguf-py : add tqdm as a dependency
It's small, and used for a progress bar
in GGUFWriter.write_tensors_to_file
2024-05-06 09:08:09 -04:00
Francis Couture-Harpin
68c5ac628f Merge branch 'master' into compilade/lazy-convert-hf 2024-05-06 08:07:08 -04:00
Georgi Gerganov
bcdee0daa7
minor : fix trailing whitespace 2024-05-06 09:31:30 +03:00
Francis Couture-Harpin
62303e7f77 convert-hf : minor changes for consistency 2024-05-05 16:49:53 -04:00
Francis Couture-Harpin
bc78bf4cdb convert-hf : faster model parts loading
Instead of pre-loading them all into a dict, iterate on the tensors
in the model parts progressively as needed in Model.write_tensors

Conversion for some architectures relies on checking for the presence
of specific tensor names, so for multi-part models, the weight map is read
from the relevant json file to quickly get these names up-front.
2024-05-05 12:58:32 -04:00
kunnis
628b299106
Adding support for the --numa argument for llama-bench. (#7080) 2024-05-05 14:17:47 +02:00
Sigbjørn Skjæret
8f8acc8683
Disable benchmark on forked repo (#7034)
* Disable benchmark on forked repo

* only check owner on schedule event

* check owner on push also

* more readable as multi-line

* ternary won't work

* style++

* test++

* enable actions debug

* test--

* remove debug

* test++

* do debug where we can get logs

* test--

* this is driving me crazy

* correct github.event usage

* remove test condition

* correct github.event usage

* test++

* test--

* event_name is pull_request_target

* test++

* test--

* update ref checks
2024-05-05 13:38:55 +02:00
Lyle Dean
ca36326020
readme : add note that LLaMA 3 is not supported with convert.py (#7065) 2024-05-05 08:21:46 +03:00
DAN™
889bdd7686
command-r : add BPE pre-tokenization (#7063)
* Add BPE pre-tokenization for Command-R/R+.

* Bump transformers convert requirement.

* command-r : add individual digits regex

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-05 08:19:30 +03:00
Brian
6fbd432211
py : logging and flake8 suppression refactoring (#7081)
Set one as executable and add basicConfig()
to another. Also added noqa tag to test scripts.
2024-05-05 08:07:48 +03:00
Francis Couture-Harpin
98db4347e8 convert-hf : remove einops requirement for InternLM2 2024-05-04 23:57:22 -04:00
Francis Couture-Harpin
0c3833286e convert-hf : flake8 doesn't like lowercase L as a variable name 2024-05-04 23:56:43 -04:00
Francis Couture-Harpin
f09674fbbd convert-hf : save memory with lazy evaluation 2024-05-04 23:56:43 -04:00
Francis Couture-Harpin
215a0d38c8 convert-hf : fix Refact conversion 2024-05-04 23:55:42 -04:00
Xuan Son Nguyen
842500144e
gguf-split: add --no-tensor-first-split (#7072) 2024-05-04 18:56:22 +02:00
Jeximo
cf768b7e71
Tidy Android Instructions README.md (#7016)
* Tidy Android Instructions README.md

Remove CLBlast instructions(outdated), added OpenBlas.

* don't assume git is installed

Added apt install git, so that git clone works

* removed OpenBlas

Linked to Linux build instructions

* fix typo

Remove word "run"

* correct style

Co-authored-by: slaren <slarengh@gmail.com>

* correct grammar

Co-authored-by: slaren <slarengh@gmail.com>

* delete reference to Android API

* remove Fdroid reference, link directly to Termux

Fdroid is not required

Co-authored-by: slaren <slarengh@gmail.com>

* Update README.md

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-05-04 18:10:15 +02:00
viric
fcd84a0f5a
Fix Linux /sys cpu path to guess number of cores (#7064) 2024-05-04 15:26:53 +02:00
Francis Couture-Harpin
f2099c50ab convert-hf : align the message logged for converted tensors 2024-05-04 09:20:01 -04:00
maor-ps
03fb8a002d
If first token generated from the server is the stop word the server will crash (#7038)
This will reproduce the issue in llama13b
{
'prompt': 'Q: hello world \nA: ',
 'stop': ['\n'],
 'temperature': 0.0,
 'n_predict': 10,
 'cache_prompt': True,
 'n_probs': 10
}
2024-05-04 11:06:40 +02:00
Georgi Gerganov
92139b90af
tests : add test-tokenizer-0.sh + fix some tokenizers (#7036)
* tests : add test-tokenizer-0.sh

* unicode : add all unicode number ranges

* starcoder : fix pre-tokenizer

* tests : add test that fails with DeepSeek tokenizers

* falcon : fix regex

* unicode : regenerate unicode tables

* refact : add tokenizer model

* lint : fix

* tests : disable failing tests

ggml-ci

* refact : add tests files

ggml-ci

* convert : print -> logging

ggml-ci

* lint : fix

* unicode : digit -> number

* phi-3 : update
2024-05-04 08:32:32 +03:00
Francis Couture-Harpin
98f2d0e0d7 convert-hf : more consistent formatting of cmdline args 2024-05-03 23:05:41 -04:00
Francis Couture-Harpin
3e5e0dced5 Merge branch 'master' into compilade/convert-hf-refactor 2024-05-03 16:20:54 -04:00
Brian
a2ac89d6ef
convert.py : add python logging instead of print() (#6511)
* convert.py: add python logging instead of print()

* convert.py: verbose flag takes priority over dump flag log suppression

* convert.py: named instance logging

* convert.py: use explicit logger id string

* convert.py: convert extra print() to named logger

* convert.py: sys.stderr.write --> logger.error

* *.py: Convert all python scripts to use logging module

* requirements.txt: remove extra line

* flake8: update flake8 ignore and exclude to match ci settings

* gh-actions: add flake8-no-print to flake8 lint step

* pre-commit: add flake8-no-print to flake8 and also update pre-commit version

* convert-hf-to-gguf.py: print() to logger conversion

* *.py: logging basiconfig refactor to use conditional expression

* *.py: removed commented out logging

* fixup! *.py: logging basiconfig refactor to use conditional expression

* constant.py: logger.error then exit should be a raise exception instead

* *.py: Convert logger error and sys.exit() into a raise exception (for atypical error)

* gguf-convert-endian.py: refactor convert_byteorder() to use tqdm progressbar

* verify-checksum-model.py: This is the result of the program, it should be printed to stdout.

* compare-llama-bench.py: add blank line for readability during missing repo response

* reader.py: read_gguf_file() use print() over logging

* convert.py: warning goes to stderr and won't hurt the dump output

* gguf-dump.py: dump_metadata() should print to stdout

* convert-hf-to-gguf.py: print --> logger.debug or ValueError()

* verify-checksum-models.py: use print() for printing table

* *.py: refactor logging.basicConfig()

* gguf-py/gguf/*.py: use __name__ as logger name

Since they will be imported and not run directly.

* python-lint.yml: use .flake8 file instead

* constants.py: logger no longer required

* convert-hf-to-gguf.py: add additional logging

* convert-hf-to-gguf.py: print() --> logger

* *.py: fix flake8 warnings

* revert changes to convert-hf-to-gguf.py for get_name()

* convert-hf-to-gguf-update.py: use triple quoted f-string instead

* *.py: accidentally corrected the wrong line

* *.py: add compilade warning suggestions and style fixes
2024-05-03 22:36:41 +03:00
Daniel Bevenius
433def286e
llama : rename ctx to user_data in progress_callback (#7045)
* llama : rename ctx to user_data in progress_callback

This commit renames the `ctx` parameter to `user_data` in the
`llama_progress_callback` typedef.

The motivation for this is that other callbacks use `user_data` or
`data`, and using `ctx` in this case might be confusing as it could be
confused with `llama_context`.

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-05-03 15:24:30 +02:00
Francis Couture-Harpin
6a54973d82 Merge branch 'master' into compilade/convert-hf-refactor 2024-05-02 20:02:46 -04:00
Bartowski
60325fa56f
Remove .attention from skipped tensors to match more accurately (#7051) 2024-05-03 01:49:09 +02:00
Francis Couture-Harpin
13f4cf70db convert-hf : use a plain class for Model, and forbid direct instantiation
There are no abstract methods used anyway,
so using ABC isn't really necessary.
2024-05-02 15:52:19 -04:00