Commit graph

3378 commits

Author SHA1 Message Date
Daniel Bevenius
091a7af9fe llama : add early return for empty range (#8327)
* llama : add early return for empty range

This commit adds an early return to the llama_kv_cache_seq_add and
llama_kv_cache_seq_div functions.

The motivation for adding this is to avoid looping over the cache
when the range is empty. I ran into this when using the self-extend
feature in main.cpp.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* llama : add static_cast to fix CI warning/error

This commit attempts to fix the following warning/error:

```console
src/llama.cpp:7271:31: error:
comparison of integer expressions of different signedness:
‘int’ and ‘uint32_t’ {aka ‘unsigned int’} [-Werror=sign-compare]
 7271 |                         if (i < hparams.n_layer_dense_lead) {
      |                             ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
This can be reproduced locally by setting -Wsign-compare in the
Makefile.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! llama : add early return for empty range

Remove the setting of cache.head to 0 when the range is empty.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* Update src/llama.cpp

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-14 00:20:26 +08:00
jaime-m-p
0106884e98 Detokenizer fixes (#8039)
* Add llama_detokenize():
  - Update header files location
  - UNKNOWN and CONTROL are 'special pieces'
  - Remove space after UNKNOWN and CONTROL
  - Refactor llama_token_to_piece()
  - Add flag: clean_up_tokenization_spaces
  - Symmetric params for llama_tokenize() and llama_detokenize()

* Update and fix tokenizer tests:
  - Using llama_detokenize()
  - Unexpected vocab type as test fail instead of error
    - Useful when automating tests:
    - If you don't know in advance the vocab type
    - Differenciate other loading errors
  - Skip unicode surrogaes and undefined
  - Gracefully exit threads
    - Using exit() is throwing random exceptions
  - Clean old known problematic codepoints
  - Minor: confusing hexadecimal codepoint

* Update bruteforce random tests
  - Add detokenizer checks
  - New generator: ascii_lr_strip
  - New generator: apostrophe
  - Add more vocabs files
  - Detokenize special tokens.
  - Replace errors with '\uFFFD' when detokenizing to 'utf-8'
  - More edge cases
  - Better detokenization results check

* Fix add_space_prefix, set false by default
* Better leading space removal
* Do not remove space when decoding special tokens
* Bugfix: custom regexs splits undefined unicode codepoints
* 'viking' detokenizer clean spaces
2024-07-14 00:20:26 +08:00
Xuan Son Nguyen
16ab65b7b9 Reorganize documentation pages (#8325)
* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
2024-07-14 00:20:26 +08:00
Georgi Gerganov
401892e563 llama : fix compile warning (#8304) 2024-07-14 00:20:26 +08:00
Natsu
c667e897e9 cmake : add GGML_BUILD and GGML_SHARED macro definitions (#8281) 2024-07-14 00:20:26 +08:00
Georgi Gerganov
c738f1bc89 convert : remove AWQ remnants (#8320) 2024-07-14 00:17:51 +08:00
Georgi Gerganov
655a624782 llama : minor indentation during tensor loading (#8304)
* llama : minor indentation during tensor loading

ggml-ci

* llama : use int for layer iterators [no ci]
2024-07-14 00:17:51 +08:00
Johannes Gäßler
1dfab16f5d CUDA: MMQ support for iq4_nl, iq4_xs (#8278) 2024-07-14 00:17:51 +08:00
Daniele
4bb7223486 CUDA: revert part of the RDNA1 optimizations (#8309)
The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s
2024-07-14 00:17:51 +08:00
Douglas Hanley
d49328a3bf llama : streamline embeddings from "non-embedding" models (#8087) 2024-07-14 00:17:51 +08:00
Johannes Gäßler
972fbf7fbf CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (#8311) 2024-07-14 00:17:51 +08:00
Pieter Ouwerkerk
df8b4d8e39 readme : fix minor typos [no ci] (#8314) 2024-07-14 00:17:51 +08:00
Daniel Bevenius
53da9d276e passkey : add short intro to README.md [no-ci] (#8317)
* passkey : add short intro to README.md [no-ci]

This commit adds a short introduction to the README.md file in the
examples/passkey directory.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* Update examples/passkey/README.md

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-14 00:17:51 +08:00
Georgi Gerganov
f5cb88cc73 llama : prefer n_ over num_ prefix (#8308) 2024-07-14 00:17:51 +08:00
Georgi Gerganov
8696144105 contributing : update guidelines (#8316) 2024-07-14 00:17:51 +08:00
Neo Zhang
a4c8edcb67 fix for multiple cards 2024-07-14 00:15:55 +08:00
Neo Zhang
aeaed61904
Merge pull request #1 from arthw/update_warp
[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (#8266) cherry-pick b549a1bbef
2024-07-13 16:44:28 +08:00
arthw
74e3185cfd fix editorconfig check format issue 2024-07-13 16:02:15 +08:00
arthw
4cd9e48670 cherry-pick b549a1bbef,
[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (#8266)
    * fix group_norm ut

    * split softmax

    * fix softmax

    * add concat support condition

    * revert debug code

    * move QK_WARP_SIZE to presets.hpp

Fix issue in above PR:
  fix norm() nullptr lead to crash on iGPU.
  use WARP_32_SIZE replace QK_WARP_SIZE
  optimize dmmv.cpp for iGPU.
  add sycl_hw.cpp to detect Hardware info.
2024-07-13 14:44:38 +08:00
Georgi Gerganov
c5009e6128 py : switch to snake_case (#8305)
* py : switch to snake_case

ggml-ci

* cont

ggml-ci

* cont

ggml-ci

* cont : fix link

* gguf-py : use snake_case in scripts entrypoint export

* py : rename requirements for convert_legacy_llama.py

Needed for scripts/check-requirements.sh

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-07-07 21:20:52 +08:00
Xuan Son Nguyen
6d6ecd3200 cli: add EOT when user hit Ctrl+C (#8296)
* main: add need_insert_eot

* do not format system prompt if it is empty
2024-07-07 21:20:05 +08:00
Icecream95
cbfc850793 llama : add OpenELM support (#7359)
* Initial OpenELM support (270M only so far)

* Fill out missing entries in llama_model_type_name

* fixup! Initial OpenELM support (270M only so far)

Fix formatting

* llama : support all OpenELM models

* llama : add variable GQA and variable FFN sizes

Some metadata keys can now also be arrays to support setting
their value per-layer for models like OpenELM.

* llama : minor spacing changes

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : use std::array for per-layer hparams

* llama : fix save/load state

* llama : do not print hparams for vocab-only models

* llama : handle n_head == 0

* llama : use const ref for print_f and fix division by zero

* llama : fix t5 uses of n_head and n_ff

* llama : minor comment

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-07 21:20:05 +08:00
Daniel Bevenius
63c6e90eab tokenize : add --show-count (token) option (#8299)
This commit adds a new option to the tokenize example, --show-count.
When this is set the total number of tokens are printed to stdout.

This was added as an option as I was concerned that there might be
scripts that use the output from this program and it might be better to
not print this information by default.

The motivation for this is that can be useful to find out how many
tokens a file contains, for example when trying to determine prompt
input file sizes for testing.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-07 21:20:05 +08:00
ditsuke
498d561ab1 build: Export hf-to-gguf as snakecase 2024-07-07 21:20:05 +08:00
ditsuke
cb46165d9e doc: Add context for why we add an explicit pytorch source 2024-07-07 21:20:05 +08:00
ditsuke
ba8aea8457 chore: Remove rebase artifacts 2024-07-07 21:20:05 +08:00
ditsuke
1d1fea0b6e chore: Fixup requirements and build 2024-07-07 21:20:05 +08:00
ditsuke
1ee5d59f67 chore: ignore all __pychache__ 2024-07-07 21:20:05 +08:00
ditsuke
3aefc742fe fix: Update script paths in CI scripts 2024-07-07 21:20:05 +08:00
ditsuke
84f249c4e8 fix: Actually include scripts in build
Not namespaced though :(
2024-07-07 21:20:05 +08:00
ditsuke
2c753017ae build(python): Package scripts with pip-0517 compliance 2024-07-07 21:20:05 +08:00
fairydreaming
ff2ca9cfb7 Inference support for T5 and FLAN-T5 model families (#5763)
* llama : add inference support and model types for T5 and FLAN-T5 model families

* llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token()

* common, llama-cli, llama-batched : add support for encoder-decoder models

* convert-hf : handle shared token embeddings tensors in T5Model

* convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models)

* convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model

* convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-07 21:20:05 +08:00
Daniel Bevenius
3a710b6aaf tests : add _CRT_SECURE_NO_WARNINGS for WIN32 (#8231)
This commit adds the compile definition `_CRT_SECURE_NO_WARNINGS`
to the root cmake subproject.

The motivation for this is that currently the following warnings are
displayed when compiling the tests and common cmake subprojects:
```console
test-llama-grammar.cpp
C:\llama.cpp\src\.\llama.cpp(1406,77): warning C4996: 'strerror':
This function or variable may be unsafe. Consider using strerror_s
instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See
online help for details.
[C:\llama.cpp\build\tests\test-llama-grammar.vcxproj]
...
```

This compile definition is currently set for the `src` subproject
and this change moves into the root cmake project so that it is applied
to all cmake subprojects.
2024-07-07 21:20:05 +08:00
Daniel Bevenius
ef1600090f llama : suppress unref var in Windows MSVC (#8150)
* llama : suppress unref var in Windows MSVC

This commit suppresses two warnings that are currently generated for
src/llama.cpp when building on Windows MSVC

```console
C:\llama.cpp\src\llama.cpp(14349,45): warning C4101: 'ex':
unreferenced local variable [C:\llama.cpp\build\src\llama.vcxproj]
C:\llama.cpp\src\llama.cpp(19285,44): warning C4101: 'e':
unreferenced local variable [C:\llama.cpp\build\src\llama.vcxproj]
```

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-07 21:20:05 +08:00
Georgi Gerganov
e9d503a5d7 convert : fix gemma v1 tokenizer convert (#8248)
ggml-ci
2024-07-07 21:20:04 +08:00
Daniele
ab0e5dee19 Define and optimize RDNA1 (#8085) 2024-07-07 21:14:44 +08:00
slaren
80ffd6e497 ppl : fix n_seq_max for perplexity (#8277)
* ppl : fix n_seq_max for perplexity

* use 1 seq for kl_divergence
2024-07-07 21:14:44 +08:00
Xuan Son Nguyen
40a2a1b936 fix phi 3 conversion (#8262) 2024-07-07 21:14:44 +08:00
Neo Zhang
fdef7d606e replace get_work_group_size() by local buf 2024-07-04 11:55:23 +08:00
Neo Zhang
2493479958 skip UT for BF16 2024-07-04 08:28:58 +08:00
Neo Zhang
96e3826f83 update for title 2024-07-03 12:59:34 +08:00
AidanBeltonS
51be862438 Dequant improvements rebase (#8255)
* Single load for half2

* Store scales in local mem

* Vec load quantized values
2024-07-03 12:02:33 +08:00
MistApproach
85ec6c02c2 fix: add missing short command line argument -mli for multiline-input (#8261) 2024-07-03 11:51:04 +08:00
Clint Herron
044995e2d1 Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. (#8258) 2024-07-03 11:47:48 +08:00
Faisal Zaghloul
6b695b5a2c Add JAIS model(s) (#8118)
* Add `JAIS` model(s)

* cleanup

* address review comments

* remove hack

* un-hardcode max-alibi-bias

* minor tweaks

---------

Co-authored-by: fmz <quic_fzaghlou@quic.com>
2024-07-03 11:44:37 +08:00
Daniel Bevenius
785f24b954 convert-hf : print output file name when completed (#8181)
* convert-hf : print output file name when completed

This commit adds the output file name to the log message when the
conversion is completed.

The motivation for this change is that when `--outfile` option is not
specified it migth not be obvious where the output file is written.

With this change the output of running the script will be something like
the following:
```console
INFO:hf-to-gguf:Model successfully exported to models/gemma-2-9b-it.gguf.
```

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! convert-hf : print output file name when completed

Updates the output of to support printing the directory if the output is
split into multiple files. Also the output file name is now retrieved
from the model_instance object.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! convert-hf : print output file name when completed

Use parent attribute of Path object and string interpolation.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! convert-hf : print output file name when completed

Use os.sep instead of hardcoding the path separator.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-03 11:44:19 +08:00
slaren
726953cda5 cuda : update supports_op for matrix multiplication (#8245) 2024-07-03 11:42:59 +08:00
Neo Zhang
9c593619f3 fix multiple gpu, add device choose mode, update the guide for usages 2024-07-03 11:20:54 +08:00
Jianyu Zhang
de2763118f fix to support multiple GPUs, fix set single device, unify id/device_id/device_index 2024-07-03 10:21:29 +08:00
luoyu-intel
a9f3b10215
[SYCL] Fix win build conflict of math library (#8230)
* fix win build conflict of math library

* fix the condition: !(win32 & SYCL)

* revert warp_size=16
2024-07-02 12:50:07 +08:00