Commit graph

3355 commits

Author SHA1 Message Date
daghanerdonmez
1052802685 readme : fix typo [no ci] (#8389)
Bakus-Naur --> Backus-Naur
2024-07-14 00:20:27 +08:00
compilade
c380b899e5 gguf-py : do not use internal numpy types (#7472) 2024-07-14 00:20:27 +08:00
Georgi Gerganov
9ad5bcaad3 flake.lock: Update (#8342)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/2a55567fcf15b1b1c7ed712a2c6fadaec7412ea8?narHash=sha256-iKzJcpdXih14qYVcZ9QC9XuZYnPc6T8YImb6dX166kw%3D' (2024-06-01)
  → 'github:hercules-ci/flake-parts/9227223f6d922fee3c7b190b2cc238a99527bbb7?narHash=sha256-pQMhCCHyQGRzdfAkdJ4cIWiw%2BJNuWsTX7f0ZYSyz0VY%3D' (2024-07-03)
• Updated input 'flake-parts/nixpkgs-lib':
    'https://github.com/NixOS/nixpkgs/archive/eb9ceca17df2ea50a250b6b27f7bf6ab0186f198.tar.gz?narHash=sha256-lIbdfCsf8LMFloheeE6N31%2BBMIeixqyQWbSr2vk79EQ%3D' (2024-06-01)
  → 'https://github.com/NixOS/nixpkgs/archive/5daf0514482af3f97abaefc78a6606365c9108e2.tar.gz?narHash=sha256-Fm2rDDs86sHy0/1jxTOKB1118Q0O3Uc7EC0iXvXKpbI%3D' (2024-07-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/b2852eb9365c6de48ffb0dc2c9562591f652242a?narHash=sha256-C8e9S7RzshSdHB7L%2Bv9I51af1gDM5unhJ2xO1ywxNH8%3D' (2024-06-27)
  → 'github:NixOS/nixpkgs/9f4128e00b0ae8ec65918efeba59db998750ead6?narHash=sha256-rwz8NJZV%2B387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ%3D' (2024-07-03)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-07-14 00:20:27 +08:00
Alberto Cabrera Pérez
7a8fa37316 labeler : updated sycl to match docs and code refactor (#8373) 2024-07-14 00:20:27 +08:00
b4b4o
790e9b2a0e readme : fix web link error [no ci] (#8347) 2024-07-14 00:20:27 +08:00
Alberto Cabrera Pérez
a7d7781692 sycl : fix powf call in device code (#8368) 2024-07-14 00:20:27 +08:00
Georgi Gerganov
86d41e6e1c scripts : fix sync for sycl 2024-07-14 00:20:27 +08:00
Georgi Gerganov
a5038fc736 sync : ggml
ggml-ci
2024-07-14 00:20:27 +08:00
Georgi Gerganov
8ab505a2e9 tests : fix whitespace (#0) 2024-07-14 00:20:27 +08:00
John Balis
fec49428a6 feat: cuda implementation for ggml_conv_transpose_1d (ggml/854)
* conv transpose 1d passing test for 1d input and kernel

* working for different input and output channel counts, added test for variable stride

* initial draft appears to work with stride other than 1

* working with all old and new conv1d  tests

* added a test for large tensors

* removed use cuda hardcoding

* restored test-conv-transpose.c

* removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail

* fixed accumulator bug

* added test to test-backend-ops

* fixed mistake

* addressed review

* fixed includes

* removed blank lines

* style and warning fixes

* return failure when test fails

* fix supports_op

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-07-14 00:20:27 +08:00
Kevin Wang
9ff6a62845 common : preallocate sampling token data vector (#8363)
`emplace_back` repeatedly-called is slower than preallocating the vector to the vocab size and directly inserting the data. Some rudimentary profiling with `chrono` improves the performance of this block of code from ~500us/op to ~40us/op.

Overall, this slightly improves the sampling performance which has a more substantial impact for the `examples/lookahead` implementation -- I am able to see a ~10% performance boost in lookahead inference.
2024-07-14 00:20:27 +08:00
Georgi Gerganov
da09d77524 infill : assert prefix/suffix tokens + remove old space logic (#8351) 2024-07-14 00:20:27 +08:00
Kevin Wang
6e022a225a common : avoid unnecessary logits fetch (#8358) 2024-07-14 00:20:27 +08:00
toyer
68d1711f73 readme : add supported glm models (#8360) 2024-07-14 00:20:27 +08:00
compilade
df044303f3 py : type-check all Python scripts with Pyright (#8341)
* py : type-check all Python scripts with Pyright

* server-tests : use trailing slash in openai base_url

* server-tests : add more type annotations

* server-tests : strip "chat" from base_url in oai_chat_completions

* server-tests : model metadata is a dict

* ci : disable pip cache in type-check workflow

The cache is not shared between branches, and it's 250MB in size,
so it would become quite a big part of the 10GB cache limit of the repo.

* py : fix new type errors from master branch

* tests : fix test-tokenizer-random.py

Apparently, gcc applies optimisations even when pre-processing,
which confuses pycparser.

* ci : only show warnings and errors in python type-check

The "information" level otherwise has entries
from 'examples/pydantic_models_to_grammar.py',
which could be confusing for someone trying to figure out what failed,
considering that these messages can safely be ignored
even though they look like errors.
2024-07-14 00:20:27 +08:00
Denis Spasyuk
b775ea0e75 Update llama-cli documentation (#8315)
* Update README.md

* Update README.md

* Update README.md

fixed llama-cli/main, templates on some cmds added chat template sections and fixed typos in some areas

* Update README.md

* Update README.md

* Update README.md
2024-07-14 00:20:27 +08:00
Alex Tuddenham
9ee7bf007d ci : add checks for cmake,make and ctest in ci/run.sh (#8200)
* Added checks for cmake,make and ctest

* Removed erroneous whitespace
2024-07-14 00:20:27 +08:00
Andy Tai
c695235193 readme : update bindings list (#8222)
* adding guile_llama_cpp  to binding list

* fix formatting

* fix formatting
2024-07-14 00:20:27 +08:00
Brian
305b9d8892 gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048)
CLI to hash GGUF files to detect difference on a per model and per tensor level

The hash type we support is:

- `--xxh64`: use xhash 64bit hash mode (default)
- `--sha1`: use sha1
- `--uuid`: use uuid
- `--sha256`: use sha256

While most POSIX systems already have hash checking programs like sha256sum, it
is designed to check entire files. This is not ideal for our purpose if we want
to check for consistency of the tensor data even if the metadata content of the
gguf KV store has been updated.

This program is designed to hash a gguf tensor payload on a 'per tensor layer'
in addition to a 'entire tensor model' hash. The intent is that the entire
tensor layer can be checked first but if there is any detected inconsistencies,
then the per tensor hash can be used to narrow down the specific tensor layer
that has inconsistencies.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-14 00:20:27 +08:00
toyer
bfa07c7003 llama : support glm3 and glm4 (#8031)
* add chatglm3-6b model support huggingface model:
 https://hf-mirror.com/THUDM/chatglm3-6b

Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>

* remove .rotary_pos_emb.inv_freq and unuse code for chatglm3 model

Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>

* fix lint error

Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>

* optimize convert-hf-to-gguf.py for chatglm model

Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>

* support glm-4-9b-chat

Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>

* fix eos tokens to glm4

* remove unused log

* add preprocess to chatglm3 and chatglm4

* add eos_id_list to llama.cpp

* fix code style

* fix code style

* fix conflicts

* fix conflicts

* Revert "add eos_id_list to llama.cpp"

This reverts commit 3a4d5790bf.

* set <|endoftext|> as eos and <|user|> as eot

* fix chat template bug

* add comment to glm prefix and suffix

* fix conflicts and add rope_ratio & ChatGLMForConditionalGeneration

* fix chat template bug

* fix codestyle

* fix conflicts

* modified the general name of glm model

* fix conflicts

* remove prefix and suffix

* use normal glm4 chattempalte & use LLM_FFN_SWIGLU in phi3

* fix: resolve Flake8 errors in `convert-hf-to-gguf.py`

- Fix E302 by adding two blank lines before top-level function definitions
- Replace print statements to fix NP100
- Fix E303 by ensuring only one blank line between lines of code

* fix rope ratio to solve incorrect answers

* fix by comments

---------

Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
Co-authored-by: XingXing Qiao <qiaoxx@dingdao.com>
Co-authored-by: Umpire2018 <138990495+Umpire2018@users.noreply.github.com>
2024-07-14 00:20:27 +08:00
Georgi Gerganov
155ec5bf82 llama : fix n_rot default (#8348)
ggml-ci
2024-07-14 00:20:27 +08:00
compilade
78706ed9a8 py : use cpu-only torch in requirements.txt (#8335) 2024-07-14 00:20:27 +08:00
standby24x7
4dff06da04 finetune: Rename command name in README.md (#8343)
Rename an old command name "finetune" to "llama-finetune"
in README.md

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
2024-07-14 00:20:26 +08:00
standby24x7
a490e75225 finetune: Rename an old command name in finetune.sh (#8344)
This patch replaces an old commad "main" with "llama-cli"
in finetune.sh.
The part that I fixed is comment, so it doesn't change
the script.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
2024-07-14 00:20:26 +08:00
Bjarke Viksøe
0162c9c537 server: Retrieve prompt template in /props (#8337)
* server: Retrieve prompt template in /props

This PR adds the following:
- Expose the model's Jinja2 prompt template from the model in the /props endpoint.
- Change log-level from Error to Warning for warning about template mismatch.

The front-end stands a better chance of actually executing the Jinja template format correctly. Server is currently just guessing it.

Ideally this should have been inside a JSON block that expose the same key/value pairs as listed during startup in "llm_load_print_meta" function.

* Make string buffer dynamic

* Add doc and better string handling

* Using chat_template naming convention

* Use intermediate vector for string assignment
2024-07-14 00:20:26 +08:00
Derrick T. Woolworth
a82ac78c6f added support for Authorization Bearer tokens when downloading model (#8307)
* added support for Authorization Bearer tokens

* removed auth_token, removed set_ function, other small fixes

* Update common/common.cpp

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-07-14 00:20:26 +08:00
Xuan Son Nguyen
6234b41211 update main readme (#8333) 2024-07-14 00:20:26 +08:00
Daniel Bevenius
091a7af9fe llama : add early return for empty range (#8327)
* llama : add early return for empty range

This commit adds an early return to the llama_kv_cache_seq_add and
llama_kv_cache_seq_div functions.

The motivation for adding this is to avoid looping over the cache
when the range is empty. I ran into this when using the self-extend
feature in main.cpp.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* llama : add static_cast to fix CI warning/error

This commit attempts to fix the following warning/error:

```console
src/llama.cpp:7271:31: error:
comparison of integer expressions of different signedness:
‘int’ and ‘uint32_t’ {aka ‘unsigned int’} [-Werror=sign-compare]
 7271 |                         if (i < hparams.n_layer_dense_lead) {
      |                             ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
This can be reproduced locally by setting -Wsign-compare in the
Makefile.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! llama : add early return for empty range

Remove the setting of cache.head to 0 when the range is empty.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* Update src/llama.cpp

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-14 00:20:26 +08:00
jaime-m-p
0106884e98 Detokenizer fixes (#8039)
* Add llama_detokenize():
  - Update header files location
  - UNKNOWN and CONTROL are 'special pieces'
  - Remove space after UNKNOWN and CONTROL
  - Refactor llama_token_to_piece()
  - Add flag: clean_up_tokenization_spaces
  - Symmetric params for llama_tokenize() and llama_detokenize()

* Update and fix tokenizer tests:
  - Using llama_detokenize()
  - Unexpected vocab type as test fail instead of error
    - Useful when automating tests:
    - If you don't know in advance the vocab type
    - Differenciate other loading errors
  - Skip unicode surrogaes and undefined
  - Gracefully exit threads
    - Using exit() is throwing random exceptions
  - Clean old known problematic codepoints
  - Minor: confusing hexadecimal codepoint

* Update bruteforce random tests
  - Add detokenizer checks
  - New generator: ascii_lr_strip
  - New generator: apostrophe
  - Add more vocabs files
  - Detokenize special tokens.
  - Replace errors with '\uFFFD' when detokenizing to 'utf-8'
  - More edge cases
  - Better detokenization results check

* Fix add_space_prefix, set false by default
* Better leading space removal
* Do not remove space when decoding special tokens
* Bugfix: custom regexs splits undefined unicode codepoints
* 'viking' detokenizer clean spaces
2024-07-14 00:20:26 +08:00
Xuan Son Nguyen
16ab65b7b9 Reorganize documentation pages (#8325)
* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
2024-07-14 00:20:26 +08:00
Georgi Gerganov
401892e563 llama : fix compile warning (#8304) 2024-07-14 00:20:26 +08:00
Natsu
c667e897e9 cmake : add GGML_BUILD and GGML_SHARED macro definitions (#8281) 2024-07-14 00:20:26 +08:00
Georgi Gerganov
c738f1bc89 convert : remove AWQ remnants (#8320) 2024-07-14 00:17:51 +08:00
Georgi Gerganov
655a624782 llama : minor indentation during tensor loading (#8304)
* llama : minor indentation during tensor loading

ggml-ci

* llama : use int for layer iterators [no ci]
2024-07-14 00:17:51 +08:00
Johannes Gäßler
1dfab16f5d CUDA: MMQ support for iq4_nl, iq4_xs (#8278) 2024-07-14 00:17:51 +08:00
Daniele
4bb7223486 CUDA: revert part of the RDNA1 optimizations (#8309)
The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s
2024-07-14 00:17:51 +08:00
Douglas Hanley
d49328a3bf llama : streamline embeddings from "non-embedding" models (#8087) 2024-07-14 00:17:51 +08:00
Johannes Gäßler
972fbf7fbf CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (#8311) 2024-07-14 00:17:51 +08:00
Pieter Ouwerkerk
df8b4d8e39 readme : fix minor typos [no ci] (#8314) 2024-07-14 00:17:51 +08:00
Daniel Bevenius
53da9d276e passkey : add short intro to README.md [no-ci] (#8317)
* passkey : add short intro to README.md [no-ci]

This commit adds a short introduction to the README.md file in the
examples/passkey directory.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* Update examples/passkey/README.md

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-14 00:17:51 +08:00
Georgi Gerganov
f5cb88cc73 llama : prefer n_ over num_ prefix (#8308) 2024-07-14 00:17:51 +08:00
Georgi Gerganov
8696144105 contributing : update guidelines (#8316) 2024-07-14 00:17:51 +08:00
Neo Zhang
a4c8edcb67 fix for multiple cards 2024-07-14 00:15:55 +08:00
Neo Zhang
aeaed61904
Merge pull request #1 from arthw/update_warp
[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (#8266) cherry-pick b549a1bbef
2024-07-13 16:44:28 +08:00
arthw
74e3185cfd fix editorconfig check format issue 2024-07-13 16:02:15 +08:00
arthw
4cd9e48670 cherry-pick b549a1bbef,
[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (#8266)
    * fix group_norm ut

    * split softmax

    * fix softmax

    * add concat support condition

    * revert debug code

    * move QK_WARP_SIZE to presets.hpp

Fix issue in above PR:
  fix norm() nullptr lead to crash on iGPU.
  use WARP_32_SIZE replace QK_WARP_SIZE
  optimize dmmv.cpp for iGPU.
  add sycl_hw.cpp to detect Hardware info.
2024-07-13 14:44:38 +08:00
Georgi Gerganov
c5009e6128 py : switch to snake_case (#8305)
* py : switch to snake_case

ggml-ci

* cont

ggml-ci

* cont

ggml-ci

* cont : fix link

* gguf-py : use snake_case in scripts entrypoint export

* py : rename requirements for convert_legacy_llama.py

Needed for scripts/check-requirements.sh

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-07-07 21:20:52 +08:00
Xuan Son Nguyen
6d6ecd3200 cli: add EOT when user hit Ctrl+C (#8296)
* main: add need_insert_eot

* do not format system prompt if it is empty
2024-07-07 21:20:05 +08:00
Icecream95
cbfc850793 llama : add OpenELM support (#7359)
* Initial OpenELM support (270M only so far)

* Fill out missing entries in llama_model_type_name

* fixup! Initial OpenELM support (270M only so far)

Fix formatting

* llama : support all OpenELM models

* llama : add variable GQA and variable FFN sizes

Some metadata keys can now also be arrays to support setting
their value per-layer for models like OpenELM.

* llama : minor spacing changes

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : use std::array for per-layer hparams

* llama : fix save/load state

* llama : do not print hparams for vocab-only models

* llama : handle n_head == 0

* llama : use const ref for print_f and fix division by zero

* llama : fix t5 uses of n_head and n_ff

* llama : minor comment

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-07 21:20:05 +08:00
Daniel Bevenius
63c6e90eab tokenize : add --show-count (token) option (#8299)
This commit adds a new option to the tokenize example, --show-count.
When this is set the total number of tokens are printed to stdout.

This was added as an option as I was concerned that there might be
scripts that use the output from this program and it might be better to
not print this information by default.

The motivation for this is that can be useful to find out how many
tokens a file contains, for example when trying to determine prompt
input file sizes for testing.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-07 21:20:05 +08:00