Commit graph

3391 commits

Author SHA1 Message Date
compilade
4e3d43f66b llama : fix pre-tokenization of non-special added tokens (#8228)
* llama : fix mpt and olmo pre-tokenizer

* llama : pre-tokenize non-special user-defined tokens first

* llama : fix detection of control-like user-defined tokens

* convert_hf : identify which user-defined tokens are control tokens

Only used in _set_vocab_gpt2() for now.

* convert_hf : identify more added control tokens for SPM tokenziers

This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly,
including HTML tags and consecutive spaces,
but it unfortunately requires model re-conversion.

There seems to be a weird behavior of the HF tokenizer for Gemma,
which prefers to use the 16-space token over more lengthy space tokens,
while using the SentencePiece tokenizer does not do this.
(the implementation in llama.cpp has the same behavior as SentencePiece)

* llama : fix wrong pre-tokenization of byte tokens

* llama : fix Viking pre-tokenizer regex

The order was previously wrong, which caused errors in some tests.

* llama : fix command-r detokenization

* convert_hf : reduce usages of the UNKNOWN token type

* llama : add UNKNOWN tokens in the special tokens cache

* convert_hf : reduce usages of UNKNOWN for InternLM2

This makes the changes from #8321 more consistent
with the other changes made here.

* test-tokenizer-random : reduce potential confilcts with #8379

* test-tokenizer-random : add a failing edge case for falcon
2024-07-27 21:23:09 +08:00
bandoti
08bd5616c1 vulkan : cmake integration (#8119)
* Add Vulkan to CMake pkg

* Add Sycl to CMake pkg

* Add OpenMP to CMake pkg

* Split generated shader file into separate translation unit

* Add CMake target for Vulkan shaders

* Update README.md

* Add make target for Vulkan shaders

* Use pkg-config to locate vulkan library

* Add vulkan SDK dep to ubuntu-22-cmake-vulkan workflow

* Clean up tabs

* Move sudo to apt-key invocation

* Forward GGML_EXTRA_LIBS to CMake config pkg

* Update vulkan obj file paths

* Add shaderc to nix pkg

* Add python3 to Vulkan nix build

* Link against ggml in cmake pkg

* Remove Python dependency from Vulkan build

* code review changes

* Remove trailing newline

* Add cflags from pkg-config to fix w64devkit build

* Update README.md

* Remove trailing whitespace

* Update README.md

* Remove trailing whitespace

* Fix doc heading

* Make glslc required Vulkan component

* remove clblast from nix pkg
2024-07-27 21:23:09 +08:00
Georgi Gerganov
2aa671745c metal : template-ify some of the kernels (#8447)
ggml-ci
2024-07-27 21:23:09 +08:00
arthw
a364ec78f3 fix UT of concat 2024-07-14 11:07:56 +08:00
Neo Zhang
e700d37f68 mv softmax to separated file 2024-07-14 01:02:58 +08:00
Georgi Gerganov
07d457b83f server : handle content array in chat API (#8449)
* server : handle content array in chat API

* Update examples/server/utils.hpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-07-14 00:28:26 +08:00
Georgi Gerganov
21825798c2 main : print error on empty input (#8456) 2024-07-14 00:28:26 +08:00
Daniel Bevenius
318d950e79 llama : suppress unary minus operator warning (#8448)
This commit updates the _try_copy lambda and moves the unary minus
operator to after the cast to int32_t.

The motivation for this that currently the following warning is
generated on windows:

```console
llama.cpp\src\llama.cpp(21147,30): warning C4146: unary minus operator
applied to unsigned type, result still unsigned
```
2024-07-14 00:28:26 +08:00
Douglas Hanley
0a7d1bf5de server : ensure batches are either all embed or all completion (#8420)
* make sure batches are all embed or all non-embed

* non-embedding batch for sampled tokens; fix unused params warning
2024-07-14 00:28:26 +08:00
Armen Kaleshian
3ebd51fcad docker : fix filename for convert-hf-to-gguf.py in tools.sh (#8441)
Commit b0a4699 changed the name of this script from convert-hf-to-gguf.py to
convert_hf_to_gguf.py breaking how convert is called from within a Docker
container.
2024-07-14 00:28:26 +08:00
Jiří Podivín
757ae96e5d convert : remove fsep token from GPTRefactForCausalLM (#8237)
The <filename> token used by Refact doesn't serve
the same purpose as the <file_separator> from CodeGemma.

Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
2024-07-14 00:28:26 +08:00
Georgi Gerganov
e0916db972 examples : sprintf -> snprintf (#8434)
* examples : sprintf -> snprintf

ggml-ci

* examples : use sizeof() instead of hardcoded constants
2024-07-14 00:28:26 +08:00
Georgi Gerganov
f6786401d2 ggml : minor naming changes (#8433)
* ggml : minor naming changes

ggml-ci

* ggml : use PRId64 [no ci]

* ggml : revert FA K/Q names
2024-07-14 00:28:26 +08:00
Chen Xi
fa700d1a84 [SYCL] fix the mul_mat_id ut issues (#8427)
* fix part of mul_mat_id

* skip the bfloat 16 sycl ut

Signed-off-by: Chen Xi <xi2chen@intel.com>

---------

Signed-off-by: Chen Xi <xi2chen@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Chen Xi <xi2chen@intel.com>
2024-07-14 00:28:26 +08:00
Nicholai Tukanov
b4caa00c7c ggml : add NVPL BLAS support (#8329) (#8425)
* ggml : add NVPL BLAS support

* ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>`

---------

Co-authored-by: ntukanov <ntukanov@nvidia.com>
2024-07-14 00:28:26 +08:00
Daniel Bevenius
a5e36a3518 cuda : suppress 'noreturn' warn in no_device_code (#8414)
* cuda : suppress 'noreturn' warn in no_device_code

This commit adds a while(true) loop to the no_device_code function in
common.cuh. This is done to suppress the warning:

```console
/ggml/src/ggml-cuda/template-instances/../common.cuh:346:1: warning:
function declared 'noreturn' should not return [-Winvalid-noreturn]
  346 | }
      | ^
```

The motivation for this is to reduce the number of warnings when
compilng with GGML_HIPBLAS=ON.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! cuda : suppress 'noreturn' warn in no_device_code

Update __trap macro instead of using a while loop to suppress the
warning.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-14 00:28:26 +08:00
Johannes Gäßler
6a9dcf01ad CUDA: optimize and refactor MMQ (#8416)
* CUDA: optimize and refactor MMQ

* explicit q8_1 memory layouts, add documentation
2024-07-14 00:28:26 +08:00
Georgi Gerganov
8c88cd899b gitignore : deprecated binaries 2024-07-14 00:28:26 +08:00
compilade
4e4205aa6f tokenize : add --no-parse-special option (#8423)
This should allow more easily explaining
how parse_special affects tokenization.
2024-07-14 00:28:26 +08:00
Georgi Gerganov
2ed5fd58b5 llama : use F32 precision in Qwen2 attention and no FA (#8412) 2024-07-14 00:28:26 +08:00
Clint Herron
86ced79ae6 Initialize default slot sampling parameters from the global context. (#8418) 2024-07-14 00:28:26 +08:00
Clint Herron
2f027bcb15 Name Migration: Build the deprecation-warning 'main' binary every time (#8404)
* Modify the deprecation-warning 'main' binary to build every time, instead of only when a legacy binary is present. This is to help users of tutorials and other instruction sets from knowing what to do when the 'main' binary is missing and they are trying to follow instructions.

* Adjusting 'server' name-deprecation binary to build all the time, similar to the 'main' legacy name binary.
2024-07-14 00:28:26 +08:00
AidanBeltonS
35b1aff5cf [SYCL] Use multi_ptr to clean up deprecated warnings (#8256) 2024-07-14 00:28:18 +08:00
Georgi Gerganov
e78fa06f3d ggml : move sgemm sources to llamafile subfolder (#8394)
ggml-ci
2024-07-14 00:23:01 +08:00
Dibakar Gope
528f58ff8d ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (#5780)
* Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files

* Arm AArch64: minor code refactoring for rebase

* Arm AArch64: minor code refactoring for resolving a build issue with cmake

* Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code change for resolving a build issue with server-windows

* retrigger checks

* Arm AArch64: minor code changes for rebase

* Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits

* Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig

* Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code refactoring

* Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat

* Arm AArch64: minimize changes in ggml_compute_forward_mul_mat

* Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* rebase on the latest master commit 3fd62a6 and adapt to the new directory structure

* Arm AArch64: remove a redundant comment

* Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off

* Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels

* Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type
2024-07-14 00:23:01 +08:00
M. Yusuf Sarıgöz
04ba8fca3e gguf-py rel pipeline (#8410)
* Upd gguf-py/readme

* Bump patch version for release
2024-07-14 00:23:01 +08:00
Borislav Stanimirov
224090c64e llama : C++20 compatibility for u8 strings (#8408) 2024-07-14 00:23:01 +08:00
Borislav Stanimirov
35f85f71e5 msvc : silence codecvt c++17 deprecation warnings (#8395) 2024-07-14 00:23:01 +08:00
fairydreaming
f4e68cd731 llama : add assert about missing llama_encode() call (#8400)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-07-14 00:23:01 +08:00
RunningLeon
0464524ddd py : fix converter for internlm2 (#8321)
* update internlm2

* remove unused file

* fix lint
2024-07-14 00:23:01 +08:00
laik
eb16c41949 py : fix extra space in convert_hf_to_gguf.py (#8407) 2024-07-14 00:23:01 +08:00
Clint Herron
ae3a78ad34 Server: Enable setting default sampling parameters via command-line (#8402)
* Load server sampling parameters from the server context by default.

* Wordsmithing comment
2024-07-14 00:23:01 +08:00
Andy Salerno
8af17465a9 Update README.md to fix broken link to docs (#8399)
Update the "Performance troubleshooting" doc link to be correct - the file was moved into a dir called 'development'
2024-07-14 00:23:01 +08:00
Clint Herron
0e6506aeb0 Deprecation warning to assist with migration to new binary names (#8283)
* Adding a simple program to provide a deprecation warning that can exist to help people notice the binary name change from #7809 and migrate to the new filenames.

* Build legacy replacement binaries only if they already exist. Check for their existence every time so that they are not ignored.
2024-07-14 00:22:58 +08:00
Johannes Gäßler
c7d621d0da make/cmake: LLAMA_NO_CCACHE -> GGML_NO_CCACHE (#8392) 2024-07-14 00:21:54 +08:00
Borislav Stanimirov
5c10e23a80 cmake : allow external ggml (#8370) 2024-07-14 00:20:27 +08:00
daghanerdonmez
1052802685 readme : fix typo [no ci] (#8389)
Bakus-Naur --> Backus-Naur
2024-07-14 00:20:27 +08:00
compilade
c380b899e5 gguf-py : do not use internal numpy types (#7472) 2024-07-14 00:20:27 +08:00
Georgi Gerganov
9ad5bcaad3 flake.lock: Update (#8342)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/2a55567fcf15b1b1c7ed712a2c6fadaec7412ea8?narHash=sha256-iKzJcpdXih14qYVcZ9QC9XuZYnPc6T8YImb6dX166kw%3D' (2024-06-01)
  → 'github:hercules-ci/flake-parts/9227223f6d922fee3c7b190b2cc238a99527bbb7?narHash=sha256-pQMhCCHyQGRzdfAkdJ4cIWiw%2BJNuWsTX7f0ZYSyz0VY%3D' (2024-07-03)
• Updated input 'flake-parts/nixpkgs-lib':
    'https://github.com/NixOS/nixpkgs/archive/eb9ceca17df2ea50a250b6b27f7bf6ab0186f198.tar.gz?narHash=sha256-lIbdfCsf8LMFloheeE6N31%2BBMIeixqyQWbSr2vk79EQ%3D' (2024-06-01)
  → 'https://github.com/NixOS/nixpkgs/archive/5daf0514482af3f97abaefc78a6606365c9108e2.tar.gz?narHash=sha256-Fm2rDDs86sHy0/1jxTOKB1118Q0O3Uc7EC0iXvXKpbI%3D' (2024-07-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/b2852eb9365c6de48ffb0dc2c9562591f652242a?narHash=sha256-C8e9S7RzshSdHB7L%2Bv9I51af1gDM5unhJ2xO1ywxNH8%3D' (2024-06-27)
  → 'github:NixOS/nixpkgs/9f4128e00b0ae8ec65918efeba59db998750ead6?narHash=sha256-rwz8NJZV%2B387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ%3D' (2024-07-03)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-07-14 00:20:27 +08:00
Alberto Cabrera Pérez
7a8fa37316 labeler : updated sycl to match docs and code refactor (#8373) 2024-07-14 00:20:27 +08:00
b4b4o
790e9b2a0e readme : fix web link error [no ci] (#8347) 2024-07-14 00:20:27 +08:00
Alberto Cabrera Pérez
a7d7781692 sycl : fix powf call in device code (#8368) 2024-07-14 00:20:27 +08:00
Georgi Gerganov
86d41e6e1c scripts : fix sync for sycl 2024-07-14 00:20:27 +08:00
Georgi Gerganov
a5038fc736 sync : ggml
ggml-ci
2024-07-14 00:20:27 +08:00
Georgi Gerganov
8ab505a2e9 tests : fix whitespace (#0) 2024-07-14 00:20:27 +08:00
John Balis
fec49428a6 feat: cuda implementation for ggml_conv_transpose_1d (ggml/854)
* conv transpose 1d passing test for 1d input and kernel

* working for different input and output channel counts, added test for variable stride

* initial draft appears to work with stride other than 1

* working with all old and new conv1d  tests

* added a test for large tensors

* removed use cuda hardcoding

* restored test-conv-transpose.c

* removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail

* fixed accumulator bug

* added test to test-backend-ops

* fixed mistake

* addressed review

* fixed includes

* removed blank lines

* style and warning fixes

* return failure when test fails

* fix supports_op

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-07-14 00:20:27 +08:00
Kevin Wang
9ff6a62845 common : preallocate sampling token data vector (#8363)
`emplace_back` repeatedly-called is slower than preallocating the vector to the vocab size and directly inserting the data. Some rudimentary profiling with `chrono` improves the performance of this block of code from ~500us/op to ~40us/op.

Overall, this slightly improves the sampling performance which has a more substantial impact for the `examples/lookahead` implementation -- I am able to see a ~10% performance boost in lookahead inference.
2024-07-14 00:20:27 +08:00
Georgi Gerganov
da09d77524 infill : assert prefix/suffix tokens + remove old space logic (#8351) 2024-07-14 00:20:27 +08:00
Kevin Wang
6e022a225a common : avoid unnecessary logits fetch (#8358) 2024-07-14 00:20:27 +08:00
toyer
68d1711f73 readme : add supported glm models (#8360) 2024-07-14 00:20:27 +08:00