Commit graph

3436 commits

Author SHA1 Message Date
Georgi Gerganov
07d457b83f server : handle content array in chat API (#8449)
* server : handle content array in chat API

* Update examples/server/utils.hpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-07-14 00:28:26 +08:00
Georgi Gerganov
21825798c2 main : print error on empty input (#8456) 2024-07-14 00:28:26 +08:00
Daniel Bevenius
318d950e79 llama : suppress unary minus operator warning (#8448)
This commit updates the _try_copy lambda and moves the unary minus
operator to after the cast to int32_t.

The motivation for this that currently the following warning is
generated on windows:

```console
llama.cpp\src\llama.cpp(21147,30): warning C4146: unary minus operator
applied to unsigned type, result still unsigned
```
2024-07-14 00:28:26 +08:00
Douglas Hanley
0a7d1bf5de server : ensure batches are either all embed or all completion (#8420)
* make sure batches are all embed or all non-embed

* non-embedding batch for sampled tokens; fix unused params warning
2024-07-14 00:28:26 +08:00
Armen Kaleshian
3ebd51fcad docker : fix filename for convert-hf-to-gguf.py in tools.sh (#8441)
Commit b0a4699 changed the name of this script from convert-hf-to-gguf.py to
convert_hf_to_gguf.py breaking how convert is called from within a Docker
container.
2024-07-14 00:28:26 +08:00
Jiří Podivín
757ae96e5d convert : remove fsep token from GPTRefactForCausalLM (#8237)
The <filename> token used by Refact doesn't serve
the same purpose as the <file_separator> from CodeGemma.

Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
2024-07-14 00:28:26 +08:00
Georgi Gerganov
e0916db972 examples : sprintf -> snprintf (#8434)
* examples : sprintf -> snprintf

ggml-ci

* examples : use sizeof() instead of hardcoded constants
2024-07-14 00:28:26 +08:00
Georgi Gerganov
f6786401d2 ggml : minor naming changes (#8433)
* ggml : minor naming changes

ggml-ci

* ggml : use PRId64 [no ci]

* ggml : revert FA K/Q names
2024-07-14 00:28:26 +08:00
Chen Xi
fa700d1a84 [SYCL] fix the mul_mat_id ut issues (#8427)
* fix part of mul_mat_id

* skip the bfloat 16 sycl ut

Signed-off-by: Chen Xi <xi2chen@intel.com>

---------

Signed-off-by: Chen Xi <xi2chen@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Chen Xi <xi2chen@intel.com>
2024-07-14 00:28:26 +08:00
Nicholai Tukanov
b4caa00c7c ggml : add NVPL BLAS support (#8329) (#8425)
* ggml : add NVPL BLAS support

* ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>`

---------

Co-authored-by: ntukanov <ntukanov@nvidia.com>
2024-07-14 00:28:26 +08:00
Daniel Bevenius
a5e36a3518 cuda : suppress 'noreturn' warn in no_device_code (#8414)
* cuda : suppress 'noreturn' warn in no_device_code

This commit adds a while(true) loop to the no_device_code function in
common.cuh. This is done to suppress the warning:

```console
/ggml/src/ggml-cuda/template-instances/../common.cuh:346:1: warning:
function declared 'noreturn' should not return [-Winvalid-noreturn]
  346 | }
      | ^
```

The motivation for this is to reduce the number of warnings when
compilng with GGML_HIPBLAS=ON.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! cuda : suppress 'noreturn' warn in no_device_code

Update __trap macro instead of using a while loop to suppress the
warning.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-14 00:28:26 +08:00
Johannes Gäßler
6a9dcf01ad CUDA: optimize and refactor MMQ (#8416)
* CUDA: optimize and refactor MMQ

* explicit q8_1 memory layouts, add documentation
2024-07-14 00:28:26 +08:00
Georgi Gerganov
8c88cd899b gitignore : deprecated binaries 2024-07-14 00:28:26 +08:00
compilade
4e4205aa6f tokenize : add --no-parse-special option (#8423)
This should allow more easily explaining
how parse_special affects tokenization.
2024-07-14 00:28:26 +08:00
Georgi Gerganov
2ed5fd58b5 llama : use F32 precision in Qwen2 attention and no FA (#8412) 2024-07-14 00:28:26 +08:00
Clint Herron
86ced79ae6 Initialize default slot sampling parameters from the global context. (#8418) 2024-07-14 00:28:26 +08:00
Clint Herron
2f027bcb15 Name Migration: Build the deprecation-warning 'main' binary every time (#8404)
* Modify the deprecation-warning 'main' binary to build every time, instead of only when a legacy binary is present. This is to help users of tutorials and other instruction sets from knowing what to do when the 'main' binary is missing and they are trying to follow instructions.

* Adjusting 'server' name-deprecation binary to build all the time, similar to the 'main' legacy name binary.
2024-07-14 00:28:26 +08:00
AidanBeltonS
35b1aff5cf [SYCL] Use multi_ptr to clean up deprecated warnings (#8256) 2024-07-14 00:28:18 +08:00
Georgi Gerganov
e78fa06f3d ggml : move sgemm sources to llamafile subfolder (#8394)
ggml-ci
2024-07-14 00:23:01 +08:00
Dibakar Gope
528f58ff8d ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (#5780)
* Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files

* Arm AArch64: minor code refactoring for rebase

* Arm AArch64: minor code refactoring for resolving a build issue with cmake

* Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code change for resolving a build issue with server-windows

* retrigger checks

* Arm AArch64: minor code changes for rebase

* Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits

* Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig

* Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code refactoring

* Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat

* Arm AArch64: minimize changes in ggml_compute_forward_mul_mat

* Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* rebase on the latest master commit 3fd62a6 and adapt to the new directory structure

* Arm AArch64: remove a redundant comment

* Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off

* Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels

* Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type
2024-07-14 00:23:01 +08:00
M. Yusuf Sarıgöz
04ba8fca3e gguf-py rel pipeline (#8410)
* Upd gguf-py/readme

* Bump patch version for release
2024-07-14 00:23:01 +08:00
Borislav Stanimirov
224090c64e llama : C++20 compatibility for u8 strings (#8408) 2024-07-14 00:23:01 +08:00
Borislav Stanimirov
35f85f71e5 msvc : silence codecvt c++17 deprecation warnings (#8395) 2024-07-14 00:23:01 +08:00
fairydreaming
f4e68cd731 llama : add assert about missing llama_encode() call (#8400)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-07-14 00:23:01 +08:00
RunningLeon
0464524ddd py : fix converter for internlm2 (#8321)
* update internlm2

* remove unused file

* fix lint
2024-07-14 00:23:01 +08:00
laik
eb16c41949 py : fix extra space in convert_hf_to_gguf.py (#8407) 2024-07-14 00:23:01 +08:00
Clint Herron
ae3a78ad34 Server: Enable setting default sampling parameters via command-line (#8402)
* Load server sampling parameters from the server context by default.

* Wordsmithing comment
2024-07-14 00:23:01 +08:00
Andy Salerno
8af17465a9 Update README.md to fix broken link to docs (#8399)
Update the "Performance troubleshooting" doc link to be correct - the file was moved into a dir called 'development'
2024-07-14 00:23:01 +08:00
Clint Herron
0e6506aeb0 Deprecation warning to assist with migration to new binary names (#8283)
* Adding a simple program to provide a deprecation warning that can exist to help people notice the binary name change from #7809 and migrate to the new filenames.

* Build legacy replacement binaries only if they already exist. Check for their existence every time so that they are not ignored.
2024-07-14 00:22:58 +08:00
Johannes Gäßler
c7d621d0da make/cmake: LLAMA_NO_CCACHE -> GGML_NO_CCACHE (#8392) 2024-07-14 00:21:54 +08:00
Borislav Stanimirov
5c10e23a80 cmake : allow external ggml (#8370) 2024-07-14 00:20:27 +08:00
daghanerdonmez
1052802685 readme : fix typo [no ci] (#8389)
Bakus-Naur --> Backus-Naur
2024-07-14 00:20:27 +08:00
compilade
c380b899e5 gguf-py : do not use internal numpy types (#7472) 2024-07-14 00:20:27 +08:00
Georgi Gerganov
9ad5bcaad3 flake.lock: Update (#8342)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/2a55567fcf15b1b1c7ed712a2c6fadaec7412ea8?narHash=sha256-iKzJcpdXih14qYVcZ9QC9XuZYnPc6T8YImb6dX166kw%3D' (2024-06-01)
  → 'github:hercules-ci/flake-parts/9227223f6d922fee3c7b190b2cc238a99527bbb7?narHash=sha256-pQMhCCHyQGRzdfAkdJ4cIWiw%2BJNuWsTX7f0ZYSyz0VY%3D' (2024-07-03)
• Updated input 'flake-parts/nixpkgs-lib':
    'https://github.com/NixOS/nixpkgs/archive/eb9ceca17df2ea50a250b6b27f7bf6ab0186f198.tar.gz?narHash=sha256-lIbdfCsf8LMFloheeE6N31%2BBMIeixqyQWbSr2vk79EQ%3D' (2024-06-01)
  → 'https://github.com/NixOS/nixpkgs/archive/5daf0514482af3f97abaefc78a6606365c9108e2.tar.gz?narHash=sha256-Fm2rDDs86sHy0/1jxTOKB1118Q0O3Uc7EC0iXvXKpbI%3D' (2024-07-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/b2852eb9365c6de48ffb0dc2c9562591f652242a?narHash=sha256-C8e9S7RzshSdHB7L%2Bv9I51af1gDM5unhJ2xO1ywxNH8%3D' (2024-06-27)
  → 'github:NixOS/nixpkgs/9f4128e00b0ae8ec65918efeba59db998750ead6?narHash=sha256-rwz8NJZV%2B387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ%3D' (2024-07-03)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-07-14 00:20:27 +08:00
Alberto Cabrera Pérez
7a8fa37316 labeler : updated sycl to match docs and code refactor (#8373) 2024-07-14 00:20:27 +08:00
b4b4o
790e9b2a0e readme : fix web link error [no ci] (#8347) 2024-07-14 00:20:27 +08:00
Alberto Cabrera Pérez
a7d7781692 sycl : fix powf call in device code (#8368) 2024-07-14 00:20:27 +08:00
Georgi Gerganov
86d41e6e1c scripts : fix sync for sycl 2024-07-14 00:20:27 +08:00
Georgi Gerganov
a5038fc736 sync : ggml
ggml-ci
2024-07-14 00:20:27 +08:00
Georgi Gerganov
8ab505a2e9 tests : fix whitespace (#0) 2024-07-14 00:20:27 +08:00
John Balis
fec49428a6 feat: cuda implementation for ggml_conv_transpose_1d (ggml/854)
* conv transpose 1d passing test for 1d input and kernel

* working for different input and output channel counts, added test for variable stride

* initial draft appears to work with stride other than 1

* working with all old and new conv1d  tests

* added a test for large tensors

* removed use cuda hardcoding

* restored test-conv-transpose.c

* removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail

* fixed accumulator bug

* added test to test-backend-ops

* fixed mistake

* addressed review

* fixed includes

* removed blank lines

* style and warning fixes

* return failure when test fails

* fix supports_op

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-07-14 00:20:27 +08:00
Kevin Wang
9ff6a62845 common : preallocate sampling token data vector (#8363)
`emplace_back` repeatedly-called is slower than preallocating the vector to the vocab size and directly inserting the data. Some rudimentary profiling with `chrono` improves the performance of this block of code from ~500us/op to ~40us/op.

Overall, this slightly improves the sampling performance which has a more substantial impact for the `examples/lookahead` implementation -- I am able to see a ~10% performance boost in lookahead inference.
2024-07-14 00:20:27 +08:00
Georgi Gerganov
da09d77524 infill : assert prefix/suffix tokens + remove old space logic (#8351) 2024-07-14 00:20:27 +08:00
Kevin Wang
6e022a225a common : avoid unnecessary logits fetch (#8358) 2024-07-14 00:20:27 +08:00
toyer
68d1711f73 readme : add supported glm models (#8360) 2024-07-14 00:20:27 +08:00
compilade
df044303f3 py : type-check all Python scripts with Pyright (#8341)
* py : type-check all Python scripts with Pyright

* server-tests : use trailing slash in openai base_url

* server-tests : add more type annotations

* server-tests : strip "chat" from base_url in oai_chat_completions

* server-tests : model metadata is a dict

* ci : disable pip cache in type-check workflow

The cache is not shared between branches, and it's 250MB in size,
so it would become quite a big part of the 10GB cache limit of the repo.

* py : fix new type errors from master branch

* tests : fix test-tokenizer-random.py

Apparently, gcc applies optimisations even when pre-processing,
which confuses pycparser.

* ci : only show warnings and errors in python type-check

The "information" level otherwise has entries
from 'examples/pydantic_models_to_grammar.py',
which could be confusing for someone trying to figure out what failed,
considering that these messages can safely be ignored
even though they look like errors.
2024-07-14 00:20:27 +08:00
Denis Spasyuk
b775ea0e75 Update llama-cli documentation (#8315)
* Update README.md

* Update README.md

* Update README.md

fixed llama-cli/main, templates on some cmds added chat template sections and fixed typos in some areas

* Update README.md

* Update README.md

* Update README.md
2024-07-14 00:20:27 +08:00
Alex Tuddenham
9ee7bf007d ci : add checks for cmake,make and ctest in ci/run.sh (#8200)
* Added checks for cmake,make and ctest

* Removed erroneous whitespace
2024-07-14 00:20:27 +08:00
Andy Tai
c695235193 readme : update bindings list (#8222)
* adding guile_llama_cpp  to binding list

* fix formatting

* fix formatting
2024-07-14 00:20:27 +08:00
Brian
305b9d8892 gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048)
CLI to hash GGUF files to detect difference on a per model and per tensor level

The hash type we support is:

- `--xxh64`: use xhash 64bit hash mode (default)
- `--sha1`: use sha1
- `--uuid`: use uuid
- `--sha256`: use sha256

While most POSIX systems already have hash checking programs like sha256sum, it
is designed to check entire files. This is not ideal for our purpose if we want
to check for consistency of the tensor data even if the metadata content of the
gguf KV store has been updated.

This program is designed to hash a gguf tensor payload on a 'per tensor layer'
in addition to a 'entire tensor model' hash. The intent is that the entire
tensor layer can be checked first but if there is any detected inconsistencies,
then the per tensor hash can be used to narrow down the specific tensor layer
that has inconsistencies.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-14 00:20:27 +08:00