llama.cpp

Author	SHA1	Message	Date
Georgi Gerganov	07d457b83f	server : handle content array in chat API (#8449 ) * server : handle content array in chat API * Update examples/server/utils.hpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-07-14 00:28:26 +08:00
Georgi Gerganov	21825798c2	main : print error on empty input (#8456 )	2024-07-14 00:28:26 +08:00
Daniel Bevenius	318d950e79	llama : suppress unary minus operator warning (#8448 ) This commit updates the _try_copy lambda and moves the unary minus operator to after the cast to int32_t. The motivation for this that currently the following warning is generated on windows: ```console llama.cpp\src\llama.cpp(21147,30): warning C4146: unary minus operator applied to unsigned type, result still unsigned ```	2024-07-14 00:28:26 +08:00
Douglas Hanley	0a7d1bf5de	server : ensure batches are either all embed or all completion (#8420 ) * make sure batches are all embed or all non-embed * non-embedding batch for sampled tokens; fix unused params warning	2024-07-14 00:28:26 +08:00
Armen Kaleshian	3ebd51fcad	docker : fix filename for convert-hf-to-gguf.py in tools.sh (#8441 ) Commit `b0a4699` changed the name of this script from convert-hf-to-gguf.py to convert_hf_to_gguf.py breaking how convert is called from within a Docker container.	2024-07-14 00:28:26 +08:00
Jiří Podivín	757ae96e5d	convert : remove fsep token from GPTRefactForCausalLM (#8237 ) The <filename> token used by Refact doesn't serve the same purpose as the <file_separator> from CodeGemma. Signed-off-by: Jiri Podivin <jpodivin@redhat.com>	2024-07-14 00:28:26 +08:00
Georgi Gerganov	e0916db972	examples : sprintf -> snprintf (#8434 ) * examples : sprintf -> snprintf ggml-ci * examples : use sizeof() instead of hardcoded constants	2024-07-14 00:28:26 +08:00
Georgi Gerganov	f6786401d2	ggml : minor naming changes (#8433 ) * ggml : minor naming changes ggml-ci * ggml : use PRId64 [no ci] * ggml : revert FA K/Q names	2024-07-14 00:28:26 +08:00
Chen Xi	fa700d1a84	[SYCL] fix the mul_mat_id ut issues (#8427 ) * fix part of mul_mat_id * skip the bfloat 16 sycl ut Signed-off-by: Chen Xi <xi2chen@intel.com> --------- Signed-off-by: Chen Xi <xi2chen@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> Co-authored-by: Chen Xi <xi2chen@intel.com>	2024-07-14 00:28:26 +08:00
Nicholai Tukanov	b4caa00c7c	ggml : add NVPL BLAS support (#8329 ) (#8425 ) * ggml : add NVPL BLAS support * ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>` --------- Co-authored-by: ntukanov <ntukanov@nvidia.com>	2024-07-14 00:28:26 +08:00
Daniel Bevenius	a5e36a3518	cuda : suppress 'noreturn' warn in no_device_code (#8414 ) * cuda : suppress 'noreturn' warn in no_device_code This commit adds a while(true) loop to the no_device_code function in common.cuh. This is done to suppress the warning: ```console /ggml/src/ggml-cuda/template-instances/../common.cuh:346:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn] 346 \| } \| ^ ``` The motivation for this is to reduce the number of warnings when compilng with GGML_HIPBLAS=ON. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * squash! cuda : suppress 'noreturn' warn in no_device_code Update __trap macro instead of using a while loop to suppress the warning. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-07-14 00:28:26 +08:00
Johannes Gäßler	6a9dcf01ad	CUDA: optimize and refactor MMQ (#8416 ) * CUDA: optimize and refactor MMQ * explicit q8_1 memory layouts, add documentation	2024-07-14 00:28:26 +08:00
Georgi Gerganov	8c88cd899b	gitignore : deprecated binaries	2024-07-14 00:28:26 +08:00
compilade	4e4205aa6f	tokenize : add --no-parse-special option (#8423 ) This should allow more easily explaining how parse_special affects tokenization.	2024-07-14 00:28:26 +08:00
Georgi Gerganov	2ed5fd58b5	llama : use F32 precision in Qwen2 attention and no FA (#8412 )	2024-07-14 00:28:26 +08:00
Clint Herron	86ced79ae6	Initialize default slot sampling parameters from the global context. (#8418 )	2024-07-14 00:28:26 +08:00
Clint Herron	2f027bcb15	Name Migration: Build the deprecation-warning 'main' binary every time (#8404 ) * Modify the deprecation-warning 'main' binary to build every time, instead of only when a legacy binary is present. This is to help users of tutorials and other instruction sets from knowing what to do when the 'main' binary is missing and they are trying to follow instructions. * Adjusting 'server' name-deprecation binary to build all the time, similar to the 'main' legacy name binary.	2024-07-14 00:28:26 +08:00
AidanBeltonS	35b1aff5cf	[SYCL] Use multi_ptr to clean up deprecated warnings (#8256 )	2024-07-14 00:28:18 +08:00
Georgi Gerganov	e78fa06f3d	ggml : move sgemm sources to llamafile subfolder (#8394 ) ggml-ci	2024-07-14 00:23:01 +08:00
Dibakar Gope	528f58ff8d	ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (#5780 ) * Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files * Arm AArch64: minor code refactoring for rebase * Arm AArch64: minor code refactoring for resolving a build issue with cmake * Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code change for resolving a build issue with server-windows * retrigger checks * Arm AArch64: minor code changes for rebase * Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits * Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig * Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code refactoring * Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat * Arm AArch64: minimize changes in ggml_compute_forward_mul_mat * Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * rebase on the latest master commit `3fd62a6` and adapt to the new directory structure * Arm AArch64: remove a redundant comment * Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off * Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels * Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type	2024-07-14 00:23:01 +08:00
M. Yusuf Sarıgöz	04ba8fca3e	gguf-py rel pipeline (#8410 ) * Upd gguf-py/readme * Bump patch version for release	2024-07-14 00:23:01 +08:00
Borislav Stanimirov	224090c64e	llama : C++20 compatibility for u8 strings (#8408 )	2024-07-14 00:23:01 +08:00
Borislav Stanimirov	35f85f71e5	msvc : silence codecvt c++17 deprecation warnings (#8395 )	2024-07-14 00:23:01 +08:00
fairydreaming	f4e68cd731	llama : add assert about missing llama_encode() call (#8400 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-07-14 00:23:01 +08:00
RunningLeon	0464524ddd	py : fix converter for internlm2 (#8321 ) * update internlm2 * remove unused file * fix lint	2024-07-14 00:23:01 +08:00
laik	eb16c41949	py : fix extra space in convert_hf_to_gguf.py (#8407 )	2024-07-14 00:23:01 +08:00
Clint Herron	ae3a78ad34	Server: Enable setting default sampling parameters via command-line (#8402 ) * Load server sampling parameters from the server context by default. * Wordsmithing comment	2024-07-14 00:23:01 +08:00
Andy Salerno	8af17465a9	Update README.md to fix broken link to docs (#8399 ) Update the "Performance troubleshooting" doc link to be correct - the file was moved into a dir called 'development'	2024-07-14 00:23:01 +08:00
Clint Herron	0e6506aeb0	Deprecation warning to assist with migration to new binary names (#8283 ) * Adding a simple program to provide a deprecation warning that can exist to help people notice the binary name change from #7809 and migrate to the new filenames. * Build legacy replacement binaries only if they already exist. Check for their existence every time so that they are not ignored.	2024-07-14 00:22:58 +08:00
Johannes Gäßler	c7d621d0da	make/cmake: LLAMA_NO_CCACHE -> GGML_NO_CCACHE (#8392 )	2024-07-14 00:21:54 +08:00
Borislav Stanimirov	5c10e23a80	cmake : allow external ggml (#8370 )	2024-07-14 00:20:27 +08:00
daghanerdonmez	1052802685	readme : fix typo [no ci] (#8389 ) Bakus-Naur --> Backus-Naur	2024-07-14 00:20:27 +08:00
compilade	c380b899e5	gguf-py : do not use internal numpy types (#7472 )	2024-07-14 00:20:27 +08:00
Georgi Gerganov	9ad5bcaad3	flake.lock: Update (#8342 ) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/2a55567fcf15b1b1c7ed712a2c6fadaec7412ea8?narHash=sha256-iKzJcpdXih14qYVcZ9QC9XuZYnPc6T8YImb6dX166kw%3D' (2024-06-01) → 'github:hercules-ci/flake-parts/9227223f6d922fee3c7b190b2cc238a99527bbb7?narHash=sha256-pQMhCCHyQGRzdfAkdJ4cIWiw%2BJNuWsTX7f0ZYSyz0VY%3D' (2024-07-03) • Updated input 'flake-parts/nixpkgs-lib': 'https://github.com/NixOS/nixpkgs/archive/eb9ceca17df2ea50a250b6b27f7bf6ab0186f198.tar.gz?narHash=sha256-lIbdfCsf8LMFloheeE6N31%2BBMIeixqyQWbSr2vk79EQ%3D' (2024-06-01) → 'https://github.com/NixOS/nixpkgs/archive/5daf0514482af3f97abaefc78a6606365c9108e2.tar.gz?narHash=sha256-Fm2rDDs86sHy0/1jxTOKB1118Q0O3Uc7EC0iXvXKpbI%3D' (2024-07-01) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/b2852eb9365c6de48ffb0dc2c9562591f652242a?narHash=sha256-C8e9S7RzshSdHB7L%2Bv9I51af1gDM5unhJ2xO1ywxNH8%3D' (2024-06-27) → 'github:NixOS/nixpkgs/9f4128e00b0ae8ec65918efeba59db998750ead6?narHash=sha256-rwz8NJZV%2B387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ%3D' (2024-07-03) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-07-14 00:20:27 +08:00
Alberto Cabrera Pérez	7a8fa37316	labeler : updated sycl to match docs and code refactor (#8373 )	2024-07-14 00:20:27 +08:00
b4b4o	790e9b2a0e	readme : fix web link error [no ci] (#8347 )	2024-07-14 00:20:27 +08:00
Alberto Cabrera Pérez	a7d7781692	sycl : fix powf call in device code (#8368 )	2024-07-14 00:20:27 +08:00
Georgi Gerganov	86d41e6e1c	scripts : fix sync for sycl	2024-07-14 00:20:27 +08:00
Georgi Gerganov	a5038fc736	sync : ggml ggml-ci	2024-07-14 00:20:27 +08:00
Georgi Gerganov	8ab505a2e9	tests : fix whitespace (#0 )	2024-07-14 00:20:27 +08:00
John Balis	fec49428a6	feat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854) * conv transpose 1d passing test for 1d input and kernel * working for different input and output channel counts, added test for variable stride * initial draft appears to work with stride other than 1 * working with all old and new conv1d tests * added a test for large tensors * removed use cuda hardcoding * restored test-conv-transpose.c * removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail * fixed accumulator bug * added test to test-backend-ops * fixed mistake * addressed review * fixed includes * removed blank lines * style and warning fixes * return failure when test fails * fix supports_op --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-07-14 00:20:27 +08:00
Kevin Wang	9ff6a62845	common : preallocate sampling token data vector (#8363 ) `emplace_back` repeatedly-called is slower than preallocating the vector to the vocab size and directly inserting the data. Some rudimentary profiling with `chrono` improves the performance of this block of code from ~500us/op to ~40us/op. Overall, this slightly improves the sampling performance which has a more substantial impact for the `examples/lookahead` implementation -- I am able to see a ~10% performance boost in lookahead inference.	2024-07-14 00:20:27 +08:00
Georgi Gerganov	da09d77524	infill : assert prefix/suffix tokens + remove old space logic (#8351 )	2024-07-14 00:20:27 +08:00
Kevin Wang	6e022a225a	common : avoid unnecessary logits fetch (#8358 )	2024-07-14 00:20:27 +08:00
toyer	68d1711f73	readme : add supported glm models (#8360 )	2024-07-14 00:20:27 +08:00
compilade	df044303f3	py : type-check all Python scripts with Pyright (#8341 ) * py : type-check all Python scripts with Pyright * server-tests : use trailing slash in openai base_url * server-tests : add more type annotations * server-tests : strip "chat" from base_url in oai_chat_completions * server-tests : model metadata is a dict * ci : disable pip cache in type-check workflow The cache is not shared between branches, and it's 250MB in size, so it would become quite a big part of the 10GB cache limit of the repo. * py : fix new type errors from master branch * tests : fix test-tokenizer-random.py Apparently, gcc applies optimisations even when pre-processing, which confuses pycparser. * ci : only show warnings and errors in python type-check The "information" level otherwise has entries from 'examples/pydantic_models_to_grammar.py', which could be confusing for someone trying to figure out what failed, considering that these messages can safely be ignored even though they look like errors.	2024-07-14 00:20:27 +08:00
Denis Spasyuk	b775ea0e75	Update llama-cli documentation (#8315 ) * Update README.md * Update README.md * Update README.md fixed llama-cli/main, templates on some cmds added chat template sections and fixed typos in some areas * Update README.md * Update README.md * Update README.md	2024-07-14 00:20:27 +08:00
Alex Tuddenham	9ee7bf007d	ci : add checks for cmake,make and ctest in ci/run.sh (#8200 ) * Added checks for cmake,make and ctest * Removed erroneous whitespace	2024-07-14 00:20:27 +08:00
Andy Tai	c695235193	readme : update bindings list (#8222 ) * adding guile_llama_cpp to binding list * fix formatting * fix formatting	2024-07-14 00:20:27 +08:00
Brian	305b9d8892	gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048 ) CLI to hash GGUF files to detect difference on a per model and per tensor level The hash type we support is: - `--xxh64`: use xhash 64bit hash mode (default) - `--sha1`: use sha1 - `--uuid`: use uuid - `--sha256`: use sha256 While most POSIX systems already have hash checking programs like sha256sum, it is designed to check entire files. This is not ideal for our purpose if we want to check for consistency of the tensor data even if the metadata content of the gguf KV store has been updated. This program is designed to hash a gguf tensor payload on a 'per tensor layer' in addition to a 'entire tensor model' hash. The intent is that the entire tensor layer can be checked first but if there is any detected inconsistencies, then the per tensor hash can be used to narrow down the specific tensor layer that has inconsistencies. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-14 00:20:27 +08:00

1 2 3 4 5 ...

3436 commits