llama.cpp

Author	SHA1	Message	Date
compilade	4e3d43f66b	llama : fix pre-tokenization of non-special added tokens (#8228 ) * llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from #8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with #8379 * test-tokenizer-random : add a failing edge case for falcon	2024-07-27 21:23:09 +08:00
bandoti	08bd5616c1	vulkan : cmake integration (#8119 ) * Add Vulkan to CMake pkg * Add Sycl to CMake pkg * Add OpenMP to CMake pkg * Split generated shader file into separate translation unit * Add CMake target for Vulkan shaders * Update README.md * Add make target for Vulkan shaders * Use pkg-config to locate vulkan library * Add vulkan SDK dep to ubuntu-22-cmake-vulkan workflow * Clean up tabs * Move sudo to apt-key invocation * Forward GGML_EXTRA_LIBS to CMake config pkg * Update vulkan obj file paths * Add shaderc to nix pkg * Add python3 to Vulkan nix build * Link against ggml in cmake pkg * Remove Python dependency from Vulkan build * code review changes * Remove trailing newline * Add cflags from pkg-config to fix w64devkit build * Update README.md * Remove trailing whitespace * Update README.md * Remove trailing whitespace * Fix doc heading * Make glslc required Vulkan component * remove clblast from nix pkg	2024-07-27 21:23:09 +08:00
Georgi Gerganov	2aa671745c	metal : template-ify some of the kernels (#8447 ) ggml-ci	2024-07-27 21:23:09 +08:00
arthw	a364ec78f3	fix UT of concat	2024-07-14 11:07:56 +08:00
Neo Zhang	e700d37f68	mv softmax to separated file	2024-07-14 01:02:58 +08:00
Georgi Gerganov	07d457b83f	server : handle content array in chat API (#8449 ) * server : handle content array in chat API * Update examples/server/utils.hpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-07-14 00:28:26 +08:00
Georgi Gerganov	21825798c2	main : print error on empty input (#8456 )	2024-07-14 00:28:26 +08:00
Daniel Bevenius	318d950e79	llama : suppress unary minus operator warning (#8448 ) This commit updates the _try_copy lambda and moves the unary minus operator to after the cast to int32_t. The motivation for this that currently the following warning is generated on windows: ```console llama.cpp\src\llama.cpp(21147,30): warning C4146: unary minus operator applied to unsigned type, result still unsigned ```	2024-07-14 00:28:26 +08:00
Douglas Hanley	0a7d1bf5de	server : ensure batches are either all embed or all completion (#8420 ) * make sure batches are all embed or all non-embed * non-embedding batch for sampled tokens; fix unused params warning	2024-07-14 00:28:26 +08:00
Armen Kaleshian	3ebd51fcad	docker : fix filename for convert-hf-to-gguf.py in tools.sh (#8441 ) Commit `b0a4699` changed the name of this script from convert-hf-to-gguf.py to convert_hf_to_gguf.py breaking how convert is called from within a Docker container.	2024-07-14 00:28:26 +08:00
Jiří Podivín	757ae96e5d	convert : remove fsep token from GPTRefactForCausalLM (#8237 ) The <filename> token used by Refact doesn't serve the same purpose as the <file_separator> from CodeGemma. Signed-off-by: Jiri Podivin <jpodivin@redhat.com>	2024-07-14 00:28:26 +08:00
Georgi Gerganov	e0916db972	examples : sprintf -> snprintf (#8434 ) * examples : sprintf -> snprintf ggml-ci * examples : use sizeof() instead of hardcoded constants	2024-07-14 00:28:26 +08:00
Georgi Gerganov	f6786401d2	ggml : minor naming changes (#8433 ) * ggml : minor naming changes ggml-ci * ggml : use PRId64 [no ci] * ggml : revert FA K/Q names	2024-07-14 00:28:26 +08:00
Chen Xi	fa700d1a84	[SYCL] fix the mul_mat_id ut issues (#8427 ) * fix part of mul_mat_id * skip the bfloat 16 sycl ut Signed-off-by: Chen Xi <xi2chen@intel.com> --------- Signed-off-by: Chen Xi <xi2chen@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> Co-authored-by: Chen Xi <xi2chen@intel.com>	2024-07-14 00:28:26 +08:00
Nicholai Tukanov	b4caa00c7c	ggml : add NVPL BLAS support (#8329 ) (#8425 ) * ggml : add NVPL BLAS support * ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>` --------- Co-authored-by: ntukanov <ntukanov@nvidia.com>	2024-07-14 00:28:26 +08:00
Daniel Bevenius	a5e36a3518	cuda : suppress 'noreturn' warn in no_device_code (#8414 ) * cuda : suppress 'noreturn' warn in no_device_code This commit adds a while(true) loop to the no_device_code function in common.cuh. This is done to suppress the warning: ```console /ggml/src/ggml-cuda/template-instances/../common.cuh:346:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn] 346 \| } \| ^ ``` The motivation for this is to reduce the number of warnings when compilng with GGML_HIPBLAS=ON. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * squash! cuda : suppress 'noreturn' warn in no_device_code Update __trap macro instead of using a while loop to suppress the warning. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-07-14 00:28:26 +08:00
Johannes Gäßler	6a9dcf01ad	CUDA: optimize and refactor MMQ (#8416 ) * CUDA: optimize and refactor MMQ * explicit q8_1 memory layouts, add documentation	2024-07-14 00:28:26 +08:00
Georgi Gerganov	8c88cd899b	gitignore : deprecated binaries	2024-07-14 00:28:26 +08:00
compilade	4e4205aa6f	tokenize : add --no-parse-special option (#8423 ) This should allow more easily explaining how parse_special affects tokenization.	2024-07-14 00:28:26 +08:00
Georgi Gerganov	2ed5fd58b5	llama : use F32 precision in Qwen2 attention and no FA (#8412 )	2024-07-14 00:28:26 +08:00
Clint Herron	86ced79ae6	Initialize default slot sampling parameters from the global context. (#8418 )	2024-07-14 00:28:26 +08:00
Clint Herron	2f027bcb15	Name Migration: Build the deprecation-warning 'main' binary every time (#8404 ) * Modify the deprecation-warning 'main' binary to build every time, instead of only when a legacy binary is present. This is to help users of tutorials and other instruction sets from knowing what to do when the 'main' binary is missing and they are trying to follow instructions. * Adjusting 'server' name-deprecation binary to build all the time, similar to the 'main' legacy name binary.	2024-07-14 00:28:26 +08:00
AidanBeltonS	35b1aff5cf	[SYCL] Use multi_ptr to clean up deprecated warnings (#8256 )	2024-07-14 00:28:18 +08:00
Georgi Gerganov	e78fa06f3d	ggml : move sgemm sources to llamafile subfolder (#8394 ) ggml-ci	2024-07-14 00:23:01 +08:00
Dibakar Gope	528f58ff8d	ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (#5780 ) * Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files * Arm AArch64: minor code refactoring for rebase * Arm AArch64: minor code refactoring for resolving a build issue with cmake * Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code change for resolving a build issue with server-windows * retrigger checks * Arm AArch64: minor code changes for rebase * Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits * Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig * Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code refactoring * Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat * Arm AArch64: minimize changes in ggml_compute_forward_mul_mat * Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * rebase on the latest master commit `3fd62a6` and adapt to the new directory structure * Arm AArch64: remove a redundant comment * Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off * Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels * Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type	2024-07-14 00:23:01 +08:00
M. Yusuf Sarıgöz	04ba8fca3e	gguf-py rel pipeline (#8410 ) * Upd gguf-py/readme * Bump patch version for release	2024-07-14 00:23:01 +08:00
Borislav Stanimirov	224090c64e	llama : C++20 compatibility for u8 strings (#8408 )	2024-07-14 00:23:01 +08:00
Borislav Stanimirov	35f85f71e5	msvc : silence codecvt c++17 deprecation warnings (#8395 )	2024-07-14 00:23:01 +08:00
fairydreaming	f4e68cd731	llama : add assert about missing llama_encode() call (#8400 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-07-14 00:23:01 +08:00
RunningLeon	0464524ddd	py : fix converter for internlm2 (#8321 ) * update internlm2 * remove unused file * fix lint	2024-07-14 00:23:01 +08:00
laik	eb16c41949	py : fix extra space in convert_hf_to_gguf.py (#8407 )	2024-07-14 00:23:01 +08:00
Clint Herron	ae3a78ad34	Server: Enable setting default sampling parameters via command-line (#8402 ) * Load server sampling parameters from the server context by default. * Wordsmithing comment	2024-07-14 00:23:01 +08:00
Andy Salerno	8af17465a9	Update README.md to fix broken link to docs (#8399 ) Update the "Performance troubleshooting" doc link to be correct - the file was moved into a dir called 'development'	2024-07-14 00:23:01 +08:00
Clint Herron	0e6506aeb0	Deprecation warning to assist with migration to new binary names (#8283 ) * Adding a simple program to provide a deprecation warning that can exist to help people notice the binary name change from #7809 and migrate to the new filenames. * Build legacy replacement binaries only if they already exist. Check for their existence every time so that they are not ignored.	2024-07-14 00:22:58 +08:00
Johannes Gäßler	c7d621d0da	make/cmake: LLAMA_NO_CCACHE -> GGML_NO_CCACHE (#8392 )	2024-07-14 00:21:54 +08:00
Borislav Stanimirov	5c10e23a80	cmake : allow external ggml (#8370 )	2024-07-14 00:20:27 +08:00
daghanerdonmez	1052802685	readme : fix typo [no ci] (#8389 ) Bakus-Naur --> Backus-Naur	2024-07-14 00:20:27 +08:00
compilade	c380b899e5	gguf-py : do not use internal numpy types (#7472 )	2024-07-14 00:20:27 +08:00
Georgi Gerganov	9ad5bcaad3	flake.lock: Update (#8342 ) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/2a55567fcf15b1b1c7ed712a2c6fadaec7412ea8?narHash=sha256-iKzJcpdXih14qYVcZ9QC9XuZYnPc6T8YImb6dX166kw%3D' (2024-06-01) → 'github:hercules-ci/flake-parts/9227223f6d922fee3c7b190b2cc238a99527bbb7?narHash=sha256-pQMhCCHyQGRzdfAkdJ4cIWiw%2BJNuWsTX7f0ZYSyz0VY%3D' (2024-07-03) • Updated input 'flake-parts/nixpkgs-lib': 'https://github.com/NixOS/nixpkgs/archive/eb9ceca17df2ea50a250b6b27f7bf6ab0186f198.tar.gz?narHash=sha256-lIbdfCsf8LMFloheeE6N31%2BBMIeixqyQWbSr2vk79EQ%3D' (2024-06-01) → 'https://github.com/NixOS/nixpkgs/archive/5daf0514482af3f97abaefc78a6606365c9108e2.tar.gz?narHash=sha256-Fm2rDDs86sHy0/1jxTOKB1118Q0O3Uc7EC0iXvXKpbI%3D' (2024-07-01) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/b2852eb9365c6de48ffb0dc2c9562591f652242a?narHash=sha256-C8e9S7RzshSdHB7L%2Bv9I51af1gDM5unhJ2xO1ywxNH8%3D' (2024-06-27) → 'github:NixOS/nixpkgs/9f4128e00b0ae8ec65918efeba59db998750ead6?narHash=sha256-rwz8NJZV%2B387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ%3D' (2024-07-03) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-07-14 00:20:27 +08:00
Alberto Cabrera Pérez	7a8fa37316	labeler : updated sycl to match docs and code refactor (#8373 )	2024-07-14 00:20:27 +08:00
b4b4o	790e9b2a0e	readme : fix web link error [no ci] (#8347 )	2024-07-14 00:20:27 +08:00
Alberto Cabrera Pérez	a7d7781692	sycl : fix powf call in device code (#8368 )	2024-07-14 00:20:27 +08:00
Georgi Gerganov	86d41e6e1c	scripts : fix sync for sycl	2024-07-14 00:20:27 +08:00
Georgi Gerganov	a5038fc736	sync : ggml ggml-ci	2024-07-14 00:20:27 +08:00
Georgi Gerganov	8ab505a2e9	tests : fix whitespace (#0 )	2024-07-14 00:20:27 +08:00
John Balis	fec49428a6	feat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854) * conv transpose 1d passing test for 1d input and kernel * working for different input and output channel counts, added test for variable stride * initial draft appears to work with stride other than 1 * working with all old and new conv1d tests * added a test for large tensors * removed use cuda hardcoding * restored test-conv-transpose.c * removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail * fixed accumulator bug * added test to test-backend-ops * fixed mistake * addressed review * fixed includes * removed blank lines * style and warning fixes * return failure when test fails * fix supports_op --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-07-14 00:20:27 +08:00
Kevin Wang	9ff6a62845	common : preallocate sampling token data vector (#8363 ) `emplace_back` repeatedly-called is slower than preallocating the vector to the vocab size and directly inserting the data. Some rudimentary profiling with `chrono` improves the performance of this block of code from ~500us/op to ~40us/op. Overall, this slightly improves the sampling performance which has a more substantial impact for the `examples/lookahead` implementation -- I am able to see a ~10% performance boost in lookahead inference.	2024-07-14 00:20:27 +08:00
Georgi Gerganov	da09d77524	infill : assert prefix/suffix tokens + remove old space logic (#8351 )	2024-07-14 00:20:27 +08:00
Kevin Wang	6e022a225a	common : avoid unnecessary logits fetch (#8358 )	2024-07-14 00:20:27 +08:00
toyer	68d1711f73	readme : add supported glm models (#8360 )	2024-07-14 00:20:27 +08:00

1 2 3 4 5 ...

3391 commits