llama.cpp

Author	SHA1	Message	Date
Georgi Gerganov	b49a13dd2f	convert : fix set_vocab_sentencepiece (#6866 ) * convert : fix set_vocab_sentencepiece * Update convert-hf-to-gguf.py	2024-05-18 08:46:20 +03:00
slaren	05834841dc	ggml : fix quants nans when all the group weights are very close to zero (#7313 )	2024-05-18 02:39:54 +02:00
Engininja2	ef277de2ad	cmake : fix typo in AMDGPU_TARGETS (#7356 )	2024-05-18 02:39:25 +02:00
jaime-m-p	b43272afa2	Unicode codepoint flags for custom regexs (#7245 ) * Replace CODEPOINT_TYPE_* with codepoint_flags * Update and bugfix brute force random test * Deterministic brute force random test * Unicode normalization NFD * Get rid of BOM	2024-05-18 01:09:13 +02:00
Johannes Gäßler	0fc1e820a9	CUDA: faster large batch FA without tensor cores (#7314 )	2024-05-17 18:54:52 +02:00
Gavin Zhao	82ca83db3c	ROCm: use native CMake HIP support (#5966 ) Supercedes #4024 and #4813. CMake's native HIP support has become the recommended way to add HIP code into a project (see [here](https://rocm.docs.amd.com/en/docs-6.0.0/conceptual/cmake-packages.html#using-hip-in-cmake)). This PR makes the following changes: 1. The environment variable `HIPCXX` or CMake option `CMAKE_HIP_COMPILER` should be used to specify the HIP compiler. Notably this shouldn't be `hipcc`, but ROCm's clang, which usually resides in `$ROCM_PATH/llvm/bin/clang`. Previously this was control by `CMAKE_C_COMPILER` and `CMAKE_CXX_COMPILER`. Note that since native CMake HIP support is not yet available on Windows, on Windows we fall back to the old behavior. 2. CMake option `CMAKE_HIP_ARCHITECTURES` is used to control the GPU architectures to build for. Previously this was controled by `GPU_TARGETS`. 3. Updated the Nix recipe to account for these new changes. 4. The GPU targets to build against in the Nix recipe is now consistent with the supported GPU targets in nixpkgs. 5. Added CI checks for HIP on both Linux and Windows. On Linux, we test both the new and old behavior. The most important part about this PR is the separation of the HIP compiler and the C/C++ compiler. This allows users to choose a different C/C++ compiler if desired, compared to the current situation where when building for ROCm support, everything must be compiled with ROCm's clang. ~~Makefile is unchanged. Please let me know if we want to be consistent on variables' naming because Makefile still uses `GPU_TARGETS` to control architectures to build for, but I feel like setting `CMAKE_HIP_ARCHITECTURES` is a bit awkward when you're calling `make`.~~ Makefile used `GPU_TARGETS` but the README says to use `AMDGPU_TARGETS`. For consistency with CMake, all usage of `GPU_TARGETS` in Makefile has been updated to `AMDGPU_TARGETS`. Thanks to the suggestion of @jin-eld, to maintain backwards compatibility (and not break too many downstream users' builds), if `CMAKE_CXX_COMPILER` ends with `hipcc`, then we still compile using the original behavior and emit a warning that recommends switching to the new HIP support. Similarly, if `AMDGPU_TARGETS` is set but `CMAKE_HIP_ARCHITECTURES` is not, then we forward `AMDGPU_TARGETS` to `CMAKE_HIP_ARCHITECTURES` to ease the transition to the new HIP support. Signed-off-by: Gavin Zhao <git@gzgz.dev>	2024-05-17 17:03:03 +02:00
Radoslav Gerganov	f4bd8b3d26	rpc : set SO_REUSEADDR for the server socket (#7320 ) ref: #7293	2024-05-17 17:25:44 +03:00
Brian	51e9d02599	Added a single test function script and fix debug-test.sh to be more robust (#7279 ) * run-single-test.sh: added a single test function script and fix debug-test.sh to be more robust * debug-test.sh: combined execute and gdb test mode via -g flag * debug-test.sh: refactor * debug-test: refactor for clarity * debug-test.sh: comment style changes * debug-test.sh: fix gdb	2024-05-17 22:40:14 +10:00
Aarni Koskela	d273c1402b	py : convert-hf-to-gguf-update improvements (#7340 ) * convert-hf-to-gguf-update: automate updating * convert-hf-to-gguf-update: improve download * share requests session for performance * create directories only when needed, don't skip downloads when empty directory encountered * be more graceful about errors	2024-05-17 15:11:45 +03:00
fairydreaming	27b040691c	llama : use n_embd_head_v when reshaping kqv (#7327 ) * llama : use n_embd_head_v instead of n_embd_head_k when reshaping kqv * llama : use n_embd_v_gqa and n_embd_head_v instead of n_embd_k_gqa and n_embd_head_k when making a view of cached value vectors. --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2024-05-17 14:24:38 +03:00
Johannes Gäßler	29c60d8cdd	tokenization: add warning for double BOS (#7332 )	2024-05-17 09:59:57 +02:00
Herman Semenov	359cbe3f46	ggml-quants, llama : removed excess checks (#7274 )	2024-05-17 10:08:49 +03:00
amd-lalithnc	e18bc6aaf3	convert : fix Qwen/Qwen-7b conversion (#7308 )	2024-05-17 10:01:58 +03:00
Radoslav Gerganov	ee94172d33	server : add support for the RPC backend (#7305 ) ref: #7292	2024-05-17 10:00:17 +03:00
Justine Tunney	934266c0e0	ggml : rewrite silu and softmax for cpu (#7154 ) This change upstreams llamafile's vectorized expf() functions. This lets us compute softmax and silu more accurately than the short[65536] lookup table that GGML previously used to make this operation go faster. We can support aarch64 and sse2+ with the worst case rounding error of 2ulp. It makes make -j8 tests && ./tests/test-backend-ops -o SOFT_MAX -b CPU perf go 1.5x faster for SSE2+FMA, 1.9x faster for AVX2+FMA and 2.1x on AVX512	2024-05-17 09:58:52 +03:00
Leon Knauer	9c4fdcbec8	[Server] Added --verbose option to README [no ci] (#7335 )	2024-05-17 10:11:03 +10:00
Pierrick Hymbert	24ecb58168	Revert "server bench: fix bench not waiting for model load (#7284 )" (#7334 ) This reverts commit `583fd6b000`.	2024-05-16 20:43:45 +02:00
HanishKVC	999bd396d0	ChatON: forgot to get c string format	2024-05-16 23:50:03 +05:30
HanishKVC	0cbfd40f18	ChatON: Option for a fallback tmpl to use wrt chat-tmpl-apply-ex	2024-05-16 23:27:34 +05:30
HanishKVC	1a0df950eb	C++17: Use and limit C++17 to common library for now C++17 provides a good enough variant as a standard feature, and chaton uses the same at its core, instead of rolling out its own struct of union based variant. And given that currently chaton is part of common library and not the base llama library, so limit the use of c++17 to common library. Initially while experimenting, had set the flag for full llama, limitting it for now. Also by now most embedded targets should be potentially having c++ compilers and libraries with support for c++17 features. So chances are it is a ok enough path to take.	2024-05-16 14:56:07 +05:30
Radoslav Gerganov	9afdffe70e	rpc : get available mem for the CPU backend This can be overridden with the -m command line option ref: #7293	2024-05-16 12:04:08 +03:00
Radoslav Gerganov	3b3963c55c	rpc : add command line arg for specifying backend memory ref: #7293	2024-05-16 09:58:29 +03:00
HanishKVC	239b5be219	ChatON+: Cleanup integration with CMake Rename chaton-meta hpp to cpp and include this cpp file which brings in the compile time built-in global chaton configurable template data into the common library, and avoid the nop hpp file references. Update chaton.hpp to not include the meta-cpp, instead just make a reference to the global ChatTemplates instance, so that the hpp can be used as a header file proper. Avoid pragma once in the chaton-meta.cpp, including the script, which helps create it.	2024-05-16 12:22:27 +05:30
Jared Van Bortel	dda64fc17c	convert : get general.name from model dir, not its parent (#5615 ) Co-authored-by: Brian <mofosyne@gmail.com>	2024-05-16 16:15:23 +10:00
Herman Semenov	0350f58152	grammar, json, llama: replace push on emplace if it possible (#7273 )	2024-05-16 16:14:24 +10:00
Vaibhav Srivastav	ad52d5c259	doc: add references to hugging face GGUF-my-repo quantisation web tool. (#7288 ) * chore: add references to the quantisation space. * fix grammer lol. * Update README.md Co-authored-by: Julien Chaumond <julien@huggingface.co> * Update README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Julien Chaumond <julien@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-16 15:38:43 +10:00
Max Krasnyansky	172b78210a	ci: fix bin/Release path for windows-arm64 builds (#7317 ) Switch to Ninja Multi-Config CMake generator to resurect bin/Release path that broke artifact packaging in CI.	2024-05-16 15:36:43 +10:00
Max Krasnyansky	13ad16af12	Add support for properly optimized Windows ARM64 builds with LLVM and MSVC (#7191 ) * logging: add proper checks for clang to avoid errors and warnings with VA_ARGS * build: add CMake Presets and toolchian files for Windows ARM64 * matmul-int8: enable matmul-int8 with MSVC and fix Clang warnings * ci: add support for optimized Windows ARM64 builds with MSVC and LLVM * matmul-int8: fixed typos in q8_0_q8_0 matmuls Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * matmul-int8: remove unnecessary casts in q8_0_q8_0 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-16 12:47:36 +10:00
Daniel Bevenius	8f7080bf48	readme : remove stray double quote (#7310 ) Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-05-15 23:41:03 +02:00
kunnis	e1b40ac3b9	ggml : use dynamic thread scheduling for matrix multiplication (#6915 ) * Just reordering some structs. * Adding in the calls to mm_pause * Passing around the state * Renaming and moving a bunch of variables around. * Extracting the logic to it's own function. * Moving some variable definitions into the chunk function. * Moving some variables around * moving src1_cont inside * Moving row_size * adding the current_chunk * Reorg the code. * Formatting to match the orig patch * starting to setup the chunking variables * Starting the buildup of the loop * The yield shouldn't be necessary. * adding the looping structure based on the chunk configuration. * Add in the re-chunking code. * Making it much more likely to rechunk. * disable resizing if numa is enabled. * Updating comments with what we've learned. * Fix formatting * Couple more formatting fixes. * More style fixes. * Fix Warnings * Going with unused because there's conditional logic that needs it. * Update ggml.c * Update ggml.c ---------	2024-05-15 19:59:12 +02:00
HanishKVC	7a3ac0cc15	Merge branch 'master' into hkvc_chaton_v3 Merge upstream as of 20240515IST11XY	2024-05-15 23:17:11 +05:30
HanishKVC	397249df61	DataUtilsString: string_as_hex and use direct log helpers	2024-05-15 21:14:24 +05:30
HanishKVC	bb3fe48c16	SimpCfg+DataUtilsString: Move string helpers to its own file	2024-05-15 19:25:31 +05:30
agray3	dc020985b8	Avoid unnecessarily disabling CUDA graphs (#7302 ) As discussed in PR #6766, CUDA graphs were being disabled in the presence of long prompts. This fixes the issue by avoiding the consective update counter from incrementing unnecessarily for tokens in which cuda graphs are disabled due to batch size > 1.	2024-05-15 15:44:49 +02:00
slaren	344f9126cc	ggml : tag ggml_tensor::backend as deprecated (#7290 )	2024-05-15 15:08:48 +02:00
HanishKVC	cdd91f5ad1	SimpCfg: Trap conversion error and raise appropriate exception	2024-05-15 18:37:15 +05:30
AidanBeltonS	9a17ab914b	Add missing " (#7303 )	2024-05-15 17:56:30 +05:30
dm4	ea3b0590ee	embedding : free the batch after execution (#7297 )	2024-05-15 15:01:12 +03:00
Georgi Gerganov	29499bb593	sync : ggml	2024-05-15 13:23:41 +03:00
John Balis	48aa8fd1f2	ggml : add `ggml_upscale_ext` (ggml/814) * initial commit with CPU implementation of upscale to shape and test, cuda implementation next * experimental commit to see if dst shape is correct * test version * test * removed unnecessary params * refactor * fixed tests * ggml : metal impl + cleanup + sycl dev warnings * patched ggml_upscale cuda op to handle non-contiguous tensors, added test for non-contiguous behavior * metal : fix upsacle op to support nb00 + style --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-15 13:23:33 +03:00
HanishKVC	4f5add68c6	GroupKV:Dump/Log type of the variant instance also	2024-05-15 14:28:48 +05:30
HanishKVC	dc03a7134a	CMakeLists: base std::variantC++17, specificTest std::formatC++20	2024-05-15 12:56:05 +05:30
Johannes Gäßler	583fd6b000	server bench: fix bench not waiting for model load (#7284 )	2024-05-15 08:44:16 +02:00
HanishKVC	4a15989000	ChatON: Forgot this note earlier	2024-05-15 03:38:41 +05:30
HanishKVC	a3d641b555	ChatON: Move loading from json file into its own file Any program which wants to use json file to update/extend the chaton's configurable template data, can include this new file chaton_json.hpp, to get the reqd functionality. Update chaton_meta_ok, _chaton_meta_validate_dump and chaton_meta_load_json to either work with a passed ChatTemplates instance, or fallback to the compiled-in global instance of same.	2024-05-15 03:00:25 +05:30
HanishKVC	14c28e717e	GroupKV+: dump cleanup - forgot to commit earlier	2024-05-15 02:11:26 +05:30
HanishKVC	8975de996b	ChatON: Update Notes to match the updated semantics and flows The initial version was rooted around a json object, while the new version is rooted around a MapOfMapOfVariant (GroupKV), which could be preloaded with chat templates info at compile time itself and used as is. Or optionally one could allow the configurable template data to be extended/updated at runtime from a text(/SimpCfg)/json file.	2024-05-14 21:54:52 +05:30
Georgi Gerganov	9f773486ab	script : sync ggml-rpc	2024-05-14 19:14:38 +03:00
Georgi Gerganov	e8a7fd4fb0	metal : support FA without mask + add asserts (#7278 ) * ggml : fa without mask + add asserts ggml-ci * metal : support non-contiguous KV ggml-ci	2024-05-14 19:09:30 +03:00
Georgi Gerganov	a5e3fde857	sync : ggml ggml-ci	2024-05-14 19:08:09 +03:00

... 2 3 4 5 6 ...

3288 commits