llama.cpp

Author	SHA1	Message	Date
HanishKVC	1574201f71	ChatON:LoadJSon:ChatTemplates: revPrompt, system-user flags WIP:NOTE: Initial go converting from json driven flow to ChatTemplatesGroupKV related flow done. Needs to be tested. A optional helper added to load ChatTemplates from a specified json file. Need to add a compile time initialized MapOfMapOfVariants wrt the chat template details of models/standards already known to the program. So that one can use the llama.cpp and this new chat template logic, even without json dependency, if one doesnt want to.	2024-05-12 01:45:19 +05:30
HanishKVC	444d2ccf9c	ChatON:LoadJSON: ChatTemplates - global/system/user/assistant Manually iterate the json object items using begin-end explicitly, because the implicit iteration for loop related helpers for the used json lib gives only the values and not a key-value pair.	2024-05-12 01:35:31 +05:30
HanishKVC	2efc09f2d0	ChatON: Unnecessarily indirect nlohmann json code used for exploring/testing commited just for future reference	2024-05-12 00:42:17 +05:30
HanishKVC	b9d9700de3	CMakeLists.txt: Compile C++ code for -std=c++20	2024-05-11 23:42:08 +05:30
HanishKVC	b944d04d08	ChatON: Add constructor for ChatTemplates which chains into GKV	2024-05-11 23:42:08 +05:30
HanishKVC	d9959b74e7	GroupKV: Get ready for use in llama.cpp ++ Avoid defining GKV_TEST_PRG, used for self testing, by default Add it to common library	2024-05-11 23:40:03 +05:30
HanishKVC	4a9a6ce256	ChatON: ChatONMetaDump switch to GKV/ChatTemplates based flow	2024-05-11 22:53:45 +05:30
HanishKVC	484c710eab	GroupKV:Add GetValue which throws exception	2024-05-11 20:49:51 +05:30
HanishKVC	9d4450d51a	GroupKV: Let dump return a string, rather than printing/logging	2024-05-11 19:43:34 +05:30
HanishKVC	e999934e91	ChatON:WIP: initial go at GroupKV based flow, instead of json	2024-05-11 19:41:58 +05:30
HanishKVC	f294fddf43	GroupKV: Add group_exists checker	2024-05-11 19:18:19 +05:30
HanishKVC	dde72df9d3	GroupKV: Rename the internal map	2024-05-11 18:23:06 +05:30
HanishKVC	fdefb39518	GroupKV:Make LDBUG macros conditional, avoid condition at usage site Also change LWARN to LDBUG wrt previously GKV_DEBUG conditional code	2024-05-11 13:30:56 +05:30
HanishKVC	7f03dd0d4b	GroupKV: Add int32_t to variant list, to simplify int use So that no need to explicitly specify <int64_t> or LL wrt int literals, which dont need 64bit space by default. Which also means one shouldnt/cant mix up type of value stored and default type specified when getting.	2024-05-11 12:45:58 +05:30
HanishKVC	0342124946	GroupKV: Add to_str wrt vectors, help avoid compiler confusion	2024-05-11 12:27:42 +05:30
HanishKVC	7d7c59ec50	GroupKV:Simplify:P2: Rename tags, Make debug logs conditional Rename all the log messages to have GKV and not SC. The log messages in get_vector made conditional to GKV_DEBUG, this was missed out earlier in simpcfg itself.	2024-05-11 11:57:27 +05:30
HanishKVC	d764a9d395	GroupKV: Simplify code to the minimal needed for GroupKV - P1	2024-05-11 11:37:06 +05:30
HanishKVC	86b842b172	GroupKV: Duplicate SimpCfg to chop down into GroupKV IE a minimal MapOfMapOfVariant, with some basic helpers. This can be the basis of a ChatTemplates object, as well as SimpCfg built on top of it.	2024-05-11 10:57:32 +05:30
HanishKVC	c0506f94bf	SimpCfg: Allow for direct initialization lists based init This should pave way for having a default chat templates dataset in the code, without needing to load it from a config file, if one doesnt want to. TODO: allow for loading config from json into simpcfg, so that a program which uses llama.cpp can decide, whether it is ok with what is already there in the internal dataset, or allow for loading template info at runtime using the simpcfg's simple text file or additionally include the json code to load template info at runtime from json file.	2024-05-11 00:33:31 +05:30
HanishKVC	fe27902964	SimpCfg: Avoid iostream/cout and format for direct library use It appears like std::format is not supported in older g++/lib still in wide use like current debian stable, so avoiding same wrt direct library use. Allow for empty VAARGS NOTE: However test program mode of the same uses cout and format	2024-05-10 22:27:07 +05:30
HanishKVC	1f9a0eb8ce	ChatON: Remove unneeded iostream	2024-05-10 21:10:44 +05:30
HanishKVC	abb406b888	Merge branch 'master' into hkvc_chaton_v3 Have merged master branch has of 20240510IST12XY with chaton_v3 branch. As part of same had to update the flow in examples/main/main.cpp wrt conversion related commit in master branch and my chaton related commits in this branch.	2024-05-10 13:14:26 +05:30
Andrei	d11afd6652	llava : fix moondream support (#7163 ) * Revert "Revert "llava : add support for moondream vision language model (#6899)"" This reverts commit `9da243b36a`. * Fix num_positions and embeddings initialization	2024-05-10 09:41:10 +03:00
Ouadie EL FAROUKI	8c570c9496	Minor arithmetic improvement to mmvq wrapper kernel (#7172 )	2024-05-10 08:32:15 +08:00
slaren	eaf4bd8b39	eval-callback : fix conversion to float (#7184 )	2024-05-10 01:04:12 +02:00
0cc4m	befddd0f15	Vulkan Bugfixes and Improvements (#7084 ) * Modify mat mat mul shader for mul_mat_id, modify mat vec mul shaders for single call batch operation * Further work towards MoE, disabled for now * Disable MoE code (not ready yet), fix a number of bugs in shaders and Vulkan code * Add softmax with f16 mask and pos buffer support * Disable mul_mat_id shaders for now * Fix flake8 * Fix validation errors caused by empty buffers on larger batch sizes	2024-05-09 20:39:54 +02:00
Georgi Gerganov	d46dbc76f8	readme : add scheduled server workflow status badge	2024-05-09 16:40:42 +03:00
l3utterfly	0961d86604	readme : add app (#6371 ) * added Layla to supported UIs * Update README.md	2024-05-09 16:32:40 +03:00
jaime-m-p	43248e5594	llama3 custom regex split (#6965 ) * merged the changes from deepseeker models to main branch * Moved regex patterns to unicode.cpp and updated unicode.h * Moved header files * Resolved issues * added and refactored unicode_regex_split and related functions * Updated/merged the deepseek coder pr * Refactored code * Adding unicode regex mappings * Adding unicode regex function * Added needed functionality, testing remains * Fixed issues * Fixed issue with gpt2 regex custom preprocessor * unicode : fix? unicode_wstring_to_utf8 * lint : fix whitespaces * tests : add tokenizer tests for numbers * unicode : remove redundant headers * tests : remove and rename tokenizer test scripts * tests : add sample usage * gguf-py : reader prints warnings on duplicate keys * llama : towards llama3 tokenization support (wip) * unicode : shot in the dark to fix tests on Windows * unicode : first try custom implementations * convert : add "tokenizer.ggml.pre" GGUF KV (wip) * llama : use new pre-tokenizer type * convert : fix pre-tokenizer type writing * lint : fix * make : add test-tokenizer-0-llama-v3 * wip * models : add llama v3 vocab file * llama : adapt punctuation regex + add llama 3 regex * minor * unicode : set bomb * unicode : set bomb * unicode : always use std::wregex * unicode : support \p{N}, \p{L} and \p{P} natively * unicode : try fix windows * unicode : category support via std::regex * unicode : clean-up * unicode : simplify * llama3 custom regex split * convert : add convert-hf-to-gguf-update.py ggml-ci * lint : update * convert : add falcon ggml-ci * unicode : normalize signatures * lint : fix * lint : fix * convert : remove unused functions * convert : add comments * convert : exercise contractions ggml-ci * Using char32_t for codepoints * lint : fix * already exists unicode_tolower() * Typing * Restore BOM * cmake : refactor test targets * tests : refactor vocab tests ggml-ci * tests : add more vocabs and tests ggml-ci * unicode : cleanup * scripts : ignore new update script in check-requirements.sh * Fix merge * models : add phi-3, mpt, gpt-2, starcoder * tests : disable obsolete ggml-ci * tests : use faster bpe test ggml-ci * llama : more prominent warning for old BPE models * tests : disable test-tokenizer-1-bpe due to slowness ggml-ci * Move unused variable value * GPT2 custom regex split * Add alternative regex for custom aplit llama3 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Style * Add bruteforce random tests for token encoding * wip: fixing unicode codepoint ranges * Fix merge * Unicode tables: separator, lowercase, uppercase and whitespace * llama3 custom regex split: fix \s * Restore BOM * Style * wip: generate NDF table * Ignore special tokens for testing * Clean gen-unicode-data.py * Refactor random tokenizer test * lint : fix * tests : add fail test for llama-bpe --------- Co-authored-by: Jaggzh <jaggz.h@gmail.com> Co-authored-by: Kazim Abrar Mahi <kazimabrarmahi135@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: jaime-m-p <>	2024-05-09 23:30:44 +10:00
Johannes Gäßler	a743d76a01	CUDA: generalize FP16 fattn vec kernel (#7061 ) * CUDA: generalize FP16 fattn vec kernel * disable unsupported head sizes for AMD in test * try AMD fix * fix batch size 2-8 * partially revert changes	2024-05-09 14:32:02 +02:00
Galunid	f31ec120bc	Add warning if token is invalid (#7173 )	2024-05-09 14:13:05 +02:00
Daniel Bevenius	fd9f92b154	llama : update llama_timings.n_p_eval setting (#7160 ) This commit changes the value assigned to llama_timings.n_p_eval when ctx->n_p_eval is 0 to be 1 instead of 1 which is the current value. The motivation for this change is that if session caching is enabled, for example using the `--prompt-cache main-session.txt` command line argument for the main example, and if the same prompt is used then on subsequent runs, the prompt tokens will not actually be passed to llama_decode, and n_p_eval will not be updated by llama_synchoronize. But the value of n_p_eval will be set 1 by llama_get_timings because ctx->n_p_eval will be 0. This could be interpreted as 1 token was evaluated for the prompt which could be misleading for applications using this value. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-05-09 14:03:29 +03:00
Sigbjørn Skjæret	22842164bc	gguf-py : add special token modification capability (#7166 ) * Add special token modification capability To be able to fix/amend special tokens in a GGUF let's add two new arguments: * `--special-token <name> <value>` where `<name>` can be bos, eos, prefix, middle, etc. while `<value>` is the token value, f.ex. `"<｜fim▁begin｜>"` * `--special-token-by-id <name> <id>` where `<id>` is the ID of the token, f.ex. 32006 So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following: ```bash python3 gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<｜fim▁begin｜>" --special-token middle "<｜fim▁hole｜>" --special-token suffix "<｜fim▁end｜>" ``` * improve help text * flake-- * fix multiple tokens warning * make script executable * switch to namedtuple, no need to dataclass * typing++ * add progress bar * Add special token modification capability To be able to fix/amend special tokens in a GGUF let's add two new arguments: * `--special-token <name> <value>` where `<name>` can be bos, eos, prefix, middle, etc. while `<value>` is the token value, f.ex. `"<｜fim▁begin｜>"` * `--special-token-by-id <name> <id>` where `<id>` is the ID of the token, f.ex. 32006 So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following: ```bash gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<｜fim▁begin｜>" --special-token middle "<｜fim▁end｜>" --special-token suffix "<｜fim▁hole｜>" ``` (yes, fim_end is the `middle` token, because completion is a `prefix`/`suffix`/`middle` sequence (where `middle` is unfilled)) or ```bash gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<fim_prefix>" --special-token middle "<fim_middle>" --special-token suffix "<fim_suffix>" ``` etc... NB: The tokens have to exist already, trying to add non-existent token name/IDs will be ignored (with a warning), while non-existent values will fail (with an error). * improve help text * flake-- * fix multiple tokens warning * make script executable * switch to namedtuple, no need to dataclass * typing++ * add progress bar * fail on invalid token id	2024-05-09 13:56:00 +03:00
Albert Jin	4734524882	opencl : alignment size converted from bits to bytes (#7090 ) * opencl alignment size should be converted from bits to bytes Reference: https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#CL_DEVICE_MEM_BASE_ADDR_ALIGN > Alignment requirement (in bits) for sub-buffer offsets. * Update ggml-opencl.cpp for readability using division instead of shift Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-05-09 12:34:37 +03:00
Ahmet Zeer	07cd41d096	TypoFix (#7162 )	2024-05-09 10:16:45 +02:00
Jared Van Bortel	4426e2987b	cmake : fix typo (#7151 )	2024-05-08 19:55:32 -04:00
compilade	f98eb31c51	convert-hf : save memory with lazy evaluation (#7075 ) * convert-hf : begin refactoring write_tensor * convert : upgrade to sentencepiece v0.2.0 * convert-hf : remove unused n_dims in extra__tensors convert-hf : simplify MoE weights stacking * convert-hf : flake8 linter doesn't like semicolons * convert-hf : allow unusual model part names For example, loading `model-00001-of-00001.safetensors` now works. * convert-hf : fix stacking MoE expert tensors `torch.stack` and `torch.cat` don't do the same thing. * convert-hf : fix Mamba conversion Tested to work even with a SentencePiece-based tokenizer. * convert : use a string for the SentencePiece tokenizer path * convert-hf : display tensor shape * convert-hf : convert norms to f32 by default * convert-hf : sort model part names `os.listdir` is said to list files in arbitrary order. Sorting the file names should let "model-00009-of-00042.safetensors" be loaded before "model-00010-of-00042.safetensors". * convert-hf : use an ABC for Model again It seems Protocol can't be used as a statically type-checked ABC, because its subclasses also can't be instantiated. (why did it seem to work?) At least there's still a way to throw an error when forgetting to define the `model_arch` property of any registered Model subclasses. * convert-hf : use a plain class for Model, and forbid direct instantiation There are no abstract methods used anyway, so using ABC isn't really necessary. * convert-hf : more consistent formatting of cmdline args * convert-hf : align the message logged for converted tensors * convert-hf : fix Refact conversion * convert-hf : save memory with lazy evaluation * convert-hf : flake8 doesn't like lowercase L as a variable name * convert-hf : remove einops requirement for InternLM2 * convert-hf : faster model parts loading Instead of pre-loading them all into a dict, iterate on the tensors in the model parts progressively as needed in Model.write_tensors Conversion for some architectures relies on checking for the presence of specific tensor names, so for multi-part models, the weight map is read from the relevant json file to quickly get these names up-front. * convert-hf : minor changes for consistency * gguf-py : add tqdm as a dependency It's small, and used for a progress bar in GGUFWriter.write_tensors_to_file	2024-05-08 18:16:38 -04:00
agray3	bc4bba364f	Introduction of CUDA Graphs to LLama.cpp (#6766 ) * DRAFT: Introduction of CUDA Graphs to LLama.cpp * FIx issues raised in comments * Tidied to now only use CUDA runtime (not mixed with driver calls) * disable for multi-gpu and batch size > 1 * Disable CUDA graphs for old GPU arch and with env var * added missing CUDA_CHECKs * Addressed comments * further addressed comments * limit to GGML_ALLOW_CUDA_GRAPHS defined in llama.cpp cmake * Added more comprehensive graph node checking * With mechanism to fall back if graph capture fails * Revert "With mechanism to fall back if graph capture fails" This reverts commit `eb9f15fb6f`. * Fall back if graph capture fails and address other comments * - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS - rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS - updated Makefile build to enable CUDA graphs - removed graph capture failure checking in ggml_cuda_error using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context - fixed several resource leaks - fixed issue with zero node graphs - changed fixed size arrays to vectors - removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed - removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row - changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX - code style fixes - things to look into - VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional - possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes * fix build without cuda graphs * remove outdated comment * replace minimum cc value with a constant --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-05-08 22:55:49 +02:00
Johannes Gäßler	c12452c7ae	JSON: [key] -> .at(key), assert() -> GGML_ASSERT (#7143 )	2024-05-08 21:53:08 +02:00
Georgi Gerganov	9da243b36a	Revert "llava : add support for moondream vision language model (#6899 )" This reverts commit `46e12c4692`.	2024-05-08 22:14:39 +03:00
JohnnyB	bd1871fa2b	server : add themes + favicon (#6848 ) * Added themes support with two sample themes and a favicon. * Newline * Newline * Newline * Trailing whitespace * Increased opacity for contrast * Increase opacity. Check actions cancelled for some other priority job and I can't seem to manually re-run them, so MOAR OPACITY * Opacity action trigger. Trying to re-trigger the cancelled action. * One more opacity adjustment This Actions pipeline is failing for random issues. * Delete examples/server/themes/buttons_top/completion.js This will be served from the static string built-in to server. * Delete examples/server/themes/buttons_top/index.js This will be served from the static string built-in to server. * Delete examples/server/themes/wild/completion.js This will be served from the static string built-in to server. * Delete examples/server/themes/buttons_top/json-schema-to-grammar.mjs This will be served from the static string built-in to server. * Delete examples/server/themes/wild/index.js This will be served from the static string built-in to server. * Delete examples/server/themes/wild/json-schema-to-grammar.mjs This will be served from the static string built-in to server. * Replaced underscore.	2024-05-08 22:12:06 +03:00
Gilad S	26458af1d6	metal : use `vm_allocate` instead of `posix_memalign` on macOS (#7078 ) * fix: use `malloc` instead of `posix_memalign` in `ggml-metal.m` to make it not crash Electron proccesses * fix: typo * fix: use `vm_allocate` instead of `posix_memalign` * fix: don't call `newBufferWithBytesNoCopy` with `NULL` when `ggml_metal_host_malloc` returns `NULL` * fix: use `vm_allocate` only on macOS	2024-05-08 22:08:10 +03:00
Dawid Potocki	83330d8cd6	main : add --conversation / -cnv flag (#7108 )	2024-05-08 17:32:32 +03:00
Eve	465263d0cf	sgemm : AVX Q4_0 and Q8_0 (#6891 ) * basic avx implementation * style * combine denibble with load * reduce 256 to 128 (and back!) conversions * sse load * Update sgemm.cpp * oops oops	2024-05-08 17:29:23 +03:00
HanishKVC	8fe8231313	ChatON:SubPartsAwareTokenizePath: Allow extract subparts testing	2024-05-08 19:51:57 +05:30
HanishKVC	a49697b488	ChatON: Keep compiler happy simbly	2024-05-08 19:22:46 +05:30
HanishKVC	0d81ffe6eb	Tests:ChatON: Add partial skeleton wrt subparts tokenizing	2024-05-08 19:06:51 +05:30
HanishKVC	868ab608f0	ChatON: Add forceParseSpecial flag to subparts aware tokenizing	2024-05-08 18:42:22 +05:30
HanishKVC	b6da7d9c9d	ChatON: tokenize keeping in mind the taggedMessage subparts Initial go	2024-05-08 18:38:07 +05:30
Johan	911b3900dd	server : add_special option for tokenize endpoint (#7059 )	2024-05-08 15:27:58 +03:00

1 2 3 4 5 ...

3010 commits