llama.cpp

Author	SHA1	Message	Date
Georgi Gerganov	9da243b36a	Revert "llava : add support for moondream vision language model (#6899 )" This reverts commit `46e12c4692`.	2024-05-08 22:14:39 +03:00
JohnnyB	bd1871fa2b	server : add themes + favicon (#6848 ) * Added themes support with two sample themes and a favicon. * Newline * Newline * Newline * Trailing whitespace * Increased opacity for contrast * Increase opacity. Check actions cancelled for some other priority job and I can't seem to manually re-run them, so MOAR OPACITY * Opacity action trigger. Trying to re-trigger the cancelled action. * One more opacity adjustment This Actions pipeline is failing for random issues. * Delete examples/server/themes/buttons_top/completion.js This will be served from the static string built-in to server. * Delete examples/server/themes/buttons_top/index.js This will be served from the static string built-in to server. * Delete examples/server/themes/wild/completion.js This will be served from the static string built-in to server. * Delete examples/server/themes/buttons_top/json-schema-to-grammar.mjs This will be served from the static string built-in to server. * Delete examples/server/themes/wild/index.js This will be served from the static string built-in to server. * Delete examples/server/themes/wild/json-schema-to-grammar.mjs This will be served from the static string built-in to server. * Replaced underscore.	2024-05-08 22:12:06 +03:00
Gilad S	26458af1d6	metal : use `vm_allocate` instead of `posix_memalign` on macOS (#7078 ) * fix: use `malloc` instead of `posix_memalign` in `ggml-metal.m` to make it not crash Electron proccesses * fix: typo * fix: use `vm_allocate` instead of `posix_memalign` * fix: don't call `newBufferWithBytesNoCopy` with `NULL` when `ggml_metal_host_malloc` returns `NULL` * fix: use `vm_allocate` only on macOS	2024-05-08 22:08:10 +03:00
Dawid Potocki	83330d8cd6	main : add --conversation / -cnv flag (#7108 )	2024-05-08 17:32:32 +03:00
Eve	465263d0cf	sgemm : AVX Q4_0 and Q8_0 (#6891 ) * basic avx implementation * style * combine denibble with load * reduce 256 to 128 (and back!) conversions * sse load * Update sgemm.cpp * oops oops	2024-05-08 17:29:23 +03:00
HanishKVC	8fe8231313	ChatON:SubPartsAwareTokenizePath: Allow extract subparts testing	2024-05-08 19:51:57 +05:30
HanishKVC	a49697b488	ChatON: Keep compiler happy simbly	2024-05-08 19:22:46 +05:30
HanishKVC	0d81ffe6eb	Tests:ChatON: Add partial skeleton wrt subparts tokenizing	2024-05-08 19:06:51 +05:30
HanishKVC	868ab608f0	ChatON: Add forceParseSpecial flag to subparts aware tokenizing	2024-05-08 18:42:22 +05:30
HanishKVC	b6da7d9c9d	ChatON: tokenize keeping in mind the taggedMessage subparts Initial go	2024-05-08 18:38:07 +05:30
Johan	911b3900dd	server : add_special option for tokenize endpoint (#7059 )	2024-05-08 15:27:58 +03:00
20kdc	ad211edef5	convert.py : --vocab-only generates false but valid params (#7027 ) An example of how this might be used in the style of baby-llama will be attached with this PR.	2024-05-08 15:22:32 +03:00
Ren Xuancheng	229ffff872	llama : add BPE pre-tokenization for Qwen2 (#7114 ) * Add BPE pre-tokenization for Qwen2. * minor : fixes --------- Co-authored-by: Ren Xuancheng <17811943+jklj077@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-08 15:06:43 +03:00
HanishKVC	8dfa31bb91	ChatON: Make c-api wrappers a bit robust incl some cross checks If the tagged message will be of 0 length, ensure that the passed dest char* array, has null inserted appropriately. Check that user has passed a non-null pNumParts. Dont hard code int32_t size, pick using sizeof	2024-05-08 17:05:45 +05:30
Xuan Son Nguyen	1fd9c1741d	clean up json_value & server_log (#7142 )	2024-05-08 13:24:14 +02:00
DAN™	4cd621c26d	convert : add BPE pre-tokenization for DBRX (#7132 ) * Add BPE pre-tokenization for DBRX. * Add vocab GGUFs. * Remove test. * Remove GGUFs.	2024-05-08 13:43:23 +03:00
Georgi Gerganov	7e0b6a7b3b	py : also print the normalizers	2024-05-08 12:47:07 +03:00
Brian	acdce3cdef	compare-llama-bench.py: add missing basicConfig (#7138 ) * compare-llama-bench.py: add missing basicConfig * compare-llama-bench.py: Add line break between error message and print_help() * Add regular print() markdown table	2024-05-08 10:54:39 +02:00
Justine Tunney	3855416027	ggml : introduce bfloat16 support (#6412 ) * Introduce bfloat16 support Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as their canonical floating point format. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───┐ 0b0000000000000000 brain16 This encoding has the same number of exponent bits as float32. That makes conversion relatively straightforward, even in the absence of hardware support. For example, converting brain16 to binary32 means simply shifting 16 bits to the left. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───────────────────┐ 0b00000000000000000000000000000000 IEEE binary32 The issue is that converting bf16 to fp16 can result in information loss. Only 13% of bf16 numbers can be precisely represented in fp16 which in practice ends up being 99.71% of Mistral 7b v0.2's weights however there is currently no way other than fp32 to get the others ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌─┴─┐┌─┴──────┐ 0b0000000000000000 IEEE binary16 This change fixes that, by adding a bf16 data type to GGML. Support for CPU inference has been implemented along with optimizations for the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2 improves somewhere around -0.0024 to -0.0046 compared to using fp16 * Remove GGML code that's not needed * Minimize the GGML API surface area for BF16 * Remove bf16 luts * Make the GGML header look nicer * Fix documentation * Apply ggerganov's fixes for test-backend-ops * Add BF16 code for new ggml_validate_row_data() function	2024-05-08 09:30:09 +03:00
Georgi Gerganov	c0e6fbf8c3	metal : fix unused warning	2024-05-08 09:14:50 +03:00
Jeximo	c780e75305	Further tidy on Android instructions README.md (#7077 ) * Further tidy on Android instructions README.md Fixed some logic when following readme direction * Clean up redundent information A new user arriving will see simple directions on llama.cpp homepage * corrected puncuation Period after cmake, colon after termux * re-word for clarity method seems to be more correct, instead of alternative in this context * Organized required packages per build type building llama.cpp with NDK on a pc doesn't require installing clang, cmake, git, or wget in termux. * README.md corrected title * fix trailing whitespace	2024-05-08 02:26:43 +02:00
jukofyork	48b2f9c1fc	Fixed save_imatrix to match old behaviour for MoE (#7099 ) * Fixed save_imatrix to match old behaviour for MoE This fix is simple and clear, but unnecessarily doubles the memory overhead.. * Fixed missing idx variable * Unconditionally increment ncall Co-authored-by: slaren <slarengh@gmail.com> * Fixed 2 bugs in save_imatrix() - Fixed segfault bug because the counts vector needed to be created. - Fixed pre-existing bug didn't actually add to the counts for "--combine" option. * ncall needs summing too * Trailing whitespace --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-05-08 02:24:16 +02:00
Johannes Gäßler	af0a5b6163	server: fix incorrectly reported token probabilities (#7125 ) * server: normalize token probabilities * fix temperature == 0.0f	2024-05-07 23:07:58 +02:00
nopperl	b6aa670203	Fix OLMo HF to GGUF conversion (#6910 )	2024-05-07 21:39:43 +02:00
Kyle Mistele	260b7c6529	server : update readme with undocumented options (#7013 )	2024-05-07 21:44:29 +03:00
Georgi Gerganov	53d6c52e22	readme : update hot topics	2024-05-07 21:43:13 +03:00
RhinoDevel	3af34c1d1b	main : update log text (EOS to EOG) (#7104 ) * Update log text (EOS to EOG) The log text "found EOS" is no longer always correct, here, because there is now an is-EOG check that also returns true for EOT. * Improve log msg. further by using "an" instead of "some". As suggested, to avoid misunderstanding (no multiple EOG tokens found, just one).	2024-05-07 20:51:31 +03:00
omahs	04976db7a8	docs: fix typos (#7124 ) * fix typo * fix typos * fix typo * fix typos * fix typo * fix typos	2024-05-07 18:20:33 +03:00
Georgi Gerganov	947d3ad27d	ci : add GG_BUILD_EXTRA_TESTS_0 env (#7098 ) * ci : add GG_BUILD_EXTRA_TESTS_0 env ggml-ci * Update run.sh ggml-ci	2024-05-07 11:08:49 +03:00
HanishKVC	76791bad63	ChatON:Fix partsLengths to int32_t type, instead of int so that the size of the elements is explicit and fixed, so that it is inturn in sync with the fixed int size specified wrt the c-api, even with any c compilers with different idea about int. avoid some ununsed vars, need to update compile flags later to enable corresponding warnings.	2024-05-07 12:40:49 +05:30
HanishKVC	b3a56545d6	ChatON:Reposition alertAssistantAtEnd flag for consistency	2024-05-07 11:49:43 +05:30
HanishKVC	0852f3b7ec	ChatON:ExCApi: Rename for consistency	2024-05-07 11:46:40 +05:30
HanishKVC	43a3a91b03	ChatON: Cleanup/Refine initial go at tmpl_apply_ex_capi	2024-05-07 11:44:25 +05:30
HanishKVC	7c288d3dfc	ChatON: Rename to partstypes for consistency	2024-05-07 11:32:20 +05:30
HanishKVC	04b4a15177	ChatON: Initial go at chat-template-apply c-api with parts info	2024-05-07 11:08:47 +05:30
HanishKVC	f6a86cd209	ChatON: Update the Note a bit	2024-05-07 10:29:16 +05:30
William Tambellini	858f6b73f6	Add an option to build without CUDA VMM (#7067 ) Add an option to build ggml cuda without CUDA VMM resolves https://github.com/ggerganov/llama.cpp/issues/6889 https://forums.developer.nvidia.com/t/potential-nvshmem-allocated-memory-performance-issue/275416/4	2024-05-06 20:12:14 +02:00
Georgi Gerganov	b3a995b416	flake.lock: Update (#7079 ) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/9126214d0a59633752a136528f5f3b9aa8565b7d?narHash=sha256-sB4SWl2lX95bExY2gMFG5HIzvva5AVMJd4Igm%2BGpZNw%3D' (2024-04-01) → 'github:hercules-ci/flake-parts/e5d10a24b66c3ea8f150e47dfdb0416ab7c3390e?narHash=sha256-yzcRNDoyVP7%2BSCNX0wmuDju1NUCt8Dz9%2BlyUXEI0dbI%3D' (2024-05-02) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/d8fe5e6c92d0d190646fb9f1056741a229980089?dir=lib&narHash=sha256-iMUFArF0WCatKK6RzfUJknjem0H9m4KgorO/p3Dopkk%3D' (2024-03-29) → 'https://github.com/NixOS/nixpkgs/archive/50eb7ecf4cd0a5756d7275c8ba36790e5bd53e33.tar.gz?narHash=sha256-QBx10%2Bk6JWz6u7VsohfSw8g8hjdBZEf8CFzXH1/1Z94%3D' (2024-05-02) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/7bb2ccd8cdc44c91edba16c48d2c8f331fb3d856?narHash=sha256-Drmja/f5MRHZCskS6mvzFqxEaZMeciScCTFxWVLqWEY%3D' (2024-04-25) → 'github:NixOS/nixpkgs/63c3a29ca82437c87573e4c6919b09a24ea61b0f?narHash=sha256-4cPymbty65RvF1DWQfc%2BBc8B233A1BWxJnNULJKQ1EY%3D' (2024-05-02) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-05-06 08:36:06 -07:00
Georgi Gerganov	bcdee0daa7	minor : fix trailing whitespace	2024-05-06 09:31:30 +03:00
HanishKVC	b875b02979	ChatON:Initial go at vicuna chat template in meta.json Have looked at tokenizer_config.json, jinja file and default hardcoded template in llama.cpp. This is also one of the models where a Global BoS is needed. NOTE: Have taken the liberty to also add a SYSTEM: prefix wrt system message, even thou default vicuna doesnt seem to need, but vicuna-orca seems to need, so that both models can be driven from same chat template config. I am assuming the system prefix should not create any problem even in default vicuna, however if it does create a problem one can duplicate the existing vicuna block in chaton_meta.json and make the system prefix empty in it.	2024-05-06 11:27:56 +05:30
HanishKVC	0f8f2a18c2	ChatON:chat template for OpenChat in meta.json initial go The first model seen, based on templates added till now into meta json file, that needs a Global Begin. From tokenizer_config json file, it appears like even system role should have a appropriate prefix, unlike what is seen in hardcoded default chat apply template of llama.cpp and chat jinja template.	2024-05-06 11:27:56 +05:30
HanishKVC	93115a9733	ChatON: initial go at OrionStar Ai chat model template Got from its tokenizer config json. Also same found in existing hardcoded template in default chat apply template logic of llamacpp	2024-05-06 11:27:56 +05:30
HanishKVC	989c6c4125	SimpCfg: Cleanup the Note a bit to avoid some ambiguities	2024-05-06 11:27:56 +05:30
HanishKVC	344c068d7b	SimpCfg:MultiPart keys wrt get_vector With this and past few commits, now there is simple yet sufficient support to help move multi-level-hierarchy config files into the SimpCfg's simple physically 1-level, but if reqd logically multi level hierarchy flow. B4 this series of commits also one could have still achieved this, but there would have been bit more effort needed.	2024-05-06 11:27:56 +05:30
HanishKVC	19d3c88e8a	SimpCfg:MultiPart keys wrt get_value etal	2024-05-06 11:27:56 +05:30
HanishKVC	623d0b60da	SimpCfg: General MultiPart support, KeyParts not Key wrt SetValue	2024-05-06 11:27:56 +05:30
HanishKVC	c6ecd9316e	SimpCfg: Use to_str instead of using stringstream directly	2024-05-06 11:27:56 +05:30
HanishKVC	5380b1e86e	ChatON:Update meta.json wrt command-r models template info picked from tokenizer config's default entry, Verified that same is used in the existing hardcoded chat apply template flow.	2024-05-06 11:27:56 +05:30
HanishKVC	2b14bcaddb	SimpCfg:ChatON: add by Humans for All note	2024-05-06 11:27:56 +05:30
HanishKVC	20e5b383c5	SimpCfg:Trim DumpHexString only if SC_DEBUG_VERBOSE	2024-05-06 11:27:56 +05:30

1 2 3 4 5 ...

3021 commits