llama.cpp

Author	SHA1	Message	Date
HanishKVC	bdd279c0c9	ChatOn:User Begin+Prefix note update, keep things simple consistent	2024-05-06 11:27:56 +05:30
HanishKVC	84367b9fd1	ChatON: Add template for DeepSeek Was looking at the tokenized vector, and noticed that the EOS mentioned by existing chat_apply_template of llama.cpp, is different from what I noticed in tokenizer_config.json of deepseek llm, so I have added two entries * "deepseek-alt" which matches llama.cpp's chat_apply_template and * "deepseek" which matches that in tokenizer_config.json. This impacts the assistant suffix and reverse prompt entries. CasOfThis: Need to look into other entries which I added previously at a later time. However as the default logic should be picking the EOS from model file, so I assume reverse-prompt being outofsync, may not matter beyond a limit, potentially.	2024-05-06 11:27:56 +05:30
HanishKVC	f4b54069f6	ChatON: Add template for Gemma	2024-05-06 11:27:56 +05:30
HanishKVC	2a8028fba8	ChatON: Add Zephyr template to meta-json file	2024-05-06 11:27:56 +05:30
HanishKVC	57bd772bfd	ChatON: Cleanup logging Avoid showing on screen the debug messages. meta-dump can either show on screen or not, based on how LOGXLN is defined.	2024-05-06 11:27:56 +05:30
HanishKVC	217544e5ff	ChatON: Keep compiler happy Order the functions so that no need for seperate prototypes Also use kv_bool wrt boolean entries. Convert string to c char *	2024-05-06 11:27:56 +05:30
HanishKVC	3f9dfc240c	ChatON: Check for the boolean entries in meta-json	2024-05-06 11:27:56 +05:30
HanishKVC	42f6b45547	ChatON: Use the constants defined for the keys	2024-05-06 11:27:56 +05:30
HanishKVC	efb758ba7d	ChatON: Rename helpers to kv suffix, updated wrt metaok rename because they return value of specified key. [main] update metaok to take template-id, so that one can cross check that all needed entries are there wrt that template-id in the chaton-meta-json file	2024-05-06 11:27:56 +05:30
HanishKVC	e8c24c0767	ChatOn:MetaOk: Allows template-id based cross check For a given template-id, cross check, all needed entries are there in the json.	2024-05-06 11:27:56 +05:30
HanishKVC	b1055641e9	ChatON: Update the notes a bit	2024-05-06 11:27:56 +05:30
HanishKVC	11b47fbcfc	ChatON:MetaJson: Add key constants, check metaJson loaded ifNeeded	2024-05-06 11:27:56 +05:30
HanishKVC	221ccd6462	ChatOn: Add SystemUser-1st-User-Has-Prefix flag support Llama2 seems to need it, so chaton-meta-json sample file updated to use same.	2024-05-06 11:27:56 +05:30
HanishKVC	f03dd2439f	ChatOn:No global-begin/end in ChatApplyTmplSingle, ChatApplyTmpl Avoid adding global begin/end markers wrt ChatApplyTmplSingle. Add ChatApplyTmpl which goes through a vector of messages.	2024-05-06 11:27:56 +05:30
HanishKVC	c4cf0e9075	ChatON:Cleanup: BeginEnd, Debug log Update the note Rename global-prefix\|suffix to global-begin\|end. Rename chat-apply-template to chat-apply-template-single, cas it handles only a single message. Add some debug log messages to the helper functions	2024-05-06 11:27:56 +05:30
HanishKVC	d87d27512e	ChatOn: update sample meta json a bit Move [inst] [/inst] wrt llama2 from global to individual role specific parts. Avoid an extra \n wrt prefixes of llama3	2024-05-06 11:27:55 +05:30
HanishKVC	cdbe4f06ce	Chaton:Sample Meta JSON cleanup	2024-05-06 11:27:55 +05:30
HanishKVC	050d329e7e	ChatOn+Main: Initial go at chaton in main interactive flow	2024-05-06 11:27:55 +05:30
HanishKVC	1374a64200	Chaton:Meta: Add chatml meta data to sample meta json file	2024-05-06 11:27:55 +05:30
HanishKVC	093abc29a2	ChatOn: Update sample meta json to be a valid json	2024-05-06 11:27:55 +05:30
HanishKVC	dc56be951d	ChatOn:Main: Load and dump any specified chaton meta file	2024-05-06 11:27:55 +05:30
HanishKVC	35f25196a0	ChatOn:Common: Add the needed cmdline arg params and its parsing	2024-05-06 11:27:55 +05:30
HanishKVC	2146a253e8	ChatOn: Capture the idea	2024-05-06 11:27:55 +05:30
kunnis	628b299106	Adding support for the --numa argument for llama-bench. (#7080 )	2024-05-05 14:17:47 +02:00
Sigbjørn Skjæret	8f8acc8683	Disable benchmark on forked repo (#7034 ) * Disable benchmark on forked repo * only check owner on schedule event * check owner on push also * more readable as multi-line * ternary won't work * style++ * test++ * enable actions debug * test-- * remove debug * test++ * do debug where we can get logs * test-- * this is driving me crazy * correct github.event usage * remove test condition * correct github.event usage * test++ * test-- * event_name is pull_request_target * test++ * test-- * update ref checks	2024-05-05 13:38:55 +02:00
Lyle Dean	ca36326020	readme : add note that LLaMA 3 is not supported with convert.py (#7065 )	2024-05-05 08:21:46 +03:00
DAN™	889bdd7686	command-r : add BPE pre-tokenization (#7063 ) * Add BPE pre-tokenization for Command-R/R+. * Bump transformers convert requirement. * command-r : add individual digits regex --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-05 08:19:30 +03:00
Brian	6fbd432211	py : logging and flake8 suppression refactoring (#7081 ) Set one as executable and add basicConfig() to another. Also added noqa tag to test scripts.	2024-05-05 08:07:48 +03:00
Xuan Son Nguyen	842500144e	gguf-split: add --no-tensor-first-split (#7072 )	2024-05-04 18:56:22 +02:00
Jeximo	cf768b7e71	Tidy Android Instructions README.md (#7016 ) * Tidy Android Instructions README.md Remove CLBlast instructions(outdated), added OpenBlas. * don't assume git is installed Added apt install git, so that git clone works * removed OpenBlas Linked to Linux build instructions * fix typo Remove word "run" * correct style Co-authored-by: slaren <slarengh@gmail.com> * correct grammar Co-authored-by: slaren <slarengh@gmail.com> * delete reference to Android API * remove Fdroid reference, link directly to Termux Fdroid is not required Co-authored-by: slaren <slarengh@gmail.com> * Update README.md Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-05-04 18:10:15 +02:00
viric	fcd84a0f5a	Fix Linux /sys cpu path to guess number of cores (#7064 )	2024-05-04 15:26:53 +02:00
maor-ps	03fb8a002d	If first token generated from the server is the stop word the server will crash (#7038 ) This will reproduce the issue in llama13b { 'prompt': 'Q: hello world \nA: ', 'stop': ['\n'], 'temperature': 0.0, 'n_predict': 10, 'cache_prompt': True, 'n_probs': 10 }	2024-05-04 11:06:40 +02:00
Georgi Gerganov	92139b90af	tests : add test-tokenizer-0.sh + fix some tokenizers (#7036 ) * tests : add test-tokenizer-0.sh * unicode : add all unicode number ranges * starcoder : fix pre-tokenizer * tests : add test that fails with DeepSeek tokenizers * falcon : fix regex * unicode : regenerate unicode tables * refact : add tokenizer model * lint : fix * tests : disable failing tests ggml-ci * refact : add tests files ggml-ci * convert : print -> logging ggml-ci * lint : fix * unicode : digit -> number * phi-3 : update	2024-05-04 08:32:32 +03:00
Brian	a2ac89d6ef	convert.py : add python logging instead of print() (#6511 ) * convert.py: add python logging instead of print() * convert.py: verbose flag takes priority over dump flag log suppression * convert.py: named instance logging * convert.py: use explicit logger id string * convert.py: convert extra print() to named logger * convert.py: sys.stderr.write --> logger.error * .py: Convert all python scripts to use logging module requirements.txt: remove extra line * flake8: update flake8 ignore and exclude to match ci settings * gh-actions: add flake8-no-print to flake8 lint step * pre-commit: add flake8-no-print to flake8 and also update pre-commit version * convert-hf-to-gguf.py: print() to logger conversion * .py: logging basiconfig refactor to use conditional expression .py: removed commented out logging fixup! .py: logging basiconfig refactor to use conditional expression constant.py: logger.error then exit should be a raise exception instead * .py: Convert logger error and sys.exit() into a raise exception (for atypical error) gguf-convert-endian.py: refactor convert_byteorder() to use tqdm progressbar * verify-checksum-model.py: This is the result of the program, it should be printed to stdout. * compare-llama-bench.py: add blank line for readability during missing repo response * reader.py: read_gguf_file() use print() over logging * convert.py: warning goes to stderr and won't hurt the dump output * gguf-dump.py: dump_metadata() should print to stdout * convert-hf-to-gguf.py: print --> logger.debug or ValueError() * verify-checksum-models.py: use print() for printing table * .py: refactor logging.basicConfig() gguf-py/gguf/.py: use __name__ as logger name Since they will be imported and not run directly. python-lint.yml: use .flake8 file instead * constants.py: logger no longer required * convert-hf-to-gguf.py: add additional logging * convert-hf-to-gguf.py: print() --> logger * .py: fix flake8 warnings revert changes to convert-hf-to-gguf.py for get_name() * convert-hf-to-gguf-update.py: use triple quoted f-string instead * .py: accidentally corrected the wrong line *.py: add compilade warning suggestions and style fixes	2024-05-03 22:36:41 +03:00
Daniel Bevenius	433def286e	llama : rename ctx to user_data in progress_callback (#7045 ) * llama : rename ctx to user_data in progress_callback This commit renames the `ctx` parameter to `user_data` in the `llama_progress_callback` typedef. The motivation for this is that other callbacks use `user_data` or `data`, and using `ctx` in this case might be confusing as it could be confused with `llama_context`. --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-05-03 15:24:30 +02:00
Bartowski	60325fa56f	Remove .attention from skipped tensors to match more accurately (#7051 )	2024-05-03 01:49:09 +02:00
alwqx	6ecf3189e0	chore: fix typo in llama.cpp (#7032 ) Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2024-05-02 11:56:41 -04:00
Andrew Downing	b0d943de17	Update LOG_IMPL and LOG_TEE_IMPL (#7029 ) ROCm clang defines _MSC_VER which results in the wrong implementation of LOG_IMPL and LOG_TEE_IMPL being compiled. This fixes https://github.com/ggerganov/llama.cpp/issues/6972	2024-05-01 23:31:30 +02:00
l3utterfly	8d608a81b7	main : fix off by one error for context shift (#6921 )	2024-05-01 22:27:41 +03:00
Johannes Gäßler	3ea0d36000	Server: add tests for batch size, different seeds (#6950 )	2024-05-01 17:52:55 +02:00
Johannes Gäßler	1613ef8d8e	CUDA: CUDART < 11.7 workaround for __hmax, __hmax2 (#7019 )	2024-05-01 14:46:37 +02:00
slaren	c4ec9c0d3d	ci : exempt confirmed bugs from being tagged as stale (#7014 )	2024-05-01 08:13:59 +03:00
Johannes Gäßler	a8f9b07631	perplexity: more statistics, added documentation (#6936 ) * perplexity: more statistics, added documentation * add LLaMA 3 8b scoreboard	2024-04-30 23:36:27 +02:00
Kevin Gibbons	f364eb6fb5	switch to using localizedDescription (#7010 )	2024-04-30 17:14:02 +02:00
Georgi Gerganov	77e15bec62	metal : remove deprecated error code (#7008 )	2024-04-30 15:52:21 +03:00
Kevin Gibbons	a68a1e7ed0	metal : log more info on error (#6987 )	2024-04-30 12:34:50 +03:00
Georgi Gerganov	9c67c2773d	ggml : add Flash Attention (#5021 ) * ggml : add ggml_flash_attn_ext API * ggml : fix GQA support in ggml_flash_attn_ext * ggml : online attention (CPU) * metal : initial implementation * metal : f16 precision * metal : reduce branches * metal : specialize for head size * wip : 8 rows per simd group * wip : 4 rows per simd group * wip : template for rows per warp * metal : parallelize across KV size * metal : parallel reduce across heads * metal : efficient flash_attn_f16 implementation * metal : avoid redundant loads of the attention * metal : scale and mask in matrix form * metal : fix comment * llama : avoid ggml_cast, use F32 query * metal : add parallel reduce version (disabled) * metal : move output into local memory + optimize - the result from each simdgroup now stays in the registers - significantly reduced SRAM usage - more efficient skipping of -INF blocks - avoid simdgroup barrier in hot loop - add comments * metal : add tests, fix scaling, support C > 32 * metal : improve precision * ggml : fix f16 mad * metal : minor * metal : support Q > 8 * tests : add ATTN tests * metal : disable buffer allocation logs * tests : more * metal : faster inner loop for C == 32 * metal : fix array initialization * tests : ifdef * ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext * ggml : fix ggml_soft_max mask requirement * cuda : fix soft_max to use correct mask size * cuda : add flash_attn kernel (wip) * metal : optimize softmax for C > 32 * metal : optimize softmax * tests : minor fix * cuda : avoid zeroing fragments * tests : update dims * cuda : fix __hisinf() result check * cuda : avoid warp_reduce for smax * cuda : use int instead of int64_t Noticeably improves performance (thanks to Johannes) * cuda : make loops use the same loop values Thanks Johannes again for the tip * cuda : unroll some of the loops * cuda : avoid __hisinf branches * cuda : use half2 in softmax * cuda : switch to 1 warp for bs > 16 * cuda : speed-up reduce part of the kernel * cuda : unroll QK^T loop cuda : fix -INF block check * cuda : simplify softmax * cuda : fix matrix names * cuda : minor * llama : adapt to F16 KQ_pos * llama : adapt new models to F16 KQ_mask * ggml : fix F16 store (ARM NEON) * llama : fix type of KQ_mask and KQ_pos * ggml : fix CPU soft_max * tests : add hs=256 * cuda : fix build * metal : improve perf via smaller int registers * cuda : adapt soft_max to F16 mask and pos * CUDA: faster FlashAttention, kernel for bs == 1 * 16 cols for Phi-2 * no vec for hs, no hs==256 ncols==32 for Volta * adjust kernel selection logic * 4 warps, 256 stride for all D * no ncols == 64 * Multiple parallel blocks for batch size 1 * fix compile warnings * fix excessive KQ_b loads * fix cmake build * fix KV cache padding, NaN from INFINITY (#6438) * llama : flash_attn cparam + fix defrag * server: support flash_attn param * server: bench: enable flash_attn param * CUDA: refactor host code, dyn. par. blocks * fix flash_attn_vec_f16 race condition * flush softmax exp below threshold to 0 * store temp KQ in registers * Calculate KQ as FP32 if KQV has GGML_PREC_F32 * Add __hgt2_mask implementation for CUDA 11 * fix KQ FP32 precision fpr parallel_blocks > 1 * llama-bench : add -fa,--flash-attn arg * metal : add BS=1 kernel for flash attention (#6508) * metal : add BS=1 kernel for flash attention (wip) * metal : support more than 1 warps * metal : opts * metal : opt * metal : switch to parallel reduce * metal : reduce registers * metal : simplify * metal : initial FA vec kernel * metal : use F32 attention accumulators * batched-bench : add fattn arg * llama : simplify llama_build_kv_store ggml-ci * llama : adapt build_olmo to changes * ggml : fix arm fp16 store on windows * metal : clean-up * metal : clean-up kernel code * metal : minor * tests : remove benchmarks ggml-ci * ggml : fix avx512 const correctness ggml-ci * ggml : fix soft_max with bias on CPU ggml-ci * common : print --flash-attn in help * ggml : fix num dimensions in ggml_flash_attn_ext * llama : force disable flash attention for incompatible models * ggml : ggml_soft_max support F16/F32 mask/pos ggml-ci * cuda : uint -> uint32_t * cuda : "constexpr dim3" -> "const dim3" ggml-ci * cuda : try to fix __hgt2_mask ggml-ci * ggml : add TODO's for F16/F32 mask/pos support in other backends * llama : replace bool need_kq_pos with use_alibi * llama : prep ALiBi support for BERT models ggml-ci * llama : fix n_batch requirements ggml-ci * cont * server : add help for --flash-attn arg * llama : disable FA for AMD * tests : remove TMP_ATTN_BENCH ggml-ci * llama : support save/load state with FA enabled ggml-ci * ci : add CUDA save-load-state tests ggml-ci * llama : llama_kv_cache_clear zeroes data + fix save-load seq ggml-ci * llama : fix copy-paste errors, add TODO * llama : disallow incompatible states * llama : update llama_state_get_size after v_trans field * metal : remove tmp log * llama : add static reminder for llama_state_get_size * metal : fix max nsg ggml-ci * ci : fix arg order ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>	2024-04-30 12:16:08 +03:00
Georgi Gerganov	952d03dbea	convert : use utf8 encoding (#7000 ) * convert : use utf8 encoding * convert : update instructions and warning message	2024-04-30 11:05:25 +03:00
Olivier Chafik	8843a98c2b	Improve usability of --model-url & related flags (#6930 ) * args: default --model to models/ + filename from --model-url or --hf-file (or else legacy models/7B/ggml-model-f16.gguf) * args: main & server now call gpt_params_handle_model_default * args: define DEFAULT_MODEL_PATH + update cli docs * curl: check url of previous download (.json metadata w/ url, etag & lastModified) * args: fix update to quantize-stats.cpp * curl: support legacy .etag / .lastModified companion files * curl: rm legacy .etag file support * curl: reuse regex across headers callback calls * curl: unique_ptr to manage lifecycle of curl & outfile * curl: nit: no need for multiline regex flag * curl: update failed test (model file collision) + gitignore *.gguf.json	2024-04-30 00:52:50 +01:00
Clint Herron	b8c1476e44	Extending grammar integration tests (#6644 ) * Cleaning up integration tests to share code between tests and make it simpler to add new tests. * Add tests around quantifiers to ensure both matching and non-matching compliance. * Add slightly more complex grammar with quantifiers to test references with quantifiers. * Fixing build when C++17 is not present. * Separating test calls to give more helpful stack traces on failure. Adding verbose messages to give visibility for what is being tested. * Adding quotes around strings to explicitly show whitespace * Removing trailing whitespace. * Implementing suggestions from @ochafik -- grammars and test strings now print and flush before tests to aid in debugging segfaults and whatnot. * Cleaning up forgotten symbols. Modifying simple test to use test harness. Added comments for more verbose descriptions of what each test is accomplishing. * Unicode symbol modifications to hopefully make log easier to parse visually.	2024-04-29 14:40:14 -04:00

1 2 3 4 5 ...

2917 commits