llama.cpp

Author	SHA1	Message	Date
Georgi Gerganov	cb76d747d1	ggml : fix num dimensions in ggml_flash_attn_ext	2024-04-22 12:50:26 +03:00
Georgi Gerganov	a39217d428	common : print --flash-attn in help	2024-04-22 12:50:10 +03:00
Olivier Chafik	5cf5e7d490	`build`: generate hex dump of server assets during build (#6661 ) * `build`: generate hex dumps of server assets on the fly * build: workaround lack of -n on gnu xxd * build: don't use xxd in cmake * build: don't call xxd from build.zig * build: more idiomatic hexing * build: don't use xxd in Makefile (od hackery instead) * build: avoid exceeding max cmd line limit in makefile hex dump * build: hex dump assets at cmake build time (not config time)	2024-04-21 18:48:53 +01:00
Georgi Gerganov	40f74e4d73	llama : add option to render special/control tokens (#6807 ) * make : fix common dep on llama.h * llama : add option to render special tokens * readme : add API change notice ggml-ci * swift : fix build	2024-04-21 18:36:45 +03:00
Georgi Gerganov	b9cc76d87e	ggml : fix ggml_backend_cpu_supports_op() for CPY (#0 )	2024-04-21 16:48:50 +03:00
Wouter	7dbdba5690	llama : add llama-3 chat template (#6751 ) * Added llama-3 chat template * Update llama.cpp Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com> * Update llama.cpp Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com> * Update tests/test-chat-template.cpp Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com> * Added EOS stop sequence according to https://github.com/ggerganov/llama.cpp/pull/6751#issuecomment-2065602862 * Removed adding of BOS token before first message * Removed bos token from expected output from llama-3 * Update tests/test-chat-template.cpp Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com> * Update tests/test-chat-template.cpp Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com> * Added <\|end_of_text\|> as another stop token * Reverted last change of adding the end_of_text stop word for llama 3 --------- Co-authored-by: Wouter Tichelaar <tichelaarw@spar.net> Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com> Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-21 16:03:39 +03:00
pmysl	c1386c936e	gguf-py : add IQ1_M to GGML_QUANT_SIZES (#6761 )	2024-04-21 15:49:30 +03:00
Jan Boon	e8d35f47cb	doc : add link to falcon (#6789 )	2024-04-21 15:35:40 +03:00
Mohammadreza Hendiani	2cca09d509	readme : add Fedora instructions (#6783 ) * added fedora to list of distros that may need the package (the packages have the same name on Fedora) * how to add clblast that is avalible in the fedora repos	2024-04-21 15:32:05 +03:00
Justine Tunney	89b0bf0d5d	llava : use logger in llava-cli (#6797 ) This change removes printf() logging so llava-cli is shell scriptable.	2024-04-21 15:19:04 +03:00
Pedro Cuenca	b97bc3966e	llama : support Llama 3 HF conversion (#6745 ) * Support Llama 3 conversion The tokenizer is BPE. * style * Accept suggestion Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com> * llama : add llama_token_is_eog() ggml-ci * llama : auto-detect more EOT tokens when missing in KV data * convert : replacing EOS token is a hack * llama : fix codegemma EOT token + add TODOs * llama : fix model type string for 8B model --------- Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-21 14:50:41 +03:00
Jan Boon	b8109bc013	doc : server tests require llama to be built with curl enabled (#6788 )	2024-04-20 18:29:50 +02:00
Georgi Gerganov	aed82f6837	common : try to fix Android CI (#6780 ) * common : disable get_math_cpu_count() until Android CI gets fixed * common : another try	2024-04-20 13:27:12 +03:00
loonerin	0e4802b2ec	ci: add ubuntu latest release and fix missing build number (mac & ubuntu) (#6748 )	2024-04-19 19:03:35 +02:00
Georgi Gerganov	871fcb6e10	ggml : fix soft_max with bias on CPU ggml-ci	2024-04-19 18:03:56 +03:00
Georgi Gerganov	3badef1fe1	ggml : fix avx512 const correctness ggml-ci	2024-04-19 17:45:08 +03:00
Georgi Gerganov	52945429eb	tests : remove benchmarks ggml-ci	2024-04-19 17:38:28 +03:00
Georgi Gerganov	29f6ad8d95	Merge branch 'master' into gg/flash-attn	2024-04-19 17:30:09 +03:00
Georgi Gerganov	bc346166f9	metal : minor	2024-04-19 17:24:52 +03:00
Georgi Gerganov	1a88565b44	metal : clean-up kernel code	2024-04-19 15:52:49 +03:00
Georgi Gerganov	97eaece7d6	metal : clean-up	2024-04-19 15:30:27 +03:00
Georgi Gerganov	703c6e6528	ggml : fix arm fp16 store on windows	2024-04-19 14:20:41 +03:00
Pierrick Hymbert	637e9a86c2	server: static: upstream upgrade (#6765 )	2024-04-19 13:19:01 +02:00
Georgi Gerganov	e32b281743	llama : adapt build_olmo to changes	2024-04-19 14:04:56 +03:00
Georgi Gerganov	1db66c1dac	Merge branch 'master' into gg/flash-attn	2024-04-19 14:03:55 +03:00
Georgi Gerganov	74d57f9513	llama : simplify llama_build_kv_store ggml-ci	2024-04-19 13:49:57 +03:00
nopperl	9958c81b79	Implement the OLMo architecture (#6741 ) * implement olmo architecture * remove unused variable * remove unused moe branch * remove check for weight * remove superfluous moe, bias and rope tensors * clarified comment * fix clamp_kqv setting * remove obsolete parameter name filter	2024-04-19 11:35:54 +02:00
Austin	8b1b1f4982	train : add general name (#6752 ) * llama : make general.name optional * train: Add 'general.name' to model metadata Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> --------- Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-19 10:16:45 +03:00
Neo Zhang	bca40e9814	fix wrong parameter in cmd in readme-sycl.md (#6755 ) Co-authored-by: jianyuzh <jianyu.zhang@intel.com>	2024-04-19 09:16:31 +08:00
Georgi Gerganov	9ca869876e	batched-bench : add fattn arg	2024-04-18 21:41:32 +03:00
Georgi Gerganov	c16a7c2688	metal : use F32 attention accumulators	2024-04-18 21:20:30 +03:00
slaren	0d56246f4b	ggml : group all experts in a single ggml_mul_mat_id (#6505 ) * ggml : group all experts in a single ggml_mul_mat_id cuda : improve mmid row copy * cuda : fix bin bcast with non-cont src0 * test-backend-ops : only run all mul mat tests for base types * llama : disable moe offloading with SYCL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-18 15:18:48 +02:00
Sigbjørn Skjæret	03c0946d73	convert : support models with multiple chat templates (#6588 ) * Support converting models with multiple chat templates Adds the following metadata: * tokenizer.chat_templates * tokenizer.chat_template.<name1> * tokenizer.chat_template.<name2> * tokenizer.chat_template.<...> Where `tokenizer.chat_templates` is an array of the template names (except `default`), `default` is added to the regular `tokenizer.chat_template`. * replace filtered characters with underscore * New script to add/modify/remove metadata This scripts creates a copy of a GGUF file and allows you to add/modify/remove metadata in the process. Most importantly this allows you to update chat templates, either as a string or directly from an updated tokenizer_config.json file. * Add files via upload add new script to project/readme * flake--	2024-04-18 14:49:01 +03:00
Georgi Gerganov	fa9e8c6689	Merge branch 'master' into gg/flash-attn	2024-04-18 14:39:23 +03:00
Ren Xuancheng	e11b2e6e1e	Qwen2 : assume tied weights if lm_head/output weights is missing (#6738 )	2024-04-18 14:38:04 +03:00
Georgi Gerganov	105332cc17	metal : add BS=1 kernel for flash attention (#6508 ) * metal : add BS=1 kernel for flash attention (wip) * metal : support more than 1 warps * metal : opts * metal : opt * metal : switch to parallel reduce * metal : reduce registers * metal : simplify * metal : initial FA vec kernel	2024-04-18 14:33:07 +03:00
Georgi Gerganov	260cdb2d08	llama-bench : add -fa,--flash-attn arg	2024-04-18 14:28:19 +03:00
Johannes Gäßler	87968de9a9	fix KQ FP32 precision fpr parallel_blocks > 1	2024-04-18 13:15:32 +02:00
Johannes Gäßler	2f538b9547	Add __hgt2_mask implementation for CUDA 11	2024-04-18 13:15:32 +02:00
Johannes Gäßler	0bc67dd1c8	Calculate KQ as FP32 if KQV has GGML_PREC_F32	2024-04-18 13:15:32 +02:00
Johannes Gäßler	a5b0e2dea0	store temp KQ in registers	2024-04-18 13:15:32 +02:00
Johannes Gäßler	ef9e1593f3	flush softmax exp below threshold to 0	2024-04-18 13:15:32 +02:00
Johannes Gäßler	6a3b84236d	fix flash_attn_vec_f16 race condition	2024-04-18 13:15:32 +02:00
Johannes Gäßler	34f93bbb39	CUDA: refactor host code, dyn. par. blocks	2024-04-18 13:15:32 +02:00
slaren	c71bfd736e	llama : fix compatibility with old 2 expert models (#6735 )	2024-04-18 10:04:47 +03:00
Pierrick HYMBERT	5668c79ea0	server: bench: enable flash_attn param	2024-04-17 23:26:29 +02:00
Georgi Gerganov	3b8f1ec4b1	llamafile : tmp disable + build sgemm.o when needed (#6716 ) * build : sgemm.o only when needed ggml-ci * llamafile : tmp disable due to MoE bug ggml-ci	2024-04-17 23:58:26 +03:00
Yaroslav	8dd1ec8b3f	readme : add UI (#6724 ) * Update README.md * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-17 15:47:50 +03:00
Pierrick HYMBERT	405385726e	server: support flash_attn param	2024-04-17 14:05:02 +02:00
Georgi Gerganov	599ce84a71	llama : flash_attn cparam + fix defrag	2024-04-17 12:01:39 +03:00

1 2 3 4 5 ...

2876 commits