llama.cpp

Author	SHA1	Message	Date
Georgi Gerganov	56657e52e5	llama : fix n_batch requirements ggml-ci	2024-04-23 17:30:37 +03:00
Georgi Gerganov	19e8982f51	llama : prep ALiBi support for BERT models ggml-ci	2024-04-23 17:24:28 +03:00
Georgi Gerganov	78d363b0d4	llama : replace bool need_kq_pos with use_alibi	2024-04-23 17:15:13 +03:00
Georgi Gerganov	3864eea4cb	ggml : add TODO's for F16/F32 mask/pos support in other backends	2024-04-23 10:06:56 +03:00
Georgi Gerganov	c129369702	cuda : try to fix __hgt2_mask ggml-ci	2024-04-23 09:18:55 +03:00
Georgi Gerganov	c70bfd7bcb	cuda : "constexpr dim3" -> "const dim3" ggml-ci	2024-04-22 20:31:23 +03:00
Georgi Gerganov	5408d55506	cuda : uint -> uint32_t	2024-04-22 19:12:06 +03:00
Georgi Gerganov	f725ca90fb	ggml : ggml_soft_max support F16/F32 mask/pos ggml-ci	2024-04-22 14:53:11 +03:00
Georgi Gerganov	c11d05fec0	llama : force disable flash attention for incompatible models	2024-04-22 12:50:41 +03:00
Georgi Gerganov	cb76d747d1	ggml : fix num dimensions in ggml_flash_attn_ext	2024-04-22 12:50:26 +03:00
Georgi Gerganov	a39217d428	common : print --flash-attn in help	2024-04-22 12:50:10 +03:00
Georgi Gerganov	871fcb6e10	ggml : fix soft_max with bias on CPU ggml-ci	2024-04-19 18:03:56 +03:00
Georgi Gerganov	3badef1fe1	ggml : fix avx512 const correctness ggml-ci	2024-04-19 17:45:08 +03:00
Georgi Gerganov	52945429eb	tests : remove benchmarks ggml-ci	2024-04-19 17:38:28 +03:00
Georgi Gerganov	29f6ad8d95	Merge branch 'master' into gg/flash-attn	2024-04-19 17:30:09 +03:00
Georgi Gerganov	bc346166f9	metal : minor	2024-04-19 17:24:52 +03:00
Georgi Gerganov	1a88565b44	metal : clean-up kernel code	2024-04-19 15:52:49 +03:00
Georgi Gerganov	97eaece7d6	metal : clean-up	2024-04-19 15:30:27 +03:00
Georgi Gerganov	703c6e6528	ggml : fix arm fp16 store on windows	2024-04-19 14:20:41 +03:00
Pierrick Hymbert	637e9a86c2	server: static: upstream upgrade (#6765 )	2024-04-19 13:19:01 +02:00
Georgi Gerganov	e32b281743	llama : adapt build_olmo to changes	2024-04-19 14:04:56 +03:00
Georgi Gerganov	1db66c1dac	Merge branch 'master' into gg/flash-attn	2024-04-19 14:03:55 +03:00
Georgi Gerganov	74d57f9513	llama : simplify llama_build_kv_store ggml-ci	2024-04-19 13:49:57 +03:00
nopperl	9958c81b79	Implement the OLMo architecture (#6741 ) * implement olmo architecture * remove unused variable * remove unused moe branch * remove check for weight * remove superfluous moe, bias and rope tensors * clarified comment * fix clamp_kqv setting * remove obsolete parameter name filter	2024-04-19 11:35:54 +02:00
Austin	8b1b1f4982	train : add general name (#6752 ) * llama : make general.name optional * train: Add 'general.name' to model metadata Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> --------- Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-19 10:16:45 +03:00
Neo Zhang	bca40e9814	fix wrong parameter in cmd in readme-sycl.md (#6755 ) Co-authored-by: jianyuzh <jianyu.zhang@intel.com>	2024-04-19 09:16:31 +08:00
Georgi Gerganov	9ca869876e	batched-bench : add fattn arg	2024-04-18 21:41:32 +03:00
Georgi Gerganov	c16a7c2688	metal : use F32 attention accumulators	2024-04-18 21:20:30 +03:00
slaren	0d56246f4b	ggml : group all experts in a single ggml_mul_mat_id (#6505 ) * ggml : group all experts in a single ggml_mul_mat_id cuda : improve mmid row copy * cuda : fix bin bcast with non-cont src0 * test-backend-ops : only run all mul mat tests for base types * llama : disable moe offloading with SYCL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-18 15:18:48 +02:00
Sigbjørn Skjæret	03c0946d73	convert : support models with multiple chat templates (#6588 ) * Support converting models with multiple chat templates Adds the following metadata: * tokenizer.chat_templates * tokenizer.chat_template.<name1> * tokenizer.chat_template.<name2> * tokenizer.chat_template.<...> Where `tokenizer.chat_templates` is an array of the template names (except `default`), `default` is added to the regular `tokenizer.chat_template`. * replace filtered characters with underscore * New script to add/modify/remove metadata This scripts creates a copy of a GGUF file and allows you to add/modify/remove metadata in the process. Most importantly this allows you to update chat templates, either as a string or directly from an updated tokenizer_config.json file. * Add files via upload add new script to project/readme * flake--	2024-04-18 14:49:01 +03:00
Georgi Gerganov	fa9e8c6689	Merge branch 'master' into gg/flash-attn	2024-04-18 14:39:23 +03:00
Ren Xuancheng	e11b2e6e1e	Qwen2 : assume tied weights if lm_head/output weights is missing (#6738 )	2024-04-18 14:38:04 +03:00
Georgi Gerganov	105332cc17	metal : add BS=1 kernel for flash attention (#6508 ) * metal : add BS=1 kernel for flash attention (wip) * metal : support more than 1 warps * metal : opts * metal : opt * metal : switch to parallel reduce * metal : reduce registers * metal : simplify * metal : initial FA vec kernel	2024-04-18 14:33:07 +03:00
Georgi Gerganov	260cdb2d08	llama-bench : add -fa,--flash-attn arg	2024-04-18 14:28:19 +03:00
Johannes Gäßler	87968de9a9	fix KQ FP32 precision fpr parallel_blocks > 1	2024-04-18 13:15:32 +02:00
Johannes Gäßler	2f538b9547	Add __hgt2_mask implementation for CUDA 11	2024-04-18 13:15:32 +02:00
Johannes Gäßler	0bc67dd1c8	Calculate KQ as FP32 if KQV has GGML_PREC_F32	2024-04-18 13:15:32 +02:00
Johannes Gäßler	a5b0e2dea0	store temp KQ in registers	2024-04-18 13:15:32 +02:00
Johannes Gäßler	ef9e1593f3	flush softmax exp below threshold to 0	2024-04-18 13:15:32 +02:00
Johannes Gäßler	6a3b84236d	fix flash_attn_vec_f16 race condition	2024-04-18 13:15:32 +02:00
Johannes Gäßler	34f93bbb39	CUDA: refactor host code, dyn. par. blocks	2024-04-18 13:15:32 +02:00
slaren	c71bfd736e	llama : fix compatibility with old 2 expert models (#6735 )	2024-04-18 10:04:47 +03:00
Pierrick HYMBERT	5668c79ea0	server: bench: enable flash_attn param	2024-04-17 23:26:29 +02:00
Georgi Gerganov	3b8f1ec4b1	llamafile : tmp disable + build sgemm.o when needed (#6716 ) * build : sgemm.o only when needed ggml-ci * llamafile : tmp disable due to MoE bug ggml-ci	2024-04-17 23:58:26 +03:00
Yaroslav	8dd1ec8b3f	readme : add UI (#6724 ) * Update README.md * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-17 15:47:50 +03:00
Pierrick HYMBERT	405385726e	server: support flash_attn param	2024-04-17 14:05:02 +02:00
Georgi Gerganov	599ce84a71	llama : flash_attn cparam + fix defrag	2024-04-17 12:01:39 +03:00
Georgi Gerganov	2c41180e88	Merge branch 'master' into gg/flash-attn	2024-04-17 10:13:09 +03:00
Zheng.Deng	facb8b56f8	convert : fix autoawq gemma (#6704 ) * fix autoawq quantized gemma model convert error using autoawq to quantize gemma model will include a lm_head.weight tensor in model-00001-of-00002.safetensors. it result in this situation that convert-hf-to-gguf.py can't map lm_head.weight. skip loading this tensor could prevent this error. * change code to full string match and print necessary message change code to full string match and print a short message to inform users that lm_head.weight has been skipped. --------- Co-authored-by: Zheng.Deng <32841220+CUGfred@users.noreply.github.com>	2024-04-16 23:51:07 +03:00
Georgi Gerganov	532c1737a1	llama : make general.name optional (#6709 )	2024-04-16 23:50:38 +03:00

1 2 3 4 5 ...

2823 commits