llama.cpp

Author	SHA1	Message	Date
Georgi Gerganov	78d363b0d4	llama : replace bool need_kq_pos with use_alibi	2024-04-23 17:15:13 +03:00
Georgi Gerganov	3864eea4cb	ggml : add TODO's for F16/F32 mask/pos support in other backends	2024-04-23 10:06:56 +03:00
Georgi Gerganov	c129369702	cuda : try to fix __hgt2_mask ggml-ci	2024-04-23 09:18:55 +03:00
Georgi Gerganov	c70bfd7bcb	cuda : "constexpr dim3" -> "const dim3" ggml-ci	2024-04-22 20:31:23 +03:00
Georgi Gerganov	5408d55506	cuda : uint -> uint32_t	2024-04-22 19:12:06 +03:00
Georgi Gerganov	f725ca90fb	ggml : ggml_soft_max support F16/F32 mask/pos ggml-ci	2024-04-22 14:53:11 +03:00
Georgi Gerganov	c11d05fec0	llama : force disable flash attention for incompatible models	2024-04-22 12:50:41 +03:00
Georgi Gerganov	cb76d747d1	ggml : fix num dimensions in ggml_flash_attn_ext	2024-04-22 12:50:26 +03:00
Georgi Gerganov	a39217d428	common : print --flash-attn in help	2024-04-22 12:50:10 +03:00
Georgi Gerganov	871fcb6e10	ggml : fix soft_max with bias on CPU ggml-ci	2024-04-19 18:03:56 +03:00
Georgi Gerganov	3badef1fe1	ggml : fix avx512 const correctness ggml-ci	2024-04-19 17:45:08 +03:00
Georgi Gerganov	52945429eb	tests : remove benchmarks ggml-ci	2024-04-19 17:38:28 +03:00
Georgi Gerganov	29f6ad8d95	Merge branch 'master' into gg/flash-attn	2024-04-19 17:30:09 +03:00
Georgi Gerganov	bc346166f9	metal : minor	2024-04-19 17:24:52 +03:00
Georgi Gerganov	1a88565b44	metal : clean-up kernel code	2024-04-19 15:52:49 +03:00
Georgi Gerganov	97eaece7d6	metal : clean-up	2024-04-19 15:30:27 +03:00
Georgi Gerganov	703c6e6528	ggml : fix arm fp16 store on windows	2024-04-19 14:20:41 +03:00
Pierrick Hymbert	637e9a86c2	server: static: upstream upgrade (#6765 )	2024-04-19 13:19:01 +02:00
Georgi Gerganov	e32b281743	llama : adapt build_olmo to changes	2024-04-19 14:04:56 +03:00
Georgi Gerganov	1db66c1dac	Merge branch 'master' into gg/flash-attn	2024-04-19 14:03:55 +03:00
Georgi Gerganov	74d57f9513	llama : simplify llama_build_kv_store ggml-ci	2024-04-19 13:49:57 +03:00
nopperl	9958c81b79	Implement the OLMo architecture (#6741 ) * implement olmo architecture * remove unused variable * remove unused moe branch * remove check for weight * remove superfluous moe, bias and rope tensors * clarified comment * fix clamp_kqv setting * remove obsolete parameter name filter	2024-04-19 11:35:54 +02:00
Austin	8b1b1f4982	train : add general name (#6752 ) * llama : make general.name optional * train: Add 'general.name' to model metadata Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> --------- Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-19 10:16:45 +03:00
Neo Zhang	bca40e9814	fix wrong parameter in cmd in readme-sycl.md (#6755 ) Co-authored-by: jianyuzh <jianyu.zhang@intel.com>	2024-04-19 09:16:31 +08:00
Georgi Gerganov	9ca869876e	batched-bench : add fattn arg	2024-04-18 21:41:32 +03:00
Georgi Gerganov	c16a7c2688	metal : use F32 attention accumulators	2024-04-18 21:20:30 +03:00
slaren	0d56246f4b	ggml : group all experts in a single ggml_mul_mat_id (#6505 ) * ggml : group all experts in a single ggml_mul_mat_id cuda : improve mmid row copy * cuda : fix bin bcast with non-cont src0 * test-backend-ops : only run all mul mat tests for base types * llama : disable moe offloading with SYCL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-18 15:18:48 +02:00
Sigbjørn Skjæret	03c0946d73	convert : support models with multiple chat templates (#6588 ) * Support converting models with multiple chat templates Adds the following metadata: * tokenizer.chat_templates * tokenizer.chat_template.<name1> * tokenizer.chat_template.<name2> * tokenizer.chat_template.<...> Where `tokenizer.chat_templates` is an array of the template names (except `default`), `default` is added to the regular `tokenizer.chat_template`. * replace filtered characters with underscore * New script to add/modify/remove metadata This scripts creates a copy of a GGUF file and allows you to add/modify/remove metadata in the process. Most importantly this allows you to update chat templates, either as a string or directly from an updated tokenizer_config.json file. * Add files via upload add new script to project/readme * flake--	2024-04-18 14:49:01 +03:00
Georgi Gerganov	fa9e8c6689	Merge branch 'master' into gg/flash-attn	2024-04-18 14:39:23 +03:00
Ren Xuancheng	e11b2e6e1e	Qwen2 : assume tied weights if lm_head/output weights is missing (#6738 )	2024-04-18 14:38:04 +03:00
Georgi Gerganov	105332cc17	metal : add BS=1 kernel for flash attention (#6508 ) * metal : add BS=1 kernel for flash attention (wip) * metal : support more than 1 warps * metal : opts * metal : opt * metal : switch to parallel reduce * metal : reduce registers * metal : simplify * metal : initial FA vec kernel	2024-04-18 14:33:07 +03:00
Georgi Gerganov	260cdb2d08	llama-bench : add -fa,--flash-attn arg	2024-04-18 14:28:19 +03:00
Johannes Gäßler	87968de9a9	fix KQ FP32 precision fpr parallel_blocks > 1	2024-04-18 13:15:32 +02:00
Johannes Gäßler	2f538b9547	Add __hgt2_mask implementation for CUDA 11	2024-04-18 13:15:32 +02:00
Johannes Gäßler	0bc67dd1c8	Calculate KQ as FP32 if KQV has GGML_PREC_F32	2024-04-18 13:15:32 +02:00
Johannes Gäßler	a5b0e2dea0	store temp KQ in registers	2024-04-18 13:15:32 +02:00
Johannes Gäßler	ef9e1593f3	flush softmax exp below threshold to 0	2024-04-18 13:15:32 +02:00
Johannes Gäßler	6a3b84236d	fix flash_attn_vec_f16 race condition	2024-04-18 13:15:32 +02:00
Johannes Gäßler	34f93bbb39	CUDA: refactor host code, dyn. par. blocks	2024-04-18 13:15:32 +02:00
slaren	c71bfd736e	llama : fix compatibility with old 2 expert models (#6735 )	2024-04-18 10:04:47 +03:00
Pierrick HYMBERT	5668c79ea0	server: bench: enable flash_attn param	2024-04-17 23:26:29 +02:00
Georgi Gerganov	3b8f1ec4b1	llamafile : tmp disable + build sgemm.o when needed (#6716 ) * build : sgemm.o only when needed ggml-ci * llamafile : tmp disable due to MoE bug ggml-ci	2024-04-17 23:58:26 +03:00
Yaroslav	8dd1ec8b3f	readme : add UI (#6724 ) * Update README.md * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-17 15:47:50 +03:00
Pierrick HYMBERT	405385726e	server: support flash_attn param	2024-04-17 14:05:02 +02:00
Georgi Gerganov	599ce84a71	llama : flash_attn cparam + fix defrag	2024-04-17 12:01:39 +03:00
Georgi Gerganov	2c41180e88	Merge branch 'master' into gg/flash-attn	2024-04-17 10:13:09 +03:00
Zheng.Deng	facb8b56f8	convert : fix autoawq gemma (#6704 ) * fix autoawq quantized gemma model convert error using autoawq to quantize gemma model will include a lm_head.weight tensor in model-00001-of-00002.safetensors. it result in this situation that convert-hf-to-gguf.py can't map lm_head.weight. skip loading this tensor could prevent this error. * change code to full string match and print necessary message change code to full string match and print a short message to inform users that lm_head.weight has been skipped. --------- Co-authored-by: Zheng.Deng <32841220+CUGfred@users.noreply.github.com>	2024-04-16 23:51:07 +03:00
Georgi Gerganov	532c1737a1	llama : make general.name optional (#6709 )	2024-04-16 23:50:38 +03:00
Georgi Gerganov	666867b799	ggml : fix llamafile sgemm wdata offsets (#6710 ) ggml-ci	2024-04-16 23:50:22 +03:00
Justine Tunney	8cc91dc63c	ggml : add llamafile sgemm (#6414 ) This change upstreams llamafile's cpu matrix multiplication kernels which improve image and prompt evaluation speed. For starters, Q4_0 and Q8_0 weights should go ~40% faster on CPU. The biggest benefits are with data types like f16 / f32, which process prompts 2x faster thus making them faster than quantized data types for prompt evals. This change also introduces bona fide AVX512 support since tinyBLAS is able to exploit the larger register file. For example, on my CPU llama.cpp llava-cli processes an image prompt at 305 tokens/second, using the Q4_K and Q4_0 types, which has always been faster than if we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With this change, f16 LLaVA performance leap frogs to 464 tokens/second. On Intel Core i9-14900K this change improves F16 prompt perf by 5x. For example, using llama.cpp at HEAD with Mistral 7b f16 to process a 215 token prompt will go 13 tok/sec. This change has fixes making it go 52 tok/sec. It's mostly thanks to my vectorized outer product kernels but also because I added support for correctly counting the number of cores on Alderlake, so the default thread count discounts Intel's new efficiency cores. Only Linux right now can count cores. This work was sponsored by Mozilla who's given permission to change the license of this code from Apache 2.0 to MIT. To read more about what's improved, and how it works, see: https://justine.lol/matmul/	2024-04-16 21:55:30 +03:00

1 2 3 4 5 ...

2821 commits