llama.cpp

Author	SHA1	Message	Date
Georgi Gerganov	bc346166f9	metal : minor	2024-04-19 17:24:52 +03:00
Georgi Gerganov	1a88565b44	metal : clean-up kernel code	2024-04-19 15:52:49 +03:00
Georgi Gerganov	97eaece7d6	metal : clean-up	2024-04-19 15:30:27 +03:00
Georgi Gerganov	703c6e6528	ggml : fix arm fp16 store on windows	2024-04-19 14:20:41 +03:00
Georgi Gerganov	e32b281743	llama : adapt build_olmo to changes	2024-04-19 14:04:56 +03:00
Georgi Gerganov	1db66c1dac	Merge branch 'master' into gg/flash-attn	2024-04-19 14:03:55 +03:00
Georgi Gerganov	74d57f9513	llama : simplify llama_build_kv_store ggml-ci	2024-04-19 13:49:57 +03:00
nopperl	9958c81b79	Implement the OLMo architecture (#6741 ) * implement olmo architecture * remove unused variable * remove unused moe branch * remove check for weight * remove superfluous moe, bias and rope tensors * clarified comment * fix clamp_kqv setting * remove obsolete parameter name filter	2024-04-19 11:35:54 +02:00
Austin	8b1b1f4982	train : add general name (#6752 ) * llama : make general.name optional * train: Add 'general.name' to model metadata Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> --------- Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-19 10:16:45 +03:00
Neo Zhang	bca40e9814	fix wrong parameter in cmd in readme-sycl.md (#6755 ) Co-authored-by: jianyuzh <jianyu.zhang@intel.com>	2024-04-19 09:16:31 +08:00
Georgi Gerganov	9ca869876e	batched-bench : add fattn arg	2024-04-18 21:41:32 +03:00
Georgi Gerganov	c16a7c2688	metal : use F32 attention accumulators	2024-04-18 21:20:30 +03:00
slaren	0d56246f4b	ggml : group all experts in a single ggml_mul_mat_id (#6505 ) * ggml : group all experts in a single ggml_mul_mat_id cuda : improve mmid row copy * cuda : fix bin bcast with non-cont src0 * test-backend-ops : only run all mul mat tests for base types * llama : disable moe offloading with SYCL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-18 15:18:48 +02:00
Sigbjørn Skjæret	03c0946d73	convert : support models with multiple chat templates (#6588 ) * Support converting models with multiple chat templates Adds the following metadata: * tokenizer.chat_templates * tokenizer.chat_template.<name1> * tokenizer.chat_template.<name2> * tokenizer.chat_template.<...> Where `tokenizer.chat_templates` is an array of the template names (except `default`), `default` is added to the regular `tokenizer.chat_template`. * replace filtered characters with underscore * New script to add/modify/remove metadata This scripts creates a copy of a GGUF file and allows you to add/modify/remove metadata in the process. Most importantly this allows you to update chat templates, either as a string or directly from an updated tokenizer_config.json file. * Add files via upload add new script to project/readme * flake--	2024-04-18 14:49:01 +03:00
Georgi Gerganov	fa9e8c6689	Merge branch 'master' into gg/flash-attn	2024-04-18 14:39:23 +03:00
Ren Xuancheng	e11b2e6e1e	Qwen2 : assume tied weights if lm_head/output weights is missing (#6738 )	2024-04-18 14:38:04 +03:00
Georgi Gerganov	105332cc17	metal : add BS=1 kernel for flash attention (#6508 ) * metal : add BS=1 kernel for flash attention (wip) * metal : support more than 1 warps * metal : opts * metal : opt * metal : switch to parallel reduce * metal : reduce registers * metal : simplify * metal : initial FA vec kernel	2024-04-18 14:33:07 +03:00
Georgi Gerganov	260cdb2d08	llama-bench : add -fa,--flash-attn arg	2024-04-18 14:28:19 +03:00
Johannes Gäßler	87968de9a9	fix KQ FP32 precision fpr parallel_blocks > 1	2024-04-18 13:15:32 +02:00
Johannes Gäßler	2f538b9547	Add __hgt2_mask implementation for CUDA 11	2024-04-18 13:15:32 +02:00
Johannes Gäßler	0bc67dd1c8	Calculate KQ as FP32 if KQV has GGML_PREC_F32	2024-04-18 13:15:32 +02:00
Johannes Gäßler	a5b0e2dea0	store temp KQ in registers	2024-04-18 13:15:32 +02:00
Johannes Gäßler	ef9e1593f3	flush softmax exp below threshold to 0	2024-04-18 13:15:32 +02:00
Johannes Gäßler	6a3b84236d	fix flash_attn_vec_f16 race condition	2024-04-18 13:15:32 +02:00
Johannes Gäßler	34f93bbb39	CUDA: refactor host code, dyn. par. blocks	2024-04-18 13:15:32 +02:00
slaren	c71bfd736e	llama : fix compatibility with old 2 expert models (#6735 )	2024-04-18 10:04:47 +03:00
Pierrick HYMBERT	5668c79ea0	server: bench: enable flash_attn param	2024-04-17 23:26:29 +02:00
Georgi Gerganov	3b8f1ec4b1	llamafile : tmp disable + build sgemm.o when needed (#6716 ) * build : sgemm.o only when needed ggml-ci * llamafile : tmp disable due to MoE bug ggml-ci	2024-04-17 23:58:26 +03:00
Yaroslav	8dd1ec8b3f	readme : add UI (#6724 ) * Update README.md * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-17 15:47:50 +03:00
Pierrick HYMBERT	405385726e	server: support flash_attn param	2024-04-17 14:05:02 +02:00
Georgi Gerganov	599ce84a71	llama : flash_attn cparam + fix defrag	2024-04-17 12:01:39 +03:00
Georgi Gerganov	2c41180e88	Merge branch 'master' into gg/flash-attn	2024-04-17 10:13:09 +03:00
Zheng.Deng	facb8b56f8	convert : fix autoawq gemma (#6704 ) * fix autoawq quantized gemma model convert error using autoawq to quantize gemma model will include a lm_head.weight tensor in model-00001-of-00002.safetensors. it result in this situation that convert-hf-to-gguf.py can't map lm_head.weight. skip loading this tensor could prevent this error. * change code to full string match and print necessary message change code to full string match and print a short message to inform users that lm_head.weight has been skipped. --------- Co-authored-by: Zheng.Deng <32841220+CUGfred@users.noreply.github.com>	2024-04-16 23:51:07 +03:00
Georgi Gerganov	532c1737a1	llama : make general.name optional (#6709 )	2024-04-16 23:50:38 +03:00
Georgi Gerganov	666867b799	ggml : fix llamafile sgemm wdata offsets (#6710 ) ggml-ci	2024-04-16 23:50:22 +03:00
Justine Tunney	8cc91dc63c	ggml : add llamafile sgemm (#6414 ) This change upstreams llamafile's cpu matrix multiplication kernels which improve image and prompt evaluation speed. For starters, Q4_0 and Q8_0 weights should go ~40% faster on CPU. The biggest benefits are with data types like f16 / f32, which process prompts 2x faster thus making them faster than quantized data types for prompt evals. This change also introduces bona fide AVX512 support since tinyBLAS is able to exploit the larger register file. For example, on my CPU llama.cpp llava-cli processes an image prompt at 305 tokens/second, using the Q4_K and Q4_0 types, which has always been faster than if we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With this change, f16 LLaVA performance leap frogs to 464 tokens/second. On Intel Core i9-14900K this change improves F16 prompt perf by 5x. For example, using llama.cpp at HEAD with Mistral 7b f16 to process a 215 token prompt will go 13 tok/sec. This change has fixes making it go 52 tok/sec. It's mostly thanks to my vectorized outer product kernels but also because I added support for correctly counting the number of cores on Alderlake, so the default thread count discounts Intel's new efficiency cores. Only Linux right now can count cores. This work was sponsored by Mozilla who's given permission to change the license of this code from Apache 2.0 to MIT. To read more about what's improved, and how it works, see: https://justine.lol/matmul/	2024-04-16 21:55:30 +03:00
Ashish	dbceec87c0	llama : add StableLM2 12B (#6635 ) * StableLM2 12B support for huggingface -> GGUF * StableLM12 tensormapping and constants * StableLM-2-12b model support * fix * Added 12B support * Removed autoformatting; resolved bug where model_arch was not selecting StableLM2 * Formatting * Do QK norm stacking in model conversion step * Converge StableLM and StableLM2 code to simplify graph construction * Fix accidental removal * Removed warnings * Revert formatter * Move QK norm stack to private function so it's easier to read * refactor stablelm graph builder to support 1.6, 3b and 12b more efficiently * Proper check for None type for new_name to avoid crash; formatting; revert change to base class `write_tensors()` * Format * Formatting * format Co-authored-by: compilade <git@compilade.net> * Fix incorrect check for K norm * space after commas; Keep indentation multiple of 4 spaces * Flake8 format * Removed unnecessary conditional branches * Removed unused comment * Fixed incorrect tensor passing * Format --------- Co-authored-by: compilade <git@compilade.net>	2024-04-16 18:48:35 +03:00
Shijie	f4dea7da18	llama : add qwen2moe (#6074 ) * support qwen2moe * fix-review * metal : support unary ops for nelements % 4 != 0 * metal : require contiguousness for float4 unary kernels * metal : require contiguousness for float4 unary kernels (cont) * fix-review * names : for brevity "SHARED_EXP" -> "SHEXP" * llama : reuse build_moe_ffn() * llama : add model type name --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-16 18:40:48 +03:00
Daniel Bevenius	8a56075b07	gritlm : add --outdir option to hf.sh script (#6699 ) This commit updates the hf.sh script usage to include the --outdir option and specifies the models directory as the output directory. The motivation for this is to avoid cluttering the root directory with model files. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-04-16 09:34:06 +03:00
Georgi Gerganov	58227ffdeb	perplexity : require positive --ctx-size arg (#6695 )	2024-04-16 09:28:33 +03:00
Daniel Bevenius	4fbd8098e6	gguf : add special tokens metadata for FIM/Infill (#6689 ) This commit adds special token metadata for Fill-In-the-Middle (FIM)/Infill to the GGUF model. The motivation for this is that currently there is support for CodeLlama but other models exist now like CodeGemma, but the different models use different token ids for the special tokens and this commit allows for supporting multiple models. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-04-16 09:13:13 +03:00
Olivier Chafik	7593639ce3	`main`: add --json-schema / -j flag (#6659 ) * main: add --json-schema / -j * json: move json-schema-to-grammar to common lib * json: fix zig build	2024-04-15 18:35:21 +01:00
compilade	132f55795e	llama : fix restoring the number of outputs from state files (#6687 )	2024-04-15 15:56:55 +03:00
Pierrick Hymbert	3272896d79	server : revert "minor layout improvements" (#6684 ) This reverts commit `b3a96f27f0`.	2024-04-15 15:18:47 +03:00
Steven Prichard	7fc16a2c32	swift : linux support (#6590 ) - Package.swift now supports conditional compilation based on OS - Allows for package to be used by SPM on Non-Apple platforms Co-authored-by: Steven Prichard <steven.prichard@justeattakeaway.com>	2024-04-15 13:14:46 +03:00
Neo Zhang Jianyu	17e98d4c96	fix mul_mat_id() for new input, make the ut pass (#6682 )	2024-04-15 17:12:26 +08:00
David Renshaw	1958f7e06c	llama : add missing kv clear in llama_beam_search (#6664 )	2024-04-14 15:24:15 -04:00
Chao Jiang	04fbc5f23e	Add Command R chat template (#6650 ) * Add chat template for command-r model series * Fix indentation * Add chat template test for command-r models and update the implementation to trim whitespaces * Remove debug print	2024-04-14 18:16:34 +02:00
Georgi Gerganov	f184dd9208	flake.lock: Update (#6669 )	2024-04-14 06:55:30 -07:00
Dave	422c2aff1c	Added support for GGML_OP_CLAMP in Metal (#6662 ) * Added support for GGML_OP_CLAMP in Metal * Corrected size --------- Co-authored-by: dave-fl <dave@Davids-MacBook-Pro.local>	2024-04-14 13:14:19 +02:00

1 2 3 4 5 ...

2807 commits