llama.cpp

Author	SHA1	Message	Date
Meng Zhang	bb9931cf92	Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-09-16 01:14:55 +08:00
Meng Zhang	f989ba151d	fix: remove max_position_embeddings, use n_train_ctx	2023-09-16 00:56:19 +08:00
Meng Zhang	e1fa9dd24c	Merge pull request #3 from TabbyML/support-starcoder-mqa feat: support starcoder mqa	2023-09-16 00:41:36 +08:00
Meng Zhang	08f35c46a6	support-mqa-directly	2023-09-16 00:36:47 +08:00
Meng Zhang	5ca037b9df	add other starcoder models: 3B, 7B, 15B	2023-09-16 00:10:37 +08:00
Meng Zhang	57eaa39c16	refactor: cleanup comments a bit	2023-09-16 00:05:32 +08:00
Meng Zhang	caa722095a	Merge pull request #2 from ggerganov/support-starcoder-fix Support starcoder fix	2023-09-15 23:13:14 +08:00
Georgi Gerganov	92a4f86879	llama : make starcoder graph build more consistent with others	2023-09-15 17:57:10 +03:00
Georgi Gerganov	f82328ab65	metal : fix out-of-bounds access in soft_max kernels	2023-09-15 17:56:49 +03:00
Meng Zhang	6c353dc7c2	cleanup useless code	2023-09-15 19:00:14 +08:00
Meng Zhang	a1cf66ea94	working in cpu, metal buggy	2023-09-15 18:45:43 +08:00
Meng Zhang	101c578715	add TBD	2023-09-15 15:23:50 +08:00
Meng Zhang	8bc76a225d	add input embeddings handling	2023-09-15 14:47:04 +08:00
Meng Zhang	ab13d071e1	store mqa directly	2023-09-15 14:18:36 +08:00
Meng Zhang	4420cff654	fix vram calculation for starcoder	2023-09-15 13:52:43 +08:00
Meng Zhang	dac31da489	fix comments	2023-09-15 12:57:38 +08:00
Meng Zhang	0be15e162c	fix head count kv	2023-09-15 12:56:20 +08:00
Meng Zhang	77c7ec179c	properly load all starcoder params	2023-09-15 12:47:22 +08:00
Meng Zhang	2683611944	set n_positions to max_positioin_embeddings	2023-09-15 12:35:46 +08:00
Meng Zhang	a17ef39792	add max_position_embeddings	2023-09-15 12:35:17 +08:00
Meng Zhang	57f064d7c2	load starcoder weight	2023-09-15 12:12:33 +08:00
Meng Zhang	166a259f67	set head_count_kv = 1	2023-09-15 12:12:27 +08:00
Meng Zhang	7298c37e7e	add LLM_ARCH_STARCODER to llama.cpp	2023-09-15 11:49:21 +08:00
Meng Zhang	7e0a843b6a	fix ffn_down name	2023-09-15 11:45:18 +08:00
Meng Zhang	76d32cca59	convert MQA to MHA	2023-09-15 11:42:16 +08:00
Meng Zhang	eb7f0eba3e	support convert starcoder weights to gguf	2023-09-15 11:24:24 +08:00
Meng Zhang	0c5d4d87b0	add placeholder of starcoder in gguf / llama.cpp	2023-09-15 10:39:47 +08:00
Cebtenzzre	98311c4277	llama : make quantize example up to 2.7x faster (#3115 )	2023-09-14 21:09:53 -04:00
jneem	feea179e9f	flake : allow $out/include to already exist (#3175 )	2023-09-14 21:54:47 +03:00
Andrei	769266a543	cmake : compile ggml-rocm with -fpic when building shared library (#3158 )	2023-09-14 20:38:16 +03:00
Asbjørn Olling	cf8238e7f4	flake : include llama.h in nix output (#3159 )	2023-09-14 20:25:00 +03:00
Cebtenzzre	4b8560e72a	make : fix clang++ detection, move some definitions to CPPFLAGS (#3155 ) * make : fix clang++ detection * make : fix compiler definitions outside of CPPFLAGS	2023-09-14 20:22:47 +03:00
Alon	83a53b753a	CI: add FreeBSD & simplify CUDA windows (#3053 ) * add freebsd to ci * bump actions/checkout to v3 * bump cuda 12.1.0 -> 12.2.0 * bump Jimver/cuda-toolkit version * unify and simplify "Copy and pack Cuda runtime" * install only necessary cuda sub packages	2023-09-14 19:21:25 +02:00
akawrykow	5c872dbca2	falcon : use stated vocab size (#2914 )	2023-09-14 20:19:42 +03:00
bandoti	990a5e226a	cmake : add relocatable Llama package (#2960 ) * Keep static libs and headers with install * Add logic to generate Config package * Use proper build info * Add llama as import library * Prefix target with package name * Add example project using CMake package * Update README * Update README * Remove trailing whitespace	2023-09-14 20:04:40 +03:00
dylan	980ab41afb	docker : add gpu image CI builds (#3103 ) Enables the GPU enabled container images to be built and pushed alongside the CPU containers. Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>	2023-09-14 19:47:00 +03:00
Kerfuffle	e394084166	gguf-py : support identity operation in TensorNameMap (#3095 ) Make try_suffixes keyword param optional.	2023-09-14 19:32:26 +03:00
jameswu2014	4c8643dd6e	feature : support Baichuan serial models (#3009 )	2023-09-14 12:32:10 -04:00
Leng Yue	35f73049af	speculative : add heuristic algorithm (#3006 ) * Add heuristic algo for speculative * Constrain minimum n_draft to 2 * speculative : improve heuristic impl * speculative : be more rewarding upon guessing max drafted tokens * speculative : fix typos --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-09-14 19:14:44 +03:00
goerch	71ca2fad7d	whisper : tokenizer fix + re-enable tokenizer test for LLaMa (#3096 ) * Fix für #2721 * Reenable tokenizer test for LLaMa * Add `console.cpp` dependency * Fix dependency to `common` * Fixing wrong fix. * Make console usage platform specific Work on compiler warnings. * Adapting makefile * Remove trailing whitespace * Adapting the other parts of the makefile * Fix typo.	2023-09-13 16:19:44 +03:00
Tristan Ross	1b6c650d16	cmake : add a compiler flag check for FP16 format (#3086 )	2023-09-13 16:08:52 +03:00
Johannes Gäßler	0a5eebb45d	CUDA: mul_mat_q RDNA2 tunings (#2910 ) * CUDA: mul_mat_q RDNA2 tunings * Update ggml-cuda.cu Co-authored-by: Henri Vasserman <henv@hot.ee> --------- Co-authored-by: Henri Vasserman <henv@hot.ee>	2023-09-13 11:20:24 +02:00
FK	84e723653c	speculative: add --n-gpu-layers-draft option (#3063 )	2023-09-13 08:50:46 +02:00
Eric Sommerlade	b52b29ab9d	arm64 support for windows (#3007 ) Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>	2023-09-12 21:54:20 -04:00
Johannes Gäßler	4f7cd6ba9c	CUDA: fix LoRAs (#3130 )	2023-09-13 00:15:33 +02:00
Johannes Gäßler	89e89599fd	CUDA: fix mul_mat_q not used for output tensor (#3127 )	2023-09-11 22:58:41 +02:00
Johannes Gäßler	d54a4027a6	CUDA: lower GPU latency + fix Windows performance (#3110 )	2023-09-11 19:55:51 +02:00
Jhen-Jie Hong	1b0d09259e	cmake : support build for iOS/tvOS (#3116 ) * cmake : support build for iOS/tvOS * ci : add iOS/tvOS build into macOS-latest-cmake * ci : split ios/tvos jobs	2023-09-11 19:49:06 +08:00
Johannes Gäßler	8a4ca9af56	CUDA: add device number to error messages (#3112 )	2023-09-11 13:00:24 +02:00
Kawrakow	f31b6f4e2d	metal : PP speedup (#3084 ) * Minor speed gains for all quantization types * metal: faster kernel_scale via float4 * Various other speedups for "small" kernels * metal: faster soft_max vial float4 * metal: faster diagonal infinity Although, to me it looks like one should simply fuse scale + diagnonal infinity + soft_max on the KQtensor. * Another faster f16 x f32 matrix multiply kernel * Reverting the diag infinity change It does work for PP, but somehow it fails for TG. Need to look more into it. * metal: add back faster diagonal infinity This time more carefully * metal : minor (readibility) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-09-11 10:30:11 +03:00

1 2 3 4 5 ...

1260 commits