llama.cpp

Author	SHA1	Message	Date
Branden Butler	c6280bc3f4	Update to use backend GUID and changed signatures	2024-03-14 20:32:41 -05:00
Branden Butler	01be58caa9	Fix simple to use new per-node thread count	2024-03-14 20:32:39 -05:00
Branden Butler	619bf62acf	Support new MPI backend in llama.cpp and increase GGML max split inputs	2024-03-14 20:32:35 -05:00
Branden Butler	942ce843f8	Working MPI backend implementation	2024-03-14 20:26:39 -05:00
Branden Butler	bc93545005	Allow MPI backend to wrap multiple backends	2024-03-14 20:26:37 -05:00
Branden Butler	968cefb4a9	Wrap backends with MPI backend	2024-03-14 20:26:35 -05:00
Branden Butler	b98274c76f	Begin transition to backend v2	2024-03-14 20:26:32 -05:00
Branden Butler	aa166462f1	Fix draft thread args and remove grads from mpi eval_init	2024-03-14 20:26:28 -05:00
Branden Butler	c9d18263b3	Allow per-node threads to be set in command-line args, add mpi support to main	2024-03-14 20:26:24 -05:00
Branden Butler	32078d6fe1	Fix missing layer_inp_i names	2024-03-14 20:24:50 -05:00
Branden Butler	b7599f7a56	Fix some mpi mem leaks, add mpi-layer-split to help when using mpi	2024-03-14 20:24:48 -05:00
Branden Butler	888d4f591b	Update MPI code to new KV seq rm and bos/eos model APIs	2024-03-14 20:24:39 -05:00
Branden Butler	bcfb190c28	Synchronize batch sequence info, fixing MPI for llama_decode()	2024-03-14 20:23:08 -05:00
Branden Butler	ede7ff0c66	Fix MPI compilation errors	2024-03-14 20:07:07 -05:00
Branden Butler	50a63eb5f9	Fix minor rebase errors	2024-03-14 20:07:02 -05:00
Branden Butler	fda60ead35	Replace vector with C-style array and length in llama_split_layers_weighted	2024-03-14 20:06:58 -05:00
Branden Butler	364b707130	Remove unrelated sections from mpi readme	2024-03-14 20:06:53 -05:00
Branden Butler	6c07d6cfa1	Remove fprintf logs from mpi main	2024-03-14 20:06:48 -05:00
Branden Butler	8fe813130a	Update MPI example to follow main changes	2024-03-14 20:06:43 -05:00
Branden Butler	16eff5af69	Disable warmup under MPI	2024-03-14 20:06:36 -05:00
Branden Butler	4829c6224e	Revert accidental removal of ggml_mpi_backend_init	2024-03-14 20:04:13 -05:00
Branden Butler	78112ab5c2	Remove mtest (#3177 )	2024-03-14 20:04:11 -05:00
Branden Butler	1e78fa4f91	Add code comments in MPI	2024-03-14 20:04:09 -05:00
Branden Butler	40a810923a	Add documentation for ggml-mpi functions	2024-03-14 20:04:07 -05:00
Branden Butler	3ca1ca0182	Refactor MPI for heterogenous cluster support. Adds support for different options and number of layers per node. The per-node options are implemented as parsing command-line options from a file instead of from the command-line itself. This allows each node to have its own version of this options file. The different number of layers per-node is implemented as a new option, `mpi-layer-split`, that takes a list of percentages. These percentages are used to calculate the range of layers to delegate to each node. The ranges are calculated on the head node and then scattered to the other nodes to maintain a single source of truth.	2024-03-14 20:04:05 -05:00
Georgi Gerganov	4755afd1cb	llama : fix integer overflow during quantization (#6063 )	2024-03-14 22:58:41 +02:00
Steve Grubb	6e0438da3c	gguf : fix resource leaks (#6061 ) There several places where a gguf context is allocated. A call to gguf_free is missing in some error paths. Also on linux, llama-bench was missing a fclose.	2024-03-14 20:29:32 +02:00
Ondřej Čertík	727107707a	gguf-py : bump version to 0.8.0 (#6060 )	2024-03-14 19:57:31 +02:00
Michael Podvitskiy	69ff61397d	llama : support models without vocabulary (#5798 ) * additional methods to read model and ctx parameters * vocab size as a part of a model metadata * models without vocabulary, convert.py part * models without vocabulary, llama.cpp part * PR clean up * converter scrypt fixes * llama_vocab_type update (renamed the new key) * pr review fixes * revert function renaming * one more NoVocab assert	2024-03-14 18:21:56 +02:00
Georgi Gerganov	044ec4b2a5	embedding : add EOS token if not present (#899 )	2024-03-14 15:14:14 +02:00
Georgi Gerganov	77178eedc8	gguf-py : fix dtype check (#6045 )	2024-03-14 13:32:14 +02:00
Jian Liao	15a333260a	readme : improve readme for Llava-1.6 example (#6044 ) Co-authored-by: Jian Liao <jianliao@adobe.com>	2024-03-14 13:18:23 +02:00
Pierrick Hymbert	43241adf22	server: disable debug release type sanitizer, simplify trigger (#6047 ) - increase time out for server - do not fail fast	2024-03-14 13:15:39 +02:00
Georgi Gerganov	a44bc969e4	llama : fix typo	2024-03-14 13:13:06 +02:00
Michael Podvitskiy	2c4fb69246	llama : optimize defrag moves + fix fragmentation calculation (#6037 ) * attempt to reduce the impact of a worst-case scenario * fragmentation calculation fix * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-14 12:56:48 +02:00
Ondřej Čertík	3ca23481dd	gguf-py : add support for I8, I16 and I32 (#6045 ) * Refactor dtype handling to be extensible This code is equivalent as before, but now it is prepared to easily add more NumPy dtypes. * Add support for I8, I16 and I32 These types are allowed in the GGUF specification. * Add support for I8, I16 and I32 to gguf_writer * Add support for I8, I16, I32 to gguf_reader	2024-03-14 12:40:14 +02:00
Georgi Gerganov	3fe8d7a17f	ggml : designate enum vals for integer types (#6050 )	2024-03-14 12:38:37 +02:00
Georgi Gerganov	68265ebfc6	embedding : print all resulting embeddings (#899 )	2024-03-14 12:37:20 +02:00
Georgi Gerganov	381da2d9f0	metal : build metallib + fix embed path (#6015 ) * metal : build metallib + fix embed path ggml-ci * metal : fix embed build + update library load logic ggml-ci * metal : fix embeded library build ggml-ci * ci : fix iOS builds to use embedded library	2024-03-14 11:55:23 +02:00
Georgi Gerganov	0fd6c1f015	embedding : print cosine similarity (#899 )	2024-03-14 10:12:29 +02:00
Linwei Wang	19885d205e	readme : update details about running llama in Termux on Android (#6039 )	2024-03-13 20:34:40 +02:00
Georgi Gerganov	76a936c893	readme : update API changes and hot topics	2024-03-13 20:33:56 +02:00
Clint Herron	463628372d	grammar : handle missing "root" node (#6004 )	2024-03-13 20:10:40 +02:00
slaren	f30ea47a87	llama : add pipeline parallelism support (#6017 ) * llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-13 18:54:21 +01:00
slaren	d8fd0ccf6a	test-backend-ops : skip CPU backend by default (#6028 )	2024-03-13 15:58:30 +02:00
AidanBeltonS	b3d978600f	Update get version (#6025 )	2024-03-13 18:47:54 +05:30
Xuan Son Nguyen	99b71c068f	Server: Use multi-task for embeddings endpoint (#6001 ) * use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}	2024-03-13 11:39:11 +01:00
slaren	306d34be7a	ci : remove tidy-review (#6021 )	2024-03-12 17:55:19 +02:00
Georgi Gerganov	8030da7afe	ggml : reuse quantum structs across backends (#5943 ) * ggml : reuse quant blocks across backends ggml-ci * ggml : define helper constants only for CUDA and SYCL ggml-ci * ggml : define helper quantum constants for SYCL ggml-ci	2024-03-12 14:27:20 +02:00
Georgi Gerganov	184215e783	ggml : fix UB in IQ2_S and IQ3_S (#6012 )	2024-03-12 13:49:55 +02:00

1 2 3 4 5 ...

2456 commits