Commit graph

2456 commits

Author SHA1 Message Date
Branden Butler
c6280bc3f4 Update to use backend GUID and changed signatures 2024-03-14 20:32:41 -05:00
Branden Butler
01be58caa9 Fix simple to use new per-node thread count 2024-03-14 20:32:39 -05:00
Branden Butler
619bf62acf Support new MPI backend in llama.cpp and increase GGML max split inputs 2024-03-14 20:32:35 -05:00
Branden Butler
942ce843f8 Working MPI backend implementation 2024-03-14 20:26:39 -05:00
Branden Butler
bc93545005 Allow MPI backend to wrap multiple backends 2024-03-14 20:26:37 -05:00
Branden Butler
968cefb4a9 Wrap backends with MPI backend 2024-03-14 20:26:35 -05:00
Branden Butler
b98274c76f Begin transition to backend v2 2024-03-14 20:26:32 -05:00
Branden Butler
aa166462f1 Fix draft thread args and remove grads from mpi eval_init 2024-03-14 20:26:28 -05:00
Branden Butler
c9d18263b3 Allow per-node threads to be set in command-line args, add mpi support to main 2024-03-14 20:26:24 -05:00
Branden Butler
32078d6fe1 Fix missing layer_inp_i names 2024-03-14 20:24:50 -05:00
Branden Butler
b7599f7a56 Fix some mpi mem leaks, add mpi-layer-split to help when using mpi 2024-03-14 20:24:48 -05:00
Branden Butler
888d4f591b Update MPI code to new KV seq rm and bos/eos model APIs 2024-03-14 20:24:39 -05:00
Branden Butler
bcfb190c28 Synchronize batch sequence info, fixing MPI for llama_decode() 2024-03-14 20:23:08 -05:00
Branden Butler
ede7ff0c66 Fix MPI compilation errors 2024-03-14 20:07:07 -05:00
Branden Butler
50a63eb5f9 Fix minor rebase errors 2024-03-14 20:07:02 -05:00
Branden Butler
fda60ead35 Replace vector with C-style array and length in llama_split_layers_weighted 2024-03-14 20:06:58 -05:00
Branden Butler
364b707130 Remove unrelated sections from mpi readme 2024-03-14 20:06:53 -05:00
Branden Butler
6c07d6cfa1 Remove fprintf logs from mpi main 2024-03-14 20:06:48 -05:00
Branden Butler
8fe813130a Update MPI example to follow main changes 2024-03-14 20:06:43 -05:00
Branden Butler
16eff5af69 Disable warmup under MPI 2024-03-14 20:06:36 -05:00
Branden Butler
4829c6224e Revert accidental removal of ggml_mpi_backend_init 2024-03-14 20:04:13 -05:00
Branden Butler
78112ab5c2 Remove mtest (#3177) 2024-03-14 20:04:11 -05:00
Branden Butler
1e78fa4f91 Add code comments in MPI 2024-03-14 20:04:09 -05:00
Branden Butler
40a810923a Add documentation for ggml-mpi functions 2024-03-14 20:04:07 -05:00
Branden Butler
3ca1ca0182 Refactor MPI for heterogenous cluster support.
Adds support for different options and number of layers
per node.

The per-node options are implemented as parsing
command-line options from a file instead of from the
command-line itself. This allows each node to have its own
version of this options file.

The different number of layers per-node is implemented
as a new option, `mpi-layer-split`, that takes
a list of percentages. These percentages are used to calculate
the range of layers to delegate to each node. The ranges
are calculated on the head node and then scattered to the other
nodes to maintain a single source of truth.
2024-03-14 20:04:05 -05:00
Georgi Gerganov
4755afd1cb
llama : fix integer overflow during quantization (#6063) 2024-03-14 22:58:41 +02:00
Steve Grubb
6e0438da3c
gguf : fix resource leaks (#6061)
There several places where a gguf context is allocated. A call to gguf_free
is missing in some error paths. Also on linux, llama-bench was missing a
fclose.
2024-03-14 20:29:32 +02:00
Ondřej Čertík
727107707a
gguf-py : bump version to 0.8.0 (#6060) 2024-03-14 19:57:31 +02:00
Michael Podvitskiy
69ff61397d
llama : support models without vocabulary (#5798)
* additional methods to read model and ctx parameters

* vocab size as a part of a model metadata

* models without vocabulary, convert.py part

* models without vocabulary, llama.cpp part

* PR clean up

* converter scrypt fixes

* llama_vocab_type update (renamed the new key)

* pr review fixes

* revert function renaming

* one more NoVocab assert
2024-03-14 18:21:56 +02:00
Georgi Gerganov
044ec4b2a5
embedding : add EOS token if not present (#899) 2024-03-14 15:14:14 +02:00
Georgi Gerganov
77178eedc8
gguf-py : fix dtype check (#6045) 2024-03-14 13:32:14 +02:00
Jian Liao
15a333260a
readme : improve readme for Llava-1.6 example (#6044)
Co-authored-by: Jian Liao <jianliao@adobe.com>
2024-03-14 13:18:23 +02:00
Pierrick Hymbert
43241adf22
server: disable debug release type sanitizer, simplify trigger (#6047)
- increase time out for server
 - do not fail fast
2024-03-14 13:15:39 +02:00
Georgi Gerganov
a44bc969e4
llama : fix typo 2024-03-14 13:13:06 +02:00
Michael Podvitskiy
2c4fb69246
llama : optimize defrag moves + fix fragmentation calculation (#6037)
* attempt to reduce the impact of a worst-case scenario

* fragmentation calculation fix

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-14 12:56:48 +02:00
Ondřej Čertík
3ca23481dd
gguf-py : add support for I8, I16 and I32 (#6045)
* Refactor dtype handling to be extensible

This code is equivalent as before, but now it is prepared to easily add
more NumPy dtypes.

* Add support for I8, I16 and I32

These types are allowed in the GGUF specification.

* Add support for I8, I16 and I32 to gguf_writer

* Add support for I8, I16, I32 to gguf_reader
2024-03-14 12:40:14 +02:00
Georgi Gerganov
3fe8d7a17f
ggml : designate enum vals for integer types (#6050) 2024-03-14 12:38:37 +02:00
Georgi Gerganov
68265ebfc6
embedding : print all resulting embeddings (#899) 2024-03-14 12:37:20 +02:00
Georgi Gerganov
381da2d9f0
metal : build metallib + fix embed path (#6015)
* metal : build metallib + fix embed path

ggml-ci

* metal : fix embed build + update library load logic

ggml-ci

* metal : fix embeded library build

ggml-ci

* ci : fix iOS builds to use embedded library
2024-03-14 11:55:23 +02:00
Georgi Gerganov
0fd6c1f015
embedding : print cosine similarity (#899) 2024-03-14 10:12:29 +02:00
Linwei Wang
19885d205e
readme : update details about running llama in Termux on Android (#6039) 2024-03-13 20:34:40 +02:00
Georgi Gerganov
76a936c893
readme : update API changes and hot topics 2024-03-13 20:33:56 +02:00
Clint Herron
463628372d
grammar : handle missing "root" node (#6004) 2024-03-13 20:10:40 +02:00
slaren
f30ea47a87
llama : add pipeline parallelism support (#6017)
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs

ggml-ci

* server : add -ub, --ubatch-size parameter

* fix server embedding test

* llama : fix Mamba inference for pipeline parallelism

Tested to work correctly with both `main` and `parallel` examples.

* llama : limit max batch size to n_batch

* add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism
default increase to 4 (from 2)

changing this value may improve performance for some systems, but increases memory usage

* fix hip build

* fix sycl build (disable cpy_tensor_async)

* fix hip build

* llama : limit n_batch and n_ubatch to n_ctx during context creation

* llama : fix norm backend

* batched-bench : sync after decode

* swiftui : sync after decode

* ggml : allow ggml_get_rows to use multiple threads if they are available

* check n_ubatch >= n_tokens with non-casual attention

* llama : do not limit n_batch to n_ctx with non-casual attn

* server : construct batch with size of llama_n_batch

* ggml_backend_cpu_graph_compute : fix return value when alloc fails

* llama : better n_batch and n_ubatch comment

* fix merge

* small fix

* reduce default n_batch to 2048

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-13 18:54:21 +01:00
slaren
d8fd0ccf6a
test-backend-ops : skip CPU backend by default (#6028) 2024-03-13 15:58:30 +02:00
AidanBeltonS
b3d978600f
Update get version (#6025) 2024-03-13 18:47:54 +05:30
Xuan Son Nguyen
99b71c068f
Server: Use multi-task for embeddings endpoint (#6001)
* use multitask for embd endpoint

* specify types

* remove redundant {"n_predict", 0}
2024-03-13 11:39:11 +01:00
slaren
306d34be7a
ci : remove tidy-review (#6021) 2024-03-12 17:55:19 +02:00
Georgi Gerganov
8030da7afe
ggml : reuse quantum structs across backends (#5943)
* ggml : reuse quant blocks across backends

ggml-ci

* ggml : define helper constants only for CUDA and SYCL

ggml-ci

* ggml : define helper quantum constants for SYCL

ggml-ci
2024-03-12 14:27:20 +02:00
Georgi Gerganov
184215e783
ggml : fix UB in IQ2_S and IQ3_S (#6012) 2024-03-12 13:49:55 +02:00