Commit graph

1229 commits

Author SHA1 Message Date
Henri Vasserman
6ddeefad9b
[Zig] Fixing Zig build and improvements (#2554)
* Fix zig after console.o was split

* Better include and flag management

* Change LTO to option
2023-08-17 23:11:18 +03:00
Kerfuffle
8dae7ce684
Add --cfg-negative-prompt-file option for examples (#2591)
Add --cfg-negative-prompt-file option for examples
2023-08-17 07:29:44 -06:00
Georgi Gerganov
a73ccf1aa3
llama : replace (permute + reshape + view_1d) with (view_3d) (#2538)
ggml-ci
2023-08-17 10:47:09 +03:00
drbh
7cf54e1f74
tests : adds simple llama grammar tests (#2618)
* adds simple llama grammar tests

* fix lint and add Makefile

* 0 terminate code_points

* avoid dangling pointers in candidate cleanup

* cleanup grammar at end of test
2023-08-17 10:41:01 +03:00
Shouzheng Liu
a872a2b28e
ggml-alloc : fix discrepency between measure&eval (#2639)
The GGML memory allocator consistently places a tensor within the
optimal-fit memory block, which is the smallest block capable of
accommodating the tensor's size. During the measurement phase, the final
block is generously sized, ensuring it never qualifies as the
optimal-fit block as long as there exists another block capable of
accommodating the tensor. Nevertheless, in the evaluation phase, the
last block is constrained in size and could potentially qualify as the
optimal-fit block. Consequently, there exists the possibility of a
tensor being allocated to a different region during evaluation, leading
to more memory fragmentation in our scratch buffer.

This recent commit guarantees uniform behavior of the allocator across
both the measurement and evaluation phases, eliminating discrepancies
between the two.
2023-08-17 10:35:53 +03:00
xaedes
714fec06ee
use ggml_add_cast in finetuning
lora-applied weights will now have data type F32, which improves gradients when finetuning quantized base models
2023-08-16 23:53:12 +02:00
xaedes
9198b24e4e
add ggml_add_cast API function
this function works like ggml_add, but accepts a data type for the resulting tensor.
only supported for quantized src0 input.
2023-08-16 23:50:46 +02:00
Kolen Cheung
0919a0f73d
cmake : install ggml-meta.metal if LLAMA_METAL (#2449) 2023-08-16 23:09:49 +03:00
Jhen-Jie Hong
ed53db86c3
metal : print error of load pipeline state (#2564)
* metal : print error of load pipeline state

* metal : return null if load pipeline failed
2023-08-16 23:09:03 +03:00
xaedes
f80e245d7b
add lora finetune support on quantized base model tensors 2023-08-16 22:08:44 +02:00
Shouzheng Liu
fc8ef549e5
metal : enable ggml-alloc (#2627)
* metal: enable ggml-alloc

Make ggml-alloc work with concurrently dispatch.

* style-fix

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-16 23:08:28 +03:00
Shouzheng Liu
bf83bff674
metal : matrix-matrix multiplication kernel (#2615)
* metal: matrix-matrix multiplication kernel

This commit removes MPS and uses custom matrix-matrix multiplication
kernels for all quantization types. This commit also adds grouped-query
attention to support llama2 70B.

* metal: fix performance degradation from gqa

Integers are slow on the GPU, and 64-bit divides are extremely slow.
In the context of GQA, we introduce a 64-bit divide that cannot be
optimized out by the compiler, which results in a decrease of ~8% in
inference performance. This commit fixes that issue by calculating a
part of the offset with a 32-bit divide. Naturally, this limits the
size of a single matrix to ~4GB. However, this limitation should
suffice for the near future.

* metal: fix bugs for GQA and perplexity test.

I mixed up ne02 and nb02 in previous commit.
2023-08-16 23:07:04 +03:00
xaedes
83a4ad7986
remove trailing whitespace 2023-08-16 22:05:41 +02:00
xaedes
83cb9ed4f5
implement ggml_compute_forward_out_prod_q_f32 2023-08-16 22:01:06 +02:00
xaedes
79ad888768
remove unused call to not existing llama_get_layer_from_model 2023-08-16 21:56:36 +02:00
xaedes
1151653b15
replace llama API functions to get model tensors by one function to get model tensor by name
LLAMA_API struct ggml_tensor * llama_get_model_tensor(struct llama_model * model, const char * name);
2023-08-16 21:36:40 +02:00
xaedes
39a2d15461
avoid stack overflow resulting from big ggml_cgraph
replace stack allocation and ggml_build_forward by ggml_new_graph in combination with ggml_build_forward_expand
2023-08-16 16:42:25 +02:00
xaedes
0ab2507ce5
fix names of lora tensors 2023-08-16 16:41:20 +02:00
xaedes
620275361d
add debug prints for training memory improvements 2023-08-16 16:23:21 +02:00
xaedes
be7e564b11
bug fixes to make finetune compile
automatic allocator does not work yet
2023-08-16 16:21:43 +02:00
xaedes
50b1e66200
remove const model and layer arguments in API functions for accessing model tensors 2023-08-16 16:21:02 +02:00
xaedes
28ee0c8583
first draft for LORA finetune training 2023-08-16 15:31:04 +02:00
xaedes
c0a372fd3d
add API functions to access remaining model parameters:
mult, head and rot
2023-08-16 15:30:31 +02:00
xaedes
9eb1ef8653
move and remove code 2023-08-15 14:03:02 +02:00
xaedes
5e059ace25
add stub example for finetuning, based on train-text-from-scratch 2023-08-15 13:54:28 +02:00
xaedes
316b0707f4
add API functions to access llama model tensors 2023-08-15 13:53:13 +02:00
Georgi Gerganov
b5ffb2849d
scripts : add helper script to get wikitext 2023-08-15 10:05:25 +03:00
Jhen-Jie Hong
3ebb00935f
server : add missing /json-schema-to-grammar.mjs (#2616)
fixes #2611
2023-08-15 06:14:14 +08:00
xaedes
3b5515bbe0
reverse order of for loop in ggml_build_backward_expand to save memory when using gradient checkpointing and allocator
with this loop order gradient checkpointing with allocator on 16 layer model saves 13% memory; 2 layer memory it saves 2% memory.

the computation results are the same
2023-08-14 22:09:36 +02:00
xaedes
56228461c8
fix memory "leak" in optimizers
each iteration a new cplan with new memory for work data was allocated.
now cplan creation only happens at the start of optimization, with each iteration reusing the cplan and its work data.
2023-08-14 21:12:02 +02:00
xaedes
3e6468b097
fix test when to create temporary backward graph
temporary backward graph is only necessary when using checkpointing
2023-08-14 20:57:18 +02:00
xaedes
098654c277
only use ggml_allocr_alloc when tensor has NULL data and is no view 2023-08-14 20:57:18 +02:00
xaedes
faf3e21eaf
add debug asserts in ggml_allocr_alloc to some common pitfalls when using this function directly 2023-08-14 20:50:09 +02:00
xaedes
6e280b24dc
remove unused forward_batch function 2023-08-14 19:02:12 +02:00
xaedes
3794dceb7f
remove unused train params: mem_compute1_gb & mem_compute2_gb
mem_compute_gb is used for compute when automatic memory allocator is not enabled, otherwise it can be very small to only hold the tensor definitions
mem_compute0_gb is used for automatic memory allocator (as long as measurement of max required size is not implemented)
2023-08-14 18:44:42 +02:00
xaedes
6f161c784b
remove trailing whitespace 2023-08-14 18:33:27 +02:00
xaedes
271e4d64b5
remove unused training parameters "use_scratch" and "use_unified" 2023-08-14 18:31:59 +02:00
xaedes
c954f41ca4
remove handwritten training functions 2023-08-14 18:30:50 +02:00
xaedes
fe788a1c7a
allocate graph on context using ggml_new_graph 2023-08-14 18:24:13 +02:00
xaedes
75baed230c
set names for tensors in unified train function for easier debugging 2023-08-14 18:17:14 +02:00
xaedes
3e99a8d653
format name of cloned tensors with " (clone)" suffix 2023-08-14 18:15:09 +02:00
xaedes
865c4cd3c1
integrate unified training function which may use memory allocator
the unified training function also supports arguments whether to use flash attention and/or gradient checkpointing
2023-08-14 18:12:58 +02:00
xaedes
4ed096c6b0
add training options whether to use allocator and/or unified training function 2023-08-14 18:10:02 +02:00
xaedes
d6c5b03858
fix ASSERT to work with zero layers 2023-08-14 18:08:19 +02:00
xaedes
38f4438c32
make sure some tensors are not reallocated by inserting new temporary nodes depending on them:
output and parameter gradient tensors need to be available at the end of the graph execution

parameter gradient tensors also need to be available before the graph execution because they are set to zero before each optimizer iteration

checkpoint tensors are allocated all together to reduce memory allocator fragmentation

afterwards, in addition to the temporary nodes, we also need to reset the temporary leafs
2023-08-14 18:07:16 +02:00
xaedes
9716eb8ef0
fix variable name and add missing boolean negation 2023-08-14 17:59:19 +02:00
xaedes
5884b43a62
add input tensors as checkpoints
so that recursive tensor cloning of gradient checkpointing terminates on input tensors
2023-08-14 17:58:49 +02:00
xaedes
b2f1310196
swap arguments to commutative ops to be the same as in forward_batch_wo_cache_flash_attn 2023-08-14 17:57:13 +02:00
xaedes
5a11b75875
fix variable names 2023-08-14 17:55:51 +02:00
xaedes
345f516f7c
correctly clone view tensors by setting data pointers
without this the checkpointing would only work when being used together with memory allocator
2023-08-14 17:55:13 +02:00