Commit graph

1255 commits

Author SHA1 Message Date
xaedes
27c24ffa1b
add option to save finetune output every N iterations 2023-08-20 20:16:46 +02:00
xaedes
d61ed6b431
mixing multiple LORA adapters is now possible
pass more than one '--lora FNAME' argument to apply more than one LORA.
use '--lora-scaled FNAME S' when you want to specify a user-defined scale for an adapter.
2023-08-20 18:48:35 +02:00
Kawrakow
5e9ff54a67
More efficient Hellaswag implementation (#2677)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-20 16:44:46 +03:00
Georgi Gerganov
1f0bccb279
server : better default prompt (#2646) 2023-08-19 05:45:36 +08:00
Jhen-Jie Hong
f63564adfa
server : update xxd usage for older versions compatibility (#2649)
* server : update xxd usage for older versions compatibility

* remove unused $func
2023-08-19 05:41:32 +08:00
Adrian
2d8b76a110
Add link to clojure bindings to Readme. (#2659) 2023-08-18 21:39:22 +02:00
xaedes
37dfb544aa
resolve todo
allocator will only make it inplace when they are of the same type
2023-08-18 21:22:41 +02:00
xaedes
3e47890760
remove unnecessary src tensor from ggml_repeat & ggml_repeat_back
we don't need data of src[1] for computation, only to setup the correct output shape.
remove dependency on src[1], so that allocator can work more freely.

the computational graph is still completely determined, because the output shape is naturally included
2023-08-18 20:51:00 +02:00
xaedes
65b0561637
remove unnecessary src tensor from ggml_get_rows_back
we don't need data of src[2] for computation, only to setup the correct output shape.
remove dependency on src[2], so that allocator can work more freely.

the computational graph is still completely determined, because the output shape is naturally included.
this is similar to how ggml_reshape does it.
2023-08-18 20:25:42 +02:00
xaedes
6c98640035
bug fix: make sure finetune input gradient is allocated at begin and kept until end 2023-08-18 20:10:04 +02:00
xaedes
63cb374a99
change default finetune params lora_r and lora_alpha to match the n_rank parameters of 4 2023-08-18 19:08:15 +02:00
xaedes
7a63d429af
adjust maximal values to support finetuning 3B models 2023-08-18 17:32:31 +02:00
Georgi Gerganov
7af633aec3
readme : incoming BREAKING CHANGE 2023-08-18 17:48:31 +03:00
xaedes
113c90f1cc
improve optimization iteration prints 2023-08-18 16:24:42 +02:00
xaedes
a0c2752ba7
remove debug prints and function to compute tensor data hash 2023-08-18 16:24:13 +02:00
xaedes
011f47f972
remove trailing whitespace 2023-08-18 16:02:46 +02:00
xaedes
f358204a5f
avoid keeping in memory ALL of the gradients
The problem here stems from ggml_graph_reset. This function is called in the optimization function, before each graph computation, to reset the gradients to zero. This required a unique memory slot for each gradient: allocating memory from a previosly freed memory location might lead to non-zero input gradients.

During ggml_compute_backward the gradients are build stepwise by adding or substracting new values, starting from a OP_NONE tensor which needs to contain zero-values. This requires the graph reset.

To avoid this I now remember in ggml_build_backward_expand the original OP_NONE gradient tensors in a hash table, which is passed to ggml_compute_backward. There instead of using add (or sub or similar) I test whether the existing gradient to be changed is a zero-valued-tensor by looking up its existence in the hash table. When it is such a zero-tensor it will not be modified, but replaced by the value to be added, otherwise the regular add (not inplace, allocator will take care of this) will be used. This way none of those zero-tensor values will be necessary in the final backward graph and more importantly they won't need a unique memory slot, just to make them zero.
2023-08-18 16:01:43 +02:00
xaedes
a252111b45
fix bug in ggml_out_prod which resulted in wrong n_dims of result tensors 2023-08-18 15:03:57 +02:00
xaedes
44526cb261
make sure base model tensors data cannot be used in viewable operations
memory allocator would try to make lora application inplace on base model tensors.
since those are memory mapped this will result in memory access violations
2023-08-18 15:03:17 +02:00
slaren
097e121e2f
llama : add benchmark example (#2626)
* llama : add benchmark example

* add to examples CMakeLists.txt

* fix msvc build

* add missing include

* add Bessel's correction to stdev calculation

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* improve markdown formatting

* add missing include

* print warning is NDEBUG is not defined

* remove n_prompt and n_gen from the matrix, use each value separately instead

* better checks for non-optimized builds

* llama.cpp : fix MEM_REQ_SCRATCH0 reusing the value of n_ctx of the first call

* fix json formatting

* add sql output

* add basic cpu and gpu info (linx/cuda only)

* markdown: also show values that differ from the default

* markdown: add build id

* cleanup

* improve formatting

* formatting

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-08-18 12:44:58 +02:00
mdrokz
eaf98c2649
readme : add link to Rust bindings (#2656) 2023-08-18 13:17:58 +03:00
Georgi Gerganov
e9b12c332e
perplexity : more meaningful ETA number - 2 decimal points 2023-08-18 12:48:55 +03:00
Evan Jones
604b8bdfa6
Fix unicode in grammars (fixes #2501) (#2553)
* Fix unicode in grammars (fixes #2501)

* add more comments

* fix test-llama-grammar
2023-08-17 19:54:44 -04:00
staviq
10151bee2e
server : support for saving templates in browser LocalStorage (#2486)
* support for templates in browser LocalStorage

* sync accepted #2409 fix from upstream

* convert autosave invocation to useEffect

* Apply suggestions from code review

Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>

* Regen index.html.cpp, suggested from code review

---------

Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
2023-08-18 07:34:01 +08:00
xaedes
0bb897c82a
bug fix: actually use result type passed to ggml_add_cast 2023-08-18 00:59:06 +02:00
Johannes Gäßler
0992a7b8b1
README: fix LLAMA_CUDA_MMV_Y documentation (#2647) 2023-08-17 23:57:59 +02:00
Henri Vasserman
6ddeefad9b
[Zig] Fixing Zig build and improvements (#2554)
* Fix zig after console.o was split

* Better include and flag management

* Change LTO to option
2023-08-17 23:11:18 +03:00
Kerfuffle
8dae7ce684
Add --cfg-negative-prompt-file option for examples (#2591)
Add --cfg-negative-prompt-file option for examples
2023-08-17 07:29:44 -06:00
Georgi Gerganov
a73ccf1aa3
llama : replace (permute + reshape + view_1d) with (view_3d) (#2538)
ggml-ci
2023-08-17 10:47:09 +03:00
drbh
7cf54e1f74
tests : adds simple llama grammar tests (#2618)
* adds simple llama grammar tests

* fix lint and add Makefile

* 0 terminate code_points

* avoid dangling pointers in candidate cleanup

* cleanup grammar at end of test
2023-08-17 10:41:01 +03:00
Shouzheng Liu
a872a2b28e
ggml-alloc : fix discrepency between measure&eval (#2639)
The GGML memory allocator consistently places a tensor within the
optimal-fit memory block, which is the smallest block capable of
accommodating the tensor's size. During the measurement phase, the final
block is generously sized, ensuring it never qualifies as the
optimal-fit block as long as there exists another block capable of
accommodating the tensor. Nevertheless, in the evaluation phase, the
last block is constrained in size and could potentially qualify as the
optimal-fit block. Consequently, there exists the possibility of a
tensor being allocated to a different region during evaluation, leading
to more memory fragmentation in our scratch buffer.

This recent commit guarantees uniform behavior of the allocator across
both the measurement and evaluation phases, eliminating discrepancies
between the two.
2023-08-17 10:35:53 +03:00
xaedes
714fec06ee
use ggml_add_cast in finetuning
lora-applied weights will now have data type F32, which improves gradients when finetuning quantized base models
2023-08-16 23:53:12 +02:00
xaedes
9198b24e4e
add ggml_add_cast API function
this function works like ggml_add, but accepts a data type for the resulting tensor.
only supported for quantized src0 input.
2023-08-16 23:50:46 +02:00
Kolen Cheung
0919a0f73d
cmake : install ggml-meta.metal if LLAMA_METAL (#2449) 2023-08-16 23:09:49 +03:00
Jhen-Jie Hong
ed53db86c3
metal : print error of load pipeline state (#2564)
* metal : print error of load pipeline state

* metal : return null if load pipeline failed
2023-08-16 23:09:03 +03:00
xaedes
f80e245d7b
add lora finetune support on quantized base model tensors 2023-08-16 22:08:44 +02:00
Shouzheng Liu
fc8ef549e5
metal : enable ggml-alloc (#2627)
* metal: enable ggml-alloc

Make ggml-alloc work with concurrently dispatch.

* style-fix

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-16 23:08:28 +03:00
Shouzheng Liu
bf83bff674
metal : matrix-matrix multiplication kernel (#2615)
* metal: matrix-matrix multiplication kernel

This commit removes MPS and uses custom matrix-matrix multiplication
kernels for all quantization types. This commit also adds grouped-query
attention to support llama2 70B.

* metal: fix performance degradation from gqa

Integers are slow on the GPU, and 64-bit divides are extremely slow.
In the context of GQA, we introduce a 64-bit divide that cannot be
optimized out by the compiler, which results in a decrease of ~8% in
inference performance. This commit fixes that issue by calculating a
part of the offset with a 32-bit divide. Naturally, this limits the
size of a single matrix to ~4GB. However, this limitation should
suffice for the near future.

* metal: fix bugs for GQA and perplexity test.

I mixed up ne02 and nb02 in previous commit.
2023-08-16 23:07:04 +03:00
xaedes
83a4ad7986
remove trailing whitespace 2023-08-16 22:05:41 +02:00
xaedes
83cb9ed4f5
implement ggml_compute_forward_out_prod_q_f32 2023-08-16 22:01:06 +02:00
xaedes
79ad888768
remove unused call to not existing llama_get_layer_from_model 2023-08-16 21:56:36 +02:00
xaedes
1151653b15
replace llama API functions to get model tensors by one function to get model tensor by name
LLAMA_API struct ggml_tensor * llama_get_model_tensor(struct llama_model * model, const char * name);
2023-08-16 21:36:40 +02:00
xaedes
39a2d15461
avoid stack overflow resulting from big ggml_cgraph
replace stack allocation and ggml_build_forward by ggml_new_graph in combination with ggml_build_forward_expand
2023-08-16 16:42:25 +02:00
xaedes
0ab2507ce5
fix names of lora tensors 2023-08-16 16:41:20 +02:00
xaedes
620275361d
add debug prints for training memory improvements 2023-08-16 16:23:21 +02:00
xaedes
be7e564b11
bug fixes to make finetune compile
automatic allocator does not work yet
2023-08-16 16:21:43 +02:00
xaedes
50b1e66200
remove const model and layer arguments in API functions for accessing model tensors 2023-08-16 16:21:02 +02:00
xaedes
28ee0c8583
first draft for LORA finetune training 2023-08-16 15:31:04 +02:00
xaedes
c0a372fd3d
add API functions to access remaining model parameters:
mult, head and rot
2023-08-16 15:30:31 +02:00
xaedes
9eb1ef8653
move and remove code 2023-08-15 14:03:02 +02:00