Commit graph

1416 commits

Author SHA1 Message Date
xaedes
bef1e97875
move common opt_callback into common/train 2023-09-16 18:54:57 +02:00
xaedes
e9758ae1d2
move common train params into common/train 2023-09-16 18:45:59 +02:00
xaedes
ee27333b16
move train data saving code into callback to unify code of opt_callback
train_params are still different in finetune and train-text-from-scratch, so it can't yet be moved to train.h|cpp
2023-09-16 17:50:16 +02:00
xaedes
a8c8907c62
move train state into struct train_state 2023-09-16 17:30:38 +02:00
xaedes
9f4b1bf88d
move common train functions into common/train.[h|cpp] 2023-09-16 16:17:13 +02:00
xaedes
00b656f6db
remove lbfgs related train parameters 2023-09-16 15:59:46 +02:00
xaedes
ab56b63b27
update train-text-from-scratch with tokenization, sample selection and shuffling from finetune 2023-09-15 23:45:54 +02:00
xaedes
cc60b3f639
remove outcommented old code 2023-09-15 23:45:05 +02:00
xaedes
4f2ce91b9e
add static keywords 2023-09-15 23:44:53 +02:00
xaedes
76804fab1d
exclude some more known zero values from computations in flash_attn_f32 & flash_attn_back_f32 2023-09-14 22:19:39 +02:00
xaedes
d88dae2980
block tiling for out-prod inspired by mul-mat
block sizes are empirically optimized

roughly doubles the flops of out-prod
2023-09-14 19:50:02 +02:00
xaedes
0971fee710
reshuffle original sample order instead of the previous shuffled order
otherwise resumed reshuffle will not result in same sample order
2023-09-14 18:21:23 +02:00
xaedes
3a9c1d7f5a
set lora_alpha to value of lora_r if it is not set via command line
otherwise only changing lora_r will change scaling of lora adapter used in prediction
2023-09-14 17:58:31 +02:00
xaedes
20cf1a4589
use unrolled vec_mad in out_prod
y is vec_mad result vec.
x is vec_mad input vec.
v is vec_mad input scalar.

ggml_vec_mad_f32_unroll will internally loop over x and v with same y.

GGML_VEC_MAD_UNROLL is by default defined to 32.

This value is empirical optimized using performance test runs of out-prod in openllama-3b finetune with 256 context length and batch size 1. It gives 23% performance boost for out_prod.

Full measurements of out-prod runtime in ms:
	unroll_xv	unroll_yv
1	67014.643	87826.469
2	77117.552	89077.656
4	72091.311	109121.657
8	61077.543	88678.334
16	56914.67	79514.947
24	59024.595	84350.254
28	55952.446	83368.73
32	51476.658	85177.745
36	55973.792	84659.92
40	55139.616	93844.738
48	60736.392	93330.267
64	99856.878	116994.99

Second column is when unrollying yv instead of xv
2023-09-14 17:20:29 +02:00
xaedes
2c59f7bea3
account for possible leading whitespace that will be added by tokenizer
e.g. '\t' will be tokenized by llama spm tokenizer to [29871, 12]
2023-09-14 10:48:38 +02:00
xaedes
f627e2fe9c
pass correct max number of tokens to llama_tokenize 2023-09-14 03:04:04 +02:00
xaedes
7f378a7561
remove probably unnecessary exception type flags from stringstream 2023-09-14 00:21:05 +02:00
xaedes
ec57689f64
exclude known zero values from computations in flash_attn_f32 & flash_attn_back_f32 2023-09-13 18:37:51 +02:00
xaedes
7898652dfb
update shuffle rng state on reshuffle 2023-09-13 16:20:50 +02:00
xaedes
0e32932931
add sample start patterns and options to force new or by default resume last shuffling 2023-09-13 15:36:09 +02:00
xaedes
1cef45953b
remove unused command line options 2023-09-09 21:58:36 +02:00
xaedes
54b21a397c
Merge branch 'master' into finetune-lora
# Conflicts:
#	examples/train-text-from-scratch/train-text-from-scratch.cpp
#	llama.h
2023-09-09 21:30:22 +02:00
xaedes
ace90884a6
measure max compute size for each cgraph eval order and use best order
this can bring huge memory savings:
e.g. codellama-34b with n_ctx=64, n_batch=1 goes from 92927.8mb down to 4627.6 MB
2023-09-09 21:00:25 +02:00
xaedes
917d2870b4
add cgraph evaluation order member and corresponding enum type
this controls in which order ggml_build_forward visits source nodes.
by default the nodes are visited left to right, i.e. src[0] first.
in some cases it is beneficial for ggml-alloc to visit in a different order.
two possible orders are supported: left-to-right (src[0] first) and right-to-left (src[0] last).
2023-09-09 20:52:53 +02:00
xaedes
d3f1b438a8
simplify broadcasting mul_mat backward using ggml_repeat_back 2023-09-09 18:55:18 +02:00
xaedes
d3aaf0876a
add comment briefly describing what ggml_repeat_back does 2023-09-09 18:47:27 +02:00
xaedes
9738526899
decouple random number generator of each operation test
when changing one test the rng of others tests is not influenced anymore
2023-09-09 18:46:35 +02:00
xaedes
dd3278619d
test broadcasting mul_mat backward pass 2023-09-09 18:38:29 +02:00
xaedes
aea8b6be74
support broadcastable a in out_prod(a, b) and backward pass of broadcasting mul_mat(a, b) 2023-09-09 18:37:45 +02:00
xaedes
35260f7d74
fix finetune to support grouped-query-attention (using flash-attention)
note: ggml changes to ggml_out_prod are necessary to support grouped-query-attention without flash-attention.
2023-09-09 17:10:23 +02:00
xaedes
833a56c144
add llama API functions to get grouped-query-attention n_head parameter 'n_head_kv'. 2023-09-09 17:07:59 +02:00
xaedes
d7aade7d8a
support grouped-query-attention in ggml_flash_attn and ggml_flash_attn_back
k and v can now be repeated in q along ne[2]

in forward pass just use modulo to compute k and v indices, like ik2 = iq2 % nek2.

in backard pass this won't work as easy, because multiple threads will compete to accumulate to the same k->grad[:,ik1,ik2,ik3] and v->grad[:,iv1,iv2,iv3].
so we change the parallelization over q rows to be over k rows. this ensures non-overlapping (ik2,ik3) across threads.
in each thread we then iterate over the number of repetitions of k/v in q to compute iq2 as iq2 = ik2 + irep*nek2.

since ne2 is not the same for q,k and v we also change how the gradients are concatenated into the result tensor.
additionally the offsets of gradq, gradk and gradv in the result tensor are now memory aligned.

we also simplify the compute_backward part of flash_attn to use ggml_reshape instead of switching over the number of dimensions.
this needs a small change to ggml_reshape, removing the assertion of second argument to be contiguous.
since only the shape (ne) of the second reshape argument is of relevance, its memory layout (nb) is irrelevant -> it can very well be non-contiguous.

change test-grad0 to also test for repeated k/v in q.

this changes the rng and now results in small gradient differences in softmax. these solely come from using f16 exp table lookup in forward softmax: when temporarily changing softmax to use actual exp function, the reported gradient differences go away. gradient differences coming solely from f16 table lookup are acceptable.
added a note to explain this.
2023-09-09 17:07:07 +02:00
kchro3
21ac3a1503
metal : support for Swift (#3078)
* Metal support for Swift

* update

* add a toggle for arm/arm64

* set minimum versions for all platforms

* update to use newLibraryWithURL

* bump version

Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>

---------

Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
2023-09-09 17:12:10 +08:00
Jhen-Jie Hong
4fd5477955
metal : support build for iOS/tvOS (#3089) 2023-09-09 11:46:04 +03:00
takov751
ec2a24fedf
flake : add train-text-from-scratch to flake.nix (#3042) 2023-09-08 19:06:26 +03:00
Ikko Eltociear Ashimine
7d99aca759
readme : fix typo (#3043)
* readme : fix typo

acceleation -> acceleration

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-09-08 19:04:32 +03:00
Kawrakow
ba7ffbb251
metal : Q3_K speedup (#2995)
* Slightly faster Q3_K and Q5_K on metal

* Another Q3_K speedup on metal

Combined with previous commit, we are now +9.6% for TG.
PP is not affected as this happens via the matrix multiplication
templates.

* Slowly progressing on Q3_K on metal

We are now 13% faster than master

* nother small improvement for Q3_K on metal

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-09-08 19:01:04 +03:00
Cebtenzzre
e64f5b5578
examples : make n_ctx warning work again (#3066)
This was broken by commit e36ecdcc ("build : on Mac OS enable Metal by
default (#2901)").
2023-09-08 11:43:35 -04:00
Georgi Gerganov
94f10b91ed
readme : update hot tpoics 2023-09-08 18:18:04 +03:00
Georgi Gerganov
b3e9852e47
sync : ggml (CUDA GLM RoPE + POSIX) (#3082)
ggml-ci
2023-09-08 17:58:07 +03:00
Przemysław Pawełczyk
cb6c44c5e0
build : do not use _GNU_SOURCE gratuitously (#2035)
* Do not use _GNU_SOURCE gratuitously.

What is needed to build llama.cpp and examples is availability of
stuff defined in The Open Group Base Specifications Issue 6
(https://pubs.opengroup.org/onlinepubs/009695399/) known also as
Single Unix Specification v3 (SUSv3) or POSIX.1-2001 + XSI extensions,
plus some stuff from BSD that is not specified in POSIX.1.

Well, that was true until NUMA support was added recently,
so enable GNU libc extensions for Linux builds to cover that.

Not having feature test macros in source code gives greater flexibility
to those wanting to reuse it in 3rd party app, as they can build it with
FTMs set by Makefile here or other FTMs depending on their needs.

It builds without issues in Alpine (musl libc), Ubuntu (glibc), MSYS2.

* make : enable Darwin extensions for macOS to expose RLIMIT_MEMLOCK

* make : enable BSD extensions for DragonFlyBSD to expose RLIMIT_MEMLOCK

* make : use BSD-specific FTMs to enable alloca on BSDs

* make : fix OpenBSD build by exposing newer POSIX definitions

* cmake : follow recent FTM improvements from Makefile
2023-09-08 15:09:21 +03:00
hongbo.mo
a21baeb122
docker : add git to full-cuda.Dockerfile main-cuda.Dockerfile (#3044) 2023-09-08 13:57:55 +03:00
Yui
6ff712a6d1
Update deprecated GGML TheBloke links to GGUF (#3079) 2023-09-08 12:32:55 +02:00
slaren
ebc96086af
ggml-alloc : correctly check mmap return value for errors (#3075) 2023-09-08 04:04:56 +02:00
Kunshang Ji
7f412dab9c
enable CPU HBM (#2603)
* add cpu hbm support

* add memalign 0 byte check

* Update ggml.c

* Update llama.cpp

* ggml : allow ggml_init with 0 size

* retrigger ci

* fix code style

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-09-08 03:46:56 +02:00
Cebtenzzre
6336d834ec
convert : fix F32 ftype not being saved (#3048) 2023-09-07 14:27:42 -04:00
Cebtenzzre
00d62adb79
fix some warnings from gcc and clang-tidy (#3038)
Co-authored-by: xaedes <xaedes@gmail.com>
2023-09-07 13:22:29 -04:00
Cebtenzzre
4fa2cc1750
make : improve test target (#3031) 2023-09-07 10:15:01 -04:00
Cebtenzzre
5ffab089a5
make : fix CPPFLAGS (#3035) 2023-09-07 10:13:50 -04:00
slaren
15b67a66c2
llama-bench : use two tokens in the warmup run for prompt evals (#3059) 2023-09-07 15:52:34 +02:00