Commit graph

1176 commits

Author SHA1 Message Date
Eve
81844fbcfd
tests : Fix compilation warnings (Linux/GCC) (#2451)
* fix hellaswag print format, cast away warning in test-double-float

* c++11 cannot use designated initializers

* add static to test-grad0.c internal functions

* use memcpy in test-double-float.c

* port c tests to c++

* use initializer list for ggml_init_params
2023-08-02 11:06:19 +03:00
Yiming Cui
a312193e18
readme : Add Chinese LLaMA-2 / Alpaca-2 to supported models (#2475)
* add support for chinese llama-2 / alpaca-2

* remove white spaces
2023-08-02 09:18:31 +03:00
Bono Lv
c574bddb36
fix a typo in examples/server/README.md (#2478) 2023-08-01 14:54:28 +02:00
ebraminio
86aeb27734
server : Support dark mode (#2414)
* server : Support dark mode

So it respects user system light / dark settings.

* Update index.html.hpp by running ./deps.sh
2023-08-01 10:56:23 +02:00
Matteo Boschini
1873ff586b
metal : add gqa8 kernel to allow llama-2-70B on metal (#2459)
* Added gqa8 kernel to allow llama-2-70B on metal

* Update ggml-metal.m

Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>

* Extend kernel_mul_mat_f16_f32 to handle gqa broadcast

* Added ne03==ne13 assertion

---------

Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>
2023-08-01 10:43:12 +03:00
Johannes Gäßler
49e7cb5bb1
CUDA: fixed LLAMA_FAST compilation option (#2473) 2023-07-31 21:02:19 +02:00
Johannes Gäßler
b772bba42e
CUDA: fixed cmake F16 option (#2471) 2023-07-31 19:52:22 +02:00
Johannes Gäßler
0728c5a8b9
CUDA: mmq CLI option, fixed mmq build issues (#2453) 2023-07-31 15:44:35 +02:00
Johannes Gäßler
1215ed7d5c
CUDA: Implemented row flattening for non-glm RoPE (#2468) 2023-07-31 14:32:30 +02:00
Johannes Gäßler
2dbf518911
CUDA: fewer memory bank conflicts for mul_mat_q (#2458) 2023-07-31 13:18:51 +02:00
slaren
9d2382b3e4
Fix Metal backend broken from the allocator changes (#2455)
* fix Metal backend broken from the allocator changes
2023-07-31 11:02:53 +02:00
slaren
a113689571
ggml : add graph tensor allocator (#2411)
* ggml : add graph tensor allocator

* ggml : don't calculate data pointer of unallocated tensors when creating a view with an offset

* ggml : refactor ggml_view_Nd into ggml_view_tensor_offset
2023-07-30 15:58:01 +02:00
Johannes Gäßler
11f3ca06b8
CUDA: Quantized matrix matrix multiplication (#2160)
* mmq implementation for non k-quants

* q6_K

* q2_K

* q3_k

* q4_K

* vdr

* q5_K

* faster q8_1 loading

* loop unrolling

* add __restrict__

* q2_K sc_high

* GGML_CUDA_MMQ_Y

* Updated Makefile

* Update Makefile

* DMMV_F16 -> F16

* Updated README, CMakeLists

* Fix CMakeLists.txt

* Fix CMakeLists.txt

* Fix multi GPU out-of-bounds
2023-07-29 23:04:44 +02:00
Johannes Gäßler
9baf9ef304
CUDA: faster multi GPU synchronization (#2448) 2023-07-29 23:04:10 +02:00
xaedes
22cb368dd9
remove trailing whitespace 2023-07-28 23:55:30 +02:00
xaedes
c1a5e116a4
llama training : fix ggml_rms_norm_back calls to pass configurable eps 2023-07-28 23:13:20 +02:00
xaedes
ecdc16163e
ggml : update ggml_rms_norm_back with configurable eps 2023-07-28 23:13:20 +02:00
xaedes
87035b96f7
remove out-commented vectorized code of opt_adam
the vectorized code might be bit faster for low number of parameters, but it had a big memory usage overhead
2023-07-28 23:13:20 +02:00
xaedes
0f6a8ab519
tighten abs error bounds for sqrt in test-grad0 2023-07-28 23:13:20 +02:00
xaedes
47055c929f
tighten abs error bounds for flash_attn in test-grad0 2023-07-28 23:13:20 +02:00
xaedes
dbbc263313
add conditional compilation of using F16 exp in flash attention
uncomment `// #define GGML_FLASH_ATTN_EXP_FP16` to enable usage of f16 exp in flash attention
2023-07-28 23:13:20 +02:00
xaedes
1065c3b7b9
tighten abs error bounds for cross_entropy_loss in test-grad0 2023-07-28 23:13:20 +02:00
xaedes
24a4b099f3
change sampling parameters for prediction after training to defaults of common.h
and clarify what is context for prediction and what are generated tokens
2023-07-28 23:13:19 +02:00
xaedes
17a0898d50
fix increase of model.train_samples and model.train_tokens
now that each optimizer iteration gets its own batch we need to multiply by number of opt iterations
2023-07-28 23:13:19 +02:00
xaedes
58024d3e5f
rename training parameter cos-decay-alpha to cos-decay-min and clarify that adam-min-alpha also applies to warmup 2023-07-28 23:13:19 +02:00
xaedes
e6ff0728e0
add minimum number of tensor dimensions to apply weight decay (default 2)
this allows to not apply weight decay to bias parameters
2023-07-28 23:13:19 +02:00
xaedes
d7aa4d9576
use optimization callback in training
allows dynamic learning schedule and different batch data for each iteration without relying on low n_iter and high n_examples parameters

reduces runtime by avoiding restart of optimization function and improves training convergence by providing a different batch for each iteration
2023-07-28 23:13:19 +02:00
xaedes
bfc3119139
add optimization callback to ggml_opt_resume_g
this callback is called before each iteration with custom data and pointer to learning schedule parameter (only used in Adam(W)).

can be used for dynamic learning schedule and setting input data for batches before each iteration
2023-07-28 23:13:18 +02:00
xaedes
e843d6e71c
measure and print total training time 2023-07-28 23:13:18 +02:00
xaedes
ff759d957c
remove unused function argument from get_example_targets_batch 2023-07-28 23:13:18 +02:00
xaedes
ce937bc431
replace memcpy with reshape operation so that the graph is not cut at the input
this makes it possible to store other values into the input tensor and then simply recompute the graph without rebuilding it
2023-07-28 23:13:18 +02:00
xaedes
c6a18e15c1
add more training parameters:
--enable-restart N         Only for Adam optimizer. Enable restarts of cos-decay
--disable-restart N        Only for Adam optimizer. Disable restarts of cos-decay
--opt-past N               Number of optimization iterations to track for delta convergence test. Disabled when zero.
--opt-delta N              Maximum delta for delta convergence test. Disabled when <= zero.
--opt-max-no-improvement N Maximum number of optimization iterations with no improvement. Disabled when <= zero.
--adam-epsf N              AdamW epsilon for convergence test. Disabled when <= zero.
--adam-min-alpha N         Adam minimum learning rate alpha, usually 0.1 * alpha
2023-07-28 23:13:18 +02:00
xaedes
d0fbb7d328
llama : fix rope usage in train-text-from-scratch after ChatGLM change 2023-07-28 23:13:17 +02:00
xaedes
fc379a2de3
disable gradient checkpointing debug output 2023-07-28 23:13:17 +02:00
xaedes
3744a9be74
improve gradient checkpointing
sqrt(n_layers) is only the best checkpoint step when mem size of checkpoints and mem size of layers are equal.
since layers require more memory than the single-tensor-checkpoint we use, the optimal values are compute different:

```
  given: n, u, v
  objective: minimize(a*u+b*v) where a*b=n, a>0, b>0
  b=n/a
  minimize(a*u+v*n/a)
  diff(a*u+v*n/a, a) = u - (v*n/a)/a
  diff(a*u+v*n/a, a) == 0
  u - (v*n/a)/a == 0
  u == v*n/(a*a)
  u*a*a = v*n
  a*a = v*n/u
  a = sqrt(n*v/u)
```

this change results in more checkpoints, requiring less layers to store between checkpoints, overall improving memory usage.
2023-07-28 23:13:17 +02:00
xaedes
51dc77092f
change cross_entropy_loss to output average over all rows
this helps keeping the loss and gradients in a sane range
2023-07-28 23:13:17 +02:00
xaedes
87febeec91
improve finite differences of test-grad0 by using double instead of float 2023-07-28 23:13:17 +02:00
xaedes
864e7e3aa1
fix test-grad0 for soft_max
dont use only sum as aggregation, because sum of softmax is always 1 -> finite differences should not work
instead use sum(log(soft_max()*(1-eps)+eps)); use eps to avoid log(0)
2023-07-28 23:13:17 +02:00
xaedes
2d1e6e0675
fix test-grad0 for cross_entropy_loss
the second argument to cross_entropy_loss must sum up to 1 for each row
2023-07-28 23:13:17 +02:00
xaedes
2c6985f79e
bug fixes for cross entropy loss
ggml_cross_entropy_loss: sums where not correctly added in workload of each thread
ggml_cross_entropy_loss_back: simplify backward process, reducing numerical issues

guard usage of exp f16 lookup in cross entropy by #define GGML_CROSS_ENTROPY_EXP_FP16

cross entropy loss is only used once during training, but it is quite sensitive to numerical errors introduced by exp-f16-lookup.
so exp-f16-lookup for cross entropy loss is disabled by default, trading better gradients for very slightly worse runtime performance.
2023-07-28 23:13:16 +02:00
xaedes
97964a4cc9
change default AdamW weight decay parameter defined in ggml to 0.0, making Adam default instead of AdamW
btw: the default weight decay parameter for torch.optim.AdamW is 0.01
2023-07-28 23:13:16 +02:00
xaedes
f175ead6ef
change default AdamW weight decay parameter used in training to 0.1 as used in nanoGPT 2023-07-28 23:13:16 +02:00
xaedes
a80f184e6d
change AdamW decay parameter to work like the torch AdamW decay parameter
It is now relative to Adam learning rate `alpha*sched`.
Before that it was relative to `sched` only.

`alpha` being the maximum learning rate and `sched` being a scaling parameter in [0..1]
2023-07-28 23:13:16 +02:00
xaedes
ed4319e1a7
add and use function ggml_build_backward_expand to avoid stack overflows with large maximum number of nodes
GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep);
2023-07-28 23:13:16 +02:00
xaedes
e05e4414ac
remove unused compute buffer 3 2023-07-28 23:12:00 +02:00
xaedes
6e3f95bf06
implement gradient checkpointing for training
reduces memory overhead from O(n_layer) to O(sqrt(n_layer))

as explained in readme of https://github.com/cybertronai/gradient-checkpointing
2023-07-28 23:11:59 +02:00
xaedes
d7003a98cc
Fix reset of unused g->nodes and g->grads to NULL 2023-07-28 21:30:22 +02:00
xaedes
d395b19c8c
add gradient clipping to AdamW 2023-07-28 21:18:41 +02:00
xaedes
d39c8e6863
remove unnecessary Adam(W) optimizer tensors.
reduces optimizer memory overhead from 7*modelsize to 2*modelsize.

additionally allows to optimize models with more than 2^31 parameters by replacing int with int64_t.

bumps training checkpoint file version, but old checkpoints can still be read.
new version with less tensors is saved.
2023-07-28 21:17:57 +02:00
xaedes
5d124d0cb4
fix track_max_mem in forward_batch_wo_cache_flash_attn_train 2023-07-28 21:17:56 +02:00