Commit graph

1046 commits

Author SHA1 Message Date
xaedes
ce937bc431
replace memcpy with reshape operation so that the graph is not cut at the input
this makes it possible to store other values into the input tensor and then simply recompute the graph without rebuilding it
2023-07-28 23:13:18 +02:00
xaedes
c6a18e15c1
add more training parameters:
--enable-restart N         Only for Adam optimizer. Enable restarts of cos-decay
--disable-restart N        Only for Adam optimizer. Disable restarts of cos-decay
--opt-past N               Number of optimization iterations to track for delta convergence test. Disabled when zero.
--opt-delta N              Maximum delta for delta convergence test. Disabled when <= zero.
--opt-max-no-improvement N Maximum number of optimization iterations with no improvement. Disabled when <= zero.
--adam-epsf N              AdamW epsilon for convergence test. Disabled when <= zero.
--adam-min-alpha N         Adam minimum learning rate alpha, usually 0.1 * alpha
2023-07-28 23:13:18 +02:00
xaedes
d0fbb7d328
llama : fix rope usage in train-text-from-scratch after ChatGLM change 2023-07-28 23:13:17 +02:00
xaedes
fc379a2de3
disable gradient checkpointing debug output 2023-07-28 23:13:17 +02:00
xaedes
3744a9be74
improve gradient checkpointing
sqrt(n_layers) is only the best checkpoint step when mem size of checkpoints and mem size of layers are equal.
since layers require more memory than the single-tensor-checkpoint we use, the optimal values are compute different:

```
  given: n, u, v
  objective: minimize(a*u+b*v) where a*b=n, a>0, b>0
  b=n/a
  minimize(a*u+v*n/a)
  diff(a*u+v*n/a, a) = u - (v*n/a)/a
  diff(a*u+v*n/a, a) == 0
  u - (v*n/a)/a == 0
  u == v*n/(a*a)
  u*a*a = v*n
  a*a = v*n/u
  a = sqrt(n*v/u)
```

this change results in more checkpoints, requiring less layers to store between checkpoints, overall improving memory usage.
2023-07-28 23:13:17 +02:00
xaedes
51dc77092f
change cross_entropy_loss to output average over all rows
this helps keeping the loss and gradients in a sane range
2023-07-28 23:13:17 +02:00
xaedes
87febeec91
improve finite differences of test-grad0 by using double instead of float 2023-07-28 23:13:17 +02:00
xaedes
864e7e3aa1
fix test-grad0 for soft_max
dont use only sum as aggregation, because sum of softmax is always 1 -> finite differences should not work
instead use sum(log(soft_max()*(1-eps)+eps)); use eps to avoid log(0)
2023-07-28 23:13:17 +02:00
xaedes
2d1e6e0675
fix test-grad0 for cross_entropy_loss
the second argument to cross_entropy_loss must sum up to 1 for each row
2023-07-28 23:13:17 +02:00
xaedes
2c6985f79e
bug fixes for cross entropy loss
ggml_cross_entropy_loss: sums where not correctly added in workload of each thread
ggml_cross_entropy_loss_back: simplify backward process, reducing numerical issues

guard usage of exp f16 lookup in cross entropy by #define GGML_CROSS_ENTROPY_EXP_FP16

cross entropy loss is only used once during training, but it is quite sensitive to numerical errors introduced by exp-f16-lookup.
so exp-f16-lookup for cross entropy loss is disabled by default, trading better gradients for very slightly worse runtime performance.
2023-07-28 23:13:16 +02:00
xaedes
97964a4cc9
change default AdamW weight decay parameter defined in ggml to 0.0, making Adam default instead of AdamW
btw: the default weight decay parameter for torch.optim.AdamW is 0.01
2023-07-28 23:13:16 +02:00
xaedes
f175ead6ef
change default AdamW weight decay parameter used in training to 0.1 as used in nanoGPT 2023-07-28 23:13:16 +02:00
xaedes
a80f184e6d
change AdamW decay parameter to work like the torch AdamW decay parameter
It is now relative to Adam learning rate `alpha*sched`.
Before that it was relative to `sched` only.

`alpha` being the maximum learning rate and `sched` being a scaling parameter in [0..1]
2023-07-28 23:13:16 +02:00
xaedes
ed4319e1a7
add and use function ggml_build_backward_expand to avoid stack overflows with large maximum number of nodes
GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep);
2023-07-28 23:13:16 +02:00
xaedes
e05e4414ac
remove unused compute buffer 3 2023-07-28 23:12:00 +02:00
xaedes
6e3f95bf06
implement gradient checkpointing for training
reduces memory overhead from O(n_layer) to O(sqrt(n_layer))

as explained in readme of https://github.com/cybertronai/gradient-checkpointing
2023-07-28 23:11:59 +02:00
xaedes
d7003a98cc
Fix reset of unused g->nodes and g->grads to NULL 2023-07-28 21:30:22 +02:00
xaedes
d395b19c8c
add gradient clipping to AdamW 2023-07-28 21:18:41 +02:00
xaedes
d39c8e6863
remove unnecessary Adam(W) optimizer tensors.
reduces optimizer memory overhead from 7*modelsize to 2*modelsize.

additionally allows to optimize models with more than 2^31 parameters by replacing int with int64_t.

bumps training checkpoint file version, but old checkpoints can still be read.
new version with less tensors is saved.
2023-07-28 21:17:57 +02:00
xaedes
5d124d0cb4
fix track_max_mem in forward_batch_wo_cache_flash_attn_train 2023-07-28 21:17:56 +02:00
klosax
8a88e5855c
perplexity : add Hellaswag calculation (#2389)
* common.h : add hellaswag / remove perplexity-lines

* common.cpp : add hellaswag / remove perplexity-lines

* perplexity.cpp : add hellswag scores / remove perplexity-lines

* perplexity.cpp : clean up

* common.h : change default param value

* common.cpp : Change default param

* perplexity.cpp : alter wording

* common.h : alter wording

* common.cpp : alter wording
2023-07-28 21:25:36 +03:00
Lee
a9559bf77b
ggml : workaround for missing _mm256_setr_m128i in GCC < 8 in k_quants.c (#2405) 2023-07-28 21:17:45 +03:00
eric8607242
ee1b497c98
llama : support more diverse tokenizers? (#2420)
* supporting more diverse tokenizers

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-28 21:10:05 +03:00
Georgi Gerganov
d73b8d48b4
examples : fix whitespace 2023-07-28 21:05:08 +03:00
nhamanasu
34ae1caf7f
examples : server chat mode with llama2 (#2400)
* add: server chat mode with llama2

* fix: remove the unnecessary last \n
2023-07-28 21:02:10 +03:00
Weird Constructor
d91f3f0c55
readme : fix the description of the Tail free sampling (TFS) method (#2431) 2023-07-28 11:44:43 +03:00
Rand Xie
65cdf34bdc
llama : use n_embd_gqa instead of n_embd to handle llama-2 70B (#2433) 2023-07-28 11:42:53 +03:00
niansa/tuxifan
edcc7ae7d2
Obtaining LLaMA 2 instructions (#2308)
* Obtaining LLaMA 2 instructions

* Removed sharing warning for LLaMA 2

* Linked TheBloke's GGML repos

* Add LLaMA 2 to list of supported models

* Added LLaMA 2 usage instructions

* Added links to LLaMA 2 70B models
2023-07-28 03:14:11 +02:00
mj-shifu
7c529cede6
convert.py : Update to support 70B HF format model files (#2427)
* convert.py : fix llama 2 70b conversion from Huggingface
2023-07-27 14:39:17 -06:00
Georgi Gerganov
1a941869cb
metal : disable graph concurrency optimization due to bug (#2413) 2023-07-27 11:00:54 +03:00
slaren
b5472ea0ad
ggml : fix assert in ggml_set_unary_op (#2410) 2023-07-26 23:57:23 +02:00
Cebtenzzre
6df1f5940f
make : build with -Wmissing-prototypes (#2394) 2023-07-26 21:00:04 +03:00
slaren
5488fb789e
ggml : allocate graphs in a context (#2392)
* ggml : graph allocation in contexts

* allocate work buffer as a ggml_object in ggml_graph_compute_with_ctx

* llama.cpp : allocate graph in the context

* add GGML_PAD

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-26 15:56:53 +02:00
Kawrakow
eb542d3932
Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-25 18:35:53 +03:00
slaren
07aaa0f63f
ggml : fix ggml_flash_attn to use op_params (#2387)
* ggml : fix ggml_flash_attn to use op_params
2023-07-25 16:20:12 +02:00
ldwang
fce48caf9a
convert.py : support bpe tokenizer (#2228)
* support bpe tokenizer in convert

Signed-off-by: ldwang <ftgreat@gmail.com>

* support bpe tokenizer in convert

Signed-off-by: ldwang <ftgreat@gmail.com>

* support bpe tokenizer in convert, fix

Signed-off-by: ldwang <ftgreat@gmail.com>

---------

Signed-off-by: ldwang <ftgreat@gmail.com>
Co-authored-by: ldwang <ftgreat@gmail.com>
2023-07-25 16:22:09 +03:00
Jiahao Li
875086bdb9
ggml : relax contiguous constraints in activation function (#2371) 2023-07-25 15:58:32 +03:00
slaren
da1889834a
ggml : improve graph build time via hash table lookup (#2329)
* improve graph build time

* ggml_tensor : use 1 bit per flag

* use a hash table instead
2023-07-25 15:32:20 +03:00
Hesen Peng
82552b7f54
build : fix line breaking error in build-info.sh (#2349)
* fix line breaking

* build number line break removal
2023-07-25 15:24:09 +03:00
Xiao-Yong Jin
0c06204fb3
main : add --in-prefix-bos to prefix BOS to user inputs; keep EOS (#2304)
* add `--in-prefix-bos` to prefix BOS to user inputs; keep EOS

The BOS precedes the string specified by `--in-prefix`.
Model generated EOS is now kept in the context.

It provides a way to strictly following the prompt format used in
Llama-2-chat.

The EOS handling also benefits some existing finetunes that uses
EOS to mark the end of turn.

* examples/common: move input_prefix_bos to other bools
2023-07-25 15:19:11 +03:00
Eve
1fed755b1f
ci : add non-AVX scalar build/test (#2356)
* noavx build and test

* we don't need to remove f16c in windows
2023-07-25 15:16:13 +03:00
katsu560
be2301bcda
k_quants : add AVX support to dot functions with QK_K as 64 (#2339)
* add AVX to ggml_vec_dot_q2_K_q8_K()

* add AVX to ggml_vec_dot_q3_K_q8_K()

* add AVX to ggml_vec_dot_q4_K_q8_K()

* add AVX to ggml_vec_dot_q5_K_q8_K()

* add AVX to ggml_vec_dot_q6_K_q8_K()

* refactor AVX code in ggml_vec_dot_q6_K_q8_K()
2023-07-25 15:13:41 +03:00
Shouzheng Liu
1aa18ef994
metal : concurrently dispatch commands (#2358)
* metal: concurrently dispatch commands

Function `ggml_metal_graph_find_concurrency` will run and write
commands that can be issued concurrently to metal context `concur_list`
array, when `ggml_metal_graph_compute` is called for the first time.

* metal: don't call find_concurrency automatically.

* metal : code style changes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-25 15:00:19 +03:00
Kawrakow
9a08eaf3c4
Another speed gain for Q4_0 and Q4_1 on Metal (#2375)
* Another speed gain for Q4_0 and Q4_1 on Metal

* Have N_DST, etc., be template parameters

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-25 13:48:29 +03:00
Kawrakow
129d844c87
Fix Q4_K and Q5_K for QK_K = 64 on CUDA (#2359)
* Fix Q4_K and Q5_K for QK_K = 64

* Very slightly better Q5_K bit fiddling

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-25 13:48:04 +03:00
slaren
d5512b782b
server: add rms_norm_eps parameter (#2380) 2023-07-25 12:36:17 +03:00
Henri Vasserman
c798308e3a
[Server] Escape HTML in webchat (#2368)
* escape HTML in webchat
* add amp
2023-07-25 10:27:34 +03:00
slaren
41c674161f
make rms_norm_eps a parameter (#2374)
* make rms_norm_eps a parameter

* add rms_norm_eps to command line

* fix baby llama, test-grad0

* use scientific notation for eps param in the help

ggml-ci
2023-07-24 17:57:12 +02:00
Aarni Koskela
b3f138d058
Chat UI extras (#2366)
* makefile: correct deps for server

* server: tighten settings layout a little

* server: expose all currently configured generation params in UI

* server: expose remaining generation params, for the adventurous

* server: embetter mirostat fields
2023-07-24 17:54:22 +03:00
Georgi Gerganov
5b2b2dc6ae
ggml : sync (unary ops refactor, static-correctness) (#2370)
* ggml : sync (unary ops, tests)

ggml-ci

* tests : remove unnecessary funcs
2023-07-24 14:46:21 +03:00