Commit graph

1366 commits

Author SHA1 Message Date
xaedes
aea8b6be74
support broadcastable a in out_prod(a, b) and backward pass of broadcasting mul_mat(a, b) 2023-09-09 18:37:45 +02:00
xaedes
35260f7d74
fix finetune to support grouped-query-attention (using flash-attention)
note: ggml changes to ggml_out_prod are necessary to support grouped-query-attention without flash-attention.
2023-09-09 17:10:23 +02:00
xaedes
833a56c144
add llama API functions to get grouped-query-attention n_head parameter 'n_head_kv'. 2023-09-09 17:07:59 +02:00
xaedes
d7aade7d8a
support grouped-query-attention in ggml_flash_attn and ggml_flash_attn_back
k and v can now be repeated in q along ne[2]

in forward pass just use modulo to compute k and v indices, like ik2 = iq2 % nek2.

in backard pass this won't work as easy, because multiple threads will compete to accumulate to the same k->grad[:,ik1,ik2,ik3] and v->grad[:,iv1,iv2,iv3].
so we change the parallelization over q rows to be over k rows. this ensures non-overlapping (ik2,ik3) across threads.
in each thread we then iterate over the number of repetitions of k/v in q to compute iq2 as iq2 = ik2 + irep*nek2.

since ne2 is not the same for q,k and v we also change how the gradients are concatenated into the result tensor.
additionally the offsets of gradq, gradk and gradv in the result tensor are now memory aligned.

we also simplify the compute_backward part of flash_attn to use ggml_reshape instead of switching over the number of dimensions.
this needs a small change to ggml_reshape, removing the assertion of second argument to be contiguous.
since only the shape (ne) of the second reshape argument is of relevance, its memory layout (nb) is irrelevant -> it can very well be non-contiguous.

change test-grad0 to also test for repeated k/v in q.

this changes the rng and now results in small gradient differences in softmax. these solely come from using f16 exp table lookup in forward softmax: when temporarily changing softmax to use actual exp function, the reported gradient differences go away. gradient differences coming solely from f16 table lookup are acceptable.
added a note to explain this.
2023-09-09 17:07:07 +02:00
xaedes
0c2c9c7545
fix gradient accumulation bug where the same batch was used for each microstep 2023-09-06 22:45:36 +02:00
xaedes
de6170d818
fix gradient accumulation bug where the same batch was used for each microstep 2023-09-06 21:35:21 +02:00
xaedes
0393116628
Merge branch 'master' into finetune-lora
# Conflicts:
#	common/common.cpp
2023-09-06 20:15:24 +02:00
xaedes
c08fcf5947
specify default lora rank with '--lora-r N'
'--lora-r N' will specify default rank for all tensors
'--rank-wq N', etc. will override this default rank for specific tensor types.
2023-09-06 20:11:49 +02:00
xaedes
8c2d7e37f9
improve finetune time measurement
fix printf warnings on system where int64_t is (long int).
change time datatypes to double because values get big with long training times.
exclude file saving from time measurement.
converge faster to actual time per iteration by removing very small first duration before first iteration was performed.
fix bug in output of total training time, the reported value was 1000 times to small.
2023-09-06 18:06:24 +02:00
Georgi Gerganov
178b1850eb
k-quants : fix zero-weight guard in Q6_K (ref #3040) 2023-09-06 12:40:57 +03:00
Kerfuffle
ea2c85d5d2
convert-llama-ggml-to-gguf: Try to handle files older than GGJTv3 (#3023)
* convert-llama-ggmlv3-to-gguf: Try to handle files older than GGJTv3

* Better error messages for files that cannot be converted

* Add file type to GGUF output

* Rename to convert-llama-ggml-to-gguf.py

* Include original file type information in description

* Improve some informational output
2023-09-06 02:49:11 -06:00
Cebtenzzre
9912b9efc8
build : add LLAMA_METAL_NDEBUG flag (#3033) 2023-09-05 18:21:10 -04:00
Cebtenzzre
9e2023156e
make : use new flag variables for recent changes (#3019) 2023-09-05 15:12:00 -04:00
Cebtenzzre
de2fe892af
examples : replace fprintf to stdout with printf (#3017) 2023-09-05 15:10:27 -04:00
Erik Scholz
c9c3220c48
convert: fix convert.py not working with int filename_stem (#3028)
* fix implicit int to string conversion
* convert : remove an obsolete pyright comment

---------

Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>
2023-09-05 19:41:00 +02:00
xaedes
867e7c2255
Merge branch 'master' into finetune-lora 2023-09-05 14:48:46 +02:00
Georgi Gerganov
d375b8f3aa
ggml : fix L-BFGS linesearch loop 2023-09-05 12:05:13 +03:00
Georgi Gerganov
786e786061
build : fix compile warnings 2023-09-05 12:02:19 +03:00
Kawrakow
d59bd97065
Guard against all weights in a super-block being zero (#3010)
* Guard against all weights in a super-block being zero

* Also guard against extremely small weights

Closes #2982 

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-09-05 09:55:33 +02:00
Georgi Gerganov
35938ee3b0
llama : update logic for number of threads when using BLAS 2023-09-05 10:46:39 +03:00
Georgi Gerganov
921772104b
speculative : add grammar support (#2991)
* speculative : add grammar support

* grammars : add json_arr.gbnf

* grammar : add comments to new grammar file

* grammar : remove one nested level

* common : warm-up with 2 tokens - seems to work better

* speculative : print draft token pieces

* speculative : reuse grammar parser + better logs and comments

* speculative : avoid grammar_mem

* make : fix speculative build
2023-09-05 08:46:17 +03:00
xaedes
d07b6aac77
fix tracking of train_samples and train_tokens 2023-09-05 02:18:17 +02:00
xaedes
c1c3b0e0c2
add gradient accumulation
specify number accumulation steps with '--grad-acc N'.
this will simulate a bigger batch size of grad_acc*batch.
2023-09-05 01:09:06 +02:00
Georgi Gerganov
2ba85c8609
py : minor 2023-09-04 22:50:50 +03:00
xaedes
d3afd7131e
Merge branch 'master' into finetune-lora
# Conflicts:
#	Makefile
2023-09-04 21:44:05 +02:00
Georgi Gerganov
e36ecdccc8
build : on Mac OS enable Metal by default (#2901)
* build : on Mac OS enable Metal by default

* make : try to fix build on Linux

* make : move targets back to the top

* make : fix target clean

* llama : enable GPU inference by default with Metal

* llama : fix vocab_only logic when GPU is enabled

* common : better `n_gpu_layers` assignment

* readme : update Metal instructions

* make : fix merge conflict remnants

* gitignore : metal
2023-09-04 22:26:24 +03:00
slaren
bd33e5ab92
ggml-opencl : store GPU buffer in ggml_tensor::extra (#2994) 2023-09-04 14:59:52 +02:00
Cebtenzzre
3103568144
llama-bench : make cpp file non-executable (#2999) 2023-09-04 13:40:18 +03:00
Leng Yue
5b8530d88c
make : add speculative example (#3003) 2023-09-04 13:39:57 +03:00
Aarni Koskela
e4386f417f
server : add a subtle loading animation to the edit box (#2466)
* editorconfig: add override for the server HTML (which already is 2-space indented)

* server: add a subtle loading animation to the edit box
2023-09-04 16:28:55 +08:00
Jiahao Li
35195689cd
2x faster (rms) norm cuda kernels (3.7% e2e improvement) (#2985)
* 2x faster (rms) norm cuda kernels

* Fix code style
2023-09-04 08:53:30 +02:00
xaedes
9ea2f7ff58
Merge branch 'master' into finetune-lora
# Conflicts:
#	ggml-alloc.c
2023-09-04 02:40:44 +02:00
slaren
cf9b08485c
ggml-alloc : use virtual memory for measurement (#2973)
* ggml-alloc : use virtual memory for measurement

* compatibility fixes for MAP_ANONYMOUS

* fallback to fixed address for systems without virtual memory
2023-09-03 20:34:09 +02:00
xaedes
50589ed6be
load default rms_norm and rope parameters from base model 2023-09-03 20:05:54 +02:00
xaedes
bdb7092e82
add missing gguf_free in load_checkpoint_lora_file 2023-09-03 20:04:03 +02:00
xaedes
e07f5c57bb
fix printf format warnings 2023-09-03 20:03:39 +02:00
xaedes
406e0750cc
update README.md 2023-09-03 19:25:18 +02:00
Georgi Gerganov
47068e5170
speculative : PoC for speeding-up inference via speculative sampling (#2926)
* speculative : initial example

* speculative : print encoding speed

* speculative : add --draft CLI arg
2023-09-03 15:12:08 +03:00
Georgi Gerganov
8f429fa511
perplexity : fix ETA by warming up the model with an empty run 2023-09-03 13:43:17 +03:00
Kerfuffle
6519e9c99c
gguf(python): Fix special vocab handling when id < 0 (#2984) 2023-09-03 04:38:43 -06:00
Georgi Gerganov
b7f2aa9e51
metal : restore 363f0bf and fix reduce in F16_F32 kernels (#2986) 2023-09-03 13:23:33 +03:00
Alon
73a12a6344
cov : disable comment in PRs (#2989) 2023-09-03 13:19:01 +03:00
opparco
3730134776
llama : fix bpe tokenize from byte (#2889) 2023-09-03 13:18:09 +03:00
Georgi Gerganov
d9151e6f57
metal : revert 6af0bab until we fix it
This restores the generated text to be the same as before #2959
2023-09-03 12:40:56 +03:00
Alon
afc43d5f82
cov : add Code Coverage and codecov.io integration (#2928)
* update .gitignore

* makefile: add coverage support (lcov, gcovr)

* add code-coverage workflow

* update code coverage workflow

* wun on ubuntu 20.04

* use gcc-8

* check why the job hang

* add env vars

* add LLAMA_CODE_COVERAGE=1 again

* - add CODECOV_TOKEN
- add missing make lcov-report

* install lcov

* update make file -pb flag

* remove unused  GGML_NITER from workflows

* wrap coverage output files in COV_TARGETS
2023-09-03 11:48:49 +03:00
Wentai Zhang
6460f758db
opencl : fix a bug in ggml_cl_pool_malloc() for ggml_cl_mul_mat_f32() (#2955)
Co-authored-by: Wentai Zhang <wentaizhang@tencent.com>
2023-09-03 11:46:44 +03:00
Kawrakow
ca82cf7bac
metal : more optimizations (#2959)
* Very minor speedup via simd-group synchronization in f16 x f32

* Another very minor speedup on metal

* Quite significant PP speedup on metal

* Another attempt

* Minor

* Massive improvement for TG for fp16

* ~4-5% improvement for Q8_0 TG on metal

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-09-03 11:06:22 +03:00
kchro3
6a31a3bd98
swift : add support for k-quants (#2983) 2023-09-03 09:21:05 +03:00
Kerfuffle
cff7b0bf07
convert.py : BPE fixes (#2938)
* convert.py: BPE fixes?

* Remove unnecessary conditional in addl token error handling
2023-09-03 08:52:13 +03:00
Ido S
340af42f09
docs : add catai to README.md (#2967) 2023-09-03 08:50:51 +03:00