Commit graph

1402 commits

Author SHA1 Message Date
xaedes
2c59f7bea3
account for possible leading whitespace that will be added by tokenizer
e.g. '\t' will be tokenized by llama spm tokenizer to [29871, 12]
2023-09-14 10:48:38 +02:00
xaedes
f627e2fe9c
pass correct max number of tokens to llama_tokenize 2023-09-14 03:04:04 +02:00
xaedes
7f378a7561
remove probably unnecessary exception type flags from stringstream 2023-09-14 00:21:05 +02:00
xaedes
ec57689f64
exclude known zero values from computations in flash_attn_f32 & flash_attn_back_f32 2023-09-13 18:37:51 +02:00
xaedes
7898652dfb
update shuffle rng state on reshuffle 2023-09-13 16:20:50 +02:00
xaedes
0e32932931
add sample start patterns and options to force new or by default resume last shuffling 2023-09-13 15:36:09 +02:00
xaedes
1cef45953b
remove unused command line options 2023-09-09 21:58:36 +02:00
xaedes
54b21a397c
Merge branch 'master' into finetune-lora
# Conflicts:
#	examples/train-text-from-scratch/train-text-from-scratch.cpp
#	llama.h
2023-09-09 21:30:22 +02:00
xaedes
ace90884a6
measure max compute size for each cgraph eval order and use best order
this can bring huge memory savings:
e.g. codellama-34b with n_ctx=64, n_batch=1 goes from 92927.8mb down to 4627.6 MB
2023-09-09 21:00:25 +02:00
xaedes
917d2870b4
add cgraph evaluation order member and corresponding enum type
this controls in which order ggml_build_forward visits source nodes.
by default the nodes are visited left to right, i.e. src[0] first.
in some cases it is beneficial for ggml-alloc to visit in a different order.
two possible orders are supported: left-to-right (src[0] first) and right-to-left (src[0] last).
2023-09-09 20:52:53 +02:00
xaedes
d3f1b438a8
simplify broadcasting mul_mat backward using ggml_repeat_back 2023-09-09 18:55:18 +02:00
xaedes
d3aaf0876a
add comment briefly describing what ggml_repeat_back does 2023-09-09 18:47:27 +02:00
xaedes
9738526899
decouple random number generator of each operation test
when changing one test the rng of others tests is not influenced anymore
2023-09-09 18:46:35 +02:00
xaedes
dd3278619d
test broadcasting mul_mat backward pass 2023-09-09 18:38:29 +02:00
xaedes
aea8b6be74
support broadcastable a in out_prod(a, b) and backward pass of broadcasting mul_mat(a, b) 2023-09-09 18:37:45 +02:00
xaedes
35260f7d74
fix finetune to support grouped-query-attention (using flash-attention)
note: ggml changes to ggml_out_prod are necessary to support grouped-query-attention without flash-attention.
2023-09-09 17:10:23 +02:00
xaedes
833a56c144
add llama API functions to get grouped-query-attention n_head parameter 'n_head_kv'. 2023-09-09 17:07:59 +02:00
xaedes
d7aade7d8a
support grouped-query-attention in ggml_flash_attn and ggml_flash_attn_back
k and v can now be repeated in q along ne[2]

in forward pass just use modulo to compute k and v indices, like ik2 = iq2 % nek2.

in backard pass this won't work as easy, because multiple threads will compete to accumulate to the same k->grad[:,ik1,ik2,ik3] and v->grad[:,iv1,iv2,iv3].
so we change the parallelization over q rows to be over k rows. this ensures non-overlapping (ik2,ik3) across threads.
in each thread we then iterate over the number of repetitions of k/v in q to compute iq2 as iq2 = ik2 + irep*nek2.

since ne2 is not the same for q,k and v we also change how the gradients are concatenated into the result tensor.
additionally the offsets of gradq, gradk and gradv in the result tensor are now memory aligned.

we also simplify the compute_backward part of flash_attn to use ggml_reshape instead of switching over the number of dimensions.
this needs a small change to ggml_reshape, removing the assertion of second argument to be contiguous.
since only the shape (ne) of the second reshape argument is of relevance, its memory layout (nb) is irrelevant -> it can very well be non-contiguous.

change test-grad0 to also test for repeated k/v in q.

this changes the rng and now results in small gradient differences in softmax. these solely come from using f16 exp table lookup in forward softmax: when temporarily changing softmax to use actual exp function, the reported gradient differences go away. gradient differences coming solely from f16 table lookup are acceptable.
added a note to explain this.
2023-09-09 17:07:07 +02:00
kchro3
21ac3a1503
metal : support for Swift (#3078)
* Metal support for Swift

* update

* add a toggle for arm/arm64

* set minimum versions for all platforms

* update to use newLibraryWithURL

* bump version

Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>

---------

Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
2023-09-09 17:12:10 +08:00
Jhen-Jie Hong
4fd5477955
metal : support build for iOS/tvOS (#3089) 2023-09-09 11:46:04 +03:00
takov751
ec2a24fedf
flake : add train-text-from-scratch to flake.nix (#3042) 2023-09-08 19:06:26 +03:00
Ikko Eltociear Ashimine
7d99aca759
readme : fix typo (#3043)
* readme : fix typo

acceleation -> acceleration

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-09-08 19:04:32 +03:00
Kawrakow
ba7ffbb251
metal : Q3_K speedup (#2995)
* Slightly faster Q3_K and Q5_K on metal

* Another Q3_K speedup on metal

Combined with previous commit, we are now +9.6% for TG.
PP is not affected as this happens via the matrix multiplication
templates.

* Slowly progressing on Q3_K on metal

We are now 13% faster than master

* nother small improvement for Q3_K on metal

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-09-08 19:01:04 +03:00
Cebtenzzre
e64f5b5578
examples : make n_ctx warning work again (#3066)
This was broken by commit e36ecdcc ("build : on Mac OS enable Metal by
default (#2901)").
2023-09-08 11:43:35 -04:00
Georgi Gerganov
94f10b91ed
readme : update hot tpoics 2023-09-08 18:18:04 +03:00
Georgi Gerganov
b3e9852e47
sync : ggml (CUDA GLM RoPE + POSIX) (#3082)
ggml-ci
2023-09-08 17:58:07 +03:00
Przemysław Pawełczyk
cb6c44c5e0
build : do not use _GNU_SOURCE gratuitously (#2035)
* Do not use _GNU_SOURCE gratuitously.

What is needed to build llama.cpp and examples is availability of
stuff defined in The Open Group Base Specifications Issue 6
(https://pubs.opengroup.org/onlinepubs/009695399/) known also as
Single Unix Specification v3 (SUSv3) or POSIX.1-2001 + XSI extensions,
plus some stuff from BSD that is not specified in POSIX.1.

Well, that was true until NUMA support was added recently,
so enable GNU libc extensions for Linux builds to cover that.

Not having feature test macros in source code gives greater flexibility
to those wanting to reuse it in 3rd party app, as they can build it with
FTMs set by Makefile here or other FTMs depending on their needs.

It builds without issues in Alpine (musl libc), Ubuntu (glibc), MSYS2.

* make : enable Darwin extensions for macOS to expose RLIMIT_MEMLOCK

* make : enable BSD extensions for DragonFlyBSD to expose RLIMIT_MEMLOCK

* make : use BSD-specific FTMs to enable alloca on BSDs

* make : fix OpenBSD build by exposing newer POSIX definitions

* cmake : follow recent FTM improvements from Makefile
2023-09-08 15:09:21 +03:00
hongbo.mo
a21baeb122
docker : add git to full-cuda.Dockerfile main-cuda.Dockerfile (#3044) 2023-09-08 13:57:55 +03:00
Yui
6ff712a6d1
Update deprecated GGML TheBloke links to GGUF (#3079) 2023-09-08 12:32:55 +02:00
slaren
ebc96086af
ggml-alloc : correctly check mmap return value for errors (#3075) 2023-09-08 04:04:56 +02:00
Kunshang Ji
7f412dab9c
enable CPU HBM (#2603)
* add cpu hbm support

* add memalign 0 byte check

* Update ggml.c

* Update llama.cpp

* ggml : allow ggml_init with 0 size

* retrigger ci

* fix code style

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-09-08 03:46:56 +02:00
Cebtenzzre
6336d834ec
convert : fix F32 ftype not being saved (#3048) 2023-09-07 14:27:42 -04:00
Cebtenzzre
00d62adb79
fix some warnings from gcc and clang-tidy (#3038)
Co-authored-by: xaedes <xaedes@gmail.com>
2023-09-07 13:22:29 -04:00
Cebtenzzre
4fa2cc1750
make : improve test target (#3031) 2023-09-07 10:15:01 -04:00
Cebtenzzre
5ffab089a5
make : fix CPPFLAGS (#3035) 2023-09-07 10:13:50 -04:00
slaren
15b67a66c2
llama-bench : use two tokens in the warmup run for prompt evals (#3059) 2023-09-07 15:52:34 +02:00
Kawrakow
be8c9c245b
metal : parallel RoPE on Metal (#3024)
* Parallel RoPE on metal

* PR suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-09-07 16:45:01 +03:00
Kawrakow
be6beeb8d7
metal : correct fix of kernel_norm (#3060)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-09-07 16:42:42 +03:00
Georgi Gerganov
c4f496648c
metal : fix kernel_norm (fixes Falcon on Metal) (#3057)
* metal : fix kernel_norm

ggml-ci

* metal : put warning in kernel_norm to not combine the loops

* metal : restore original F16 mat-vec multiplication

It works after the norm fixes

* common : don't do warm-up with more than n_batch tokens (close #3058)

ggml-ci

* metal : minor
2023-09-07 15:49:09 +03:00
Przemysław Pawełczyk
fec2fb19e4
ggml : posixify madvise and pagesize (#3037)
* llama : use posix_madvise() instead of madvise() derived from BSD

sed -i 's,\<madvise\>,posix_&,g;s,\<MADV_,POSIX_&,g' llama.cpp

* ggml : use sysconf(_SC_PAGESIZE) instead of getpagesize() derived from BSD

sed -i 's,getpagesize(),sysconf(_SC_PAGESIZE),g' ggml.c

* metal : use sysconf(_SC_PAGESIZE) instead of getpagesize() derived from BSD

sed -i 's,getpagesize(),sysconf(_SC_PAGESIZE),g' ggml-metal.m
2023-09-07 11:15:06 +03:00
xaedes
0c2c9c7545
fix gradient accumulation bug where the same batch was used for each microstep 2023-09-06 22:45:36 +02:00
xaedes
de6170d818
fix gradient accumulation bug where the same batch was used for each microstep 2023-09-06 21:35:21 +02:00
xaedes
0393116628
Merge branch 'master' into finetune-lora
# Conflicts:
#	common/common.cpp
2023-09-06 20:15:24 +02:00
xaedes
c08fcf5947
specify default lora rank with '--lora-r N'
'--lora-r N' will specify default rank for all tensors
'--rank-wq N', etc. will override this default rank for specific tensor types.
2023-09-06 20:11:49 +02:00
xaedes
8c2d7e37f9
improve finetune time measurement
fix printf warnings on system where int64_t is (long int).
change time datatypes to double because values get big with long training times.
exclude file saving from time measurement.
converge faster to actual time per iteration by removing very small first duration before first iteration was performed.
fix bug in output of total training time, the reported value was 1000 times to small.
2023-09-06 18:06:24 +02:00
Georgi Gerganov
178b1850eb
k-quants : fix zero-weight guard in Q6_K (ref #3040) 2023-09-06 12:40:57 +03:00
Kerfuffle
ea2c85d5d2
convert-llama-ggml-to-gguf: Try to handle files older than GGJTv3 (#3023)
* convert-llama-ggmlv3-to-gguf: Try to handle files older than GGJTv3

* Better error messages for files that cannot be converted

* Add file type to GGUF output

* Rename to convert-llama-ggml-to-gguf.py

* Include original file type information in description

* Improve some informational output
2023-09-06 02:49:11 -06:00
Cebtenzzre
9912b9efc8
build : add LLAMA_METAL_NDEBUG flag (#3033) 2023-09-05 18:21:10 -04:00
Cebtenzzre
9e2023156e
make : use new flag variables for recent changes (#3019) 2023-09-05 15:12:00 -04:00
Cebtenzzre
de2fe892af
examples : replace fprintf to stdout with printf (#3017) 2023-09-05 15:10:27 -04:00