Commit graph

1477 commits

Author SHA1 Message Date
Andrei
769266a543
cmake : compile ggml-rocm with -fpic when building shared library (#3158) 2023-09-14 20:38:16 +03:00
Asbjørn Olling
cf8238e7f4
flake : include llama.h in nix output (#3159) 2023-09-14 20:25:00 +03:00
Cebtenzzre
4b8560e72a
make : fix clang++ detection, move some definitions to CPPFLAGS (#3155)
* make : fix clang++ detection

* make : fix compiler definitions outside of CPPFLAGS
2023-09-14 20:22:47 +03:00
Alon
83a53b753a
CI: add FreeBSD & simplify CUDA windows (#3053)
* add freebsd to ci

* bump actions/checkout to v3
* bump cuda 12.1.0 -> 12.2.0
* bump Jimver/cuda-toolkit version

* unify and simplify "Copy and pack Cuda runtime"
* install only necessary cuda sub packages
2023-09-14 19:21:25 +02:00
akawrykow
5c872dbca2
falcon : use stated vocab size (#2914) 2023-09-14 20:19:42 +03:00
bandoti
990a5e226a
cmake : add relocatable Llama package (#2960)
* Keep static libs and headers with install

* Add logic to generate Config package

* Use proper build info

* Add llama as import library

* Prefix target with package name

* Add example project using CMake package

* Update README

* Update README

* Remove trailing whitespace
2023-09-14 20:04:40 +03:00
dylan
980ab41afb
docker : add gpu image CI builds (#3103)
Enables the GPU enabled container images to be built and pushed
alongside the CPU containers.

Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
2023-09-14 19:47:00 +03:00
Kerfuffle
e394084166
gguf-py : support identity operation in TensorNameMap (#3095)
Make try_suffixes keyword param optional.
2023-09-14 19:32:26 +03:00
jameswu2014
4c8643dd6e
feature : support Baichuan serial models (#3009) 2023-09-14 12:32:10 -04:00
xaedes
0971fee710
reshuffle original sample order instead of the previous shuffled order
otherwise resumed reshuffle will not result in same sample order
2023-09-14 18:21:23 +02:00
Leng Yue
35f73049af
speculative : add heuristic algorithm (#3006)
* Add heuristic algo for speculative

* Constrain minimum n_draft to 2

* speculative : improve heuristic impl

* speculative : be more rewarding upon guessing max drafted tokens

* speculative : fix typos

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-09-14 19:14:44 +03:00
xaedes
3a9c1d7f5a
set lora_alpha to value of lora_r if it is not set via command line
otherwise only changing lora_r will change scaling of lora adapter used in prediction
2023-09-14 17:58:31 +02:00
xaedes
20cf1a4589
use unrolled vec_mad in out_prod
y is vec_mad result vec.
x is vec_mad input vec.
v is vec_mad input scalar.

ggml_vec_mad_f32_unroll will internally loop over x and v with same y.

GGML_VEC_MAD_UNROLL is by default defined to 32.

This value is empirical optimized using performance test runs of out-prod in openllama-3b finetune with 256 context length and batch size 1. It gives 23% performance boost for out_prod.

Full measurements of out-prod runtime in ms:
	unroll_xv	unroll_yv
1	67014.643	87826.469
2	77117.552	89077.656
4	72091.311	109121.657
8	61077.543	88678.334
16	56914.67	79514.947
24	59024.595	84350.254
28	55952.446	83368.73
32	51476.658	85177.745
36	55973.792	84659.92
40	55139.616	93844.738
48	60736.392	93330.267
64	99856.878	116994.99

Second column is when unrollying yv instead of xv
2023-09-14 17:20:29 +02:00
xaedes
2c59f7bea3
account for possible leading whitespace that will be added by tokenizer
e.g. '\t' will be tokenized by llama spm tokenizer to [29871, 12]
2023-09-14 10:48:38 +02:00
xaedes
f627e2fe9c
pass correct max number of tokens to llama_tokenize 2023-09-14 03:04:04 +02:00
xaedes
7f378a7561
remove probably unnecessary exception type flags from stringstream 2023-09-14 00:21:05 +02:00
xaedes
ec57689f64
exclude known zero values from computations in flash_attn_f32 & flash_attn_back_f32 2023-09-13 18:37:51 +02:00
xaedes
7898652dfb
update shuffle rng state on reshuffle 2023-09-13 16:20:50 +02:00
xaedes
0e32932931
add sample start patterns and options to force new or by default resume last shuffling 2023-09-13 15:36:09 +02:00
goerch
71ca2fad7d
whisper : tokenizer fix + re-enable tokenizer test for LLaMa (#3096)
* Fix für #2721

* Reenable tokenizer test for LLaMa

* Add `console.cpp` dependency

* Fix dependency to `common`

* Fixing wrong fix.

* Make console usage platform specific

Work on compiler warnings.

* Adapting makefile

* Remove trailing whitespace

* Adapting the other parts of the makefile

* Fix typo.
2023-09-13 16:19:44 +03:00
Tristan Ross
1b6c650d16
cmake : add a compiler flag check for FP16 format (#3086) 2023-09-13 16:08:52 +03:00
Johannes Gäßler
0a5eebb45d
CUDA: mul_mat_q RDNA2 tunings (#2910)
* CUDA: mul_mat_q RDNA2 tunings

* Update ggml-cuda.cu

Co-authored-by: Henri Vasserman <henv@hot.ee>

---------

Co-authored-by: Henri Vasserman <henv@hot.ee>
2023-09-13 11:20:24 +02:00
FK
84e723653c
speculative: add --n-gpu-layers-draft option (#3063) 2023-09-13 08:50:46 +02:00
Eric Sommerlade
b52b29ab9d
arm64 support for windows (#3007)
Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>
2023-09-12 21:54:20 -04:00
Johannes Gäßler
4f7cd6ba9c
CUDA: fix LoRAs (#3130) 2023-09-13 00:15:33 +02:00
Johannes Gäßler
89e89599fd
CUDA: fix mul_mat_q not used for output tensor (#3127) 2023-09-11 22:58:41 +02:00
Johannes Gäßler
d54a4027a6
CUDA: lower GPU latency + fix Windows performance (#3110) 2023-09-11 19:55:51 +02:00
Jhen-Jie Hong
1b0d09259e
cmake : support build for iOS/tvOS (#3116)
* cmake : support build for iOS/tvOS

* ci : add iOS/tvOS build into macOS-latest-cmake

* ci : split ios/tvos jobs
2023-09-11 19:49:06 +08:00
Johannes Gäßler
8a4ca9af56
CUDA: add device number to error messages (#3112) 2023-09-11 13:00:24 +02:00
Kawrakow
f31b6f4e2d
metal : PP speedup (#3084)
* Minor speed gains for all quantization types

* metal: faster kernel_scale via float4

* Various other speedups for "small" kernels

* metal: faster soft_max vial float4

* metal: faster diagonal infinity

Although, to me it looks like one should simply
fuse scale + diagnonal infinity + soft_max on the
KQtensor.

* Another faster f16 x f32 matrix multiply kernel

* Reverting the diag infinity change

It does work for PP, but somehow it fails for TG.
Need to look more into it.

* metal: add back faster diagonal infinity

This time more carefully

* metal : minor (readibility)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-09-11 10:30:11 +03:00
Erik Scholz
6eeb4d9083
convert: remove most of the n_mult usage in convert.py (#3098) 2023-09-10 11:06:53 -04:00
xaedes
1cef45953b
remove unused command line options 2023-09-09 21:58:36 +02:00
xaedes
54b21a397c
Merge branch 'master' into finetune-lora
# Conflicts:
#	examples/train-text-from-scratch/train-text-from-scratch.cpp
#	llama.h
2023-09-09 21:30:22 +02:00
xaedes
ace90884a6
measure max compute size for each cgraph eval order and use best order
this can bring huge memory savings:
e.g. codellama-34b with n_ctx=64, n_batch=1 goes from 92927.8mb down to 4627.6 MB
2023-09-09 21:00:25 +02:00
xaedes
917d2870b4
add cgraph evaluation order member and corresponding enum type
this controls in which order ggml_build_forward visits source nodes.
by default the nodes are visited left to right, i.e. src[0] first.
in some cases it is beneficial for ggml-alloc to visit in a different order.
two possible orders are supported: left-to-right (src[0] first) and right-to-left (src[0] last).
2023-09-09 20:52:53 +02:00
xaedes
d3f1b438a8
simplify broadcasting mul_mat backward using ggml_repeat_back 2023-09-09 18:55:18 +02:00
xaedes
d3aaf0876a
add comment briefly describing what ggml_repeat_back does 2023-09-09 18:47:27 +02:00
xaedes
9738526899
decouple random number generator of each operation test
when changing one test the rng of others tests is not influenced anymore
2023-09-09 18:46:35 +02:00
xaedes
dd3278619d
test broadcasting mul_mat backward pass 2023-09-09 18:38:29 +02:00
xaedes
aea8b6be74
support broadcastable a in out_prod(a, b) and backward pass of broadcasting mul_mat(a, b) 2023-09-09 18:37:45 +02:00
xaedes
35260f7d74
fix finetune to support grouped-query-attention (using flash-attention)
note: ggml changes to ggml_out_prod are necessary to support grouped-query-attention without flash-attention.
2023-09-09 17:10:23 +02:00
xaedes
833a56c144
add llama API functions to get grouped-query-attention n_head parameter 'n_head_kv'. 2023-09-09 17:07:59 +02:00
xaedes
d7aade7d8a
support grouped-query-attention in ggml_flash_attn and ggml_flash_attn_back
k and v can now be repeated in q along ne[2]

in forward pass just use modulo to compute k and v indices, like ik2 = iq2 % nek2.

in backard pass this won't work as easy, because multiple threads will compete to accumulate to the same k->grad[:,ik1,ik2,ik3] and v->grad[:,iv1,iv2,iv3].
so we change the parallelization over q rows to be over k rows. this ensures non-overlapping (ik2,ik3) across threads.
in each thread we then iterate over the number of repetitions of k/v in q to compute iq2 as iq2 = ik2 + irep*nek2.

since ne2 is not the same for q,k and v we also change how the gradients are concatenated into the result tensor.
additionally the offsets of gradq, gradk and gradv in the result tensor are now memory aligned.

we also simplify the compute_backward part of flash_attn to use ggml_reshape instead of switching over the number of dimensions.
this needs a small change to ggml_reshape, removing the assertion of second argument to be contiguous.
since only the shape (ne) of the second reshape argument is of relevance, its memory layout (nb) is irrelevant -> it can very well be non-contiguous.

change test-grad0 to also test for repeated k/v in q.

this changes the rng and now results in small gradient differences in softmax. these solely come from using f16 exp table lookup in forward softmax: when temporarily changing softmax to use actual exp function, the reported gradient differences go away. gradient differences coming solely from f16 table lookup are acceptable.
added a note to explain this.
2023-09-09 17:07:07 +02:00
kchro3
21ac3a1503
metal : support for Swift (#3078)
* Metal support for Swift

* update

* add a toggle for arm/arm64

* set minimum versions for all platforms

* update to use newLibraryWithURL

* bump version

Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>

---------

Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
2023-09-09 17:12:10 +08:00
Jhen-Jie Hong
4fd5477955
metal : support build for iOS/tvOS (#3089) 2023-09-09 11:46:04 +03:00
takov751
ec2a24fedf
flake : add train-text-from-scratch to flake.nix (#3042) 2023-09-08 19:06:26 +03:00
Ikko Eltociear Ashimine
7d99aca759
readme : fix typo (#3043)
* readme : fix typo

acceleation -> acceleration

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-09-08 19:04:32 +03:00
Kawrakow
ba7ffbb251
metal : Q3_K speedup (#2995)
* Slightly faster Q3_K and Q5_K on metal

* Another Q3_K speedup on metal

Combined with previous commit, we are now +9.6% for TG.
PP is not affected as this happens via the matrix multiplication
templates.

* Slowly progressing on Q3_K on metal

We are now 13% faster than master

* nother small improvement for Q3_K on metal

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-09-08 19:01:04 +03:00
Cebtenzzre
e64f5b5578
examples : make n_ctx warning work again (#3066)
This was broken by commit e36ecdcc ("build : on Mac OS enable Metal by
default (#2901)").
2023-09-08 11:43:35 -04:00
Georgi Gerganov
94f10b91ed
readme : update hot tpoics 2023-09-08 18:18:04 +03:00