Commit graph

647 commits

Author SHA1 Message Date
Georgi Gerganov
af275faece
Merge branch 'master' into ik/k_quants 2023-06-05 22:54:21 +03:00
Georgi Gerganov
12d43443b2
ggml : rename k_quants -> ggml-quants-k, use lowercase in code 2023-06-05 22:53:07 +03:00
Henri Vasserman
5220a991a5
Increase 3B scratch buffers. (#1698)
The 128 MB was too optimistic.
Too bad it is not dynamically computed.
2023-06-05 13:43:08 +03:00
Georgi Gerganov
d1f563a743
llama : fix Metal KV cache sync (close #1695) 2023-06-05 10:19:03 +03:00
Georgi Gerganov
827f5eda91
readme : update hot topics 2023-06-04 23:38:19 +03:00
Georgi Gerganov
ecb217db4f
llama : Metal inference (#1642)
* mtl : export the LLaMA computation graph

* ci : disable temporary

* mtl : adapt the MNIST example as starter

* mtl : no need for mtl-export tool, add cli arg for main instead

* mtl : export just a small part of the graph for now to make it easier

* mtl : move MSL code into separate file for easy editing

* mtl : initial get_rows_q4_0 kernel

* mtl : confirmed get_rows_q4_0 is working correctly

* mtl : add rms_norm kernel + confirm working

* mtl : add mul kernel + confirm working

* mtl : initial mul_mat Q4 kernel (wrong results)

* mtl : mul_mat fixes (still wrong)

* mtl : another mul_mat Q4 (still does not work)

* mtl : working mul_mat q4

* ggml : fix handling of "view" ops in ggml_graph_import()

* mtl : add rope kernel

* mtl : add reshape and transpose handling

* ggml : store offset as opt arg for ggml_view_xd() operators

* mtl : add cpy kernel + handle view ops

* mtl : confirm f16 x f32 attention mul mat

* mtl : add scale kernel

* mtl : add diag_mask_inf kernel

* mtl : fix soft_max kernel

* ggml : update ggml_nbytes() to handle non-contiguous tensors

* mtl : verify V tensor contents

* mtl : add f32 -> f32 cpy kernel

* mtl : add silu kernel

* mtl : add non-broadcast mul kernel

* mtl : full GPU inference of the computation graph

* mtl : optimize rms_norm and soft_max kernels

* mtl : add f16 mat x f32 vec multiplication kernel

* mtl : fix bug in f16 x f32 mul mat + speed-up computation

* mtl : faster mul_mat_q4_0_f32 kernel

* mtl : fix kernel signature + roll inner loop

* mtl : more threads for rms_norm + better timing

* mtl : remove printfs from inner loop

* mtl : simplify implementation

* mtl : add save/load vocab to ggml file

* mtl : plug Metal inference into llama.cpp (very quick-n-dirty)

* mtl : make it work with main example

Lots of hacks but at least now it generates text

* mtl : preparing for merge

* mtl : clean-up ggml mtl interface + suport scratch / inplace

* mtl : remove temp / debug code

* metal : final refactoring and simplification

* Revert "ci : disable temporary"

This reverts commit 98c267fc77.

* metal : add comments

* metal : clean-up stuff, fix typos

* readme : add Metal instructions

* readme : add example for main
2023-06-04 23:34:30 +03:00
Iwan Kawrakow
32a5f3a601 Had unintentionally committed the Makefile with -Ofast enabled 2023-06-04 17:35:56 +03:00
Iwan Kawrakow
431693cb10 Added forgotten ggml.o dependence on k_quants.h to the Makefile 2023-06-04 11:28:50 +03:00
0cc4m
dcb2ed4826
OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel (#1653)
* Use events instead of clFinish, where possible

* OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel

* Reduce queueing overhead for contiguous tensors by using single mul kernel call

* Adapt to #1612 cl_mem malloc changes

* Reduce code duplication between cuda and opencl branches

* Improve implementation
2023-06-04 08:12:05 +02:00
Iwan Kawrakow
0a71a4e6d3 Fix docker build
I have been sloppy with vector reinterpret casts on ARM_NEON.
It seems clang is very forgiving in that regard.
2023-06-03 19:03:15 +03:00
Iwan Kawrakow
6ef13823b8 Fix quantization error test
We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit
quantization variants.
2023-06-03 18:41:45 +03:00
Henri Vasserman
d8bd0013e8
Add info about CUDA_VISIBLE_DEVICES (#1682) 2023-06-03 16:35:20 +03:00
Jiří Podivín
b5c85468a3
Docker: change to calling convert.py (#1641)
Deprecation disclaimer was added to convert-pth-to-ggml.py
2023-06-03 15:11:53 +03:00
Iwan Kawrakow
8f5d42db9b Minor 2023-06-03 14:46:57 +03:00
Iwan Kawrakow
abd99a89a7 A slightly faster ARM_NEON A4_K dot product 2023-06-03 14:43:08 +03:00
Iwan Kawrakow
894210a351 A slightly daster Q4_K AVX2 dot product
For perplexity, where we are less memory bound, time per
pass drops by ~5%. Barely measurable difference for single
token prediction.
2023-06-03 14:43:08 +03:00
Iwan Kawrakow
9a9c5a0c80 A 10% faster CUDA vector dot kernel for Q3_K
Q3_K is now running at ~18.5 ms / token on CUDA,
so the gap to Q4_0 is only 10%.
It seems memory acccess pattern is more important for
performance than the amount of computation the kernel
does.
2023-06-03 14:43:08 +03:00
Iwan Kawrakow
c5959d53ff Don't print zeros/NaNs when no count histogram has been collected 2023-06-03 14:43:08 +03:00
Iwan Kawrakow
e51ce72e03 Fixed bug in Q2_K CUDA dot product kernel
Stranegly enough, for the few prompts I tried with the 7B model
the responses looked perfectly reasonable. Only realized something
is not quite right when I tried the larger models and started getting
nonse back.

In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X
box iusing CUDA and model fully loaded on the GPU are
  ~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B.
The max number of layers that fit in VRAM for The 65B is 32.
With that, we get ~330 ms per token, which is not that much faster
than just running on the CPU (~470 ms per token).
2023-06-03 14:43:08 +03:00
Iwan Kawrakow
7bcc37676a A slightly faster ARM_NEON Q2_K dot
Single token prediction is now ~36 ms on M2 Max.
The code is much simpler too.
2023-06-03 14:43:08 +03:00
Iwan Kawrakow
6ec70579cb Adding ARM_NEON Q2_K dot
About the same performance as Q4_K.
2023-06-03 14:43:08 +03:00
Iwan Kawrakow
8516fdf728 Adding scalar and AVX2 Q2_K dot 2023-06-03 14:43:08 +03:00
Iwan Kawrakow
b439efb712 Adding Q2_K - just CUDA for now
Token prediction is pretty good - about 15.5 ms on a RTX 4080.
Perplexity is about the same as Q4_K.
2023-06-03 14:43:08 +03:00
Iwan Kawrakow
4faa040c20 A very slightly faster ARM_NEON Q3_K dot 2023-06-03 14:43:08 +03:00
Iwan Kawrakow
13264fa067 Adding Q3_K dot for ARM_NEON
It is 22% slower than Q4_K, despite the smaller model size.
On x86_64, where we are memory bound, the Q3_K model is
quite a bit faster than Q4_K.
2023-06-03 14:43:08 +03:00
Iwan Kawrakow
a197eb50d1 Q5_K dot product for ARM_NEON 2023-06-03 14:43:08 +03:00
Iwan Kawrakow
5ca15ce155 Q6_K dot product for ARM_NEON 2023-06-03 14:43:08 +03:00
Iwan Kawrakow
a2533a72a3 Q4_K dot product for ARM_NEON 2023-06-03 14:43:08 +03:00
Iwan Kawrakow
54f808db2b Quantization mixes: didn't quite get what I wanted in the last commit 2023-06-03 14:43:08 +03:00
Iwan Kawrakow
d537b97cb8 Adding quantization mixes 2023-06-03 14:43:08 +03:00
Iwan Kawrakow
5c5191ab68 Per convention, all QX_K quantizations use Q5_K for output.weight 2023-06-03 14:43:08 +03:00
Iwan Kawrakow
b835d0f49f Adding Q5_K - scalar, AVX2, CUDA
Performance is ~20% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 5-bit model is ~22% larger than the 4-bit.
On the GPU, single token prediction is about the same as Q4_0
for both, single token and batch prediction.
2023-06-03 14:43:08 +03:00
Iwan Kawrakow
cf221afb55 Adding Q6_K - scalar, AVX2, CUDA
Performance is ~40% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 6-bit model is ~44% larger than the 4-bit.
On the GPU, single token prediction is ~6% lower than Q4_0,
batch mode (perplexity) is even closer (but still slower).
2023-06-03 14:43:08 +03:00
Iwan Kawrakow
a0b8e9f3c9 Adding Q4_K - scalar, AVX2, CUDA
Performance is the same or perhaps very slightly better than Q4_0 on the CPU.
On the GPU, single token prediction is ~10% better than Q4_0,
batch mode (perplexity is about the same).
2023-06-03 14:43:08 +03:00
Iwan Kawrakow
3d8b1de3f7 Some more CUDA optimizations for Q3_K
Single token is now 20.5 ms/token (~20% slower than Q4_0).
Perplexity is on par with Q4_0.
2023-06-03 14:43:08 +03:00
Iwan Kawrakow
a3c0673089 Some improvement for Q3_K on CUDA
It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0.
2023-06-03 14:43:08 +03:00
Iwan Kawrakow
c93cce3a45 Q3_K now working on CUDA and AVX2/scalar
CUDA is not ideal - ~50% slower than Q4_0 for
single token prediction, about the same in batch
mode (perplexity). CPU single token is ~55 ms
(on Ryzen 7950X).
2023-06-03 14:43:08 +03:00
Iwan Kawrakow
b4f71347ff Adding Q3_K and Q8_K (de)-quantization 2023-06-03 14:43:08 +03:00
Iwan Kawrakow
8673a41385 Starting to add k-quantization to ggml
I think it is better to have quantization separate from
ggml. For now just adding the k-quants there, but it would be
better to also factor out the existing ggml quantizations.
2023-06-03 14:43:08 +03:00
Evan Jones
136476e898
Fix prompt cache saving and chat-persistent rollover (#1678)
* Fix prompt cache saving and chat-persistent rollover (fixes #1670)

* clang-tidy

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2023-06-03 07:28:45 -04:00
Henri Vasserman
ffb06a345e
OpenLLaMA 3B support (#1588)
This adds support to llama.cpp to load the model.

Currently missing are changes that are required from convert.py to convert the model correctly. It needs some changes to start reading the JSON configuration for HF models instead of deriving the values by guessing.

Co-authored-by: FNsi <125447286+FNsi@users.noreply.github.com>
2023-05-30 21:24:22 +03:00
Georgi Gerganov
7552ac5863
ggml : sync cgraph import / export API 2023-05-29 19:31:44 +03:00
Georgi Gerganov
5d1830b99d
ggml : fix bug in ggml_alibi 2023-05-29 19:30:49 +03:00
DannyDaemonic
248367605e
Work around for recalculating logits in cached prompts (Fixes #1585) (#1609)
* Work around for recalculating logits in cached prompts
2023-05-29 05:13:40 -07:00
Jiří Podivín
0e730dd23b
Adding git in container package dependencies (#1621)
Git added to build packages for version information in docker image

Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
2023-05-28 21:45:50 -07:00
Johannes Gäßler
3b126f654f
LLAMA_DEBUG adds debug symbols (#1617) 2023-05-28 21:01:02 +02:00
Kerfuffle
1b78ed2081
Only show -ngl option when relevant + other doc/arg handling updates (#1625)
1. Add a `LLAMA_SUPPORTS_GPU_OFFLOAD` define to `llama.h` (defined when compiled with CLBlast or cuBLAS)
2. Update the argument handling in the common example code to only show the `-ngl`, `--n-gpu-layers` option when GPU offload is possible.
3. Add an entry for the `-ngl`, `--n-gpu-layers` option to the `main` and `server` examples documentation
4. Update `main` and `server` examples documentation to use the new style dash separator argument format
5. Update the `server` example to use dash separators for its arguments and adds `-ngl` to `--help` (only shown when compiled with appropriate support). It will still support `--memory_f32` and `--ctx_size` for compatibility.
6. Add a warning discouraging use of `--memory-f32` for the `main` and `server` examples `--help` text as well as documentation. Rationale: https://github.com/ggerganov/llama.cpp/discussions/1593#discussioncomment-6004356
2023-05-28 11:48:57 -06:00
Vladimir Zorin
337aea1139
examples : add --alias option to gpt_params to set use friendly model name (#1614) 2023-05-28 20:14:24 +03:00
Howard Su
bb051d9723
opencl : no need to allocate cl_mem on heap (#1612) 2023-05-28 20:13:36 +03:00
Howard Su
ca74884f66
opencl : use strstr to check if fp16 supported (#1611)
* Use strstr to check if fp16 supported

* Ensure ext_buffer is null terminated
2023-05-28 20:09:56 +03:00