Commit graph

904 commits

Author SHA1 Message Date
0cc4m
4e580284c0 Allow parallel execution of kernels, parallelize third and fourth dimension calls 2023-07-24 22:51:19 +02:00
0cc4m
53809c9c26 Fix trailing whitespace in CMakeLists.txt 2023-07-23 11:28:15 +02:00
0cc4m
1ac8ff3593 Handle devices with only a single queue 2023-07-22 20:05:57 +02:00
0cc4m
67843a3812 Reuse pinned allocation for f16 conversion 2023-07-22 18:48:15 +02:00
0cc4m
f2d4ca34bf Reduce usage of waitIdle 2023-07-22 18:25:07 +02:00
0cc4m
3452095089 Unroll loops in dmmv shader 2023-07-22 17:46:52 +02:00
0cc4m
2859562501 Run glslc commands in parallel 2023-07-22 17:42:34 +02:00
0cc4m
754ea680a6 Basic offloading support with mul_f32 and dmmv for q4_0 2023-07-22 10:16:18 +02:00
0cc4m
3432e378d5 Replace VMA library with native Vulkan buffer management 2023-07-20 21:57:33 +02:00
0cc4m
b5b133723a Don't free before queue done 2023-07-20 19:32:17 +02:00
0cc4m
9e97cb0baf Don't force aligned matmul 2023-07-19 21:59:03 +02:00
0cc4m
105fd199be Use pinned memory for f16 preprocessing 2023-07-19 21:03:11 +02:00
0cc4m
e4903957ec Add vectorized loading and zeropadding for matrix multiplication 2023-07-19 10:13:51 +02:00
0cc4m
8d351b8bd8 Merge upstream changes, fix conflict 2023-07-17 22:14:22 +02:00
Jiahao Li
7568d1a2b2
Support dup & cont ops on CUDA (#2242) 2023-07-17 20:39:29 +03:00
Alex Klinkhamer
b7647436cc
llama : fix t_start_sample_us initialization warning (#2238) 2023-07-17 00:01:45 +03:00
Qingyou Meng
672dda10e4
ggml : fixed runtime bugs and compile errors related to GGML_PERF and GGML_DEBUG (#2219)
* fixed runtime bugs and compile errors related to GGML_PERF and GGML_DEBUG

* remove ifdef GGML_PERF; update fmt
2023-07-16 22:57:28 +03:00
Jiří Podivín
27ab66e437
py : turn verify-checksum-models.py into executable (#2245)
README.md was adjusted to reflect the change.

Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
2023-07-16 22:54:47 +03:00
0cc4m
931a8921de Fix F32 matmul 2023-07-16 07:55:53 +02:00
0cc4m
f58fa51fd0 Increase matmul test runs for consistent results 2023-07-15 22:46:24 +02:00
0cc4m
22a4cb7f03 Handle stage flags during command buffer submission properly 2023-07-15 22:00:47 +02:00
Xiao-Yong Jin
6e7cca4047
llama : add custom RoPE (#2054)
* Implement customizable RoPE

The original RoPE has pre-defined parameters

theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]

Our customizable RoPE, ggml_rope_custom_inplace, uses

theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]

with the default matches the original

scale = 1.0
base = 10000

The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.

Recent researches show changing these two parameters extends the context limit with minimal loss.

1. Extending Context to 8K
   kaiokendev
   https://kaiokendev.github.io/til#extending-context-to-8k

2. Extending Context Window of Large Language Models via Positional Interpolation
   Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian
   https://arxiv.org/abs/2306.15595

3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
   https://www.reddit.com/user/bloc97
   https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

For the bold, try adding the following command line parameters to your favorite model:
-c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5

* ggml-metal: fix custom rope

* common: fix argument names in help

* llama: increase MEM_REQ_EVAL for MODEL_3B

It avoids crashing for quantized weights on CPU.
Better ways to calculate the required buffer size would be better.

* llama: make MEM_REQ_EVAL depend on n_ctx

* server: use proper Content-Type in curl examples

Without the header Content-Type: application/json, curl will POST with
Content-Type: application/x-www-form-urlencoded

Though our simple server doesn't care, the httplib.h used has a limit
with CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192

With Content-Type: application/json, we can send large json data.

* style : minor fixes, mostly indentations

* ggml : fix asserts

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-15 13:34:16 +03:00
0cc4m
ad3d28ee0a Reuse semaphores 2023-07-15 09:18:54 +02:00
0cc4m
0c4d841cf1 Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints 2023-07-15 09:06:53 +02:00
Dave Della Costa
a6803cab94
flake : add runHook preInstall/postInstall to installPhase so hooks function (#2224) 2023-07-14 22:13:38 +03:00
wzy
7dabc66f3c
make : use pkg-config for OpenBLAS (#2222) 2023-07-14 22:05:08 +03:00
Bach Le
7cdd30bf1f
cuda : allocate all temporary ggml_tensor_extra_gpu from a fixed-size buffer (#2220) 2023-07-14 22:00:58 +03:00
Evan Miller
e8035f141e
ggml : fix static_assert with older compilers #2024 (#2218) 2023-07-14 21:55:56 +03:00
Bach Le
7513b7b0a1
llama : add functions that work directly on model (#2197)
* Remove vocab reference from context

* Add functions that works directly with model
2023-07-14 21:55:24 +03:00
Ali Chraghi
de8342423d
build.zig : install config header (#2216) 2023-07-14 21:50:58 +03:00
Shangning Xu
c48c525f87
examples : fixed path typos in embd-input (#2214) 2023-07-14 21:40:05 +03:00
Jiahao Li
206e01de11
cuda : support broadcast add & mul (#2192)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-14 21:38:24 +03:00
Johannes Gäßler
4304bd3cde
CUDA: mul_mat_vec_q kernels for k-quants (#2203) 2023-07-14 19:44:08 +02:00
James Reynolds
229aab351c
make : fix combination of LLAMA_METAL and LLAMA_MPI (#2208)
Fixes https://github.com/ggerganov/llama.cpp/issues/2166 by moving commands after the CFLAGS are changed.
2023-07-14 20:34:40 +03:00
Georgi Gerganov
697966680b
ggml : sync (ggml_conv_2d, fix mul_mat bug, CUDA GLM rope) 2023-07-14 16:36:41 +03:00
Kawrakow
27ad57a69b
Metal: faster Q4_0 and Q4_1 matrix x vector kernels (#2212)
* 3-5% faster Q4_0 on Metal

* 7-25% faster Q4_1 on Metal

* Oops, forgot to delete the original Q4_1 kernel

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-14 11:46:21 +02:00
Howard Su
32c5411631
Revert "Support using mmap when applying LoRA (#2095)" (#2206)
Has perf regression when mlock is used.

This reverts commit 2347463201.
2023-07-13 21:58:25 +08:00
Howard Su
ff5d58faec
Fix compile error on Windows CUDA (#2207) 2023-07-13 21:58:09 +08:00
Bodo Graumann
b782422a3e
devops : add missing quotes to bash script (#2193)
This prevents accidentally expanding arguments that contain spaces.
2023-07-13 16:49:14 +03:00
Shouzheng Liu
1cbf561466
metal : new q4_0 matrix-vector kernel (#2188)
Prefetch data to improve GPU utilization. ~48% faster for 33B model.
2023-07-12 23:10:55 +03:00
Georgi Gerganov
975221e954
ggml : broadcast mul_mat + conv batch support (#2199)
* ggml : broadcast mul_mat + conv batch support

* ggml : apply mul_mat broadcast fix by @jploski
2023-07-12 20:51:29 +03:00
Georgi Gerganov
4523d10d0c ggml : add ggml_pool_1d and ggml_pool_2d 2023-07-12 20:32:15 +03:00
Georgi Gerganov
680e6f9177 cuda : add gelu support 2023-07-12 20:32:15 +03:00
Howard Su
4e7464ef88
FP16 is supported in CM=6.0 (#2177)
* FP16 is supported in CM=6.0

* Building PTX code for both of 60 and 61

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-07-12 20:18:40 +08:00
Johannes Gäßler
2b5eb72e10
Fixed __dp4a compute capability: 6.0 -> 6.1 (#2189) 2023-07-12 10:38:52 +02:00
Georgi Gerganov
f7d278faf3
ggml : revert CUDA broadcast changes from #2183 (#2191) 2023-07-12 10:54:19 +03:00
Georgi Gerganov
20d7740a9b
ggml : sync (abort callback, mul / add broadcast, fix alibi) (#2183) 2023-07-11 22:53:34 +03:00
Spencer Sutton
5bf2a27718
ggml : remove src0 and src1 from ggml_tensor and rename opt to src (#2178)
* Add ggml changes

* Update train-text-from-scratch for change

* mpi : adapt to new ggml_tensor->src

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-11 19:31:10 +03:00
Bach Le
c9c74b4e3f
llama : add classifier-free guidance (#2135)
* Initial implementation

* Remove debug print

* Restore signature of llama_init_from_gpt_params

* Free guidance context

* Make freeing of guidance_ctx conditional

* Make Classifier-Free Guidance a sampling function

* Correct typo. CFG already means context-free grammar.

* Record sampling time in llama_sample_classifier_free_guidance

* Shift all values by the max value before applying logsoftmax

* Fix styling based on review
2023-07-11 19:18:43 +03:00
Jinwoo Jeong
3ec7e596b2
docker : add '--server' option (#2174) 2023-07-11 19:12:35 +03:00