Kawrakow
2f9cf974a0
Some more Q4_K and Q5_K speedup on CUDA ( #2346 )
...
* Faster Q5_K on CUDA
* Small Q5_K improvement on older GPUs
* Spped up Q4_K on CUDA
GTX1660: 29.5 ms/t -> 25.6 ms/t
RTX4080: 8.40 ms/t -> 8.25 ms/t
* Spped up Q4_K on CUDA
GTX1660: 36.7 ms/t -> 35.6 ms/t
RTX4080: 9.8 ms/t -> 9.5 ms/t
* Address PR comments
* Add some comments to satisfy PR reviewer
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-24 00:19:47 +03:00
IgnacioFDM
4f06592cc6
Add gqa parameter support to the server ( #2351 )
...
* Add gqa parameter support to the server
* Change help from stderr to stdout
2023-07-23 23:31:17 +03:00
Johannes Gäßler
70d26ac388
Fix __dp4a documentation ( #2348 )
2023-07-23 17:49:06 +02:00
wzy
57921ca6db
common : n_threads == -1 uses std: 🧵 :hardware_concurrency() ( #2347 )
...
* Fix #2345 , fix incorrect n_threads
* Update examples/common.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-23 16:33:02 +03:00
slaren
3602ac4255
fix n_tasks ( #2342 )
...
ggml-ci
2023-07-23 15:19:39 +02:00
slaren
95a6c595e7
ggml: move op parameters from tensors to ggml_tensor::op_params ( #2333 )
...
* ggml: move op parameters from tensors to ggml_tensor::op_params
* alibi: use memcpy for float params
* remove `src[1] = NULL` in ops
2023-07-23 14:36:02 +02:00
Georgi Gerganov
e76d630df1
llama : grouped-query attention + LLaMAv2 70B support ( #2276 )
...
* CUDA: GQA implementation
* llama : support for GQA and LLaMAv2 70B
ggml-ci
* py : fix hparams parsing (if-else blocks)
ggml-ci
* py : oh boy ..
ggml-ci
* help : fix gqa value for 70B
ggml-ci
---------
Co-authored-by: JohannesGaessler <johannesg@5d6.de>
2023-07-23 15:09:47 +03:00
maddes8cht
1d0824b247
llama : print help to stdout ( #2338 )
2023-07-23 14:59:48 +03:00
wzy
bc3ec2cdc9
flake : support nix build '.#opencl'
( #2337 )
2023-07-23 14:57:02 +03:00
Christian Demsar
a940458e48
llama : print max tensor size to stderr ( #2336 )
2023-07-23 14:56:34 +03:00
Jose Maldonado
91171b8072
make : fix CLBLAST compile support in FreeBSD ( #2331 )
...
* Fix Makefile for CLBLAST compile support and instructions for compile llama.cpp FreeBSD
* More general use-case for CLBLAST support (Linux and FreeBSD)
2023-07-23 14:52:08 +03:00
AustinMroz
355c80f49e
examples : simplify vim plugin ( #2327 )
...
Uses builtin json_encode and json_decode functions to simplify escaping
Removes the need for temp files
2023-07-23 14:16:48 +03:00
Jiahao Li
83a00ce69b
metal : support bcast add & dup & cont op ( #2323 )
2023-07-23 14:00:37 +03:00
0cc4m
53809c9c26
Fix trailing whitespace in CMakeLists.txt
2023-07-23 11:28:15 +02:00
Kawrakow
d2a43664f9
Speed up Q4_K ( #2322 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-23 08:49:20 +03:00
Johannes Gäßler
b9b7d94fc1
CUDA: Fixed 7b q3_K_S with mul_mat_vec_q ( #2313 )
2023-07-22 21:27:34 +02:00
Georgi Gerganov
b47b8a9cfe
llama : optimize memory buffers ( #2325 )
2023-07-22 21:17:57 +03:00
0cc4m
1ac8ff3593
Handle devices with only a single queue
2023-07-22 20:05:57 +02:00
0cc4m
67843a3812
Reuse pinned allocation for f16 conversion
2023-07-22 18:48:15 +02:00
0cc4m
f2d4ca34bf
Reduce usage of waitIdle
2023-07-22 18:25:07 +02:00
0cc4m
3452095089
Unroll loops in dmmv shader
2023-07-22 17:46:52 +02:00
0cc4m
2859562501
Run glslc commands in parallel
2023-07-22 17:42:34 +02:00
klosax
b5fe67f8c6
Perplexity: Compute scores correlated to HellaSwag ( #2312 )
...
* Add parameter --perplexity-lines to perplexity.cpp
2023-07-22 14:21:24 +02:00
whoreson
24baa54ac1
examples : basic VIM plugin
...
VIM plugin for server exe
2023-07-22 13:34:51 +03:00
Georgi Gerganov
dd6c67d3cb
ci : fix args
2023-07-22 12:00:56 +03:00
Georgi Gerganov
5d500e8ccf
ci : add 7B CUDA tests ( #2319 )
...
* ci : add 7B CUDA tests
ggml-ci
* ci : add Q2_K to the tests
* ci : bump CUDA ppl chunks
ggml-ci
* ci : increase CUDA TG len + add --ignore-eos
* ci : reduce CUDA ppl cunks down to 4 to save time
2023-07-22 11:48:22 +03:00
0cc4m
754ea680a6
Basic offloading support with mul_f32 and dmmv for q4_0
2023-07-22 10:16:18 +02:00
Richard Roberson
7d5f18468c
examples : add easy python script to create quantized (k-bit support) GGML models from local HF Transformer models ( #2311 )
...
* Resync my fork with new llama.cpp commits
* examples : rename to use dash instead of underscore
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-21 22:01:10 +03:00
Kawrakow
d924522a46
Custom RoPE + bettter memory management for CUDA ( #2295 )
...
* Custom RoPE + bettter memory management for CUDA
* Adjusted look ahead in ggml_cuda_pool_malloc to 5%
This is sufficient it seems.
We end up using about 200 MB less VRAM that way when running
the 13B model with context 8192.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-21 17:27:51 +03:00
Kawrakow
4d76a5f49b
Faster Q3_K implementation on Metal ( #2307 )
...
* Faster Q3_K on Metal
* Additional Q3_K speedup on Metal
* Q3_K for QK_K = 64
* Better Q3_K for QK_K = 64
21.6 ms/t -> 21.1 ms/t
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-21 17:05:30 +03:00
Georgi Gerganov
0db14fef06
ggml : fix the rope fix ( 513f861953
)
2023-07-21 15:16:55 +03:00
Ikko Eltociear Ashimine
03e566977b
examples : fix typo in minigpt4.py ( #2298 )
...
promt -> prompt
2023-07-21 14:53:07 +03:00
Georgi Gerganov
513f861953
ggml : fix rope args order + assert ( #2054 )
2023-07-21 14:51:34 +03:00
Georgi Gerganov
3973b25a64
gitignore : fix final newline
2023-07-21 14:42:41 +03:00
Guillaume "Vermeille" Sanchez
ab0e26bdfb
llama : remove cfg smooth factor as it is only a reparameterization of the guidance scale ( #2280 )
2023-07-21 13:58:36 +03:00
Jose Maldonado
73643f5fb1
gitignore : changes for Poetry users + chat examples ( #2284 )
...
A fix in Makefile for FreeBSD users. In the platfrom x86_64 is amd64. This fix resolve compilation using CFLAGS and CXXFLAGS with -march=native and -mtune=native
Add two examples for interactive mode using Llama2 models (thx TheBloke for models)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-21 13:53:27 +03:00
Georgi Gerganov
a814d04f81
make : fix indentation
2023-07-21 13:50:55 +03:00
Georgi Gerganov
4c013bb738
ci : fix MNT realpath usage ( #2250 )
2023-07-21 13:49:18 +03:00
Sky Yan
42c7c2e2e9
make : support customized LLAMA_CUDA_NVCC and LLAMA_CUDA_CCBIN ( #2275 )
...
Under certain environment, nvcc and gcc is installed under customized path but not standard path
Co-authored-by: Yan Lin <yanlin@baidu.com>
2023-07-21 13:38:57 +03:00
wzy
78a3d13424
flake : remove intel mkl from flake.nix due to missing files ( #2277 )
...
NixOS's mkl misses some libraries like mkl-sdl.pc. See #2261
Currently NixOS doesn't have intel C compiler (icx, icpx). See https://discourse.nixos.org/t/packaging-intel-math-kernel-libraries-mkl/975
So remove it from flake.nix
Some minor changes:
- Change pkgs.python310 to pkgs.python3 to keep latest
- Add pkgconfig to devShells.default
- Remove installPhase because we have `cmake --install` from #2256
2023-07-21 13:26:34 +03:00
Georgi Gerganov
ae178ab46b
llama : make tensor_split ptr instead of array ( #2272 )
2023-07-21 13:10:51 +03:00
Jiří Podivín
54e3bc76fe
make : add new target for test binaries ( #2244 )
...
Programs in the tests directory are now build with target tests
and placed in the same location.
* clean target was expanded to remove new binaries
* test target binaries are listed in a variable
* Locations of binaries were added to the .gitignore
Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-21 13:09:16 +03:00
Hatsune Miku
019fe257bb
MIKU MAYHEM: Upgrading the Default Model for Maximum Fun 🎉 ( #2287 )
...
* Miku.sh: Set default model to llama-2-7b-chat
* Miku.sh: Set ctx_size to 4096
* Miku.sh: Add in-prefix/in-suffix opts
* Miku.sh: Switch sampler to mirostat_v2 and tiny prompt improvements
2023-07-21 11:13:18 +03:00
Kawrakow
e68c96f7fe
Faster Q2_K on Metal ( #2297 )
...
* Faster Q2_K on Metal
* Deleting unnoticed and dangereous trailing white space
* Fixed bug in new metal Q2_K implementation
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-21 10:44:40 +03:00
Przemysław Pawełczyk
9cf022a188
make : fix embdinput library and server examples building on MSYS2 ( #2235 )
...
* make : fix embdinput library and server examples building on MSYS2
* cmake : fix server example building on MSYS2
2023-07-21 10:42:21 +03:00
0cc4m
3432e378d5
Replace VMA library with native Vulkan buffer management
2023-07-20 21:57:33 +02:00
0cc4m
b5b133723a
Don't free before queue done
2023-07-20 19:32:17 +02:00
Kawrakow
e782c9e735
Faster Q5_K and Q6_K on Metal ( #2294 )
...
* Faster Q6_K on Metal
* Faster Q5_K on Metal
* Another Q5_K speedup
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-20 18:19:45 +03:00
Kawrakow
785829dfe8
Faster Q4_K on Metal ( #2290 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-20 15:18:43 +03:00
Georgi Gerganov
fff0e0eafe
llama : fix regression from #2000 - could not load no-mmap models
2023-07-20 13:47:26 +03:00