Commit graph

1375 commits

Author SHA1 Message Date
Concedo
f09debb1ec remove debug 2023-06-29 20:54:56 +08:00
Concedo
966d736582 revert cublasLt removal 2023-06-29 20:51:02 +08:00
Concedo
10a2bdfaf1 Merge remote-tracking branch 'upstream/ik/context_extend' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
2023-06-29 20:35:17 +08:00
Concedo
c7c6e522e7 bigger scratch buffers for bigger context 2023-06-29 19:43:23 +08:00
Concedo
86b061b98c wip on unified cublas integration, add all the small libraries but exclude the large ones 2023-06-29 18:35:31 +08:00
Concedo
c2f1ed6556 fix compile errors 2023-06-29 17:54:12 +08:00
Concedo
dff5575647 Merge branch 'master' into concedo_experimental
# Conflicts:
#	.gitignore
#	Makefile
#	ggml-opencl.cpp
#	llama.cpp
2023-06-29 17:35:28 +08:00
Concedo
4b3a1282f0 Add flag for lowvram directly into cublas launch param
Merge remote-tracking branch 'yellowrose/pr/open/LostRuins/koboldcpp/lowvram' into concedo_experimental

# Conflicts:
#	koboldcpp.py
2023-06-29 17:07:31 +08:00
Concedo
746f5fa9e9 update lite 2023-06-29 16:44:39 +08:00
LostRuins
96a712ca1b
Porting the improved K-Quant CUDA kernels to OpenCL (#1966)
* Added broken new q4k quant

* xx + ib0

* Fix q2_k fast kernel

* Use preprocessor for QK_K

* Add q6_k fast matmul kernel

* ported q3k speedup successfully

* ported q2k and q5k speedups

* remove old dot kernels and template

* fixed global const struct types

* fixing address spaces

* fixed string too long CI issue

---------

Co-authored-by: 0cc4m <picard12@live.de>
2023-06-29 05:56:43 +02:00
m3ndax
d3494bb86b
llama : replacing auto &kv with const auto &kv (#2041)
* Replacing auto &kv with const auto &kv

* Create codacy.yml

* Delete codacy.yml
2023-06-28 21:39:08 +03:00
Salvador E. Tropea
5b351e94d0
cuda : remove nchannels_x argument from mul_mat_vec_nc_f16_f32 (#2028)
- Not used
2023-06-28 20:27:31 +03:00
Salvador E. Tropea
6432aabb6d
cuda : fix missing const qualifier in casts (#2027) 2023-06-28 20:26:26 +03:00
Howard Su
b922bc351b
llama : remove shards weight file support (#2000)
* Remove multiple shards

* Remove multiple file loaders

* Remove llama_load_tensor_shard class

* Simplify load logic

* Remove dead code guess_n_parts function

* Remove vocab_only from constructor of llama_model_loader

* Remove alignment_prevents_mmap which is not more needed.

* Remove useless check
2023-06-28 20:13:02 +03:00
Johannes Gäßler
7f9753fa12
CUDA GPU acceleration for LoRAs + f16 models (#1970) 2023-06-28 18:35:54 +02:00
ningshanwutuobang
cfa0750bc9
llama : support input embeddings directly (#1910)
* add interface for float input

* fixed inpL shape and type

* add examples of input floats

* add test example for embd input

* fixed sampling

* add free for context

* fixed add end condition for generating

* add examples for llava.py

* add READMD for llava.py

* add READMD for llava.py

* add example of PandaGPT

* refactor the interface and fixed the styles

* add cmake build for embd-input

* add cmake build for embd-input

* Add MiniGPT-4 example

* change the order of the args of llama_eval_internal

* fix ci error
2023-06-28 18:53:37 +03:00
Concedo
b084f4dc46 option for cublas 2023-06-28 21:16:40 +08:00
Concedo
b4698abafc Wip, CUDA porting malloc improvements, gpu accel for non-llama, backport old quants 2023-06-28 18:20:46 +08:00
Erik Scholz
9d23589d63
fix pthreads setaffinity usage on android (#2020) 2023-06-27 19:06:33 +02:00
Iwan Kawrakow
333c40b94c Fixed typo 2023-06-27 19:04:00 +03:00
Iwan Kawrakow
cda30038e4 Modified RoPE with linear scaling
When the context size is greater than the maximum context size
during training, scale the position given to RoPE with
trainign context / n_ctx.
2023-06-27 15:00:22 +03:00
Concedo
9527a783ea fix rope inplace 2023-06-27 19:44:33 +08:00
Concedo
282376c85a Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
#	tests/test-quantize-perf.cpp
2023-06-27 19:15:27 +08:00
Howard Su
0be54f75a6
baby-llama : fix build after ggml_rope change (#2016) 2023-06-27 08:07:13 +03:00
YellowRoseCx
8afa800fb6 Expose low_vram for CUDA
Enabling --lowvram instructs the program to not allocate a VRAM scratch buffer for holding temporary results. Reduces VRAM usage at the cost of performance, particularly prompt processing speed. Requires CUDA
2023-06-26 16:47:22 -05:00
Georgi Gerganov
181e8d9755
llama : fix rope usage after ChatGLM change 2023-06-27 00:37:33 +03:00
Georgi Gerganov
d9779021bd
ggml : add support for ChatGLM RoPE 2023-06-27 00:06:51 +03:00
Roman Parykin
d38e451578
readme : add Scala 3 bindings repo (#2010) 2023-06-26 22:47:59 +03:00
David Yang
eaa6ca5a61
ggml : increase max tensor name + clean up compiler warnings in train-text (#1988)
* Clean up compiler warnings in train-text

Some brackets to disambiguate order of operations

* Increase GGML_MAX_NAME

Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues
2023-06-26 22:45:32 +03:00
Gustavo Rocha Dias
aa777abbb7
readme : LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux (#2007)
* docs - Alternative way to build at Android, with CLBlast.

* doc - LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux.

* doc- fix typo
2023-06-26 22:34:45 +03:00
Georgi Gerganov
c824d2e368
ggml : avoid conv 2d kernel round up 2023-06-26 21:03:59 +03:00
zrm
b853d45601
ggml : add NUMA support (#1556)
* detect NUMA systems and pin work threads to nodes (linux)

* disable mmap prefetch/readahead for NUMA systems

* avoid sending finalize op to thread pool if it does nothing

* silence robot

* fix args

* make --numa a param

* recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement

* lower synchronization overhead

* statically allocate

* move numa state to g_state

* add description for --numa

* ggml : minor style changes

* ggml : minor style + try fix sanitizer build

* llama : allow to initialize backend with NUMA support

* llama : avoid ggml include in llama-util.h

* ggml : style / formatting

* ggml : fix handling of ops with n_threads > n_tasks > 1

* server : utilize numa parameter

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-26 20:57:59 +03:00
Georgi Gerganov
9225baef71
k-quants : fix indentation 2023-06-26 20:10:52 +03:00
katsu560
a84ab1da8d
tests : fix quantize perf (#1990)
* fix test quantize perf

* avoid the global state
2023-06-26 19:47:02 +03:00
katsu560
5743ca8092
k-quants : add AVX support to dot functions (#1916)
* k_quants : add AVX support

* k_quants : apply review comments
2023-06-26 19:46:07 +03:00
Georgi Gerganov
412c60e473
readme : add link to new k-quants for visibility 2023-06-26 19:45:09 +03:00
Kawrakow
6769e944c7
k-quants : support for super-block size of 64 (#2001)
* k_quants: WIP super-blocks with 64 weights

* k_quants: WIP super-blocks with 64 weights

Q6_K scalar and AVX2 works

* k_quants: WIP super-blocks with 64 weights

Q4_K scalar and AVX2 works

* k_quants: WIP super-blocks with 64 weights

Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower
than the scalar implementation)

* k_quants: WIP super-blocks with 64 weights

Q3_K scalar and AVX2 works.

* k_quants: WIP super-blocks with 64 weights

Q5_K scalar and AVX2 works, and with that all
k_quants are done on AVX2 and scalar

* k_quants: WIP super-blocks with 64 weights

Q6_K working on CUDA. Cannot make it run quite as gast as
with super-blocks with 256 weigths: 8% slower on 4080,
20% slower on the 1660 (but there we fit 1 less layer on the
GPU because pf the larger model size), so some fraction of
these 20% is due to that,

* k_quants: WIP super-blocks with 64 weights

Q4_K working on CUDA. ~10% slower on GTX-1660,
16% slower on 4080.

* k_quants: WIP super-blocks with 64 weights

Q2_K working on CUDA. ~3% slower on GTX-1660,
10% slower on 4080.

* k_quants: WIP super-blocks with 64 weights

Q3_K working on CUDA.

* k_quants: WIP super-blocks with 64 weights

Q5_K working on CUDA, and with this CUDA is done.

* k_quants: WIP super-blocks with 64 weights

Q6_K working on ARM_NEON

* k_quants: WIP super-blocks with 64 weights

Q4_K working on ARM_NEON, but quite a bit slower than 256 weights

* k_quants: WIP super-blocks with 64 weights

Q2_K working on ARM_NEON, but quite a bit slower than 256 weights

* k_quants: WIP super-blocks with 64 weights

Q3_K working on ARM_NEON, but quite a bit slower than 256 weights.

* k_quants: WIP super-blocks with 64 weights

Q5_K working on ARM_NEON, but quite a bit slower than 256 weights.

With that, we have full support for ARM_NEON, although
performance is not quite there.

* k_quants: WIP super-blocks with 64 weights

Slightly more efficient Q3_K and Q5_K

* k_quants: WIP super-blocks with 64 weights

Another small improvement for Q3_K and Q5_K on ARM_NEON

* k_quants: WIP super-blocks with 64 weights

Yet another speedup for Q5_K on ARM_NEON.
We are now within 10% of the QK_K = 256 version.

* k_quants: WIP super-blocks with 64 weights

* We are able to pass preprocessor macros to the Metal
  compiler
* Q6_K works and is actually slightly more efficient than
  the QK_K = 256 version (25.2 ms vs 25.8 ms)

* k_quants: WIP super-blocks with 64 weights

Q4_K works on Metal and is actually slightly faster
than QK_K = 256 (21.95 ms vs 24.0 ms).

* k_quants: WIP super-blocks with 64 weights

Q2_K works on Metal and is very slightly faster
than QK_K = 256 (23.8 ms vs 24.2 ms).

* k_quants: WIP super-blocks with 64 weights

Q3_K works on Metal and is slightly faster
than QK_K = 256 (26.6 ms vs 28.3 ms).

* k_quants: WIP super-blocks with 64 weights

Q5_K works on Metal and is slightly faster
than QK_K = 256 (23.7 ms vs 26.3 ms).

* k_quants: call them _K, not _k, also on Metal

* k_quants: correctly define QK_K in llama.cpp

* Fixed bug in q4_K quantization added with the 64-block addition

* Simplify via lambda

* k_quants: swicth Q3_K to 4-bit scales when QK_K = 64

Otherwise there isn't much benefit from this
quantization type. There is some very slight loss
in accuracy, but we reduce size by ~7%.
E.g., for OpenLLaMA-3B, Q3_K_S perplexity is
8.6131 with 8-bit scales and 8.6352 with 4-bit,
while file size decreases from 1.53G to 1.44G.

* k_quants: switch Q4_K to 4-bit scales when QK_K = 64

 Here the loss in accuracy is greater than for Q3_K,
 but the Q4_K points still move further to the left on
 the perplexity vs size curve.

* k_quants: forgot to add the Metal changes in last commit

* k_quants: change Q5_K to be type 0 when QK_K = 64

Still needs AVX2 implementation

* k_quants: AVX2 implementation for new 64-weight Q5_K

* k_quants: 10% faster ARM_NEON Q5_K dot product

* k_quants: fixed issue caused by merging with master

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-26 19:43:07 +03:00
Howard Su
cbebf61ca7
Fix assert when free invalid cuda pointer (#2005)
Fix assert via initializing extra structure always.
CUDA error 1 at C:\GPT\llama.cpp\ggml-cuda.cu:2536: invalid argument
2023-06-26 23:15:47 +08:00
Concedo
1fdf9d1131 desc 2023-06-26 16:58:59 +08:00
Concedo
e4c9aea840 Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-06-26 10:35:47 +08:00
Georgi Gerganov
447ccbe8c3
readme : add new roadmap + manifesto 2023-06-25 16:08:12 +03:00
Georgi Gerganov
bd34cdde38
ggml : sync latest ggml (custom operators) 2023-06-25 14:25:08 +03:00
Concedo
d2034ced7b Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
#	build.zig
#	flake.nix
#	tests/test-grad0.c
#	tests/test-sampling.cpp
#	tests/test-tokenizer-0.cpp
2023-06-25 17:01:15 +08:00
anon998
c2a08f87b8
fix server sampling: top k sampler first (#1977)
Co-authored-by: anon <anon@example.org>
2023-06-25 10:48:36 +02:00
Georgi Gerganov
66a2555ba6
readme : add Azure CI discussion link 2023-06-25 09:07:03 +03:00
sjinzh
e65ca7e14a
zig : upgrade build system support (#1981)
* upgrade zig build system support

* zig : add new line at the end of the file

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-25 08:45:44 +03:00
Robyn
5ec8dd5a3c
#1869 Fix null reference errors when training from scratch with CUDA (#1907)
* #1869 Fix null reference errors when training from scratch with CUDA build

Calling ggml_compute_forward when node->src0 was null was causing train-text-from-scratch.exe to terminate unexpectedly.

* ggml : do not dereference src0 if NULL

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-24 20:10:29 +02:00
Georgi Gerganov
65bdd52a86
tests : sync test-grad0 from ggml 2023-06-24 19:40:18 +03:00
Rowan Hart
fdd1860911
flake : fix ggml-metal.metal path and run nixfmt (#1974) 2023-06-24 14:07:08 +03:00
AN Long
c943d823c1
convert : fix invalid params in write_vocab_only (#1975) 2023-06-24 14:02:06 +03:00