Commit graph

1579 commits

Author SHA1 Message Date
Concedo
0cca0726fe reduce number of retries, fixed maxlength > maxctx bug 2023-07-23 09:59:34 +08:00
Concedo
fa0270df7c added some checks to skip generation if busy 2023-07-22 23:10:04 +08:00
Concedo
2807d98fd4 touchup (+2 squashed commit)
Squashed commit:

[8b06458] fixed broken param order

[7eabdc0] very broken, do not use
2023-07-22 22:57:56 +08:00
Concedo
3aec3038d4 bump scratch buffers 2023-07-22 18:12:18 +08:00
Concedo
52c5856a08 auto populate horde model name 2023-07-22 16:03:12 +08:00
Concedo
dd3f8dabed updated cluster to horde.koboldai.net 2023-07-22 12:42:40 +08:00
Concedo
236d0e8955 add tip about using other workers 2023-07-22 12:29:22 +08:00
Concedo
701bf0a6cd reduce sleep time between jobs 2023-07-22 11:56:43 +08:00
Concedo
343ae756fa Merge branch 'master' into concedo_experimental
# Conflicts:
#	.gitignore
#	CMakeLists.txt
#	Makefile
#	README.md
#	flake.nix
#	ggml-cuda.cu
2023-07-22 11:51:30 +08:00
Concedo
52c98228aa bugfixes for missing params 2023-07-22 11:37:44 +08:00
Concedo
d7ab6adbc1 embedded horde worker is ready 2023-07-22 11:21:32 +08:00
Richard Roberson
7d5f18468c
examples : add easy python script to create quantized (k-bit support) GGML models from local HF Transformer models (#2311)
* Resync my fork with new llama.cpp commits

* examples : rename to use dash instead of underscore

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-21 22:01:10 +03:00
Concedo
75064b4ada wip on embedded horde worker 2023-07-22 01:30:25 +08:00
Kawrakow
d924522a46
Custom RoPE + bettter memory management for CUDA (#2295)
* Custom RoPE + bettter memory management for CUDA

* Adjusted look ahead in ggml_cuda_pool_malloc to 5%

This is sufficient it seems.
We end up using about 200 MB less VRAM that way when running
the 13B model with context 8192.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-21 17:27:51 +03:00
Kawrakow
4d76a5f49b
Faster Q3_K implementation on Metal (#2307)
* Faster Q3_K on Metal

* Additional Q3_K speedup on Metal

* Q3_K for QK_K = 64

* Better Q3_K for QK_K = 64

21.6 ms/t -> 21.1 ms/t

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-21 17:05:30 +03:00
Georgi Gerganov
0db14fef06
ggml : fix the rope fix (513f861953) 2023-07-21 15:16:55 +03:00
Ikko Eltociear Ashimine
03e566977b
examples : fix typo in minigpt4.py (#2298)
promt -> prompt
2023-07-21 14:53:07 +03:00
Georgi Gerganov
513f861953
ggml : fix rope args order + assert (#2054) 2023-07-21 14:51:34 +03:00
Georgi Gerganov
3973b25a64
gitignore : fix final newline 2023-07-21 14:42:41 +03:00
Guillaume "Vermeille" Sanchez
ab0e26bdfb
llama : remove cfg smooth factor as it is only a reparameterization of the guidance scale (#2280) 2023-07-21 13:58:36 +03:00
Jose Maldonado
73643f5fb1
gitignore : changes for Poetry users + chat examples (#2284)
A fix in Makefile for FreeBSD users. In the platfrom x86_64 is amd64. This fix resolve compilation using CFLAGS and CXXFLAGS with -march=native and -mtune=native
Add two examples for interactive mode using Llama2 models (thx TheBloke for models)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-21 13:53:27 +03:00
Georgi Gerganov
a814d04f81
make : fix indentation 2023-07-21 13:50:55 +03:00
Georgi Gerganov
4c013bb738
ci : fix MNT realpath usage (#2250) 2023-07-21 13:49:18 +03:00
Sky Yan
42c7c2e2e9
make : support customized LLAMA_CUDA_NVCC and LLAMA_CUDA_CCBIN (#2275)
Under certain environment, nvcc and gcc is installed under customized path but not standard path

Co-authored-by: Yan Lin <yanlin@baidu.com>
2023-07-21 13:38:57 +03:00
wzy
78a3d13424
flake : remove intel mkl from flake.nix due to missing files (#2277)
NixOS's mkl misses some libraries like mkl-sdl.pc. See #2261
Currently NixOS doesn't have intel C compiler (icx, icpx). See https://discourse.nixos.org/t/packaging-intel-math-kernel-libraries-mkl/975
So remove it from flake.nix

Some minor changes:

- Change pkgs.python310 to pkgs.python3 to keep latest
- Add pkgconfig to devShells.default
- Remove installPhase because we have `cmake --install` from #2256
2023-07-21 13:26:34 +03:00
Georgi Gerganov
ae178ab46b
llama : make tensor_split ptr instead of array (#2272) 2023-07-21 13:10:51 +03:00
Jiří Podivín
54e3bc76fe
make : add new target for test binaries (#2244)
Programs in the tests directory are now build with target tests
and placed in the same location.

* clean target was expanded to remove new binaries

* test target binaries are listed in a variable

* Locations of binaries were added to the .gitignore

Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-21 13:09:16 +03:00
Hatsune Miku
019fe257bb
MIKU MAYHEM: Upgrading the Default Model for Maximum Fun 🎉 (#2287)
* Miku.sh: Set default model to llama-2-7b-chat

* Miku.sh: Set ctx_size to 4096

* Miku.sh: Add in-prefix/in-suffix opts

* Miku.sh: Switch sampler to mirostat_v2 and tiny prompt improvements
2023-07-21 11:13:18 +03:00
Kawrakow
e68c96f7fe
Faster Q2_K on Metal (#2297)
* Faster Q2_K on Metal

* Deleting unnoticed and dangereous trailing white space

* Fixed bug in new metal Q2_K implementation

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-21 10:44:40 +03:00
Przemysław Pawełczyk
9cf022a188
make : fix embdinput library and server examples building on MSYS2 (#2235)
* make : fix embdinput library and server examples building on MSYS2

* cmake : fix server example building on MSYS2
2023-07-21 10:42:21 +03:00
Kawrakow
e782c9e735
Faster Q5_K and Q6_K on Metal (#2294)
* Faster Q6_K on Metal

* Faster Q5_K on Metal

* Another Q5_K speedup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-20 18:19:45 +03:00
Concedo
06c08576f7 Merge remote-tracking branch 'origin/master' into concedo_experimental 2023-07-20 21:02:40 +08:00
Concedo
f036109110 script for henky 2023-07-20 21:02:12 +08:00
Kawrakow
785829dfe8
Faster Q4_K on Metal (#2290)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-20 15:18:43 +03:00
Georgi Gerganov
fff0e0eafe llama : fix regression from #2000 - could not load no-mmap models 2023-07-20 13:47:26 +03:00
Shouzheng Liu
417a85a001
metal: minor q4 optimization and reduce code size (#2248)
* metal: use uint16_t instead of uint8_t.

Apple GPU doesn't like uint8_t. For every operation on uint8_t
the gpu need to copy the uint8_t to an empty 16 bit register, then
it can issue other instructions.

For the matrix-vector multiplication kernel only, we observed a
340~350 GB/s memory read speed on M1 Max after this commit, which is
very close to the reported hardware limit.

* metal: update rms_norm kernel

This commit double the speed of rms_norm operations by using 512 threads
per threadgroup, combining with SIMD primitives to minimize the need for
thread group barriers.

* metal: use template to reduce size

Revert modifications on block_q4_0 and block_q4_1.
2023-07-20 13:32:22 +03:00
Concedo
e85557f798 launcher for rope 2023-07-20 17:45:50 +08:00
Concedo
39dc1a46c4 added token count, updated lite 2023-07-20 14:41:06 +08:00
Concedo
c49a469a79 updated lite 2023-07-19 21:13:00 +08:00
Concedo
2a88d6d3ec Merge remote-tracking branch 'ycros/api-modelbusy-fix' into concedo_experimental 2023-07-19 18:32:13 +08:00
Concedo
13e34d5058 Merge remote-tracking branch 'origin/master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	README.md
#	flake.nix
#	tests/CMakeLists.txt

update readme and lite
2023-07-19 18:28:29 +08:00
Concedo
e9467f5a44 auto rope scale adjustments, added sched yield fix for apple, adjust warning for mirostat 2023-07-19 16:44:44 +08:00
Rinne
294f424554
llama : extend API to get max devices at runtime (#2253) 2023-07-19 10:06:40 +03:00
wzy
45a1b07e9b
flake : update flake.nix (#2270)
When `isx86_32 || isx86_64`, it will use mkl, else openblas

According to
https://discourse.nixos.org/t/rpath-of-binary-contains-a-forbidden-reference-to-build/12200/3,
add -DCMAKE_SKIP_BUILD_RPATH=ON

Fix #2261, Nix doesn't provide mkl-sdl.pc.
When we build with -DBUILD_SHARED_LIBS=ON, -DLLAMA_BLAS_VENDOR=Intel10_lp64
replace mkl-sdl.pc by mkl-dynamic-lp64-iomp.pc
2023-07-19 10:01:55 +03:00
wzy
b1f4290953
cmake : install targets (#2256)
fix #2252
2023-07-19 10:01:11 +03:00
Concedo
0d7240b320 modified rope for cuda 2023-07-19 14:16:27 +08:00
Concedo
374fffb9c6 Reworking rope WIP 2023-07-19 00:54:41 +08:00
Concedo
0a11f50da8 reenabled sched_yield, reduced sampler warning msg to once per session 2023-07-18 20:26:18 +08:00
Georgi Gerganov
d01bccde9f
ci : integrate with ggml-org/ci (#2250)
* ci : run ctest

ggml-ci

* ci : add open llama 3B-v2 tests

ggml-ci

* ci : disable wget progress output

ggml-ci

* ci : add open llama 3B-v2 tg tests for q4 and q5 quantizations

ggml-ci

* tests : try to fix tail free sampling test

ggml-ci

* ci : add K-quants

ggml-ci

* ci : add short perplexity tests

ggml-ci

* ci : add README.md

* ppl : add --chunks argument to limit max number of chunks

ggml-ci

* ci : update README
2023-07-18 14:24:43 +03:00
Concedo
6d32e7fc8b Merge commit 'a6803cab94' into concedo_experimental
# Conflicts:
#	.devops/tools.sh
#	Makefile
#	build.zig
#	flake.nix
#	ggml-cuda.cu
#	ggml.h
#	tests/test-grad0.c
#	tests/test-opt.c
2023-07-18 19:12:06 +08:00