Commit graph

685 commits

Author SHA1 Message Date
Concedo
59fb174678 fixed compile errors, made mmap automatic when lora is selected, added updated quantizers and quantization handling for gpt neox gpt 2 and gptj 2023-04-24 23:20:06 +08:00
Concedo
3962eb39c7 added token unbanning 2023-04-24 21:50:20 +08:00
Concedo
1b9b9068b1 merged q4_2 and q4_3 dequants and FIXED CLBLAST SLOWNESS! 2023-04-24 21:33:01 +08:00
Concedo
e58f1d1336 Merge branch 'master' into concedo_experimental 2023-04-24 19:43:17 +08:00
Georgi Gerganov
c4fe84fb0d
llama : refactor get / set state + remove redundant kv cache API (#1143) 2023-04-24 07:40:02 +03:00
Concedo
8e615c8245 Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-04-24 12:20:08 +08:00
slaren
1d78fecdab
Fix LoRA acronym (#1145) 2023-04-23 23:03:44 +02:00
Georgi Gerganov
284685f169
scripts : add helper scripts to synch ggml repo 2023-04-23 19:57:09 +03:00
DannyDaemonic
edce63baa9
Added README.md for main with examples and explanations (#1139) 2023-04-23 15:37:02 +00:00
Georgi Gerganov
ec9cdb6752
ggml : do not print perf ops that have not been used at all 2023-04-23 18:32:52 +03:00
Georgi Gerganov
e4422e299c
ggml : better PERF prints + support "LLAMA_PERF=1 make" 2023-04-23 18:15:39 +03:00
Stephan Walter
53c8434398
Improve AVX2 for vec_dot_q4_3_q8_0 (#1138) 2023-04-23 11:01:03 +00:00
Pavol Rusnak
c6524f46eb
readme : update gpt4all instructions (#980) 2023-04-23 10:21:26 +02:00
Concedo
9129e937f9 only llama can use batch sizes above 256 to prevent unacceptably high memory usage 2023-04-23 15:57:06 +08:00
Yishuo Wang
c9e2c26f41
A better packNibbles and mul_sum_i8_pairs_float implementation using AVX512 (#1119) 2023-04-23 07:57:05 +00:00
Concedo
432cc91649 still needs to be a bit higher for very small contexts 2023-04-23 15:01:38 +08:00
Concedo
4e1ea2ac61 hopefully fixed the ooms for good 2023-04-23 13:49:50 +08:00
Gustavo Rocha Dias
3f21bd81f3
doc - Better explanation of how to build the libraries at Windows. (#107) 2023-04-23 13:40:09 +08:00
Concedo
d41490c27b just revert back to the working commit 2023-04-23 00:35:42 +08:00
Concedo
c60fb5ef4b fixed rwkv build errors on ARM devices 2023-04-23 00:18:38 +08:00
Concedo
b5d6284190 increase initial buffer too 2023-04-23 00:07:33 +08:00
Concedo
d2f14b2b1f add an extra buffer to mem allocations 2023-04-23 00:04:32 +08:00
Concedo
7c60441d71 Merge branch 'master' into concedo
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
2023-04-22 23:46:14 +08:00
Concedo
eb73b4c261 remove writing to cl_buffer_c and change it to a writeonly buffer - should work since beta is always zero. 2023-04-22 23:19:17 +08:00
Concedo
cd6c121357 reinstated the reusable buffers -> approx 10% speedup for prompt processing 2023-04-22 22:49:27 +08:00
Georgi Gerganov
0e018fe008
ggml : fix Q4_3 cuBLAS 2023-04-22 16:32:07 +03:00
Stephan Walter
857308d1e8
ci : trigger CI for drafts, but not most PR actions (#1125) 2023-04-22 16:12:29 +03:00
Stephan Walter
c50b628810
Fix CI: ARM NEON, quantization unit tests, editorconfig (#1122) 2023-04-22 10:54:13 +00:00
unbounded
5f939498d5
ggml : unit test for quantization functions (#953)
* Unit test for quantization functions

Use the ggml_internal_get_quantize_fn function to loop through all
quantization formats and run a sanity check on the result.

Also add a microbenchmark that times these functions directly without
running the rest of the GGML graph.

* test-quantize-fns: CI fixes

Fix issues uncovered in CI
 - need to use sizes divisible by 32*8 for loop unrolling
 - use intrinsic header that should work on Mac

* test-quantize: remove

Per PR comment, subsumed by test-quantize-fns

* test-quantize: fix for q8_0 intermediates
2023-04-22 12:10:39 +03:00
wbpxre150
36b4f7e064
llama : print timings on ctrl+c exit (#1021)
* print timings on ctrl+c exit

* remove redundant free memory call.

* add global pointer to ctx.
2023-04-22 11:56:35 +03:00
Concedo
811989c2ad fixed pyinstaller 2023-04-22 16:31:42 +08:00
eiery
10f19c1121
llama : have n_batch default to 512 (#1091)
* set default n_batch to 512 when using BLAS

* spacing

* alternate implementation of setting different n_batch for BLAS

* set n_batch to 512 for all cases
2023-04-22 11:27:05 +03:00
Concedo
1b7aa2b815 Merge branch 'master' into concedo
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
2023-04-22 16:22:08 +08:00
Howard Su
7e312f165c
cmake : fix build under Windows when enable BUILD_SHARED_LIBS (#1100)
* Fix build under Windows when enable BUILD_SHARED_LIBS

* Make AVX512 test on Windows to build the shared libs
2023-04-22 11:18:20 +03:00
Georgi Gerganov
872c365a91 ggml : fix AVX build + update to new Q8_0 format 2023-04-22 11:08:12 +03:00
Concedo
1ea0e15292 Merge branch 'master' into concedo
# Conflicts:
#	llama.cpp
2023-04-22 16:07:27 +08:00
Concedo
b53c5e7c80 update kobold lite 2023-04-22 16:04:43 +08:00
Georgi Gerganov
955ef9a5d5
ggml : alternative Q4_3 implementation using modified Q8_0 (#1109)
* ggml : prefer vzip to vuzp

This way we always use the same type of instruction across all quantizations

* ggml : alternative Q4_3 implementation using modified Q8_0

* ggml : fix Q4_3 scalar imlpementation

* ggml : slight improvement of Q4_3 - no need for loop unrolling

* ggml : fix AVX paths for Q8_0 quantization
2023-04-22 10:55:35 +03:00
Stephan Walter
c5aa5e5777
ggml : AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring (#1099)
* AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring

* finish AVX vectorization of quantize_row_q8_0

* Rename hsum_int_8 to hsum_i32_8
2023-04-22 10:37:05 +03:00
Clint Herron
e9a9cb0c54
examples : Improve Alpaca Default Repeat Penalty: Better Match Alpaca.cpp Experience (#1107)
* Moving parameters to separate lines for readability.

* Increasing repeate_penalty to 1.1 to make alpaca more usable by default.

* Adding trailing newline.
2023-04-22 09:54:33 +03:00
xaedes
b6e7f9b09e
llama : add api for getting/setting the complete state: rng, logits, embedding and kv_cache (#1105)
* reserve correct size for logits

* add functions to get and set the whole llama state:

including rng, logits, embedding and kv_cache

* remove unused variables

* remove trailing whitespace

* fix comment
2023-04-22 09:21:32 +03:00
Concedo
6e908c1792 added lora support 2023-04-22 12:29:38 +08:00
Concedo
c454f8b848 Gpt NeoX / Pythia integration completed 2023-04-22 11:23:25 +08:00
Concedo
7b3d04e5d4 Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
2023-04-22 10:58:16 +08:00
Concedo
4fa3dfe8bc just doesn't work properly on windows. will leave it as a manual flag for others 2023-04-22 10:57:38 +08:00
slaren
50cb666b8a
Improve cuBLAS performance by using a memory pool (#1094)
* Improve cuBLAS performance by using a memory pool

* Move cuda specific definitions to ggml-cuda.h/cu

* Add CXX flags to nvcc

* Change memory pool synchronization mechanism to a spin lock
General code cleanup
2023-04-21 21:59:17 +02:00
apaz
25d7abbd1f
llama : fixed rlimit error message (#888) 2023-04-21 21:48:06 +03:00
源文雨
018f2279f5
cmake : link threads publicly to ggml (#1042)
* fix: ld link test-tokenizer-0 error

```
cmake3 --build . --config Release
[  5%] Built target ggml
[ 16%] Built target llama
[ 22%] Linking CXX executable ../bin/test-tokenizer-0
../libllama.a(ggml.c.o):在函数‘ggml_graph_compute’中:
ggml.c:(.text+0xf2db):对‘pthread_create’未定义的引用
ggml.c:(.text+0xf9d4):对‘pthread_join’未定义的引用
collect2: error: ld returned 1 exit status
gmake[2]: *** [bin/test-tokenizer-0] 错误 1
gmake[1]: *** [tests/CMakeFiles/test-tokenizer-0.dir/all] 错误 2
gmake: *** [all] 错误 2
```

* Update CMakeLists.txt

* Update CMakeLists.txt

* Update CMakeLists.txt
2023-04-21 21:27:06 +03:00
Alex Klinkhamer
9411288271
main : evaluate tokens in batches after swapping context (#1014)
* examples : evaluate tokens in batches after swapping context

* Update examples/main/main.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-21 21:18:09 +03:00
Concedo
ef13443047 wip pythia integration 2023-04-22 01:08:23 +08:00