Commit graph

712 commits

Author SHA1 Message Date
Georgi Gerganov
f9be42add0
readme : add quantization info 2023-04-26 23:24:42 +03:00
Georgi Gerganov
574406dc7e
ggml : add Q5_0 and Q5_1 quantization (#1187)
* ggml : add Q5_0 quantization (cuBLAS only)

* ggml : fix Q5_0 qh -> uint32_t

* ggml : fix q5_0 histogram stats

* ggml : q5_0 scalar dot product

* ggml : q5_0 ARM NEON dot

* ggml : q5_0 more efficient ARM NEON using uint64_t masks

* ggml : rename Q5_0 -> Q5_1

* ggml : adding Q5_0 mode

* quantize : add Q5_0 and Q5_1 to map

* ggml : AVX2 optimizations for Q5_0, Q5_1 (#1195)

---------

Co-authored-by: Stephan Walter <stephan@walter.name>
2023-04-26 23:14:13 +03:00
Ásgeir Bjarni Ingvarsson
87a6f846d3
Allow setting the rng seed after initialization. (#1184)
The llama_set_state_data function restores the rng state to what it
was at the time llama_copy_state_data was called. But users may want
to restore the state and proceed with a different seed.
2023-04-26 22:08:43 +02:00
DaniAndTheWeb
ea3ad7eb60
Updating build instructions to include BLAS support (#1183)
* Updated build information

First update to the build instructions to include BLAS.

* Update README.md

* Update information about BLAS

* Better BLAS explanation

Adding a clearer BLAS explanation and adding a link to download the CUDA toolkit.

* Better BLAS explanation

* BLAS for Mac

Specifying that BLAS is already supported on Macs using the Accelerate Framework.

* Clarify the effect of BLAS

* Windows Make instructions

Added the instructions to build with Make on Windows

* Fixing typo

* Fix trailing whitespace
2023-04-26 22:03:03 +02:00
Pavol Rusnak
859fee6dfb
quantize : use map to assign quantization type from string (#1191)
instead of `int` (while `int` option still being supported)

This allows the following usage:

`./quantize ggml-model-f16.bin ggml-model-q4_0.bin q4_0`

instead of:

`./quantize ggml-model-f16.bin ggml-model-q4_0.bin 2`
2023-04-26 18:43:27 +02:00
Concedo
101f7a6e73 updated readme 2023-04-26 23:50:00 +08:00
Concedo
93a8e00dfa Merge branch 'master' into concedo
# Conflicts:
#	flake.nix
2023-04-26 18:01:35 +08:00
Disty0
27bc29128e
Update README.md (#120) 2023-04-26 17:33:34 +08:00
Stephan Walter
4afcc37869
Update SHA256SUMS after quantization change (#1181)
Co-authored-by: Pavol Rusnak <pavol@rusnak.io>
2023-04-25 23:41:56 +02:00
ostix360
667c501334
py : cast lora_alpha to int in convert-lora-to-ggml (#1170)
Co-authored-by: Pavol Rusnak <pavol@rusnak.io>
2023-04-25 23:33:08 +02:00
Pavol Rusnak
bb98e77be7
nix: use convert.py instead of legacy wrapper convert-pth-to-ggml.py (#981) 2023-04-25 23:19:57 +02:00
Georgi Gerganov
7a32fcb3b2
ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) (#1179)
* ggml : add Q8_0 quantization format (rename the old one to Q8_1)

* tests : fix test-quantize-fns

* ggml : finalize Q8_0 implementation

* ggml : use q4_0_q8_0 and q4_2_q8_0

* ggml : fix Q8_0 dot product bug (ARM)

* ggml : Q8_0 unroll x2

* ggml : fix bug - using wrong block type

* ggml : extend quantize_fns_t with "vec_dot_type"

* ggml : fix Q8_0 to use 255 values out of 256

* ggml : fix assert using wrong QK4_2 instead of QK4_3
2023-04-25 23:40:51 +03:00
unbounded
dd0eabc049
ggml : use full range for Q4_0 and Q4_2 quantization (#729)
* Use full range for q4_0 quantization

By keeping the sign of the highest magnitude, we can make sure the
highest value maps to -8, which is currently unused.
This is a bit of a freebie since it is fully backwards compatible with
the current format.

* Update quantize_row_q4_0 for AVX/AVX2

* Update quantize_row_q4_0 for WASM

Untested

* Update quantize_row_q4_0 for Arm NEON

* Update quantize_row_q4_0 for PowerPC

Untested

* Use full range for q4_2 quantization
2023-04-25 20:20:46 +03:00
Concedo
0aa3d839fb free old ctx on retry 2023-04-25 23:42:57 +08:00
Concedo
a696b0a16c missed another thing 2023-04-25 23:16:04 +08:00
Concedo
8c9c218609 missed a thing 2023-04-25 23:02:08 +08:00
Concedo
235daf4016 Merge branch 'master' into concedo
# Conflicts:
#	.github/workflows/build.yml
#	README.md
2023-04-25 20:44:22 +08:00
Concedo
72b2331ad6 edge cases with mem crash? need verify 2023-04-25 20:42:30 +08:00
Concedo
5eec5d6ed9 Added backwards compatibility to an earlier version of NeoX. 2023-04-25 20:34:18 +08:00
Concedo
bff998f871 Slight refactor of the python code: credits to @LuxF3rre 2023-04-25 19:20:14 +08:00
xaedes
54bb60e268
ggml : fix bug in ggml_compute_forward_sum_f32 (#1162)
The sum over all rows is now computed instead of just the last row
2023-04-24 23:02:02 +02:00
Georgi Gerganov
8a0f8673ba
ggml : export symbols (#1155) 2023-04-24 22:18:25 +03:00
xaedes
0c5692345d
examples : add save_load_state example (#1150)
* add save_load_state example

* use <cstdio> instead of <iostream> and fprintf / printf instead of cout

* renamed save-load-state example files replacing underscores by dashes
2023-04-24 19:23:31 +03:00
Georgi Gerganov
957c8ae21d
llama : increase scratch buffer size for 65B (ref #1152)
Temporary solution
2023-04-24 18:47:30 +03:00
mgroeber9110
9b0a4d4214
examples/main README improvements and some light refactoring (#1131) 2023-04-24 15:45:32 +00:00
Stephan Walter
2ec83428de
Fix build for gcc 8 and test in CI (#1154) 2023-04-24 15:38:26 +00:00
slaren
e4cf982e0d
Fix cuda compilation (#1128)
* Fix: Issue with CUBLAS compilation error due to missing -fPIC flag

---------

Co-authored-by: B1gM8c <89020353+B1gM8c@users.noreply.github.com>
2023-04-24 17:29:58 +02:00
Concedo
59fb174678 fixed compile errors, made mmap automatic when lora is selected, added updated quantizers and quantization handling for gpt neox gpt 2 and gptj 2023-04-24 23:20:06 +08:00
Concedo
3962eb39c7 added token unbanning 2023-04-24 21:50:20 +08:00
Concedo
1b9b9068b1 merged q4_2 and q4_3 dequants and FIXED CLBLAST SLOWNESS! 2023-04-24 21:33:01 +08:00
Concedo
e58f1d1336 Merge branch 'master' into concedo_experimental 2023-04-24 19:43:17 +08:00
Georgi Gerganov
c4fe84fb0d
llama : refactor get / set state + remove redundant kv cache API (#1143) 2023-04-24 07:40:02 +03:00
Concedo
8e615c8245 Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
2023-04-24 12:20:08 +08:00
slaren
1d78fecdab
Fix LoRA acronym (#1145) 2023-04-23 23:03:44 +02:00
Georgi Gerganov
284685f169
scripts : add helper scripts to synch ggml repo 2023-04-23 19:57:09 +03:00
DannyDaemonic
edce63baa9
Added README.md for main with examples and explanations (#1139) 2023-04-23 15:37:02 +00:00
Georgi Gerganov
ec9cdb6752
ggml : do not print perf ops that have not been used at all 2023-04-23 18:32:52 +03:00
Georgi Gerganov
e4422e299c
ggml : better PERF prints + support "LLAMA_PERF=1 make" 2023-04-23 18:15:39 +03:00
Stephan Walter
53c8434398
Improve AVX2 for vec_dot_q4_3_q8_0 (#1138) 2023-04-23 11:01:03 +00:00
Pavol Rusnak
c6524f46eb
readme : update gpt4all instructions (#980) 2023-04-23 10:21:26 +02:00
Concedo
9129e937f9 only llama can use batch sizes above 256 to prevent unacceptably high memory usage 2023-04-23 15:57:06 +08:00
Yishuo Wang
c9e2c26f41
A better packNibbles and mul_sum_i8_pairs_float implementation using AVX512 (#1119) 2023-04-23 07:57:05 +00:00
Concedo
432cc91649 still needs to be a bit higher for very small contexts 2023-04-23 15:01:38 +08:00
Concedo
4e1ea2ac61 hopefully fixed the ooms for good 2023-04-23 13:49:50 +08:00
Gustavo Rocha Dias
3f21bd81f3
doc - Better explanation of how to build the libraries at Windows. (#107) 2023-04-23 13:40:09 +08:00
Concedo
d41490c27b just revert back to the working commit 2023-04-23 00:35:42 +08:00
Concedo
c60fb5ef4b fixed rwkv build errors on ARM devices 2023-04-23 00:18:38 +08:00
Concedo
b5d6284190 increase initial buffer too 2023-04-23 00:07:33 +08:00
Concedo
d2f14b2b1f add an extra buffer to mem allocations 2023-04-23 00:04:32 +08:00
Concedo
7c60441d71 Merge branch 'master' into concedo
# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
2023-04-22 23:46:14 +08:00