Commit graph

539 commits

Author SHA1 Message Date
3ooabkhxtn
25b448a32f - rearranged defines, SSSE3 function only compiled if used 2023-05-12 20:48:41 +00:00
3ooabkhxtn
fc26f54e74 - Put the whole line into defined()
- Use __SSSE3__ instead of __SSE__
2023-05-12 13:59:20 +00:00
3ooabkhxtn
70c2b6c696 Put __SSE3__ into defined() 2023-05-12 13:32:00 +00:00
3ooabkhxtn
ca54314a2f - Improved prefetching
llama_print_timings:        load time =  2899.92 ms
llama_print_timings:      sample time =   127.62 ms /   128 runs   (    1.00 ms per token)
llama_print_timings: prompt eval time =  2705.68 ms /     8 tokens (  338.21 ms per token)
llama_print_timings:        eval time = 52500.58 ms /   127 runs   (  413.39 ms per token)
llama_print_timings:       total time = 55559.90 ms
2023-05-12 10:17:13 +00:00
3ooabkhxtn
8699fd0d43 - Cleanup 2023-05-12 09:25:00 +00:00
3ooabkhxtn
7379dd2dba - Added prefetch
llama_print_timings:        load time =  3021.72 ms
llama_print_timings:      sample time =   128.90 ms /   128 runs   (    1.01 ms per token)
llama_print_timings: prompt eval time =  2826.35 ms /     8 tokens (  353.29 ms per token)
llama_print_timings:        eval time = 53198.13 ms /   127 runs   (  418.88 ms per token)
llama_print_timings:       total time = 56380.69 ms
2023-05-12 09:20:48 +00:00
3ooabkhxtn
78bbb3cdfe - Use 4 accumulations instead of 2
- Removed first accumulation

ideas taken from here https://stackoverflow.blog/2020/07/08/improving-performance-with-simd-intrinsics-in-three-use-cases/

llama_print_timings:        load time =  3087.59 ms
llama_print_timings:      sample time =   132.04 ms /   128 runs   (    1.03 ms per token)
llama_print_timings: prompt eval time =  2894.28 ms /     8 tokens (  361.78 ms per token)
llama_print_timings:        eval time = 58529.67 ms /   127 runs   (  460.86 ms per token)
llama_print_timings:       total time = 61780.98 ms
2023-05-12 09:15:46 +00:00
3ooabkhxtn
607b9c7373 - Split multiplication and addition to make it easier for the compiler to optimise
- Accumulate two acc instead of one

llama_print_timings:        load time =  3137.95 ms
llama_print_timings:      sample time =   132.54 ms /   128 runs   (    1.04 ms per token)
llama_print_timings: prompt eval time =  2943.22 ms /     8 tokens (  367.90 ms per token)
llama_print_timings:        eval time = 59539.50 ms /   127 runs   (  468.81 ms per token)
llama_print_timings:       total time = 62843.23 ms
2023-05-12 08:04:54 +00:00
3ooabkhxtn
524d6c9447 - added sse instructions for ggml_vec_dot_q4_0_q8_0
Reference:

llama_print_timings:        load time = 14251.20 ms
llama_print_timings:      sample time =   129.15 ms /   128 runs   (    1.01 ms per token)
llama_print_timings: prompt eval time = 14050.58 ms /     8 tokens ( 1756.32 ms per token)
llama_print_timings:        eval time = 238504.60 ms /   127 runs   ( 1877.99 ms per token)
llama_print_timings:       total time = 252916.56 ms

SSE3 instructions

llama_print_timings:        load time =  3349.09 ms
llama_print_timings:      sample time =    53.06 ms /    52 runs   (    1.02 ms per token)
llama_print_timings: prompt eval time =  3154.19 ms /     8 tokens (  394.27 ms per token)
llama_print_timings:        eval time = 23759.20 ms /    51 runs   (  465.87 ms per token)
llama_print_timings:       total time = 27174.93 ms
2023-05-12 07:54:33 +00:00
Rinne
089b1c93ba
readme : add C#/.NET bindings repo (#1409) 2023-05-12 08:39:40 +03:00
Georgi Gerganov
b9fd7eee57
ggml : remove bit shuffling (#1405)
* ggml : remove Q4_0 bit shufling (ARM NEON)

* ggml : remove Q4_1 bit shuffling (ARM NEON + reference)

* ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON)

* ggml : remove Q4_2 bit shuffling (WIP, BROKEN)

* ggml : remove Q5_0 bit shuffling (ARM NEON)

* ggml : 2x faster scalar implementations

* ggml : remove Q5_1 bit shuffling (ARM NEON + scalar)

* ggml : simplify scalar dot

* ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit

* ggml : fix Q4_1 quantization

* ggml : update cuBLAS + normalize variable names

* ggml : remove Q4_2 mode

* ggml : minor formatting

* ggml : fix Q5_0 quantization

* scripts : add script for measuring the time per token

* AVX implementations (#1370)

* ggml : uniform 5th bit extraction

* llama : produce error upon loading old model files

* llama : fix model magic/version write

* ggml : speed-up Q5_0 + Q5_1 at 4 threads

* ggml : preserve old Q4 and Q5 formats

* ggml : simplify Q8_1 - no need for low / high sums anymore

* ggml : fix Q8_0 and Q8_1 rounding

* Revert "AVX implementations (#1370)"

This reverts commit 948d124837.

* ggml : fix AVX2 implementation

* sha : update hashes for 7B and 13B

* readme : update timings + remove warning banner

* llama : update v2 PR number to 1405

* ggml : fix WASM comments

* ggml : back to original bit order

* readme : add note that Q4 and Q5 have been changed

* llama : fix return for unknown version

---------

Co-authored-by: Stephan Walter <stephan@walter.name>
2023-05-12 00:23:08 +03:00
CRD716
b608b55a3e
prompts : model agnostic DAN (#1304)
* add model-agnostic dan prompt

* quick readme update

* save a token

* Revert "quick readme update"

This reverts commit 8dc342c069.
2023-05-11 18:10:19 +03:00
Evan Jones
cf348a60e0
main : add option to save full output to session (#1338)
* main : add option to save full output to session

* split behavior into --session and --prompt-cache

* restore original implementation with new names

* PR comments

* move the check for incompatible parameters to gpt_params_parse

* Fix whitespace

Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>

---------

Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
2023-05-10 11:37:14 -04:00
DannyDaemonic
e6a46b0ed1
Locale fix for Windows (#1379) 2023-05-09 19:53:28 +02:00
Sami Farin
9f8dbc4787
use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler (#1314)
* use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler

Tested with a 13B model.

* use _mm_pause() in busyloop

* use _mm_pause() in busyloop on x86_64 to reduce power consumption
2023-05-09 14:29:20 +02:00
DannyDaemonic
41654efea8
Interface improvements and --multiline-input (previously --author-mode) (#1040)
* Interface improvements
* Multiline input
* Track character width
* Works with all characters and control codes + Windows console fixes
2023-05-08 19:45:48 -07:00
Georgi Gerganov
56551bc11f
readme : add notice about upcoming breaking change 2023-05-08 22:52:18 +03:00
AlpinDale
fe60904eef
readme : add TOC and Pygmalion instructions (#1359) 2023-05-08 19:33:30 +03:00
Pavol Rusnak
003ba2fb43
llama : fix hparams shadow (#1367)
fixes #1363
2023-05-08 17:48:21 +03:00
Georgi Gerganov
f9a6364912
llama : require first token to be BOS (#1303)
* llama : require first token to be BOS

* scripts : add ppl-run-all.sh

* perplexity : add BOS for each chunk

* readme : update perplexity values after BOS fix

* perplexity : add clarifying comments
2023-05-08 17:41:54 +03:00
ubik2
95078cc554
convert: add ability to convert safetensors files (#1276)
* when loading a safetensors file, ignore the metadata header
* check for safetensors files first, and only use PyTorch versions when safetensors aren't available
2023-05-08 13:54:26 +02:00
Johannes Gäßler
1f48b0abcf
Documented CUDA reproducibility, added warning (#1346) 2023-05-08 02:42:01 +02:00
Henri Vasserman
e1295513a4
CI: add Windows CLBlast and OpenBLAS builds (#1277)
* Add OpenCL and CLBlast support

* Add OpenBLAS support

* Remove testing from matrix

* change build name to 'clblast'
2023-05-07 13:20:09 +02:00
swittk
1b0fd45465
ggml : Allow usage of CLBlast alongside Accelerate.framework (#1336)
Minor edit in ggml.c which originally would prevent OpenCL from loading completely if GGML_USE_ACCELERATE was defined.
Minor speedup in prompt eval time.
2023-05-06 23:03:23 -04:00
Jed Fox
3924088512
Remove default arguments from sampling functions (#1343) 2023-05-06 17:01:47 -04:00
DaniAndTheWeb
173d0e6419
makefile: automatic Arch Linux detection (#1332)
This commit is a port of a detection method used in koboldcpp's Makefile in order to automatically set the -lcblas option on Arch Linux
2023-05-05 23:57:14 +02:00
Erik Scholz
a3b85b28da
ci : add cublas to windows release (#1271) 2023-05-05 22:56:09 +02:00
Pavol Rusnak
921dcee00a
readme: add missing info (#1324) 2023-05-05 16:43:36 +02:00
Ionoclast Laboratories
2d13786e91
Fix for OpenCL / clbast builds on macOS. (#1329) 2023-05-05 14:18:21 +02:00
Benjamin Lecaillon
a90e96b266
Convert.py @staticmethod (#1327)
* Line 698 has one #staticmethod and should not

otherwise throw error at unpickle.load() as not callable

* Update convert.py

---------

Co-authored-by: Ivan Stepanov <ivanstepanovftw@gmail.com>
2023-05-05 03:17:07 +03:00
slaren
94c5652fc0
quantize: make output filename optional, default to ggml-model-<ftype>.bin (#1301) 2023-05-05 00:58:56 +02:00
Ivan Stepanov
34d9f22f44
Wrap exceptions in std::exception to verbose output on exception. (#1316) 2023-05-04 18:56:27 +02:00
Ivan Stepanov
d3e8093e9b
convert: support DT_BF16 tensors (#1309)
Co-authored-by: Pavol Rusnak <pavol@rusnak.io>
2023-05-04 18:54:37 +02:00
44670
360cfe5bec
readme : add OpenBuddy link (#1321) 2023-05-04 19:33:31 +03:00
44670
2edbdb0f99
main : add --in-suffix option (#1318)
* adding --in-suffix option

* print input suffix before generation
2023-05-04 18:41:12 +03:00
Ron Jailall
20fbf2a2a0
ggml : change immintrin.h to intrin.h for compatibility (#1307)
* change immintrin.h to intrin.h for compatibility

Building on windows11 arm throws an error on this line. Seems like using intrin.h covers x86 and and arm

* conditional def of intrin.h

* fix typo in ggml.c
2023-05-04 18:05:59 +03:00
DannyDaemonic
db1080876a
Only escape prompts when used with -e (#1311) 2023-05-04 05:08:25 -07:00
DannyDaemonic
c65a7fbfa9
Update main's README.md with new features (#1296) 2023-05-04 03:02:59 -07:00
Tomas
f647ce040f
fix #1224 reverse prompt and multi line (#1297)
* fix reverse prompt and multi line

* Code Formatting

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-04 03:02:30 -07:00
Georgi Gerganov
799fdc1b5d
ggml : vectorize Q8_0 quantization
https://github.com/ggerganov/ggml/pull/127#issuecomment-1533648531
2023-05-03 23:24:20 +03:00
khimaros
6daa09d879
examples : read chat prompts from a template file (#1196) 2023-05-03 20:58:11 +03:00
Georgi Gerganov
bca9ad938a
minor : fix whitespaces (#1302) 2023-05-03 20:09:42 +03:00
Georgi Gerganov
e2a937ca6a
minor : fix trailing whitespaces 2023-05-03 18:43:23 +03:00
KASR
b0c71c7b6d
scripts : platform independent script to verify sha256 checksums (#1203)
* python script to verify the checksum of the llama models

Added Python script for verifying SHA256 checksums of files in a directory, which can run on multiple platforms. Improved the formatting of the output results for better readability.

* Update README.md

update to the readme for improved readability and to explain the usage of the python checksum verification script

* update the verification script

I've extended the script based on suggestions by @prusnak

The script now checks the available RAM, is there is enough to check the file at once it will do so. If not the file is read in chunks.

* minor improvment

small change so that the available ram is checked and not the total ram

* remove the part of the code that reads the file at once if enough ram is available

based on suggestions from @prusnak i removed the part of the code that checks whether the user had enough ram to read the entire model at once. the file is now always read in chunks.

* Update verify-checksum-models.py

quick fix to pass the git check
2023-05-03 18:31:28 +03:00
CRD716
a8a2efdc81
examples : various prompt and example fixes (#1298)
* fix dan.txt

* miku prompt improvements

* use common characters
2023-05-03 18:26:47 +03:00
Evan Jones
e216aa0463
llama : only copy used KV cache in get / set state (#1272)
* llama : only copy used KV cache in get / set state

* switch to ggml for copying k, v

* avoid designated initializers
2023-05-02 22:26:13 -04:00
DannyDaemonic
2485d7a4d3
Process escape sequences given in prompts (#1173) 2023-05-02 18:46:20 -07:00
DannyDaemonic
13b0c68ed7
Handle signals properly on Windows (#1123) 2023-05-02 18:01:57 -07:00
DannyDaemonic
55bc5f0900
Call sh on build-info.sh (#1294) 2023-05-02 17:52:35 -07:00
kuvaus
9daff419f6
fix build-info.h for git submodules (#1289)
* make git build info work with submodules

---------

Co-authored-by: Green Sky <green@g-s.xyz>
2023-05-03 02:43:43 +02:00