Commit graph

1067 commits

Author SHA1 Message Date
Georgi Gerganov
596e1094fb
common : remove obsolete BPE API + disable test-tokenizer-1 2023-08-23 20:31:03 +03:00
Georgi Gerganov
2424e1d08e
llama : remove oboslete comment
ggml-ci
2023-08-23 20:16:40 +03:00
Georgi Gerganov
3bfb720642
llama : advanced BPE tokenizer based on ggllm.cpp imlpementation 2023-08-23 20:11:45 +03:00
Georgi Gerganov
c3f8a6e49f
llama : prep new tokenizer support 2023-08-23 19:08:44 +03:00
Georgi Gerganov
6938c5f474
Merge branch 'master' into falcon 2023-08-23 17:08:14 +03:00
Georgi Gerganov
176ea716b3
llama : better model naming and size reporting 2023-08-23 15:53:57 +03:00
slaren
e7299656bd
falcon : add CUDA offloading (#2739) 2023-08-23 15:51:30 +03:00
Georgi Gerganov
854ae5d030
metal : temporary workaround for the concurrency optimization bug 2023-08-23 15:25:31 +03:00
Georgi Gerganov
0a85ae7397
metal : fix GELU kernel numerical stability by using precise::tanh 2023-08-23 15:05:34 +03:00
klosax
b693000c2e
llama.cpp : fix linefeed token 2023-08-23 13:22:41 +02:00
Kawrakow
8207214b6a
Fix values shown in the quantize tool help (#2735)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-23 12:57:12 +03:00
Kawrakow
62959e740e
Strided perplexity (#2714)
* Implementing strided computation of perplexity

* Alternative way to output PPL results

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-23 12:56:42 +03:00
IgnacioFDM
7f7ddd5002
Fix ggml to gguf conversion on Windows (#2733)
This fixes `RuntimeWarning: overflow encountered in long_scalars`

Credit: anon (not mine)
2023-08-23 03:31:09 -06:00
Georgi Gerganov
e2d23bed1b
falcon : minor changes (still chasing the Metal problem) 2023-08-23 12:25:49 +03:00
Georgi Gerganov
a0dc47a501
metal : print extra compute pipeline info 2023-08-23 11:25:26 +03:00
Georgi Gerganov
b34ab74094
falcon : copy-paste self-attention from LLaMA 2023-08-23 11:04:26 +03:00
Georgi Gerganov
af4bbcc873
ggml : ggml_repeat always creates new tensor 2023-08-23 10:42:02 +03:00
Georgi Gerganov
99bb26078f
metal : implement RoPE (mode = 2) + avoid ggml_repeat 2023-08-23 10:41:35 +03:00
Georgi Gerganov
e3c52bd990
ggml : pass eps to ggml_norm 2023-08-23 10:40:58 +03:00
Xiao-Yong Jin
b8ad1b66b2
server : allow json array in prompt or content for direct token input (#2306)
* server: allow json array in prompt or content

We accept an array of strings and numbers representing tokens,
in addition to the current string valued prompt or content.

This allows direct token input, so that any special tokens
can be processed and used at the frontend during the construction
of the json data, before sending to the server. And the server
does not need to know or parse special tokens from textual input.

With this, we can use EOS and BOS used in llama-2-chat models.

* server: use tokenizePrompt(json) and default "" if empty prompt

* server: fix prompt check

* server: tokenize endpoint no longer adds BOS
2023-08-23 15:12:12 +08:00
Evan Jones
f5fe98d11b
docs : add grammar docs (#2701)
* docs : add grammar docs

* tweaks to grammar guide

* rework GBNF example to be a commented grammar
2023-08-22 21:01:57 -04:00
Kerfuffle
777f42ba18
Improve handling of special tokens in GGML to GGUF converter (#2725)
* Improve UNK, BOS, EOS token handling when converting without metadata.

* Allow importing as a module.

* Remove some obsolete code and minor cleanups.

* Set default UNK token mapping from -1 to 0 in llama.cpp

* Try to handle overflow due to buggy Windows Python with a better error message
2023-08-22 17:39:39 -06:00
klosax
d561b7f724
llama.cpp : fix the fix of bpe tokenizer 2023-08-23 00:06:53 +02:00
klosax
a95ae7526a
llama.cpp : fix bpe tokenizer 2023-08-23 00:02:13 +02:00
goerch
46ef5b5fcf
llama : fix whitespace escaping in tokenizer (#2724) 2023-08-23 00:10:42 +03:00
Johannes Gäßler
c63bb1d16a
CUDA: use mul_mat_q kernels by default (#2683) 2023-08-22 22:47:05 +02:00
klosax
ffa5099c6d
llama.cpp : llama default UNK token = id 0 2023-08-22 22:34:03 +02:00
klosax
9853f2cfb2
convert-falcon-hf-to-gguf.py : fix special token mapping 2023-08-22 22:29:11 +02:00
Georgi Gerganov
7bbbf38c32
llama : minor updates
ggml-ci
2023-08-22 23:26:16 +03:00
Georgi Gerganov
0ec27ad66c
falcon : minor 2023-08-22 23:11:41 +03:00
Georgi Gerganov
2d58444dae
falcon : support non-40B models 2023-08-22 22:52:14 +03:00
Georgi Gerganov
3c7c325b98
falcon : CPU inference working 2023-08-22 22:31:49 +03:00
Georgi Gerganov
085228e1f5
llama : add arch member to llama_model 2023-08-22 22:09:56 +03:00
Alex Petenchea
3b6cfe7c92
convert.py : clarifying error message (#2718) 2023-08-22 21:58:16 +03:00
Georgi Gerganov
5c5413dc14
llama : fix loading progress bar 2023-08-22 21:53:36 +03:00
Georgi Gerganov
2f3c80a845
falcon : load tensor data (CPU only) 2023-08-22 21:42:12 +03:00
Jiahao Li
800c9635b4
Fix CUDA softmax by subtracting max value before exp (#2665) 2023-08-22 20:27:06 +02:00
Georgi Gerganov
d1b3b95dc4
convert : add dummy scores + types 2023-08-22 20:55:05 +03:00
Georgi Gerganov
9f28f73785
llm : read arch-specific KVs 2023-08-22 20:34:17 +03:00
Georgi Gerganov
b19c6e4640
Merge branch 'master' into falcon 2023-08-22 20:15:01 +03:00
Georgi Gerganov
3c025a6d07
gguf : add KV constant maps 2023-08-22 20:06:15 +03:00
Georgi Gerganov
deb7dfca4b
gguf : add ftype meta info to the model (#2710)
* llama : add ftype meta info to the model

ggml-ci

* convert.py : add ftype when converting (does not work)

* convert.py : fix Enum to IntEnum

ggml-ci
2023-08-22 20:05:59 +03:00
Georgi Gerganov
3057d6a687
llama : refactor llama_model_load_internal() 2023-08-22 19:30:02 +03:00
Kawrakow
bac66994cf
Quantization imrovements for k_quants (#2707)
* Improve LLaMA-2 2-, 3- and 4-bit quantization

* Q3_K_S: use Q5_K for 1st 2 layers of attention.wv and feed_forward.w2
* Q4_K_S: use Q6_K for 1st 2 layers of attention.wv and feed_forward.w2
* Q2_K and Q3_K_M: use Q5_K instead of Q4_K for 1st 2 layers of
  attention.wv and feed_forward.w2

This leads to a slight model sized increase as follows:
Q2_K  : 2.684G vs 2.670G
Q3_K_S: 2.775G vs 2.745G
Q3_K_M: 3.071G vs 3.057G
Q4_K_S: 3.592G vs 3.563G

LLaMA-2 PPL for context 512 changes as follows:
Q2_K  : 6.6691 vs 6.8201
Q3_K_S: 6.2129 vs 6.2584
Q3_K_M: 6.0387 vs 6.1371
Q4_K_S: 5.9138 vs 6.0041

There are improvements for LLaMA-1 as well, but they are
way smaller than the above.

* Minor 4-bit quantization improvement

For the same model size as previus commit, we get
PPL = 5.9069 vs 5.9138.

* Some more fine tuning

* Adding make_qkx2_quants

With it, we get PPL = 5.8828 for L2-7B Q4_K_S.

* Another minor improvement

* Q2_K improvement

Smaller model, lower perplexity.
 7B: file size = 2.632G, PPL = 6.3772 vs original 2.670G PPL = 6.8201
12B: file size = 5.056G, PPL = 5.4577 vs original 5.130G PPL = 5.7178

It is mostly Q3_K except for tok_embeddings, attention.wq, attention.wk,
which are Q2_K

* Iterating

* Revert Q5_K back to make_qkx1_quants

* Better Q6_K

* make_qkx2_quants is better for Q5_K after all

* Fix after rebasing on master

* Fix for changed tensor names

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-22 19:14:09 +03:00
Georgi Gerganov
8bd7f06b58
llama : check if model architecture is known 2023-08-22 19:03:08 +03:00
Georgi Gerganov
4ed3469c68
llama : refactor GGUF constants into static maps 2023-08-22 18:59:39 +03:00
slaren
519c981f8b
embedding : evaluate prompt in batches (#2713) 2023-08-22 16:03:12 +02:00
slaren
1123f7fbdf
ggml-cuda : use graph allocator (#2684)
use a different function for no_alloc to avoid breaking backwards compat, fixes lora

remove 512 n_batch limit

fixed 2048 batch size

cleanup

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-08-22 15:25:19 +02:00
Georgi Gerganov
ef3f333d37
ggml : sync latest (SAM + SD operators, CUDA alibi) (#2709)
* ggml : sync latest (SAM + SD operators, CUDA alibi)

ggml-ci

* ggml : fix tabs
2023-08-22 14:22:08 +03:00
slaren
8e4364f2af
llama-bench : minor fixes (#2695) 2023-08-22 10:56:03 +03:00