Commit graph

1955 commits

Author SHA1 Message Date
FSSRepo
0a481fe1a9 integrate tensor cores 2024-01-26 20:14:02 -05:00
FSSRepo
6e7cb0eeaf update implementation 2024-01-25 11:04:51 -05:00
FSSRepo
78da3387a8 Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda 2024-01-25 09:48:37 -05:00
Georgi Gerganov
40ea8cd1ac
metal : fix comment 2024-01-25 16:31:39 +02:00
Georgi Gerganov
432ad04ffa
metal : scale and mask in matrix form 2024-01-25 15:47:52 +02:00
Georgi Gerganov
d917746ddb
metal : avoid redundant loads of the attention 2024-01-25 15:00:49 +02:00
Georgi Gerganov
1446a12b29
metal : efficient flash_attn_f16 implementation 2024-01-25 13:40:31 +02:00
FSSRepo
0fc36d872c match to metal impl 2024-01-24 16:45:30 -05:00
FSSRepo
972c2adc15 use half2 instead half4 2024-01-24 16:41:57 -05:00
FSSRepo
6416821499 fix equivalent fp16 math functions, compiler error 'undefined' 2024-01-24 10:57:05 -05:00
FSSRepo
6374bc5779 cuda: port metal version flash_attn_ext 2024-01-23 16:42:53 -05:00
FSSRepo
a689b02ad3 Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda 2024-01-23 13:51:59 -05:00
Georgi Gerganov
17720fad66
metal : parallel reduce across heads 2024-01-21 23:01:46 +02:00
Georgi Gerganov
77d08f3272
metal : parallelize across KV size 2024-01-21 22:26:45 +02:00
Georgi Gerganov
a4b6341c7b
wip : template for rows per warp 2024-01-21 19:06:30 +02:00
Georgi Gerganov
f31955f5d1
wip : 4 rows per simd group 2024-01-21 18:01:28 +02:00
Georgi Gerganov
8cde449b8b
wip : 8 rows per simd group 2024-01-21 17:37:24 +02:00
Georgi Gerganov
b97325800a
metal : specialize for head size 2024-01-21 12:01:55 +02:00
Georgi Gerganov
52ae085750
metal : reduce branches 2024-01-21 11:59:09 +02:00
Georgi Gerganov
528da7515e
metal : f16 precision 2024-01-21 11:13:24 +02:00
Georgi Gerganov
1173f49c3b
metal : initial implementation 2024-01-21 10:15:02 +02:00
Georgi Gerganov
a9681febd6
ggml : online attention (CPU) 2024-01-20 16:45:41 +02:00
Georgi Gerganov
c3cdfffa88
Merge branch 'master' into gg/flash-attn 2024-01-20 10:12:07 +02:00
Kylin
cca894f16a
cuda : fix compile error in jetson platform (#4975)
* cuda: fix compile error in jetson platform

* cuda: update comment in ggml-cuda.cu

* cuda: update ggml-cuda.cu comment
2024-01-20 09:01:46 +02:00
FSSRepo
fded2e6a11 apply suggestions 2024-01-19 20:18:18 -05:00
FSSRepo
09db1a7cf3 Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda 2024-01-19 17:38:47 -05:00
Uzo Nweke
381ee19572
finetune : fix ggml_allocr lifetimes (tmp workaround) (#5033)
* Fix issue with alloc causing max_compute_size to be calculated

* remove ggml_allocr_free as suggested in issue #4791
2024-01-19 20:20:50 +02:00
Georgi Gerganov
fa7ebcca99 ggml : fix GQA support in ggml_flash_attn_ext 2024-01-19 20:06:26 +02:00
Georgi Gerganov
a5cacb22b2
imatrix : add README.md 2024-01-19 15:24:47 +02:00
Shijie
9b75cb2b3c
llama : support upcoming Qwen2 (#5037) 2024-01-19 13:53:13 +02:00
Georgi Gerganov
de9a147df1 py : fix flake8 lint 2024-01-19 13:52:22 +02:00
Kawrakow
7051aacfac
winogrande: evaluate log-probs in parallel (#5036)
This is a relatively minor performance tweak resulting in
~10% speedup on my system.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-19 11:39:11 +02:00
chiranko
2b3b999cac
llama : add CodeShell support (#5016)
* llama: add codeshell support

* llama.cpp: fix codeshell with NeoX rope

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-19 11:07:27 +02:00
Kawrakow
993fba8180
perplexity: avoid unnecessary alloocations and logit copies (#5035)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-19 11:02:39 +02:00
Georgi Gerganov
8b20858e5e
perplexity : faster Winogrande via batching (#5024)
* perplexity : faster Winogrande via batching

ggml-ci

* perplexity : remove unused function

* perplexity : only tokenize selected tasks for Winogrande
2024-01-19 10:45:06 +02:00
John
57e2a7a52a
llama : fix falcon arch for tied output embeddings (#4978)
* falcon arch fix for tied output embeddings

* Update llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update llama.cpp

* Update llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-19 00:12:15 +02:00
Georgi Gerganov
9b6ea4263a
cmake : add ggml public headers (#5011) 2024-01-18 23:36:07 +02:00
Xuan Son Nguyen
821f0a271e
server : defer tasks when "slot unavailable" (#5018)
* server: defer task when no slot is available

* remove unnecessary log

---------

Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>
2024-01-18 22:33:05 +02:00
slaren
96d7f56d29
llama : fix mlock with no-mmap with Metal (#5025) 2024-01-18 21:12:15 +01:00
Georgi Gerganov
2d5419d08a
imatrix : fix assert for src0 non-cont check 2024-01-18 21:45:51 +02:00
Georgi Gerganov
d391ae9b49
perplexity : fix winogrande N tasks option 2024-01-18 20:49:00 +02:00
Georgi Gerganov
e9240cdfa0
scripts : add get-winogrande.sh 2024-01-18 20:45:39 +02:00
David Sommers
b46757735d
convert.py : fix llama/llama2 conversion due to vocab_size=-1 (#5019)
PR #4818 (merged last week) reintroduced a config check for vocab_size that was addressed in PR #4258 (merged 2023-11-30).

Without the fix, llama2 models can't be converted. The error is:

`ValueError: The model's vocab size is set to -1 in params.json. Please update it manually. Maybe 32000?`
2024-01-18 19:20:59 +02:00
Kawrakow
3e945cc1e9
HellaSwag: speed up by parallelizing log-prob evaluation (#5020)
For Mistral-7B and fp16, time on my system goes down from 536 seconds
to 423 seconds for the full evaluation dataset (10042 tasks).

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-18 19:18:21 +02:00
Georgi Gerganov
a1c004ef2e
ggml : add ggml_flash_attn_ext API 2024-01-18 18:55:48 +02:00
FSSRepo
e53de2866a fix compilation 2024-01-18 11:27:07 -05:00
Georgi Gerganov
ad19812cda
perplexity : faster HellaSwag via batching (#5017)
* perplexity : faster HellaSwag

ggml-ci

* perplexity : clean-up

ggml-ci

* perplexity : no need for decode_helper

ggml-ci

* perplexity : add comments

* perplexity : option to specify max batched tasks via `n_parallel`

* perplexity : remove HellaSwag restruction for n_batch
2024-01-18 15:33:01 +02:00
Kawrakow
682986a08e
Add Winogrande evaluation (#5015)
* winogrande: simple implementation

It doesn't look like it is working - why?
For Mistral-7B it is barely better than
random chance (score ~60% for 1267 tasks), while I see
Mistral-7B scoring 78.4% on the HF leader board.
1-sigma statistical uncertainty for 1267 tasks is ~1.4,
so no way the difference is due to statistics.

* winogrande: somewhat better

Score for Mistrali7-B is now 68.9 on the validation set of
winogrande_debiased. Still far from the reported 78.4, but
better than what I had before.

* winogrande: improving

Mistral-7B score is now 73.56.
Still not quite 78.4 but getting there.
We are also getting a lower score on HellaSwag
compared to HF leader board, so I'm not expecting
we will get up to 78.4 anyway.

It looks like it is better to skip the choice word(s)
when evaluating the average log-likelihood. This kind of
makes sense because a more common word (in Winogrande this is
often a name) will have a higher probability without knowing
about the follow up context, and this will skew the log-likelihood
towards the more common word. We can only do this if the
choice words are not last in the sentence.

It also looks like it is better to skip the punctuation at the
end of the sentence, provided the choice words are not last.

* winogrande: add dataset instructions

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-18 13:46:27 +02:00
Georgi Gerganov
dcad445d0c
scritps : add helper script to get hellaswag data in txt format 2024-01-18 11:44:49 +02:00
Paul Tsochantaris
1e605f4102
metal : fix memory leak, dangling pointer and unused autorel (#5007)
* Metal memory: Small memory leak on init, dangling pointer, and unused autorelease pool in graph compute

* SPM header potential fix

* Reverting symlinks
2024-01-18 10:47:24 +02:00