FSSRepo
78da3387a8
Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda
2024-01-25 09:48:37 -05:00
Georgi Gerganov
40ea8cd1ac
metal : fix comment
2024-01-25 16:31:39 +02:00
Georgi Gerganov
432ad04ffa
metal : scale and mask in matrix form
2024-01-25 15:47:52 +02:00
Georgi Gerganov
d917746ddb
metal : avoid redundant loads of the attention
2024-01-25 15:00:49 +02:00
Georgi Gerganov
1446a12b29
metal : efficient flash_attn_f16 implementation
2024-01-25 13:40:31 +02:00
FSSRepo
0fc36d872c
match to metal impl
2024-01-24 16:45:30 -05:00
FSSRepo
972c2adc15
use half2 instead half4
2024-01-24 16:41:57 -05:00
FSSRepo
6416821499
fix equivalent fp16 math functions, compiler error 'undefined'
2024-01-24 10:57:05 -05:00
FSSRepo
6374bc5779
cuda: port metal version flash_attn_ext
2024-01-23 16:42:53 -05:00
FSSRepo
a689b02ad3
Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda
2024-01-23 13:51:59 -05:00
Georgi Gerganov
17720fad66
metal : parallel reduce across heads
2024-01-21 23:01:46 +02:00
Georgi Gerganov
77d08f3272
metal : parallelize across KV size
2024-01-21 22:26:45 +02:00
Georgi Gerganov
a4b6341c7b
wip : template for rows per warp
2024-01-21 19:06:30 +02:00
Georgi Gerganov
f31955f5d1
wip : 4 rows per simd group
2024-01-21 18:01:28 +02:00
Georgi Gerganov
8cde449b8b
wip : 8 rows per simd group
2024-01-21 17:37:24 +02:00
Georgi Gerganov
b97325800a
metal : specialize for head size
2024-01-21 12:01:55 +02:00
Georgi Gerganov
52ae085750
metal : reduce branches
2024-01-21 11:59:09 +02:00
Georgi Gerganov
528da7515e
metal : f16 precision
2024-01-21 11:13:24 +02:00
Georgi Gerganov
1173f49c3b
metal : initial implementation
2024-01-21 10:15:02 +02:00
Georgi Gerganov
a9681febd6
ggml : online attention (CPU)
2024-01-20 16:45:41 +02:00
Georgi Gerganov
c3cdfffa88
Merge branch 'master' into gg/flash-attn
2024-01-20 10:12:07 +02:00
Kylin
cca894f16a
cuda : fix compile error in jetson platform ( #4975 )
...
* cuda: fix compile error in jetson platform
* cuda: update comment in ggml-cuda.cu
* cuda: update ggml-cuda.cu comment
2024-01-20 09:01:46 +02:00
FSSRepo
fded2e6a11
apply suggestions
2024-01-19 20:18:18 -05:00
FSSRepo
09db1a7cf3
Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda
2024-01-19 17:38:47 -05:00
Uzo Nweke
381ee19572
finetune : fix ggml_allocr lifetimes (tmp workaround) ( #5033 )
...
* Fix issue with alloc causing max_compute_size to be calculated
* remove ggml_allocr_free as suggested in issue #4791
2024-01-19 20:20:50 +02:00
Georgi Gerganov
fa7ebcca99
ggml : fix GQA support in ggml_flash_attn_ext
2024-01-19 20:06:26 +02:00
Georgi Gerganov
a5cacb22b2
imatrix : add README.md
2024-01-19 15:24:47 +02:00
Shijie
9b75cb2b3c
llama : support upcoming Qwen2 ( #5037 )
2024-01-19 13:53:13 +02:00
Georgi Gerganov
de9a147df1
py : fix flake8 lint
2024-01-19 13:52:22 +02:00
Kawrakow
7051aacfac
winogrande: evaluate log-probs in parallel ( #5036 )
...
This is a relatively minor performance tweak resulting in
~10% speedup on my system.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-19 11:39:11 +02:00
chiranko
2b3b999cac
llama : add CodeShell support ( #5016 )
...
* llama: add codeshell support
* llama.cpp: fix codeshell with NeoX rope
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-19 11:07:27 +02:00
Kawrakow
993fba8180
perplexity: avoid unnecessary alloocations and logit copies ( #5035 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-19 11:02:39 +02:00
Georgi Gerganov
8b20858e5e
perplexity : faster Winogrande via batching ( #5024 )
...
* perplexity : faster Winogrande via batching
ggml-ci
* perplexity : remove unused function
* perplexity : only tokenize selected tasks for Winogrande
2024-01-19 10:45:06 +02:00
John
57e2a7a52a
llama : fix falcon arch for tied output embeddings ( #4978 )
...
* falcon arch fix for tied output embeddings
* Update llama.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update llama.cpp
* Update llama.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update llama.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-19 00:12:15 +02:00
Georgi Gerganov
9b6ea4263a
cmake : add ggml public headers ( #5011 )
2024-01-18 23:36:07 +02:00
Xuan Son Nguyen
821f0a271e
server : defer tasks when "slot unavailable" ( #5018 )
...
* server: defer task when no slot is available
* remove unnecessary log
---------
Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>
2024-01-18 22:33:05 +02:00
slaren
96d7f56d29
llama : fix mlock with no-mmap with Metal ( #5025 )
2024-01-18 21:12:15 +01:00
Georgi Gerganov
2d5419d08a
imatrix : fix assert for src0 non-cont check
2024-01-18 21:45:51 +02:00
Georgi Gerganov
d391ae9b49
perplexity : fix winogrande N tasks option
2024-01-18 20:49:00 +02:00
Georgi Gerganov
e9240cdfa0
scripts : add get-winogrande.sh
2024-01-18 20:45:39 +02:00
David Sommers
b46757735d
convert.py : fix llama/llama2 conversion due to vocab_size=-1 ( #5019 )
...
PR #4818 (merged last week) reintroduced a config check for vocab_size that was addressed in PR #4258 (merged 2023-11-30).
Without the fix, llama2 models can't be converted. The error is:
`ValueError: The model's vocab size is set to -1 in params.json. Please update it manually. Maybe 32000?`
2024-01-18 19:20:59 +02:00
Kawrakow
3e945cc1e9
HellaSwag: speed up by parallelizing log-prob evaluation ( #5020 )
...
For Mistral-7B and fp16, time on my system goes down from 536 seconds
to 423 seconds for the full evaluation dataset (10042 tasks).
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-18 19:18:21 +02:00
Georgi Gerganov
a1c004ef2e
ggml : add ggml_flash_attn_ext API
2024-01-18 18:55:48 +02:00
FSSRepo
e53de2866a
fix compilation
2024-01-18 11:27:07 -05:00
Georgi Gerganov
ad19812cda
perplexity : faster HellaSwag via batching ( #5017 )
...
* perplexity : faster HellaSwag
ggml-ci
* perplexity : clean-up
ggml-ci
* perplexity : no need for decode_helper
ggml-ci
* perplexity : add comments
* perplexity : option to specify max batched tasks via `n_parallel`
* perplexity : remove HellaSwag restruction for n_batch
2024-01-18 15:33:01 +02:00
Kawrakow
682986a08e
Add Winogrande evaluation ( #5015 )
...
* winogrande: simple implementation
It doesn't look like it is working - why?
For Mistral-7B it is barely better than
random chance (score ~60% for 1267 tasks), while I see
Mistral-7B scoring 78.4% on the HF leader board.
1-sigma statistical uncertainty for 1267 tasks is ~1.4,
so no way the difference is due to statistics.
* winogrande: somewhat better
Score for Mistrali7-B is now 68.9 on the validation set of
winogrande_debiased. Still far from the reported 78.4, but
better than what I had before.
* winogrande: improving
Mistral-7B score is now 73.56.
Still not quite 78.4 but getting there.
We are also getting a lower score on HellaSwag
compared to HF leader board, so I'm not expecting
we will get up to 78.4 anyway.
It looks like it is better to skip the choice word(s)
when evaluating the average log-likelihood. This kind of
makes sense because a more common word (in Winogrande this is
often a name) will have a higher probability without knowing
about the follow up context, and this will skew the log-likelihood
towards the more common word. We can only do this if the
choice words are not last in the sentence.
It also looks like it is better to skip the punctuation at the
end of the sentence, provided the choice words are not last.
* winogrande: add dataset instructions
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-18 13:46:27 +02:00
Georgi Gerganov
dcad445d0c
scritps : add helper script to get hellaswag data in txt format
2024-01-18 11:44:49 +02:00
Paul Tsochantaris
1e605f4102
metal : fix memory leak, dangling pointer and unused autorel ( #5007 )
...
* Metal memory: Small memory leak on init, dangling pointer, and unused autorelease pool in graph compute
* SPM header potential fix
* Reverting symlinks
2024-01-18 10:47:24 +02:00
FSSRepo
f7bcfb0566
cuda: add flash attention + test
2024-01-17 16:38:28 -05:00
Georgi Gerganov
6b6916b215
sync : ggml
2024-01-17 20:54:50 +02:00