FSSRepo
2455a8d6c3
update impl
2024-01-27 12:23:40 -05:00
FSSRepo
7cea9735ab
Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda
2024-01-27 11:38:20 -05:00
FSSRepo
0a481fe1a9
integrate tensor cores
2024-01-26 20:14:02 -05:00
Georgi Gerganov
6fea843b24
metal : add parallel reduce version (disabled)
2024-01-25 18:09:30 +02:00
FSSRepo
6e7cb0eeaf
update implementation
2024-01-25 11:04:51 -05:00
Georgi Gerganov
f9ca5dcbe8
llama : avoid ggml_cast, use F32 query
2024-01-25 17:46:07 +02:00
FSSRepo
78da3387a8
Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda
2024-01-25 09:48:37 -05:00
Georgi Gerganov
40ea8cd1ac
metal : fix comment
2024-01-25 16:31:39 +02:00
Georgi Gerganov
432ad04ffa
metal : scale and mask in matrix form
2024-01-25 15:47:52 +02:00
Georgi Gerganov
d917746ddb
metal : avoid redundant loads of the attention
2024-01-25 15:00:49 +02:00
Georgi Gerganov
1446a12b29
metal : efficient flash_attn_f16 implementation
2024-01-25 13:40:31 +02:00
FSSRepo
0fc36d872c
match to metal impl
2024-01-24 16:45:30 -05:00
FSSRepo
972c2adc15
use half2 instead half4
2024-01-24 16:41:57 -05:00
FSSRepo
6416821499
fix equivalent fp16 math functions, compiler error 'undefined'
2024-01-24 10:57:05 -05:00
FSSRepo
6374bc5779
cuda: port metal version flash_attn_ext
2024-01-23 16:42:53 -05:00
FSSRepo
a689b02ad3
Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda
2024-01-23 13:51:59 -05:00
Georgi Gerganov
17720fad66
metal : parallel reduce across heads
2024-01-21 23:01:46 +02:00
Georgi Gerganov
77d08f3272
metal : parallelize across KV size
2024-01-21 22:26:45 +02:00
Georgi Gerganov
a4b6341c7b
wip : template for rows per warp
2024-01-21 19:06:30 +02:00
Georgi Gerganov
f31955f5d1
wip : 4 rows per simd group
2024-01-21 18:01:28 +02:00
Georgi Gerganov
8cde449b8b
wip : 8 rows per simd group
2024-01-21 17:37:24 +02:00
Georgi Gerganov
b97325800a
metal : specialize for head size
2024-01-21 12:01:55 +02:00
Georgi Gerganov
52ae085750
metal : reduce branches
2024-01-21 11:59:09 +02:00
Georgi Gerganov
528da7515e
metal : f16 precision
2024-01-21 11:13:24 +02:00
Georgi Gerganov
1173f49c3b
metal : initial implementation
2024-01-21 10:15:02 +02:00
Georgi Gerganov
a9681febd6
ggml : online attention (CPU)
2024-01-20 16:45:41 +02:00
Georgi Gerganov
c3cdfffa88
Merge branch 'master' into gg/flash-attn
2024-01-20 10:12:07 +02:00
Kylin
cca894f16a
cuda : fix compile error in jetson platform ( #4975 )
...
* cuda: fix compile error in jetson platform
* cuda: update comment in ggml-cuda.cu
* cuda: update ggml-cuda.cu comment
2024-01-20 09:01:46 +02:00
FSSRepo
fded2e6a11
apply suggestions
2024-01-19 20:18:18 -05:00
FSSRepo
09db1a7cf3
Merge branch 'gg/flash-attn' of https://github.com/ggerganov/llama.cpp into flash-attn-cuda
2024-01-19 17:38:47 -05:00
Uzo Nweke
381ee19572
finetune : fix ggml_allocr lifetimes (tmp workaround) ( #5033 )
...
* Fix issue with alloc causing max_compute_size to be calculated
* remove ggml_allocr_free as suggested in issue #4791
2024-01-19 20:20:50 +02:00
Georgi Gerganov
fa7ebcca99
ggml : fix GQA support in ggml_flash_attn_ext
2024-01-19 20:06:26 +02:00
Georgi Gerganov
a5cacb22b2
imatrix : add README.md
2024-01-19 15:24:47 +02:00
Shijie
9b75cb2b3c
llama : support upcoming Qwen2 ( #5037 )
2024-01-19 13:53:13 +02:00
Georgi Gerganov
de9a147df1
py : fix flake8 lint
2024-01-19 13:52:22 +02:00
Kawrakow
7051aacfac
winogrande: evaluate log-probs in parallel ( #5036 )
...
This is a relatively minor performance tweak resulting in
~10% speedup on my system.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-19 11:39:11 +02:00
chiranko
2b3b999cac
llama : add CodeShell support ( #5016 )
...
* llama: add codeshell support
* llama.cpp: fix codeshell with NeoX rope
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-19 11:07:27 +02:00
Kawrakow
993fba8180
perplexity: avoid unnecessary alloocations and logit copies ( #5035 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-19 11:02:39 +02:00
Georgi Gerganov
8b20858e5e
perplexity : faster Winogrande via batching ( #5024 )
...
* perplexity : faster Winogrande via batching
ggml-ci
* perplexity : remove unused function
* perplexity : only tokenize selected tasks for Winogrande
2024-01-19 10:45:06 +02:00
John
57e2a7a52a
llama : fix falcon arch for tied output embeddings ( #4978 )
...
* falcon arch fix for tied output embeddings
* Update llama.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update llama.cpp
* Update llama.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update llama.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-19 00:12:15 +02:00
Georgi Gerganov
9b6ea4263a
cmake : add ggml public headers ( #5011 )
2024-01-18 23:36:07 +02:00
Xuan Son Nguyen
821f0a271e
server : defer tasks when "slot unavailable" ( #5018 )
...
* server: defer task when no slot is available
* remove unnecessary log
---------
Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>
2024-01-18 22:33:05 +02:00
slaren
96d7f56d29
llama : fix mlock with no-mmap with Metal ( #5025 )
2024-01-18 21:12:15 +01:00
Georgi Gerganov
2d5419d08a
imatrix : fix assert for src0 non-cont check
2024-01-18 21:45:51 +02:00
Georgi Gerganov
d391ae9b49
perplexity : fix winogrande N tasks option
2024-01-18 20:49:00 +02:00
Georgi Gerganov
e9240cdfa0
scripts : add get-winogrande.sh
2024-01-18 20:45:39 +02:00
David Sommers
b46757735d
convert.py : fix llama/llama2 conversion due to vocab_size=-1 ( #5019 )
...
PR #4818 (merged last week) reintroduced a config check for vocab_size that was addressed in PR #4258 (merged 2023-11-30).
Without the fix, llama2 models can't be converted. The error is:
`ValueError: The model's vocab size is set to -1 in params.json. Please update it manually. Maybe 32000?`
2024-01-18 19:20:59 +02:00
Kawrakow
3e945cc1e9
HellaSwag: speed up by parallelizing log-prob evaluation ( #5020 )
...
For Mistral-7B and fp16, time on my system goes down from 536 seconds
to 423 seconds for the full evaluation dataset (10042 tasks).
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-18 19:18:21 +02:00
Georgi Gerganov
a1c004ef2e
ggml : add ggml_flash_attn_ext API
2024-01-18 18:55:48 +02:00
FSSRepo
e53de2866a
fix compilation
2024-01-18 11:27:07 -05:00