Commit graph

2825 commits

Author SHA1 Message Date
Georgi Gerganov
751591d520
server : add help for --flash-attn arg 2024-04-23 18:16:25 +03:00
Georgi Gerganov
d228bf8552
cont 2024-04-23 17:32:11 +03:00
Georgi Gerganov
56657e52e5
llama : fix n_batch requirements
ggml-ci
2024-04-23 17:30:37 +03:00
Georgi Gerganov
19e8982f51
llama : prep ALiBi support for BERT models
ggml-ci
2024-04-23 17:24:28 +03:00
Georgi Gerganov
78d363b0d4
llama : replace bool need_kq_pos with use_alibi 2024-04-23 17:15:13 +03:00
Georgi Gerganov
3864eea4cb
ggml : add TODO's for F16/F32 mask/pos support in other backends 2024-04-23 10:06:56 +03:00
Georgi Gerganov
c129369702
cuda : try to fix __hgt2_mask
ggml-ci
2024-04-23 09:18:55 +03:00
Georgi Gerganov
c70bfd7bcb
cuda : "constexpr dim3" -> "const dim3"
ggml-ci
2024-04-22 20:31:23 +03:00
Georgi Gerganov
5408d55506
cuda : uint -> uint32_t 2024-04-22 19:12:06 +03:00
Georgi Gerganov
f725ca90fb
ggml : ggml_soft_max support F16/F32 mask/pos
ggml-ci
2024-04-22 14:53:11 +03:00
Georgi Gerganov
c11d05fec0
llama : force disable flash attention for incompatible models 2024-04-22 12:50:41 +03:00
Georgi Gerganov
cb76d747d1
ggml : fix num dimensions in ggml_flash_attn_ext 2024-04-22 12:50:26 +03:00
Georgi Gerganov
a39217d428
common : print --flash-attn in help 2024-04-22 12:50:10 +03:00
Georgi Gerganov
871fcb6e10
ggml : fix soft_max with bias on CPU
ggml-ci
2024-04-19 18:03:56 +03:00
Georgi Gerganov
3badef1fe1
ggml : fix avx512 const correctness
ggml-ci
2024-04-19 17:45:08 +03:00
Georgi Gerganov
52945429eb
tests : remove benchmarks
ggml-ci
2024-04-19 17:38:28 +03:00
Georgi Gerganov
29f6ad8d95
Merge branch 'master' into gg/flash-attn 2024-04-19 17:30:09 +03:00
Georgi Gerganov
bc346166f9
metal : minor 2024-04-19 17:24:52 +03:00
Georgi Gerganov
1a88565b44
metal : clean-up kernel code 2024-04-19 15:52:49 +03:00
Georgi Gerganov
97eaece7d6
metal : clean-up 2024-04-19 15:30:27 +03:00
Georgi Gerganov
703c6e6528
ggml : fix arm fp16 store on windows 2024-04-19 14:20:41 +03:00
Pierrick Hymbert
637e9a86c2
server: static: upstream upgrade (#6765) 2024-04-19 13:19:01 +02:00
Georgi Gerganov
e32b281743
llama : adapt build_olmo to changes 2024-04-19 14:04:56 +03:00
Georgi Gerganov
1db66c1dac
Merge branch 'master' into gg/flash-attn 2024-04-19 14:03:55 +03:00
Georgi Gerganov
74d57f9513
llama : simplify llama_build_kv_store
ggml-ci
2024-04-19 13:49:57 +03:00
nopperl
9958c81b79
Implement the OLMo architecture (#6741)
* implement olmo architecture

* remove unused variable

* remove unused moe branch

* remove check for weight

* remove superfluous moe, bias and rope tensors

* clarified comment

* fix clamp_kqv setting

* remove obsolete parameter name filter
2024-04-19 11:35:54 +02:00
Austin
8b1b1f4982
train : add general name (#6752)
* llama : make general.name optional

* train: Add 'general.name' to model metadata

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

---------

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-19 10:16:45 +03:00
Neo Zhang
bca40e9814
fix wrong parameter in cmd in readme-sycl.md (#6755)
Co-authored-by: jianyuzh <jianyu.zhang@intel.com>
2024-04-19 09:16:31 +08:00
Georgi Gerganov
9ca869876e
batched-bench : add fattn arg 2024-04-18 21:41:32 +03:00
Georgi Gerganov
c16a7c2688
metal : use F32 attention accumulators 2024-04-18 21:20:30 +03:00
slaren
0d56246f4b
ggml : group all experts in a single ggml_mul_mat_id (#6505)
* ggml : group all experts in a single ggml_mul_mat_id
cuda : improve mmid row copy

* cuda : fix bin bcast with non-cont src0

* test-backend-ops : only run all mul mat tests for base types

* llama : disable moe offloading with SYCL

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-18 15:18:48 +02:00
Sigbjørn Skjæret
03c0946d73
convert : support models with multiple chat templates (#6588)
* Support converting models with multiple chat templates

Adds the following metadata:
* tokenizer.chat_templates
* tokenizer.chat_template.<name1>
* tokenizer.chat_template.<name2>
* tokenizer.chat_template.<...>

Where `tokenizer.chat_templates` is an array of the template names (except `default`), `default` is added to the regular `tokenizer.chat_template`.

* replace filtered characters with underscore

* New script to add/modify/remove metadata

This scripts creates a copy of a GGUF file and allows you to add/modify/remove metadata in the process.

Most importantly this allows you to update chat templates, either as a string or directly from an updated tokenizer_config.json file.

* Add files via upload

add new script to project/readme

* flake--
2024-04-18 14:49:01 +03:00
Georgi Gerganov
fa9e8c6689
Merge branch 'master' into gg/flash-attn 2024-04-18 14:39:23 +03:00
Ren Xuancheng
e11b2e6e1e
Qwen2 : assume tied weights if lm_head/output weights is missing (#6738) 2024-04-18 14:38:04 +03:00
Georgi Gerganov
105332cc17
metal : add BS=1 kernel for flash attention (#6508)
* metal : add BS=1 kernel for flash attention (wip)

* metal : support more than 1 warps

* metal : opts

* metal : opt

* metal : switch to parallel reduce

* metal : reduce registers

* metal : simplify

* metal : initial FA vec kernel
2024-04-18 14:33:07 +03:00
Georgi Gerganov
260cdb2d08
llama-bench : add -fa,--flash-attn arg 2024-04-18 14:28:19 +03:00
Johannes Gäßler
87968de9a9 fix KQ FP32 precision fpr parallel_blocks > 1 2024-04-18 13:15:32 +02:00
Johannes Gäßler
2f538b9547 Add __hgt2_mask implementation for CUDA 11 2024-04-18 13:15:32 +02:00
Johannes Gäßler
0bc67dd1c8 Calculate KQ as FP32 if KQV has GGML_PREC_F32 2024-04-18 13:15:32 +02:00
Johannes Gäßler
a5b0e2dea0 store temp KQ in registers 2024-04-18 13:15:32 +02:00
Johannes Gäßler
ef9e1593f3 flush softmax exp below threshold to 0 2024-04-18 13:15:32 +02:00
Johannes Gäßler
6a3b84236d fix flash_attn_vec_f16 race condition 2024-04-18 13:15:32 +02:00
Johannes Gäßler
34f93bbb39 CUDA: refactor host code, dyn. par. blocks 2024-04-18 13:15:32 +02:00
slaren
c71bfd736e
llama : fix compatibility with old 2 expert models (#6735) 2024-04-18 10:04:47 +03:00
Pierrick HYMBERT
5668c79ea0 server: bench: enable flash_attn param 2024-04-17 23:26:29 +02:00
Georgi Gerganov
3b8f1ec4b1
llamafile : tmp disable + build sgemm.o when needed (#6716)
* build : sgemm.o only when needed

ggml-ci

* llamafile : tmp disable due to MoE bug

ggml-ci
2024-04-17 23:58:26 +03:00
Yaroslav
8dd1ec8b3f
readme : add UI (#6724)
* Update README.md

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-17 15:47:50 +03:00
Pierrick HYMBERT
405385726e server: support flash_attn param 2024-04-17 14:05:02 +02:00
Georgi Gerganov
599ce84a71
llama : flash_attn cparam + fix defrag 2024-04-17 12:01:39 +03:00
Georgi Gerganov
2c41180e88
Merge branch 'master' into gg/flash-attn 2024-04-17 10:13:09 +03:00