Georgi Gerganov
56657e52e5
llama : fix n_batch requirements
...
ggml-ci
2024-04-23 17:30:37 +03:00
Georgi Gerganov
19e8982f51
llama : prep ALiBi support for BERT models
...
ggml-ci
2024-04-23 17:24:28 +03:00
Georgi Gerganov
78d363b0d4
llama : replace bool need_kq_pos with use_alibi
2024-04-23 17:15:13 +03:00
Georgi Gerganov
3864eea4cb
ggml : add TODO's for F16/F32 mask/pos support in other backends
2024-04-23 10:06:56 +03:00
Georgi Gerganov
c129369702
cuda : try to fix __hgt2_mask
...
ggml-ci
2024-04-23 09:18:55 +03:00
Georgi Gerganov
c70bfd7bcb
cuda : "constexpr dim3" -> "const dim3"
...
ggml-ci
2024-04-22 20:31:23 +03:00
Georgi Gerganov
5408d55506
cuda : uint -> uint32_t
2024-04-22 19:12:06 +03:00
Georgi Gerganov
f725ca90fb
ggml : ggml_soft_max support F16/F32 mask/pos
...
ggml-ci
2024-04-22 14:53:11 +03:00
Georgi Gerganov
c11d05fec0
llama : force disable flash attention for incompatible models
2024-04-22 12:50:41 +03:00
Georgi Gerganov
cb76d747d1
ggml : fix num dimensions in ggml_flash_attn_ext
2024-04-22 12:50:26 +03:00
Georgi Gerganov
a39217d428
common : print --flash-attn in help
2024-04-22 12:50:10 +03:00
Georgi Gerganov
871fcb6e10
ggml : fix soft_max with bias on CPU
...
ggml-ci
2024-04-19 18:03:56 +03:00
Georgi Gerganov
3badef1fe1
ggml : fix avx512 const correctness
...
ggml-ci
2024-04-19 17:45:08 +03:00
Georgi Gerganov
52945429eb
tests : remove benchmarks
...
ggml-ci
2024-04-19 17:38:28 +03:00
Georgi Gerganov
29f6ad8d95
Merge branch 'master' into gg/flash-attn
2024-04-19 17:30:09 +03:00
Georgi Gerganov
bc346166f9
metal : minor
2024-04-19 17:24:52 +03:00
Georgi Gerganov
1a88565b44
metal : clean-up kernel code
2024-04-19 15:52:49 +03:00
Georgi Gerganov
97eaece7d6
metal : clean-up
2024-04-19 15:30:27 +03:00
Georgi Gerganov
703c6e6528
ggml : fix arm fp16 store on windows
2024-04-19 14:20:41 +03:00
Pierrick Hymbert
637e9a86c2
server: static: upstream upgrade ( #6765 )
2024-04-19 13:19:01 +02:00
Georgi Gerganov
e32b281743
llama : adapt build_olmo to changes
2024-04-19 14:04:56 +03:00
Georgi Gerganov
1db66c1dac
Merge branch 'master' into gg/flash-attn
2024-04-19 14:03:55 +03:00
Georgi Gerganov
74d57f9513
llama : simplify llama_build_kv_store
...
ggml-ci
2024-04-19 13:49:57 +03:00
nopperl
9958c81b79
Implement the OLMo architecture ( #6741 )
...
* implement olmo architecture
* remove unused variable
* remove unused moe branch
* remove check for weight
* remove superfluous moe, bias and rope tensors
* clarified comment
* fix clamp_kqv setting
* remove obsolete parameter name filter
2024-04-19 11:35:54 +02:00
Austin
8b1b1f4982
train : add general name ( #6752 )
...
* llama : make general.name optional
* train: Add 'general.name' to model metadata
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
---------
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-19 10:16:45 +03:00
Neo Zhang
bca40e9814
fix wrong parameter in cmd in readme-sycl.md ( #6755 )
...
Co-authored-by: jianyuzh <jianyu.zhang@intel.com>
2024-04-19 09:16:31 +08:00
Georgi Gerganov
9ca869876e
batched-bench : add fattn arg
2024-04-18 21:41:32 +03:00
Georgi Gerganov
c16a7c2688
metal : use F32 attention accumulators
2024-04-18 21:20:30 +03:00
slaren
0d56246f4b
ggml : group all experts in a single ggml_mul_mat_id ( #6505 )
...
* ggml : group all experts in a single ggml_mul_mat_id
cuda : improve mmid row copy
* cuda : fix bin bcast with non-cont src0
* test-backend-ops : only run all mul mat tests for base types
* llama : disable moe offloading with SYCL
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-18 15:18:48 +02:00
Sigbjørn Skjæret
03c0946d73
convert : support models with multiple chat templates ( #6588 )
...
* Support converting models with multiple chat templates
Adds the following metadata:
* tokenizer.chat_templates
* tokenizer.chat_template.<name1>
* tokenizer.chat_template.<name2>
* tokenizer.chat_template.<...>
Where `tokenizer.chat_templates` is an array of the template names (except `default`), `default` is added to the regular `tokenizer.chat_template`.
* replace filtered characters with underscore
* New script to add/modify/remove metadata
This scripts creates a copy of a GGUF file and allows you to add/modify/remove metadata in the process.
Most importantly this allows you to update chat templates, either as a string or directly from an updated tokenizer_config.json file.
* Add files via upload
add new script to project/readme
* flake--
2024-04-18 14:49:01 +03:00
Georgi Gerganov
fa9e8c6689
Merge branch 'master' into gg/flash-attn
2024-04-18 14:39:23 +03:00
Ren Xuancheng
e11b2e6e1e
Qwen2 : assume tied weights if lm_head/output weights is missing ( #6738 )
2024-04-18 14:38:04 +03:00
Georgi Gerganov
105332cc17
metal : add BS=1 kernel for flash attention ( #6508 )
...
* metal : add BS=1 kernel for flash attention (wip)
* metal : support more than 1 warps
* metal : opts
* metal : opt
* metal : switch to parallel reduce
* metal : reduce registers
* metal : simplify
* metal : initial FA vec kernel
2024-04-18 14:33:07 +03:00
Georgi Gerganov
260cdb2d08
llama-bench : add -fa,--flash-attn arg
2024-04-18 14:28:19 +03:00
Johannes Gäßler
87968de9a9
fix KQ FP32 precision fpr parallel_blocks > 1
2024-04-18 13:15:32 +02:00
Johannes Gäßler
2f538b9547
Add __hgt2_mask implementation for CUDA 11
2024-04-18 13:15:32 +02:00
Johannes Gäßler
0bc67dd1c8
Calculate KQ as FP32 if KQV has GGML_PREC_F32
2024-04-18 13:15:32 +02:00
Johannes Gäßler
a5b0e2dea0
store temp KQ in registers
2024-04-18 13:15:32 +02:00
Johannes Gäßler
ef9e1593f3
flush softmax exp below threshold to 0
2024-04-18 13:15:32 +02:00
Johannes Gäßler
6a3b84236d
fix flash_attn_vec_f16 race condition
2024-04-18 13:15:32 +02:00
Johannes Gäßler
34f93bbb39
CUDA: refactor host code, dyn. par. blocks
2024-04-18 13:15:32 +02:00
slaren
c71bfd736e
llama : fix compatibility with old 2 expert models ( #6735 )
2024-04-18 10:04:47 +03:00
Pierrick HYMBERT
5668c79ea0
server: bench: enable flash_attn param
2024-04-17 23:26:29 +02:00
Georgi Gerganov
3b8f1ec4b1
llamafile : tmp disable + build sgemm.o when needed ( #6716 )
...
* build : sgemm.o only when needed
ggml-ci
* llamafile : tmp disable due to MoE bug
ggml-ci
2024-04-17 23:58:26 +03:00
Yaroslav
8dd1ec8b3f
readme : add UI ( #6724 )
...
* Update README.md
* Update README.md
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-17 15:47:50 +03:00
Pierrick HYMBERT
405385726e
server: support flash_attn param
2024-04-17 14:05:02 +02:00
Georgi Gerganov
599ce84a71
llama : flash_attn cparam + fix defrag
2024-04-17 12:01:39 +03:00
Georgi Gerganov
2c41180e88
Merge branch 'master' into gg/flash-attn
2024-04-17 10:13:09 +03:00
Zheng.Deng
facb8b56f8
convert : fix autoawq gemma ( #6704 )
...
* fix autoawq quantized gemma model convert error
using autoawq to quantize gemma model will include a lm_head.weight tensor in model-00001-of-00002.safetensors. it result in this situation that convert-hf-to-gguf.py can't map lm_head.weight. skip loading this tensor could prevent this error.
* change code to full string match and print necessary message
change code to full string match and print a short message to inform users that lm_head.weight has been skipped.
---------
Co-authored-by: Zheng.Deng <32841220+CUGfred@users.noreply.github.com>
2024-04-16 23:51:07 +03:00
Georgi Gerganov
532c1737a1
llama : make general.name optional ( #6709 )
2024-04-16 23:50:38 +03:00