Georgi Gerganov
cb76d747d1
ggml : fix num dimensions in ggml_flash_attn_ext
2024-04-22 12:50:26 +03:00
Georgi Gerganov
a39217d428
common : print --flash-attn in help
2024-04-22 12:50:10 +03:00
Olivier Chafik
5cf5e7d490
build
: generate hex dump of server assets during build (#6661 )
...
* `build`: generate hex dumps of server assets on the fly
* build: workaround lack of -n on gnu xxd
* build: don't use xxd in cmake
* build: don't call xxd from build.zig
* build: more idiomatic hexing
* build: don't use xxd in Makefile (od hackery instead)
* build: avoid exceeding max cmd line limit in makefile hex dump
* build: hex dump assets at cmake build time (not config time)
2024-04-21 18:48:53 +01:00
Georgi Gerganov
40f74e4d73
llama : add option to render special/control tokens ( #6807 )
...
* make : fix common dep on llama.h
* llama : add option to render special tokens
* readme : add API change notice
ggml-ci
* swift : fix build
2024-04-21 18:36:45 +03:00
Georgi Gerganov
b9cc76d87e
ggml : fix ggml_backend_cpu_supports_op() for CPY ( #0 )
2024-04-21 16:48:50 +03:00
Wouter
7dbdba5690
llama : add llama-3 chat template ( #6751 )
...
* Added llama-3 chat template
* Update llama.cpp
Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>
* Update llama.cpp
Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>
* Update tests/test-chat-template.cpp
Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>
* Added EOS stop sequence according to https://github.com/ggerganov/llama.cpp/pull/6751#issuecomment-2065602862
* Removed adding of BOS token before first message
* Removed bos token from expected output from llama-3
* Update tests/test-chat-template.cpp
Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com>
* Update tests/test-chat-template.cpp
Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com>
* Added <|end_of_text|> as another stop token
* Reverted last change of adding the end_of_text stop word for llama 3
---------
Co-authored-by: Wouter Tichelaar <tichelaarw@spar.net>
Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>
Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-21 16:03:39 +03:00
pmysl
c1386c936e
gguf-py : add IQ1_M to GGML_QUANT_SIZES ( #6761 )
2024-04-21 15:49:30 +03:00
Jan Boon
e8d35f47cb
doc : add link to falcon ( #6789 )
2024-04-21 15:35:40 +03:00
Mohammadreza Hendiani
2cca09d509
readme : add Fedora instructions ( #6783 )
...
* added fedora to list of distros that may need the package (the packages have the same name on Fedora)
* how to add clblast that is avalible in the fedora repos
2024-04-21 15:32:05 +03:00
Justine Tunney
89b0bf0d5d
llava : use logger in llava-cli ( #6797 )
...
This change removes printf() logging so llava-cli is shell scriptable.
2024-04-21 15:19:04 +03:00
Pedro Cuenca
b97bc3966e
llama : support Llama 3 HF conversion ( #6745 )
...
* Support Llama 3 conversion
The tokenizer is BPE.
* style
* Accept suggestion
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
* llama : add llama_token_is_eog()
ggml-ci
* llama : auto-detect more EOT tokens when missing in KV data
* convert : replacing EOS token is a hack
* llama : fix codegemma EOT token + add TODOs
* llama : fix model type string for 8B model
---------
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-21 14:50:41 +03:00
Jan Boon
b8109bc013
doc : server tests require llama to be built with curl enabled ( #6788 )
2024-04-20 18:29:50 +02:00
Georgi Gerganov
aed82f6837
common : try to fix Android CI ( #6780 )
...
* common : disable get_math_cpu_count() until Android CI gets fixed
* common : another try
2024-04-20 13:27:12 +03:00
loonerin
0e4802b2ec
ci: add ubuntu latest release and fix missing build number (mac & ubuntu) ( #6748 )
2024-04-19 19:03:35 +02:00
Georgi Gerganov
871fcb6e10
ggml : fix soft_max with bias on CPU
...
ggml-ci
2024-04-19 18:03:56 +03:00
Georgi Gerganov
3badef1fe1
ggml : fix avx512 const correctness
...
ggml-ci
2024-04-19 17:45:08 +03:00
Georgi Gerganov
52945429eb
tests : remove benchmarks
...
ggml-ci
2024-04-19 17:38:28 +03:00
Georgi Gerganov
29f6ad8d95
Merge branch 'master' into gg/flash-attn
2024-04-19 17:30:09 +03:00
Georgi Gerganov
bc346166f9
metal : minor
2024-04-19 17:24:52 +03:00
Georgi Gerganov
1a88565b44
metal : clean-up kernel code
2024-04-19 15:52:49 +03:00
Georgi Gerganov
97eaece7d6
metal : clean-up
2024-04-19 15:30:27 +03:00
Georgi Gerganov
703c6e6528
ggml : fix arm fp16 store on windows
2024-04-19 14:20:41 +03:00
Pierrick Hymbert
637e9a86c2
server: static: upstream upgrade ( #6765 )
2024-04-19 13:19:01 +02:00
Georgi Gerganov
e32b281743
llama : adapt build_olmo to changes
2024-04-19 14:04:56 +03:00
Georgi Gerganov
1db66c1dac
Merge branch 'master' into gg/flash-attn
2024-04-19 14:03:55 +03:00
Georgi Gerganov
74d57f9513
llama : simplify llama_build_kv_store
...
ggml-ci
2024-04-19 13:49:57 +03:00
nopperl
9958c81b79
Implement the OLMo architecture ( #6741 )
...
* implement olmo architecture
* remove unused variable
* remove unused moe branch
* remove check for weight
* remove superfluous moe, bias and rope tensors
* clarified comment
* fix clamp_kqv setting
* remove obsolete parameter name filter
2024-04-19 11:35:54 +02:00
Austin
8b1b1f4982
train : add general name ( #6752 )
...
* llama : make general.name optional
* train: Add 'general.name' to model metadata
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
---------
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-19 10:16:45 +03:00
Neo Zhang
bca40e9814
fix wrong parameter in cmd in readme-sycl.md ( #6755 )
...
Co-authored-by: jianyuzh <jianyu.zhang@intel.com>
2024-04-19 09:16:31 +08:00
Georgi Gerganov
9ca869876e
batched-bench : add fattn arg
2024-04-18 21:41:32 +03:00
Georgi Gerganov
c16a7c2688
metal : use F32 attention accumulators
2024-04-18 21:20:30 +03:00
slaren
0d56246f4b
ggml : group all experts in a single ggml_mul_mat_id ( #6505 )
...
* ggml : group all experts in a single ggml_mul_mat_id
cuda : improve mmid row copy
* cuda : fix bin bcast with non-cont src0
* test-backend-ops : only run all mul mat tests for base types
* llama : disable moe offloading with SYCL
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-18 15:18:48 +02:00
Sigbjørn Skjæret
03c0946d73
convert : support models with multiple chat templates ( #6588 )
...
* Support converting models with multiple chat templates
Adds the following metadata:
* tokenizer.chat_templates
* tokenizer.chat_template.<name1>
* tokenizer.chat_template.<name2>
* tokenizer.chat_template.<...>
Where `tokenizer.chat_templates` is an array of the template names (except `default`), `default` is added to the regular `tokenizer.chat_template`.
* replace filtered characters with underscore
* New script to add/modify/remove metadata
This scripts creates a copy of a GGUF file and allows you to add/modify/remove metadata in the process.
Most importantly this allows you to update chat templates, either as a string or directly from an updated tokenizer_config.json file.
* Add files via upload
add new script to project/readme
* flake--
2024-04-18 14:49:01 +03:00
Georgi Gerganov
fa9e8c6689
Merge branch 'master' into gg/flash-attn
2024-04-18 14:39:23 +03:00
Ren Xuancheng
e11b2e6e1e
Qwen2 : assume tied weights if lm_head/output weights is missing ( #6738 )
2024-04-18 14:38:04 +03:00
Georgi Gerganov
105332cc17
metal : add BS=1 kernel for flash attention ( #6508 )
...
* metal : add BS=1 kernel for flash attention (wip)
* metal : support more than 1 warps
* metal : opts
* metal : opt
* metal : switch to parallel reduce
* metal : reduce registers
* metal : simplify
* metal : initial FA vec kernel
2024-04-18 14:33:07 +03:00
Georgi Gerganov
260cdb2d08
llama-bench : add -fa,--flash-attn arg
2024-04-18 14:28:19 +03:00
Johannes Gäßler
87968de9a9
fix KQ FP32 precision fpr parallel_blocks > 1
2024-04-18 13:15:32 +02:00
Johannes Gäßler
2f538b9547
Add __hgt2_mask implementation for CUDA 11
2024-04-18 13:15:32 +02:00
Johannes Gäßler
0bc67dd1c8
Calculate KQ as FP32 if KQV has GGML_PREC_F32
2024-04-18 13:15:32 +02:00
Johannes Gäßler
a5b0e2dea0
store temp KQ in registers
2024-04-18 13:15:32 +02:00
Johannes Gäßler
ef9e1593f3
flush softmax exp below threshold to 0
2024-04-18 13:15:32 +02:00
Johannes Gäßler
6a3b84236d
fix flash_attn_vec_f16 race condition
2024-04-18 13:15:32 +02:00
Johannes Gäßler
34f93bbb39
CUDA: refactor host code, dyn. par. blocks
2024-04-18 13:15:32 +02:00
slaren
c71bfd736e
llama : fix compatibility with old 2 expert models ( #6735 )
2024-04-18 10:04:47 +03:00
Pierrick HYMBERT
5668c79ea0
server: bench: enable flash_attn param
2024-04-17 23:26:29 +02:00
Georgi Gerganov
3b8f1ec4b1
llamafile : tmp disable + build sgemm.o when needed ( #6716 )
...
* build : sgemm.o only when needed
ggml-ci
* llamafile : tmp disable due to MoE bug
ggml-ci
2024-04-17 23:58:26 +03:00
Yaroslav
8dd1ec8b3f
readme : add UI ( #6724 )
...
* Update README.md
* Update README.md
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-17 15:47:50 +03:00
Pierrick HYMBERT
405385726e
server: support flash_attn param
2024-04-17 14:05:02 +02:00
Georgi Gerganov
599ce84a71
llama : flash_attn cparam + fix defrag
2024-04-17 12:01:39 +03:00