Georgi Gerganov
ce281b904c
llama : disable FA for AMD
2024-04-24 17:54:32 +03:00
Georgi Gerganov
8937ec5307
Merge branch 'master' into gg/flash-attn
...
ggml-ci
2024-04-24 14:00:32 +03:00
mgroeber9110
3fe847b574
server : do not apply Markdown formatting in code sections ( #6850 )
2024-04-24 13:54:24 +03:00
Kyle Mistele
37246b1031
common : revert showing control tokens by default for server ( #6860 )
...
* fix: revert showing control tokens by default
* feat: revert changes to default behavior of llama_token_to_piece; provide overridden declaration to receive "bool special" param to toggle showing control tokens
* feat: use the overridden declaration of llama_token_to_piece from common/common.cpp to specify "false" so that control tokens are not shown in chat completion responses"
* common : simplify
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-24 13:15:29 +03:00
Johannes Gäßler
28103f4832
Server: fix seed for multiple slots ( #6835 )
...
* Server: add tests for consistent results
* sampling: separate rng per sampling context
2024-04-24 11:08:36 +02:00
Georgi Gerganov
c0d1b3e03e
ggml : move 32-bit arm compat in ggml-impl.h ( #6865 )
...
ggml-ci
2024-04-24 12:00:07 +03:00
Tristan Druyen
abd3314064
llama : add phi 3 chat template ( #6857 )
...
* Add phi 3 chat template & tests
* test : fix chat template result
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-24 11:52:37 +03:00
Junyang Lin
3fec68be4e
convert : add support of codeqwen due to tokenizer ( #6707 )
...
* add support of codeqwen due to tokenizer
* override load_hparams
* fix typo
* fix load_params
* convert : fix whitespace
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-24 10:16:21 +03:00
liuwei-git
c8297c6af5
llama : add phi3 support ( #6852 )
...
* add explicit phi3 support
* add explicit phi3 support
* remove unused code
* convert : add BOS token
* llama : match EOT token <|end|>
* llama : minor / style
* llama : tabs -> spaces
* convert : fix lint checks
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-24 10:00:37 +03:00
Georgi Gerganov
751591d520
server : add help for --flash-attn arg
2024-04-23 18:16:25 +03:00
Georgi Gerganov
d228bf8552
cont
2024-04-23 17:32:11 +03:00
Georgi Gerganov
56657e52e5
llama : fix n_batch requirements
...
ggml-ci
2024-04-23 17:30:37 +03:00
Georgi Gerganov
19e8982f51
llama : prep ALiBi support for BERT models
...
ggml-ci
2024-04-23 17:24:28 +03:00
Georgi Gerganov
78d363b0d4
llama : replace bool need_kq_pos with use_alibi
2024-04-23 17:15:13 +03:00
Georgi Gerganov
3864eea4cb
ggml : add TODO's for F16/F32 mask/pos support in other backends
2024-04-23 10:06:56 +03:00
Georgi Gerganov
c129369702
cuda : try to fix __hgt2_mask
...
ggml-ci
2024-04-23 09:18:55 +03:00
Anas Ahouzi
4e96a812b3
[SYCL] Windows default build instructions without -DLLAMA_SYCL_F16 flag activated ( #6767 )
...
* Fix FP32/FP16 build instructions
* Fix typo
* Recommended build instruction
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
* Recommended build instruction
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
* Recommended build instruction
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
* Add comments in Intel GPU linux
---------
Co-authored-by: Anas Ahouzi <112881240+aahouzi-intel@users.noreply.github.com>
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2024-04-23 08:53:18 +08:00
Justine Tunney
192090bae4
llamafile : improve sgemm.cpp ( #6796 )
...
* llamafile : improve sgemm.cpp
- Re-enable by default
- Fix issue described in #6716
- Make code more abstract, elegant, and maintainable
- Faster handling of weirdly shaped `m` an `n` edge cases
* Address review comments
* Help clang produce fma instructions
* Address review comments
2024-04-22 22:00:36 +03:00
Georgi Gerganov
c70bfd7bcb
cuda : "constexpr dim3" -> "const dim3"
...
ggml-ci
2024-04-22 20:31:23 +03:00
Georgi Gerganov
5408d55506
cuda : uint -> uint32_t
2024-04-22 19:12:06 +03:00
Dave Airlie
e931888d50
ggml : fix calloc argument ordering. ( #6820 )
...
Latest gcc complains here:
/home/airlied/devel/llama.cpp/ggml-alloc.c: In function ‘ggml_gallocr_new_n’:
/home/airlied/devel/llama.cpp/ggml-alloc.c:374:59: warning: ‘calloc’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
374 | ggml_gallocr_t galloc = (ggml_gallocr_t)calloc(sizeof(struct ggml_gallocr), 1);
| ^~~~~~
/home/airlied/devel/llama.cpp/ggml-alloc.c:374:59: note: earlier argument should specify number of elements, later size of each element
and a bunch more.
calloc is specified to take nmemb first then size, so realign the code.
In a couple of places there was a * x, 1 so I fixed those to use calloc properly.
2024-04-22 16:05:06 +02:00
Georgi Gerganov
8960fe86ae
llama : fix typo in <|im_end|> token text ( #6745 )
2024-04-22 15:41:11 +03:00
Georgi Gerganov
f725ca90fb
ggml : ggml_soft_max support F16/F32 mask/pos
...
ggml-ci
2024-04-22 14:53:11 +03:00
Pierrick Hymbert
c0956b09ba
ci: fix job are cancelling each other ( #6781 )
2024-04-22 13:22:54 +02:00
github-actions[bot]
e9b4a1bf68
flake.lock: Update
...
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/1042fd8b148a9105f3c0aca3a6177fd1d9360ba5?narHash=sha256-3sbWO1mbpWsLepZGbWaMovSO7ndZeFqDSdX0hZ9nVyw%3D' (2024-04-10)
→ 'github:NixOS/nixpkgs/5c24cf2f0a12ad855f444c30b2421d044120c66f?narHash=sha256-XtTSSIB2DA6tOv%2Bl0FhvfDMiyCmhoRbNB%2B0SeInZkbk%3D' (2024-04-19)
2024-04-22 10:42:43 +00:00
Georgi Gerganov
c11d05fec0
llama : force disable flash attention for incompatible models
2024-04-22 12:50:41 +03:00
Georgi Gerganov
cb76d747d1
ggml : fix num dimensions in ggml_flash_attn_ext
2024-04-22 12:50:26 +03:00
Georgi Gerganov
a39217d428
common : print --flash-attn in help
2024-04-22 12:50:10 +03:00
Olivier Chafik
5cf5e7d490
build
: generate hex dump of server assets during build (#6661 )
...
* `build`: generate hex dumps of server assets on the fly
* build: workaround lack of -n on gnu xxd
* build: don't use xxd in cmake
* build: don't call xxd from build.zig
* build: more idiomatic hexing
* build: don't use xxd in Makefile (od hackery instead)
* build: avoid exceeding max cmd line limit in makefile hex dump
* build: hex dump assets at cmake build time (not config time)
2024-04-21 18:48:53 +01:00
Georgi Gerganov
40f74e4d73
llama : add option to render special/control tokens ( #6807 )
...
* make : fix common dep on llama.h
* llama : add option to render special tokens
* readme : add API change notice
ggml-ci
* swift : fix build
2024-04-21 18:36:45 +03:00
Georgi Gerganov
b9cc76d87e
ggml : fix ggml_backend_cpu_supports_op() for CPY ( #0 )
2024-04-21 16:48:50 +03:00
Wouter
7dbdba5690
llama : add llama-3 chat template ( #6751 )
...
* Added llama-3 chat template
* Update llama.cpp
Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>
* Update llama.cpp
Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>
* Update tests/test-chat-template.cpp
Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>
* Added EOS stop sequence according to https://github.com/ggerganov/llama.cpp/pull/6751#issuecomment-2065602862
* Removed adding of BOS token before first message
* Removed bos token from expected output from llama-3
* Update tests/test-chat-template.cpp
Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com>
* Update tests/test-chat-template.cpp
Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com>
* Added <|end_of_text|> as another stop token
* Reverted last change of adding the end_of_text stop word for llama 3
---------
Co-authored-by: Wouter Tichelaar <tichelaarw@spar.net>
Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>
Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-21 16:03:39 +03:00
pmysl
c1386c936e
gguf-py : add IQ1_M to GGML_QUANT_SIZES ( #6761 )
2024-04-21 15:49:30 +03:00
Jan Boon
e8d35f47cb
doc : add link to falcon ( #6789 )
2024-04-21 15:35:40 +03:00
Mohammadreza Hendiani
2cca09d509
readme : add Fedora instructions ( #6783 )
...
* added fedora to list of distros that may need the package (the packages have the same name on Fedora)
* how to add clblast that is avalible in the fedora repos
2024-04-21 15:32:05 +03:00
Justine Tunney
89b0bf0d5d
llava : use logger in llava-cli ( #6797 )
...
This change removes printf() logging so llava-cli is shell scriptable.
2024-04-21 15:19:04 +03:00
Pedro Cuenca
b97bc3966e
llama : support Llama 3 HF conversion ( #6745 )
...
* Support Llama 3 conversion
The tokenizer is BPE.
* style
* Accept suggestion
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
* llama : add llama_token_is_eog()
ggml-ci
* llama : auto-detect more EOT tokens when missing in KV data
* convert : replacing EOS token is a hack
* llama : fix codegemma EOT token + add TODOs
* llama : fix model type string for 8B model
---------
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-21 14:50:41 +03:00
Jan Boon
b8109bc013
doc : server tests require llama to be built with curl enabled ( #6788 )
2024-04-20 18:29:50 +02:00
Georgi Gerganov
aed82f6837
common : try to fix Android CI ( #6780 )
...
* common : disable get_math_cpu_count() until Android CI gets fixed
* common : another try
2024-04-20 13:27:12 +03:00
loonerin
0e4802b2ec
ci: add ubuntu latest release and fix missing build number (mac & ubuntu) ( #6748 )
2024-04-19 19:03:35 +02:00
Georgi Gerganov
871fcb6e10
ggml : fix soft_max with bias on CPU
...
ggml-ci
2024-04-19 18:03:56 +03:00
Georgi Gerganov
3badef1fe1
ggml : fix avx512 const correctness
...
ggml-ci
2024-04-19 17:45:08 +03:00
Georgi Gerganov
52945429eb
tests : remove benchmarks
...
ggml-ci
2024-04-19 17:38:28 +03:00
Georgi Gerganov
29f6ad8d95
Merge branch 'master' into gg/flash-attn
2024-04-19 17:30:09 +03:00
Georgi Gerganov
bc346166f9
metal : minor
2024-04-19 17:24:52 +03:00
Georgi Gerganov
1a88565b44
metal : clean-up kernel code
2024-04-19 15:52:49 +03:00
Georgi Gerganov
97eaece7d6
metal : clean-up
2024-04-19 15:30:27 +03:00
Georgi Gerganov
703c6e6528
ggml : fix arm fp16 store on windows
2024-04-19 14:20:41 +03:00
Pierrick Hymbert
637e9a86c2
server: static: upstream upgrade ( #6765 )
2024-04-19 13:19:01 +02:00
Georgi Gerganov
e32b281743
llama : adapt build_olmo to changes
2024-04-19 14:04:56 +03:00