Commit graph

2917 commits

Author SHA1 Message Date
HanishKVC
bdd279c0c9 ChatOn:User Begin+Prefix note update, keep things simple consistent 2024-05-06 11:27:56 +05:30
HanishKVC
84367b9fd1 ChatON: Add template for DeepSeek
Was looking at the tokenized vector, and noticed that the EOS
mentioned by existing chat_apply_template of llama.cpp, is different
from what I noticed in tokenizer_config.json of deepseek llm, so
I have added two entries

* "deepseek-alt" which matches llama.cpp's chat_apply_template and
* "deepseek" which matches that in tokenizer_config.json.

This impacts the assistant suffix and reverse prompt entries.

CasOfThis: Need to look into other entries which I added previously
at a later time. However as the default logic should be picking the
EOS from model file, so I assume reverse-prompt being outofsync,
may not matter beyond a limit, potentially.
2024-05-06 11:27:56 +05:30
HanishKVC
f4b54069f6 ChatON: Add template for Gemma 2024-05-06 11:27:56 +05:30
HanishKVC
2a8028fba8 ChatON: Add Zephyr template to meta-json file 2024-05-06 11:27:56 +05:30
HanishKVC
57bd772bfd ChatON: Cleanup logging
Avoid showing on screen the debug messages.

meta-dump can either show on screen or not, based on how LOGXLN
is defined.
2024-05-06 11:27:56 +05:30
HanishKVC
217544e5ff ChatON: Keep compiler happy
Order the functions so that no need for seperate prototypes

Also use kv_bool wrt boolean entries.

Convert string to c char *
2024-05-06 11:27:56 +05:30
HanishKVC
3f9dfc240c ChatON: Check for the boolean entries in meta-json 2024-05-06 11:27:56 +05:30
HanishKVC
42f6b45547 ChatON: Use the constants defined for the keys 2024-05-06 11:27:56 +05:30
HanishKVC
efb758ba7d ChatON: Rename helpers to kv suffix, updated wrt metaok
rename because they return value of specified key.

[main] update metaok to take template-id, so that one can cross
check that all needed entries are there wrt that template-id in
the chaton-meta-json file
2024-05-06 11:27:56 +05:30
HanishKVC
e8c24c0767 ChatOn:MetaOk: Allows template-id based cross check
For a given template-id, cross check, all needed entries are there
in the json.
2024-05-06 11:27:56 +05:30
HanishKVC
b1055641e9 ChatON: Update the notes a bit 2024-05-06 11:27:56 +05:30
HanishKVC
11b47fbcfc ChatON:MetaJson: Add key constants, check metaJson loaded ifNeeded 2024-05-06 11:27:56 +05:30
HanishKVC
221ccd6462 ChatOn: Add SystemUser-1st-User-Has-Prefix flag support
Llama2 seems to need it, so chaton-meta-json sample file updated
to use same.
2024-05-06 11:27:56 +05:30
HanishKVC
f03dd2439f ChatOn:No global-begin/end in ChatApplyTmplSingle, ChatApplyTmpl
Avoid adding global begin/end markers wrt ChatApplyTmplSingle.

Add ChatApplyTmpl which goes through a vector of messages.
2024-05-06 11:27:56 +05:30
HanishKVC
c4cf0e9075 ChatON:Cleanup: BeginEnd, Debug log
Update the note

Rename global-prefix|suffix to global-begin|end.

Rename chat-apply-template to chat-apply-template-single, cas it
handles only a single message.

Add some debug log messages to the helper functions
2024-05-06 11:27:56 +05:30
HanishKVC
d87d27512e ChatOn: update sample meta json a bit
Move [inst] [/inst] wrt llama2 from global to individual role
specific parts.

Avoid an extra \n wrt prefixes of llama3
2024-05-06 11:27:55 +05:30
HanishKVC
cdbe4f06ce Chaton:Sample Meta JSON cleanup 2024-05-06 11:27:55 +05:30
HanishKVC
050d329e7e ChatOn+Main: Initial go at chaton in main interactive flow 2024-05-06 11:27:55 +05:30
HanishKVC
1374a64200 Chaton:Meta: Add chatml meta data to sample meta json file 2024-05-06 11:27:55 +05:30
HanishKVC
093abc29a2 ChatOn: Update sample meta json to be a valid json 2024-05-06 11:27:55 +05:30
HanishKVC
dc56be951d ChatOn:Main: Load and dump any specified chaton meta file 2024-05-06 11:27:55 +05:30
HanishKVC
35f25196a0 ChatOn:Common: Add the needed cmdline arg params and its parsing 2024-05-06 11:27:55 +05:30
HanishKVC
2146a253e8 ChatOn: Capture the idea 2024-05-06 11:27:55 +05:30
kunnis
628b299106
Adding support for the --numa argument for llama-bench. (#7080) 2024-05-05 14:17:47 +02:00
Sigbjørn Skjæret
8f8acc8683
Disable benchmark on forked repo (#7034)
* Disable benchmark on forked repo

* only check owner on schedule event

* check owner on push also

* more readable as multi-line

* ternary won't work

* style++

* test++

* enable actions debug

* test--

* remove debug

* test++

* do debug where we can get logs

* test--

* this is driving me crazy

* correct github.event usage

* remove test condition

* correct github.event usage

* test++

* test--

* event_name is pull_request_target

* test++

* test--

* update ref checks
2024-05-05 13:38:55 +02:00
Lyle Dean
ca36326020
readme : add note that LLaMA 3 is not supported with convert.py (#7065) 2024-05-05 08:21:46 +03:00
DAN™
889bdd7686
command-r : add BPE pre-tokenization (#7063)
* Add BPE pre-tokenization for Command-R/R+.

* Bump transformers convert requirement.

* command-r : add individual digits regex

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-05 08:19:30 +03:00
Brian
6fbd432211
py : logging and flake8 suppression refactoring (#7081)
Set one as executable and add basicConfig()
to another. Also added noqa tag to test scripts.
2024-05-05 08:07:48 +03:00
Xuan Son Nguyen
842500144e
gguf-split: add --no-tensor-first-split (#7072) 2024-05-04 18:56:22 +02:00
Jeximo
cf768b7e71
Tidy Android Instructions README.md (#7016)
* Tidy Android Instructions README.md

Remove CLBlast instructions(outdated), added OpenBlas.

* don't assume git is installed

Added apt install git, so that git clone works

* removed OpenBlas

Linked to Linux build instructions

* fix typo

Remove word "run"

* correct style

Co-authored-by: slaren <slarengh@gmail.com>

* correct grammar

Co-authored-by: slaren <slarengh@gmail.com>

* delete reference to Android API

* remove Fdroid reference, link directly to Termux

Fdroid is not required

Co-authored-by: slaren <slarengh@gmail.com>

* Update README.md

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-05-04 18:10:15 +02:00
viric
fcd84a0f5a
Fix Linux /sys cpu path to guess number of cores (#7064) 2024-05-04 15:26:53 +02:00
maor-ps
03fb8a002d
If first token generated from the server is the stop word the server will crash (#7038)
This will reproduce the issue in llama13b
{
'prompt': 'Q: hello world \nA: ',
 'stop': ['\n'],
 'temperature': 0.0,
 'n_predict': 10,
 'cache_prompt': True,
 'n_probs': 10
}
2024-05-04 11:06:40 +02:00
Georgi Gerganov
92139b90af
tests : add test-tokenizer-0.sh + fix some tokenizers (#7036)
* tests : add test-tokenizer-0.sh

* unicode : add all unicode number ranges

* starcoder : fix pre-tokenizer

* tests : add test that fails with DeepSeek tokenizers

* falcon : fix regex

* unicode : regenerate unicode tables

* refact : add tokenizer model

* lint : fix

* tests : disable failing tests

ggml-ci

* refact : add tests files

ggml-ci

* convert : print -> logging

ggml-ci

* lint : fix

* unicode : digit -> number

* phi-3 : update
2024-05-04 08:32:32 +03:00
Brian
a2ac89d6ef
convert.py : add python logging instead of print() (#6511)
* convert.py: add python logging instead of print()

* convert.py: verbose flag takes priority over dump flag log suppression

* convert.py: named instance logging

* convert.py: use explicit logger id string

* convert.py: convert extra print() to named logger

* convert.py: sys.stderr.write --> logger.error

* *.py: Convert all python scripts to use logging module

* requirements.txt: remove extra line

* flake8: update flake8 ignore and exclude to match ci settings

* gh-actions: add flake8-no-print to flake8 lint step

* pre-commit: add flake8-no-print to flake8 and also update pre-commit version

* convert-hf-to-gguf.py: print() to logger conversion

* *.py: logging basiconfig refactor to use conditional expression

* *.py: removed commented out logging

* fixup! *.py: logging basiconfig refactor to use conditional expression

* constant.py: logger.error then exit should be a raise exception instead

* *.py: Convert logger error and sys.exit() into a raise exception (for atypical error)

* gguf-convert-endian.py: refactor convert_byteorder() to use tqdm progressbar

* verify-checksum-model.py: This is the result of the program, it should be printed to stdout.

* compare-llama-bench.py: add blank line for readability during missing repo response

* reader.py: read_gguf_file() use print() over logging

* convert.py: warning goes to stderr and won't hurt the dump output

* gguf-dump.py: dump_metadata() should print to stdout

* convert-hf-to-gguf.py: print --> logger.debug or ValueError()

* verify-checksum-models.py: use print() for printing table

* *.py: refactor logging.basicConfig()

* gguf-py/gguf/*.py: use __name__ as logger name

Since they will be imported and not run directly.

* python-lint.yml: use .flake8 file instead

* constants.py: logger no longer required

* convert-hf-to-gguf.py: add additional logging

* convert-hf-to-gguf.py: print() --> logger

* *.py: fix flake8 warnings

* revert changes to convert-hf-to-gguf.py for get_name()

* convert-hf-to-gguf-update.py: use triple quoted f-string instead

* *.py: accidentally corrected the wrong line

* *.py: add compilade warning suggestions and style fixes
2024-05-03 22:36:41 +03:00
Daniel Bevenius
433def286e
llama : rename ctx to user_data in progress_callback (#7045)
* llama : rename ctx to user_data in progress_callback

This commit renames the `ctx` parameter to `user_data` in the
`llama_progress_callback` typedef.

The motivation for this is that other callbacks use `user_data` or
`data`, and using `ctx` in this case might be confusing as it could be
confused with `llama_context`.

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-05-03 15:24:30 +02:00
Bartowski
60325fa56f
Remove .attention from skipped tensors to match more accurately (#7051) 2024-05-03 01:49:09 +02:00
alwqx
6ecf3189e0
chore: fix typo in llama.cpp (#7032)
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-05-02 11:56:41 -04:00
Andrew Downing
b0d943de17
Update LOG_IMPL and LOG_TEE_IMPL (#7029)
ROCm clang defines _MSC_VER which results in the wrong implementation of LOG_IMPL and LOG_TEE_IMPL being compiled.

This fixes https://github.com/ggerganov/llama.cpp/issues/6972
2024-05-01 23:31:30 +02:00
l3utterfly
8d608a81b7
main : fix off by one error for context shift (#6921) 2024-05-01 22:27:41 +03:00
Johannes Gäßler
3ea0d36000
Server: add tests for batch size, different seeds (#6950) 2024-05-01 17:52:55 +02:00
Johannes Gäßler
1613ef8d8e
CUDA: CUDART < 11.7 workaround for __hmax, __hmax2 (#7019) 2024-05-01 14:46:37 +02:00
slaren
c4ec9c0d3d
ci : exempt confirmed bugs from being tagged as stale (#7014) 2024-05-01 08:13:59 +03:00
Johannes Gäßler
a8f9b07631
perplexity: more statistics, added documentation (#6936)
* perplexity: more statistics, added documentation

* add LLaMA 3 8b scoreboard
2024-04-30 23:36:27 +02:00
Kevin Gibbons
f364eb6fb5
switch to using localizedDescription (#7010) 2024-04-30 17:14:02 +02:00
Georgi Gerganov
77e15bec62
metal : remove deprecated error code (#7008) 2024-04-30 15:52:21 +03:00
Kevin Gibbons
a68a1e7ed0
metal : log more info on error (#6987) 2024-04-30 12:34:50 +03:00
Georgi Gerganov
9c67c2773d
ggml : add Flash Attention (#5021)
* ggml : add ggml_flash_attn_ext API

* ggml : fix GQA support in ggml_flash_attn_ext

* ggml : online attention (CPU)

* metal : initial implementation

* metal : f16 precision

* metal : reduce branches

* metal : specialize for head size

* wip : 8 rows per simd group

* wip : 4 rows per simd group

* wip : template for rows per warp

* metal : parallelize across KV size

* metal : parallel reduce across heads

* metal : efficient flash_attn_f16 implementation

* metal : avoid redundant loads of the attention

* metal : scale and mask in matrix form

* metal : fix comment

* llama : avoid ggml_cast, use F32 query

* metal : add parallel reduce version (disabled)

* metal : move output into local memory + optimize

- the result from each simdgroup now stays in the registers
- significantly reduced SRAM usage
- more efficient skipping of -INF blocks
- avoid simdgroup barrier in hot loop
- add comments

* metal : add tests, fix scaling, support C > 32

* metal : improve precision

* ggml : fix f16 mad

* metal : minor

* metal : support Q > 8

* tests : add ATTN tests

* metal : disable buffer allocation logs

* tests : more

* metal : faster inner loop for C == 32

* metal : fix array initialization

* tests : ifdef

* ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext

* ggml : fix ggml_soft_max mask requirement

* cuda : fix soft_max to use correct mask size

* cuda : add flash_attn kernel (wip)

* metal : optimize softmax for C > 32

* metal : optimize softmax

* tests : minor fix

* cuda : avoid zeroing fragments

* tests : update dims

* cuda : fix __hisinf() result check

* cuda : avoid warp_reduce for smax

* cuda : use int instead of int64_t

Noticeably improves performance (thanks to Johannes)

* cuda : make loops use the same loop values

Thanks Johannes again for the tip

* cuda : unroll some of the loops

* cuda : avoid __hisinf branches

* cuda : use half2 in softmax

* cuda : switch to 1 warp for bs > 16

* cuda : speed-up reduce part of the kernel

* cuda : unroll Q*K^T loop

* cuda : fix -INF block check

* cuda : simplify softmax

* cuda : fix matrix names

* cuda : minor

* llama : adapt to F16 KQ_pos

* llama : adapt new models to F16 KQ_mask

* ggml : fix F16 store (ARM NEON)

* llama : fix type of KQ_mask and KQ_pos

* ggml : fix CPU soft_max

* tests : add hs=256

* cuda : fix build

* metal : improve perf via smaller int registers

* cuda : adapt soft_max to F16 mask and pos

* CUDA: faster FlashAttention, kernel for bs == 1

* 16 cols for Phi-2

* no vec for hs, no hs==256 ncols==32 for Volta

* adjust kernel selection logic

* 4 warps, 256 stride for all D

* no ncols == 64

* Multiple parallel blocks for batch size 1

* fix compile warnings

* fix excessive KQ_b loads

* fix cmake build

* fix KV cache padding, NaN from INFINITY (#6438)

* llama : flash_attn cparam + fix defrag

* server: support flash_attn param

* server: bench: enable flash_attn param

* CUDA: refactor host code, dyn. par. blocks

* fix flash_attn_vec_f16 race condition

* flush softmax exp below threshold to 0

* store temp KQ in registers

* Calculate KQ as FP32 if KQV has GGML_PREC_F32

* Add __hgt2_mask implementation for CUDA 11

* fix KQ FP32 precision fpr parallel_blocks > 1

* llama-bench : add -fa,--flash-attn arg

* metal : add BS=1 kernel for flash attention (#6508)

* metal : add BS=1 kernel for flash attention (wip)

* metal : support more than 1 warps

* metal : opts

* metal : opt

* metal : switch to parallel reduce

* metal : reduce registers

* metal : simplify

* metal : initial FA vec kernel

* metal : use F32 attention accumulators

* batched-bench : add fattn arg

* llama : simplify llama_build_kv_store

ggml-ci

* llama : adapt build_olmo to changes

* ggml : fix arm fp16 store on windows

* metal : clean-up

* metal : clean-up kernel code

* metal : minor

* tests : remove benchmarks

ggml-ci

* ggml : fix avx512 const correctness

ggml-ci

* ggml : fix soft_max with bias on CPU

ggml-ci

* common : print --flash-attn in help

* ggml : fix num dimensions in ggml_flash_attn_ext

* llama : force disable flash attention for incompatible models

* ggml : ggml_soft_max support F16/F32 mask/pos

ggml-ci

* cuda : uint -> uint32_t

* cuda : "constexpr dim3" -> "const dim3"

ggml-ci

* cuda : try to fix __hgt2_mask

ggml-ci

* ggml : add TODO's for F16/F32 mask/pos support in other backends

* llama : replace bool need_kq_pos with use_alibi

* llama : prep ALiBi support for BERT models

ggml-ci

* llama : fix n_batch requirements

ggml-ci

* cont

* server : add help for --flash-attn arg

* llama : disable FA for AMD

* tests : remove TMP_ATTN_BENCH

ggml-ci

* llama : support save/load state with FA enabled

ggml-ci

* ci : add CUDA save-load-state tests

ggml-ci

* llama : llama_kv_cache_clear zeroes data + fix save-load seq

ggml-ci

* llama : fix copy-paste errors, add TODO

* llama : disallow incompatible states

* llama : update llama_state_get_size after v_trans field

* metal : remove tmp log

* llama : add static reminder for llama_state_get_size

* metal : fix max nsg

ggml-ci

* ci : fix arg order

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
2024-04-30 12:16:08 +03:00
Georgi Gerganov
952d03dbea
convert : use utf8 encoding (#7000)
* convert : use utf8 encoding

* convert : update instructions and warning message
2024-04-30 11:05:25 +03:00
Olivier Chafik
8843a98c2b
Improve usability of --model-url & related flags (#6930)
* args: default --model to models/ + filename from --model-url or --hf-file (or else legacy models/7B/ggml-model-f16.gguf)

* args: main & server now call gpt_params_handle_model_default

* args: define DEFAULT_MODEL_PATH + update cli docs

* curl: check url of previous download (.json metadata w/ url, etag & lastModified)

* args: fix update to quantize-stats.cpp

* curl: support legacy .etag / .lastModified companion files

* curl: rm legacy .etag file support

* curl: reuse regex across headers callback calls

* curl: unique_ptr to manage lifecycle of curl & outfile

* curl: nit: no need for multiline regex flag

* curl: update failed test (model file collision) + gitignore *.gguf.json
2024-04-30 00:52:50 +01:00
Clint Herron
b8c1476e44
Extending grammar integration tests (#6644)
* Cleaning up integration tests to share code between tests and make it simpler to add new tests.

* Add tests around quantifiers to ensure both matching and non-matching compliance.

* Add slightly more complex grammar with quantifiers to test references with quantifiers.

* Fixing build when C++17 is not present.

* Separating test calls to give more helpful stack traces on failure. Adding verbose messages to give visibility for what is being tested.

* Adding quotes around strings to explicitly show whitespace

* Removing trailing whitespace.

* Implementing suggestions from @ochafik -- grammars and test strings now print and flush before tests to aid in debugging segfaults and whatnot.

* Cleaning up forgotten symbols. Modifying simple test to use test harness. Added comments for more verbose descriptions of what each test is accomplishing.

* Unicode symbol modifications to hopefully make log easier to parse visually.
2024-04-29 14:40:14 -04:00