Commit graph

2401 commits

Author SHA1 Message Date
Francis Couture-Harpin
d0d32dced9 convert-hf : omit output.weight when identical with token_embd.weight
Only for Mamba for now, but it might be relevant for other models eventually.
Most Mamba models actually share these two tensors, albeit implicitly.
2024-03-08 10:07:52 -05:00
Francis Couture-Harpin
1c8ea55843 mamba : add missing spaces
This is purely a formatting change.
2024-03-07 22:29:45 -05:00
Francis Couture-Harpin
17e4d6c96a mamba : rename metadata to be more similar to transformers library
This breaks existing converted-to-GGUF models,
but the metadata names are more "standard".

mamba : support mamba-*-hf models

These models share their token_embd.weight with their output.weight
2024-03-07 21:32:52 -05:00
Francis Couture-Harpin
d8024a486b convert-hf : support new metadata keys for Mamba
For the models available at
https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406
2024-03-07 20:28:42 -05:00
Francis Couture-Harpin
7cd5a1f986 server : fix cache_tokens not getting correctly resized
Otherwise, when the "we have to evaluate at least 1 token" special case
was triggered, an extra token was kept in cache_tokens even if it was
removed from the KV cache.

For Mamba, this caused useless prompt reprocessing when the previous
request triggered the above case.
2024-03-07 14:18:21 -05:00
Francis Couture-Harpin
916b586386 Merge branch 'master' into support-mamba-ssm 2024-03-07 14:15:05 -05:00
Georgi Gerganov
6cdabe6526
llama-bench : add embeddings option (#5924)
* llama-bench : add embeddings option

* llama-bench : do not hard code embd default value

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-03-07 16:32:38 +02:00
Neo Zhang Jianyu
89fb735fcf
Revert "[SYCL] fix error when set main gpu to non-zero (#5901)" (#5918)
This reverts commit ceca1aef07.
2024-03-07 12:14:49 +01:00
Minsoo Cheong
55a2a900ff
server : add /v1/completions endpoint (#5914)
* add-`/v1/completions`-endpoint

* add legacy comment to `/completion` endpoint
2024-03-07 12:42:39 +02:00
Georgi Gerganov
2002bc96bf
server : refactor (#5882)
* server : refactoring (wip)

* server : remove llava/clip objects from build

* server : fix empty prompt handling + all slots idle logic

* server : normalize id vars

* server : code style

* server : simplify model chat template validation

* server : code style

* server : minor

* llama : llama_chat_apply_template support null buf

* server : do not process embedding requests when disabled

* server : reorganize structs and enums + naming fixes

* server : merge oai.hpp in utils.hpp

* server : refactor system prompt update at start

* server : disable cached prompts with self-extend

* server : do not process more than n_batch tokens per iter

* server: tests: embeddings use a real embeddings model (#5908)

* server, tests : bump batch to fit 1 embedding prompt

* server: tests: embeddings fix build type Debug is randomly failing (#5911)

* server: tests: embeddings, use different KV Cache size

* server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings

* server: tests: embeddings, no need to wait for server idle as it can timout

* server: refactor: clean up http code (#5912)

* server : avoid n_available var

ggml-ci

* server: refactor: better http codes

* server : simplify json parsing + add comment about t_last

* server : rename server structs

* server : allow to override FQDN in tests

ggml-ci

* server : add comments

---------

Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>
2024-03-07 11:41:53 +02:00
Neo Zhang Jianyu
ceca1aef07
[SYCL] fix error when set main gpu to non-zero (#5901)
* fix error when set main gpu to non-zero

* fix delete condition
2024-03-07 16:34:31 +08:00
Jared Van Bortel
e04e04f8fa
ggml : use SYS_get_cpu if SYS_getcpu is not defined (#5906)
Fixes #5694
Fixes ggerganov/whisper.cpp#1894
2024-03-06 15:42:23 -05:00
bobqianic
e25fb4b18f
ggml : use uint8x16_t return type for ggml_vqtbl1q_u8 (#5894)
* use uint8x16_t

* Update ggml-quants.c
2024-03-06 09:35:07 +02:00
Georgi Gerganov
1e35d619a6
convert : remove AWQ remnants (#5768) 2024-03-06 09:13:42 +02:00
Neo Zhang Jianyu
8ced9f7e32
add wait() to make code stable (#5895) 2024-03-06 12:08:32 +08:00
slaren
652ca2bded
compare-llama-bench.py : remove mul_mat_q (#5892) 2024-03-05 22:27:29 +01:00
Francis Couture-Harpin
5544f5211b Merge branch 'master' into support-mamba-ssm 2024-03-05 12:12:01 -05:00
Francis Couture-Harpin
93fd4b8d5b mamba : clarify some comments 2024-03-05 12:05:44 -05:00
Jared Van Bortel
bd836944f8
quants : use MM256_SET_M128I consistently to fix gcc 7 build (#5889) 2024-03-05 11:56:37 -05:00
ExtReMLapin
3de31677d3
grammars : blacklists character control set (#5888)
* Prevent control characters from being served in json string

* Prevent control characters from being served in json string (array)
2024-03-05 18:33:08 +02:00
Georgi Gerganov
82cb31eb93
Revert "grammars : don't allow to output unescaped new line in string (#5885)"
This reverts commit b1a4e994fd.
2024-03-05 15:56:24 +02:00
ExtReMLapin
b1a4e994fd
grammars : don't allow to output unescaped new line in string (#5885)
* Don't allow grammar json array to output unescaped new line in string

* Don't allow new line in json object string
2024-03-05 15:44:29 +02:00
0cc4m
61d1c88e15
Vulkan Improvements (#5835)
* Improve dequant shaders, add fast q4_0 dequant

* Optimize dmmv non-kquants for GCN

Remove unnecessary SPIR-V shader duplication

* Fix q4_0 dequant dispatch sizes

Fix backend free bug

* Optimize dequant shaders for q4_1, q5_0, q5_1 and q8_0

* Add unary and binary op shader templates

* Fix Vulkan check results

* Enable non-contiguous support for simple ops

* Add argsort

Basic q4_0 mmq shader and unit test

* Speed up q4_0 dequant code, enable mmq for q4_0

* Rework matmul pipeline selection

* Add soft_max alibi support

* Add q4_1, q5_0, q5_1 and q8_0 dequant mat mat mul shaders

* Add environment variable GGML_VK_FORCE_MAX_ALLOCATION_SIZE to limit max buffer size

Rename GGML_VULKAN_DISABLE_F16 to GGML_VK_DISABLE_F16 for consistency
2024-03-05 13:33:42 +01:00
Neo Zhang Jianyu
21b0867433
[SYCL] fix mul_mat fault in CI/unit-test (#5862)
* fix mul_mat fault in cpy_f32_f16

* rm unused function

* add wait() for memcpy

* restore ci/run.sh, rename struct defination, fix bug in ggml_sycl_op_mul_mat_sycl

* fix format issue

* llama : fix segfault from unknown model arch name (#5820)

* llama : fix segfault from unknown model arch name

* llama : make all LLM maps const

This also requires using `std::map::at` instead of its `operator[]`
which does not exist for const maps.

* llama : name LLM_ARCH_UNKNOWN to "(unknown)"

This avoids errors from `std::map::at` when
getting the general name of the model architecture.
Using "(unknown)" instead of an empty string as per suggestion
https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-1973735284

* llama : remove redundant inner const for LLM_TENSOR_NAMES

The extra const won't do anything here as const maps
return const references to values.

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* llama : remove redundant nullptr check in llm_arch_from_string

Since LLM_ARCH_NAMES is a const map, no spurious elements
with a NULL name are inserted anymore, so this check is dead code.

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* llama : refactor internal quantization functions (#5830)

* scripts : add pod-llama.sh

* ggml : IQ3_S improvements (#5829)

* iq3_s: somewhat faster AVX2 dot product

On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using
16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s.
PP-512 increases to 28.5 t/s from 23.8 t/s.

* iq3_s: somewhat faster ARM_NEON dot product

Still dog slow - 10.7 t/s up from 9.9 t/s.

* iq3_s: another small ARM_NEON improvement

10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick
that works best on AVX2.

* iq3_s: minor improvement on Metal

49.4 t/s -> 50.3 t/s

* iq3_s: PPL improvement

E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653.

* iq3_s: use new grid everywhere

* Fix ARM_NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* convert-hf : make model class definitions self-contained (#5825)

* convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821)

* ggml : fix IQ3_S AVX implementation (#5834)

ggml-ci

* llama : add abort_callback to interrupt computation (#5409)

* using abort_callback from ggml to stop llama computation

* format fix

* a brief explaining comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: tests: passkey challenge /  self-extend with context shift demo (#5832)

* server: tests: add models endpoint scenario

* server: /v1/models add some metadata

* server: tests: add debug field in context before scenario

* server: tests: download model from HF, add batch size

* server: tests: add passkey test

* server: tests: add group attention params

* server: do not truncate prompt tokens if self-extend through group attention is enabled

* server: logs: do not truncate log values

* server: tests - passkey - first good working value of nga

* server: tests: fix server timeout

* server: tests: fix passkey, add doc, fix regex content matching, fix timeout

* server: tests: fix regex content matching

* server: tests: schedule slow tests on master

* server: metrics: fix when no prompt processed

* server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1

* server: tests: increase timeout for completion

* server: tests: keep only the PHI-2 test

* server: tests: passkey add a negative test

* flake.lock: Update (#5842)

Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
  → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)
  → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* server : init http requests thread pool with --parallel if set (#5836)

* ci : schedule slow server tests only on Release or on demand (#5839)

* llama : fix llama_copy_state_data with fragmented KV cache (#5840)

The row size of the saved states was based on kv_self.head while
it should be based on llama_kv_cache_cell_max.

Existing session files should still work.

* llama : fix llama_kv_cache_cell_max inability to return 1

I've also changed its return type to uint32_t,
because this function is always used to set the value of uint32_t variables,
and because the index already has this type.

* llama : fix state size calculation

Some bytes in the state were unaccounted for in llama_get_state_size.
Since the logits reserve so much space, it did not cause problems.

* gguf-dump : support i-quants (#5841)

Co-authored-by: Black_Fox <radekliska@gmail.com>

* llama : allow for user specified embedding pooling type (#5849)

* allow for user specified pooling type

* llama : use enum types over int

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* readme : add API changes section

* cuda : fix data race in soft max (#5853)

* main : support special tokens as reverse/anti prompt (#5847)

* Support special tokens as reverse/anti prompt.

* Tokenize antiprompts only once.

* main : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* common : use LLAMA_DEFAULT_SEED (#5855)

* add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)

* cuda: fix group_norm

* cuda: add batch inference support for ggml_pad/ggml_upscale

* add ggml_arrange

* add ggml_timestep_embedding

* update ggml_arange/ggml_timestep_embedding tests

* cuda: fix im2col

* add ggml_arange/ggml_timestep_embbeding support for metal backend

* fix some bugs

* fix some bugs

* Update ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-cuda.cu

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.metal

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* modify according to the review comments

* ggml : fix compile warnings + code style

* ggml : normalize compute_forward calls + fix seg fault in debug

* minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>

* sync : ggml

* add alias for chat template (#5858)

* speculative : implement stochastic speculative sampling (#5625)

* (WIP) Implement stochastic speculative decoding

* sample from residual distribution on draft accept failure

* fix #5657: force greedy sampling with probs when temp is 0

* remove p_accept parameter

* fix style

* remove unused variables

* add srand() in speculative.cpp

* replace use of rand() with mt19937 sampling

* fixes based on review (@JohannesGaessler)

* fix r random generation

* randomly select next sequence to verify + fix bug in memory freeing

* fix bug in active_seqs sync

* fix uniform int distribution initialization

* remove warnings from comparison between int and size_t

* check grammar in `llama_sample_probability_distribution_impl`

* remove malloc code by utilizing vectors

* add PR link to README

* cmake : handle cases where git index is not found in .git (#5844)

* Update CMakeLists.txt

* Update CMakeLists.txt

* ggml : introduce ggml_status (ggml/750)

* using enum as an exit code instead of macros

* update return type from enum to unsigned int

* indentation fix

* compound update
ggml_compute_exit_code -> ggml_status
changed ggml_status from a bit-field type to simple codes
ggml_status to string cast

* ggml_status to string cast

* GGML_CALL was removed

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* sync : ggml

ggml-ci

* ggml : fix unknown status (#0)

* flake : fix

* llama : fix embeddings (#5796)

* llama : fix embeddings

ggml-ci

* llama : do not use KV cache for non-causal models

ggml-ci

* embeddings : fix llama_batch_init arg

* llama : add pooling switch

* llama : distinguish token vs sequence embeddings

ggml-ci

* llama : assert pooling tensor

* llama : simplify causal mask condition

ggml-ci

* llama : assert input batch with pooling enabled

* readme : update API changes list

* nix: static build (#5814)

* fix speculative decoding build on windows (#5874)

* rebase and rm tailing space

---------

Co-authored-by: LiangtaoJin <liang-tao.jin@intel.com>
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com>
Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Nindaleth <Nindaleth@users.noreply.github.com>
Co-authored-by: Black_Fox <radekliska@gmail.com>
Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: DAN™ <dranger003@gmail.com>
Co-authored-by: leejet <leejet714@gmail.com>
Co-authored-by: Minsoo Cheong <54794500+mscheong01@users.noreply.github.com>
Co-authored-by: Dane Madsen <dane_madsen@hotmail.com>
Co-authored-by: hutli <6594598+hutli@users.noreply.github.com>
Co-authored-by: Jeffrey Quesnelle <emozilla@nousresearch.com>
2024-03-05 13:38:35 +05:30
Minsoo Cheong
6a87ac3a52
fix editorconfig check break (#5879) 2024-03-05 11:42:23 +05:30
Jeffrey Quesnelle
29eee40474
fix speculative decoding build on windows (#5874) 2024-03-04 22:23:06 -05:00
hutli
1d41d6f7c2
nix: static build (#5814) 2024-03-04 17:33:08 -08:00
Georgi Gerganov
29ae62d2ae
llama : fix embeddings (#5796)
* llama : fix embeddings

ggml-ci

* llama : do not use KV cache for non-causal models

ggml-ci

* embeddings : fix llama_batch_init arg

* llama : add pooling switch

* llama : distinguish token vs sequence embeddings

ggml-ci

* llama : assert pooling tensor

* llama : simplify causal mask condition

ggml-ci

* llama : assert input batch with pooling enabled

* readme : update API changes list
2024-03-04 22:31:20 +02:00
Georgi Gerganov
e0843afe1b
flake : fix 2024-03-04 21:50:50 +02:00
Georgi Gerganov
a1c6d96ed8 ggml : fix unknown status (#0) 2024-03-04 20:54:23 +02:00
Georgi Gerganov
efd8533ef8 sync : ggml
ggml-ci
2024-03-04 20:54:23 +02:00
Michael Podvitskiy
9fa2627347 ggml : introduce ggml_status (ggml/750)
* using enum as an exit code instead of macros

* update return type from enum to unsigned int

* indentation fix

* compound update
ggml_compute_exit_code -> ggml_status
changed ggml_status from a bit-field type to simple codes
ggml_status to string cast

* ggml_status to string cast

* GGML_CALL was removed

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-04 20:54:23 +02:00
Dane Madsen
fe52be11e3
cmake : handle cases where git index is not found in .git (#5844)
* Update CMakeLists.txt

* Update CMakeLists.txt
2024-03-04 20:26:55 +02:00
Minsoo Cheong
6d341ab6c5
speculative : implement stochastic speculative sampling (#5625)
* (WIP) Implement stochastic speculative decoding

* sample from residual distribution on draft accept failure

* fix #5657: force greedy sampling with probs when temp is 0

* remove p_accept parameter

* fix style

* remove unused variables

* add srand() in speculative.cpp

* replace use of rand() with mt19937 sampling

* fixes based on review (@JohannesGaessler)

* fix r random generation

* randomly select next sequence to verify + fix bug in memory freeing

* fix bug in active_seqs sync

* fix uniform int distribution initialization

* remove warnings from comparison between int and size_t

* check grammar in `llama_sample_probability_distribution_impl`

* remove malloc code by utilizing vectors

* add PR link to README
2024-03-04 20:24:00 +02:00
Francis Couture-Harpin
2a99d1b243 ggml : implicitly pass src tensors through dst for Mamba-related ops 2024-03-04 10:10:50 -05:00
Xuan Son Nguyen
4ffcdce2ff
add alias for chat template (#5858) 2024-03-04 12:22:08 +01:00
Georgi Gerganov
a0fc62661f
sync : ggml 2024-03-04 10:40:04 +02:00
leejet
7d43c585dc
add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)
* cuda: fix group_norm

* cuda: add batch inference support for ggml_pad/ggml_upscale

* add ggml_arrange

* add ggml_timestep_embedding

* update ggml_arange/ggml_timestep_embedding tests

* cuda: fix im2col

* add ggml_arange/ggml_timestep_embbeding support for metal backend

* fix some bugs

* fix some bugs

* Update ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-cuda.cu

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.metal

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* modify according to the review comments

* ggml : fix compile warnings + code style

* ggml : normalize compute_forward calls + fix seg fault in debug

* minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-03-04 10:39:10 +02:00
DAN™
82f3e668ad
common : use LLAMA_DEFAULT_SEED (#5855) 2024-03-04 10:08:19 +02:00
DAN™
5a51cc1bb4
main : support special tokens as reverse/anti prompt (#5847)
* Support special tokens as reverse/anti prompt.

* Tokenize antiprompts only once.

* main : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-04 09:57:20 +02:00
Francis Couture-Harpin
eefb794bd7 mamba : support state saving and restoring 2024-03-03 13:55:48 -05:00
Francis Couture-Harpin
b83fbc9287 convert : for Mamba, fallback to internal NeoX tokenizer
The resulting models are exactly the same
as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.
2024-03-03 11:40:27 -05:00
Francis Couture-Harpin
d52dd501f0 ggml : in ggml_ssm_scan, use a threshold for soft_plus
This is how the official Mamba implementation does it,
and it's also what torch.nn.Softplus does.
2024-03-03 11:40:27 -05:00
Francis Couture-Harpin
1af1000f10 mamba : more correctly update the "used" field of the KV cache 2024-03-03 11:40:27 -05:00
Francis Couture-Harpin
206e8ee2b2 mamba : stop abusing attention metadata
This breaks existing converted-to-GGUF Mamba models,
but will allow supporting mixed architectures like MambaFormer
without needing to break Mamba models.

This will also allow changing the size of Mamba's states
without having to reconvert models in the future.
(e.g. using something else than d_conv - 1 columns for the conv_states
 will not require breaking existing converted Mamba models again)

* gguf-py : add new KV metadata key-value pairs for Mamba

* llama : add new metadata key-value pairs for Mamba

* llama : guard against divisions by zero when n_head is 0

* mamba : rename "unlimited" KV cache property to "recurrent"
2024-03-03 11:40:27 -05:00
Francis Couture-Harpin
8f605cfe0d mamba : adapt perplexity, batched, and batched-bench examples
* perplexity : limit the max number of sequences

This adapts to what the loaded model can provide.

* llama : add llama_n_max_seq to get the upper limit for seq_ids

Used by the perplexity example.

* batched : pass n_parallel to the model's context params

This should have been there already, but it wasn't.

* batched-bench : reserve sequences to support Mamba

* batched-bench : fix tokens being put in wrong sequences

Generation quality isn't what's measured in there anyway,
but at least using the correct sequences avoids using non-consecutive
token positions.
2024-03-03 11:40:27 -05:00
Francis Couture-Harpin
79d636cc7e mamba : dedicate an input tensor for state copy indices
This is cleaner and makes it easier to adapt when/if token positions
(and by extension, inp_K_shift) are no longer integers.
2024-03-03 11:40:27 -05:00
Francis Couture-Harpin
34e2fca8eb mamba : make the server and parallel examples work with whole sequences
A seq_id is dedicated to the system prompt in both cases.

* llama : make llama_kv_cache_seq_rm return whether it succeeded or not
2024-03-03 11:40:27 -05:00
Francis Couture-Harpin
3dcf79824d mamba : support llama_kv_cache_seq_cp copy chains
* mamba : support shifting and dividing the kv cache pos
2024-03-03 11:40:27 -05:00
Francis Couture-Harpin
9473ec2147 mamba : simultaneous sequence processing
A batch can now contain tokens from multiple sequences.

This is necessary for at least the parallel example, the server example,
and the HellaSwag test in the perplexity example.

However, for this to be useful, uses of llama_kv_cache_seq_rm/cp
will need to be changed to work on whole sequences.

* ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba

This operator makes it possible to use and update the correct states
for each token of the batch in the same way as ggml_ssm_scan.
Other solutions which use existing operators would need loops which would
add too many nodes to the graph (at least the ones I thought of).

Using this operator further reduces the size of the CPU compute buffer
from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512.
And (at least on CPU), it's a bit faster than before.

Note that "ggml_ssm_conv" is probably not the most appropriate name,
and it could be changed if a better one is found.

* llama : add inp_s_seq as a new input tensor

The most convenient implementation to select the correct state (for Mamba)
for each token is to directly get the correct index from a tensor.
This is why inp_s_seq is storing int32_t and not floats.

The other, less convenient way to select the correct state would be
to have inp_KQ_mask contain 1.0f for each state used by a token
and 0.0f otherwise. This complicates quickly fetching the first used
state of a token, and is also less efficient because a whole row
of the mask would always need to be read for each token.

Using indexes makes it easy to stop searching when there are
no more sequences for a token, and the first sequence assigned
is always very quickly available (it's the first element of each row).
2024-03-03 11:40:27 -05:00