llama.cpp

Author	SHA1	Message	Date
Francis Couture-Harpin	d8024a486b	convert-hf : support new metadata keys for Mamba For the models available at https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406	2024-03-07 20:28:42 -05:00
Francis Couture-Harpin	7cd5a1f986	server : fix cache_tokens not getting correctly resized Otherwise, when the "we have to evaluate at least 1 token" special case was triggered, an extra token was kept in cache_tokens even if it was removed from the KV cache. For Mamba, this caused useless prompt reprocessing when the previous request triggered the above case.	2024-03-07 14:18:21 -05:00
Francis Couture-Harpin	916b586386	Merge branch 'master' into support-mamba-ssm	2024-03-07 14:15:05 -05:00
Georgi Gerganov	6cdabe6526	llama-bench : add embeddings option (#5924 ) * llama-bench : add embeddings option * llama-bench : do not hard code embd default value --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-03-07 16:32:38 +02:00
Neo Zhang Jianyu	89fb735fcf	Revert "[SYCL] fix error when set main gpu to non-zero (#5901 )" (#5918 ) This reverts commit `ceca1aef07`.	2024-03-07 12:14:49 +01:00
Minsoo Cheong	55a2a900ff	server : add `/v1/completions` endpoint (#5914 ) * add-`/v1/completions`-endpoint * add legacy comment to `/completion` endpoint	2024-03-07 12:42:39 +02:00
Georgi Gerganov	2002bc96bf	server : refactor (#5882 ) * server : refactoring (wip) * server : remove llava/clip objects from build * server : fix empty prompt handling + all slots idle logic * server : normalize id vars * server : code style * server : simplify model chat template validation * server : code style * server : minor * llama : llama_chat_apply_template support null buf * server : do not process embedding requests when disabled * server : reorganize structs and enums + naming fixes * server : merge oai.hpp in utils.hpp * server : refactor system prompt update at start * server : disable cached prompts with self-extend * server : do not process more than n_batch tokens per iter * server: tests: embeddings use a real embeddings model (#5908) * server, tests : bump batch to fit 1 embedding prompt * server: tests: embeddings fix build type Debug is randomly failing (#5911) * server: tests: embeddings, use different KV Cache size * server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings * server: tests: embeddings, no need to wait for server idle as it can timout * server: refactor: clean up http code (#5912) * server : avoid n_available var ggml-ci * server: refactor: better http codes * server : simplify json parsing + add comment about t_last * server : rename server structs * server : allow to override FQDN in tests ggml-ci * server : add comments --------- Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>	2024-03-07 11:41:53 +02:00
Neo Zhang Jianyu	ceca1aef07	[SYCL] fix error when set main gpu to non-zero (#5901 ) * fix error when set main gpu to non-zero * fix delete condition	2024-03-07 16:34:31 +08:00
Jared Van Bortel	e04e04f8fa	ggml : use SYS_get_cpu if SYS_getcpu is not defined (#5906 ) Fixes #5694 Fixes ggerganov/whisper.cpp#1894	2024-03-06 15:42:23 -05:00
bobqianic	e25fb4b18f	ggml : use `uint8x16_t` return type for `ggml_vqtbl1q_u8` (#5894 ) * use uint8x16_t * Update ggml-quants.c	2024-03-06 09:35:07 +02:00
Georgi Gerganov	1e35d619a6	convert : remove AWQ remnants (#5768 )	2024-03-06 09:13:42 +02:00
Neo Zhang Jianyu	8ced9f7e32	add wait() to make code stable (#5895 )	2024-03-06 12:08:32 +08:00
slaren	652ca2bded	compare-llama-bench.py : remove mul_mat_q (#5892 )	2024-03-05 22:27:29 +01:00
Francis Couture-Harpin	5544f5211b	Merge branch 'master' into support-mamba-ssm	2024-03-05 12:12:01 -05:00
Francis Couture-Harpin	93fd4b8d5b	mamba : clarify some comments	2024-03-05 12:05:44 -05:00
Jared Van Bortel	bd836944f8	quants : use MM256_SET_M128I consistently to fix gcc 7 build (#5889 )	2024-03-05 11:56:37 -05:00
ExtReMLapin	3de31677d3	grammars : blacklists character control set (#5888 ) * Prevent control characters from being served in json string * Prevent control characters from being served in json string (array)	2024-03-05 18:33:08 +02:00
Georgi Gerganov	82cb31eb93	Revert "grammars : don't allow to output unescaped new line in string (#5885 )" This reverts commit `b1a4e994fd`.	2024-03-05 15:56:24 +02:00
ExtReMLapin	b1a4e994fd	grammars : don't allow to output unescaped new line in string (#5885 ) * Don't allow grammar json array to output unescaped new line in string * Don't allow new line in json object string	2024-03-05 15:44:29 +02:00
0cc4m	61d1c88e15	Vulkan Improvements (#5835 ) * Improve dequant shaders, add fast q4_0 dequant * Optimize dmmv non-kquants for GCN Remove unnecessary SPIR-V shader duplication * Fix q4_0 dequant dispatch sizes Fix backend free bug * Optimize dequant shaders for q4_1, q5_0, q5_1 and q8_0 * Add unary and binary op shader templates * Fix Vulkan check results * Enable non-contiguous support for simple ops * Add argsort Basic q4_0 mmq shader and unit test * Speed up q4_0 dequant code, enable mmq for q4_0 * Rework matmul pipeline selection * Add soft_max alibi support * Add q4_1, q5_0, q5_1 and q8_0 dequant mat mat mul shaders * Add environment variable GGML_VK_FORCE_MAX_ALLOCATION_SIZE to limit max buffer size Rename GGML_VULKAN_DISABLE_F16 to GGML_VK_DISABLE_F16 for consistency	2024-03-05 13:33:42 +01:00
Neo Zhang Jianyu	21b0867433	[SYCL] fix mul_mat fault in CI/unit-test (#5862 ) * fix mul_mat fault in cpy_f32_f16 * rm unused function * add wait() for memcpy * restore ci/run.sh, rename struct defination, fix bug in ggml_sycl_op_mul_mat_sycl * fix format issue * llama : fix segfault from unknown model arch name (#5820) * llama : fix segfault from unknown model arch name * llama : make all LLM maps const This also requires using `std::map::at` instead of its `operator[]` which does not exist for const maps. * llama : name LLM_ARCH_UNKNOWN to "(unknown)" This avoids errors from `std::map::at` when getting the general name of the model architecture. Using "(unknown)" instead of an empty string as per suggestion https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-1973735284 * llama : remove redundant inner const for LLM_TENSOR_NAMES The extra const won't do anything here as const maps return const references to values. Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * llama : remove redundant nullptr check in llm_arch_from_string Since LLM_ARCH_NAMES is a const map, no spurious elements with a NULL name are inserted anymore, so this check is dead code. --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * llama : refactor internal quantization functions (#5830) * scripts : add pod-llama.sh * ggml : IQ3_S improvements (#5829) * iq3_s: somewhat faster AVX2 dot product On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using 16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s. PP-512 increases to 28.5 t/s from 23.8 t/s. * iq3_s: somewhat faster ARM_NEON dot product Still dog slow - 10.7 t/s up from 9.9 t/s. * iq3_s: another small ARM_NEON improvement 10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick that works best on AVX2. * iq3_s: minor improvement on Metal 49.4 t/s -> 50.3 t/s * iq3_s: PPL improvement E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653. * iq3_s: use new grid everywhere * Fix ARM_NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * convert-hf : make model class definitions self-contained (#5825) * convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821) * ggml : fix IQ3_S AVX implementation (#5834) ggml-ci * llama : add abort_callback to interrupt computation (#5409) * using abort_callback from ggml to stop llama computation * format fix * a brief explaining comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: tests: passkey challenge / self-extend with context shift demo (#5832) * server: tests: add models endpoint scenario * server: /v1/models add some metadata * server: tests: add debug field in context before scenario * server: tests: download model from HF, add batch size * server: tests: add passkey test * server: tests: add group attention params * server: do not truncate prompt tokens if self-extend through group attention is enabled * server: logs: do not truncate log values * server: tests - passkey - first good working value of nga * server: tests: fix server timeout * server: tests: fix passkey, add doc, fix regex content matching, fix timeout * server: tests: fix regex content matching * server: tests: schedule slow tests on master * server: metrics: fix when no prompt processed * server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1 * server: tests: increase timeout for completion * server: tests: keep only the PHI-2 test * server: tests: passkey add a negative test * flake.lock: Update (#5842) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01) → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * server : init http requests thread pool with --parallel if set (#5836) * ci : schedule slow server tests only on Release or on demand (#5839) * llama : fix llama_copy_state_data with fragmented KV cache (#5840) The row size of the saved states was based on kv_self.head while it should be based on llama_kv_cache_cell_max. Existing session files should still work. * llama : fix llama_kv_cache_cell_max inability to return 1 I've also changed its return type to uint32_t, because this function is always used to set the value of uint32_t variables, and because the index already has this type. * llama : fix state size calculation Some bytes in the state were unaccounted for in llama_get_state_size. Since the logits reserve so much space, it did not cause problems. * gguf-dump : support i-quants (#5841) Co-authored-by: Black_Fox <radekliska@gmail.com> * llama : allow for user specified embedding pooling type (#5849) * allow for user specified pooling type * llama : use enum types over int --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * readme : add API changes section * cuda : fix data race in soft max (#5853) * main : support special tokens as reverse/anti prompt (#5847) * Support special tokens as reverse/anti prompt. * Tokenize antiprompts only once. * main : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * common : use LLAMA_DEFAULT_SEED (#5855) * add some new ops, fix some operators and add batch operations to certain operators. (ggml/747) * cuda: fix group_norm * cuda: add batch inference support for ggml_pad/ggml_upscale * add ggml_arrange * add ggml_timestep_embedding * update ggml_arange/ggml_timestep_embedding tests * cuda: fix im2col * add ggml_arange/ggml_timestep_embbeding support for metal backend * fix some bugs * fix some bugs * Update ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-cuda.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * modify according to the review comments * ggml : fix compile warnings + code style * ggml : normalize compute_forward calls + fix seg fault in debug * minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> * sync : ggml * add alias for chat template (#5858) * speculative : implement stochastic speculative sampling (#5625) * (WIP) Implement stochastic speculative decoding * sample from residual distribution on draft accept failure * fix #5657: force greedy sampling with probs when temp is 0 * remove p_accept parameter * fix style * remove unused variables * add srand() in speculative.cpp * replace use of rand() with mt19937 sampling * fixes based on review (@JohannesGaessler) * fix r random generation * randomly select next sequence to verify + fix bug in memory freeing * fix bug in active_seqs sync * fix uniform int distribution initialization * remove warnings from comparison between int and size_t * check grammar in `llama_sample_probability_distribution_impl` * remove malloc code by utilizing vectors * add PR link to README * cmake : handle cases where git index is not found in .git (#5844) * Update CMakeLists.txt * Update CMakeLists.txt * ggml : introduce ggml_status (ggml/750) * using enum as an exit code instead of macros * update return type from enum to unsigned int * indentation fix * compound update ggml_compute_exit_code -> ggml_status changed ggml_status from a bit-field type to simple codes ggml_status to string cast * ggml_status to string cast * GGML_CALL was removed Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * sync : ggml ggml-ci * ggml : fix unknown status (#0) * flake : fix * llama : fix embeddings (#5796) * llama : fix embeddings ggml-ci * llama : do not use KV cache for non-causal models ggml-ci * embeddings : fix llama_batch_init arg * llama : add pooling switch * llama : distinguish token vs sequence embeddings ggml-ci * llama : assert pooling tensor * llama : simplify causal mask condition ggml-ci * llama : assert input batch with pooling enabled * readme : update API changes list * nix: static build (#5814) * fix speculative decoding build on windows (#5874) * rebase and rm tailing space --------- Co-authored-by: LiangtaoJin <liang-tao.jin@intel.com> Co-authored-by: compilade <113953597+compilade@users.noreply.github.com> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com> Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Nindaleth <Nindaleth@users.noreply.github.com> Co-authored-by: Black_Fox <radekliska@gmail.com> Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: DAN™ <dranger003@gmail.com> Co-authored-by: leejet <leejet714@gmail.com> Co-authored-by: Minsoo Cheong <54794500+mscheong01@users.noreply.github.com> Co-authored-by: Dane Madsen <dane_madsen@hotmail.com> Co-authored-by: hutli <6594598+hutli@users.noreply.github.com> Co-authored-by: Jeffrey Quesnelle <emozilla@nousresearch.com>	2024-03-05 13:38:35 +05:30
Minsoo Cheong	6a87ac3a52	fix editorconfig check break (#5879 )	2024-03-05 11:42:23 +05:30
Jeffrey Quesnelle	29eee40474	fix speculative decoding build on windows (#5874 )	2024-03-04 22:23:06 -05:00
hutli	1d41d6f7c2	nix: static build (#5814 )	2024-03-04 17:33:08 -08:00
Georgi Gerganov	29ae62d2ae	llama : fix embeddings (#5796 ) * llama : fix embeddings ggml-ci * llama : do not use KV cache for non-causal models ggml-ci * embeddings : fix llama_batch_init arg * llama : add pooling switch * llama : distinguish token vs sequence embeddings ggml-ci * llama : assert pooling tensor * llama : simplify causal mask condition ggml-ci * llama : assert input batch with pooling enabled * readme : update API changes list	2024-03-04 22:31:20 +02:00
Georgi Gerganov	e0843afe1b	flake : fix	2024-03-04 21:50:50 +02:00
Georgi Gerganov	a1c6d96ed8	ggml : fix unknown status (#0 )	2024-03-04 20:54:23 +02:00
Georgi Gerganov	efd8533ef8	sync : ggml ggml-ci	2024-03-04 20:54:23 +02:00
Michael Podvitskiy	9fa2627347	ggml : introduce ggml_status (ggml/750) * using enum as an exit code instead of macros * update return type from enum to unsigned int * indentation fix * compound update ggml_compute_exit_code -> ggml_status changed ggml_status from a bit-field type to simple codes ggml_status to string cast * ggml_status to string cast * GGML_CALL was removed Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-04 20:54:23 +02:00
Dane Madsen	fe52be11e3	cmake : handle cases where git index is not found in .git (#5844 ) * Update CMakeLists.txt * Update CMakeLists.txt	2024-03-04 20:26:55 +02:00
Minsoo Cheong	6d341ab6c5	speculative : implement stochastic speculative sampling (#5625 ) * (WIP) Implement stochastic speculative decoding * sample from residual distribution on draft accept failure * fix #5657: force greedy sampling with probs when temp is 0 * remove p_accept parameter * fix style * remove unused variables * add srand() in speculative.cpp * replace use of rand() with mt19937 sampling * fixes based on review (@JohannesGaessler) * fix r random generation * randomly select next sequence to verify + fix bug in memory freeing * fix bug in active_seqs sync * fix uniform int distribution initialization * remove warnings from comparison between int and size_t * check grammar in `llama_sample_probability_distribution_impl` * remove malloc code by utilizing vectors * add PR link to README	2024-03-04 20:24:00 +02:00
Francis Couture-Harpin	2a99d1b243	ggml : implicitly pass src tensors through dst for Mamba-related ops	2024-03-04 10:10:50 -05:00
Xuan Son Nguyen	4ffcdce2ff	add alias for chat template (#5858 )	2024-03-04 12:22:08 +01:00
Georgi Gerganov	a0fc62661f	sync : ggml	2024-03-04 10:40:04 +02:00
leejet	7d43c585dc	add some new ops, fix some operators and add batch operations to certain operators. (ggml/747) * cuda: fix group_norm * cuda: add batch inference support for ggml_pad/ggml_upscale * add ggml_arrange * add ggml_timestep_embedding * update ggml_arange/ggml_timestep_embedding tests * cuda: fix im2col * add ggml_arange/ggml_timestep_embbeding support for metal backend * fix some bugs * fix some bugs * Update ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-cuda.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * modify according to the review comments * ggml : fix compile warnings + code style * ggml : normalize compute_forward calls + fix seg fault in debug * minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-03-04 10:39:10 +02:00
DAN™	82f3e668ad	common : use LLAMA_DEFAULT_SEED (#5855 )	2024-03-04 10:08:19 +02:00
DAN™	5a51cc1bb4	main : support special tokens as reverse/anti prompt (#5847 ) * Support special tokens as reverse/anti prompt. * Tokenize antiprompts only once. * main : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-04 09:57:20 +02:00
Francis Couture-Harpin	eefb794bd7	mamba : support state saving and restoring	2024-03-03 13:55:48 -05:00
Francis Couture-Harpin	b83fbc9287	convert : for Mamba, fallback to internal NeoX tokenizer The resulting models are exactly the same as if the tokenizer.json and tokenizer_config.json of GPT-NeoX were there.	2024-03-03 11:40:27 -05:00
Francis Couture-Harpin	d52dd501f0	ggml : in ggml_ssm_scan, use a threshold for soft_plus This is how the official Mamba implementation does it, and it's also what torch.nn.Softplus does.	2024-03-03 11:40:27 -05:00
Francis Couture-Harpin	1af1000f10	mamba : more correctly update the "used" field of the KV cache	2024-03-03 11:40:27 -05:00
Francis Couture-Harpin	206e8ee2b2	mamba : stop abusing attention metadata This breaks existing converted-to-GGUF Mamba models, but will allow supporting mixed architectures like MambaFormer without needing to break Mamba models. This will also allow changing the size of Mamba's states without having to reconvert models in the future. (e.g. using something else than d_conv - 1 columns for the conv_states will not require breaking existing converted Mamba models again) * gguf-py : add new KV metadata key-value pairs for Mamba * llama : add new metadata key-value pairs for Mamba * llama : guard against divisions by zero when n_head is 0 * mamba : rename "unlimited" KV cache property to "recurrent"	2024-03-03 11:40:27 -05:00
Francis Couture-Harpin	8f605cfe0d	mamba : adapt perplexity, batched, and batched-bench examples * perplexity : limit the max number of sequences This adapts to what the loaded model can provide. * llama : add llama_n_max_seq to get the upper limit for seq_ids Used by the perplexity example. * batched : pass n_parallel to the model's context params This should have been there already, but it wasn't. * batched-bench : reserve sequences to support Mamba * batched-bench : fix tokens being put in wrong sequences Generation quality isn't what's measured in there anyway, but at least using the correct sequences avoids using non-consecutive token positions.	2024-03-03 11:40:27 -05:00
Francis Couture-Harpin	79d636cc7e	mamba : dedicate an input tensor for state copy indices This is cleaner and makes it easier to adapt when/if token positions (and by extension, inp_K_shift) are no longer integers.	2024-03-03 11:40:27 -05:00
Francis Couture-Harpin	34e2fca8eb	mamba : make the server and parallel examples work with whole sequences A seq_id is dedicated to the system prompt in both cases. * llama : make llama_kv_cache_seq_rm return whether it succeeded or not	2024-03-03 11:40:27 -05:00
Francis Couture-Harpin	3dcf79824d	mamba : support llama_kv_cache_seq_cp copy chains * mamba : support shifting and dividing the kv cache pos	2024-03-03 11:40:27 -05:00
Francis Couture-Harpin	9473ec2147	mamba : simultaneous sequence processing A batch can now contain tokens from multiple sequences. This is necessary for at least the parallel example, the server example, and the HellaSwag test in the perplexity example. However, for this to be useful, uses of llama_kv_cache_seq_rm/cp will need to be changed to work on whole sequences. * ggml : add ggml_ssm_conv as a new operator for the conv step of Mamba This operator makes it possible to use and update the correct states for each token of the batch in the same way as ggml_ssm_scan. Other solutions which use existing operators would need loops which would add too many nodes to the graph (at least the ones I thought of). Using this operator further reduces the size of the CPU compute buffer from 140.68 MiB to 103.20 MiB with Mamba 3B with a batch size of 512. And (at least on CPU), it's a bit faster than before. Note that "ggml_ssm_conv" is probably not the most appropriate name, and it could be changed if a better one is found. * llama : add inp_s_seq as a new input tensor The most convenient implementation to select the correct state (for Mamba) for each token is to directly get the correct index from a tensor. This is why inp_s_seq is storing int32_t and not floats. The other, less convenient way to select the correct state would be to have inp_KQ_mask contain 1.0f for each state used by a token and 0.0f otherwise. This complicates quickly fetching the first used state of a token, and is also less efficient because a whole row of the mask would always need to be read for each token. Using indexes makes it easy to stop searching when there are no more sequences for a token, and the first sequence assigned is always very quickly available (it's the first element of each row).	2024-03-03 11:40:27 -05:00
Francis Couture-Harpin	de50c549c4	mamba : reduce memory usage of ggml_ssm_scan From 290.37 MiB to 140.68 MiB of CPU compute buffer size with Mamba 3B with a batch size of 512. The result tensor of ggml_ssm_scan was previously a big part of the CPU compute buffer size. To make it smaller, it does not contain the intermediate ssm states anymore. Both y and the last ssm state are combined in the result tensor, because it seems only a single tensor can be returned by an operator with the way the graph is built.	2024-03-03 11:40:27 -05:00
Francis Couture-Harpin	e73eaa7b4f	mamba : in comments, properly refer to KV cells instead of slots	2024-03-03 11:40:27 -05:00
Francis Couture-Harpin	8a43ffcfa1	mamba : multiple sequences, but one at a time This is a step towards making this Mamba implementation usable with the server example (the way the system prompt is kept when clearing the client slots will need to be changed before this can work, though). The KV cache size for this kind of model is tied to the maximum number of sequences kept at any single time. For now, this number is obtained from n_parallel (plus one, to have an extra sequence to dedicate to the system prompt), but there might be a better way to do this which won't also make the main example use 2 cells even if only 1 is really used. (for this specific case, --parallel 0 helps) Simultaneous sequence processing will probably require changes to ggml_ssm_scan, and possibly a new operator for the conv step. * mamba : support llama_kv_cache_seq_cp This (mis)uses the logic around K shifts, because tokens in a state can't be shifted anyway, and because inp_K_shift has the right shape and type. Using ggml_get_rows is a nice way to do copies, but copy chains can't work. Fortunately, copy chains don't really seem to be used in the examples. Each KV cell is dedicated to the sequence ID corresponding to its own index. * mamba : use a state mask It's cleaner than the previous heuristic of checking for the pos of the first token in the batch. inp_KQ_mask could not be re-used for this, because it has the wrong shape and because it seems more suited to the next step of simultaneous sequence processing (helping with the problem of remembering which token belongs to which sequence(s)/state(s)). * llama : replace the usage of n_ctx with kv_self.size in many places * mamba : use n_tokens directly instead of n_tok	2024-03-03 11:40:26 -05:00

1 2 3 4 5 ...

2398 commits