llama.cpp

Author	SHA1	Message	Date
DAN™	50973c77ae	common : use LLAMA_DEFAULT_SEED (#5855 )	2024-03-10 15:38:20 +08:00
DAN™	d48c24273b	main : support special tokens as reverse/anti prompt (#5847 ) * Support special tokens as reverse/anti prompt. * Tokenize antiprompts only once. * main : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-10 15:38:20 +08:00
slaren	a0a1ca04d8	cuda : fix data race in soft max (#5853 )	2024-03-10 15:38:20 +08:00
Georgi Gerganov	5b2daffcde	readme : add API changes section	2024-03-10 15:38:20 +08:00
Douglas Hanley	74a0202ff3	llama : allow for user specified embedding pooling type (#5849 ) * allow for user specified pooling type * llama : use enum types over int --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-10 15:38:20 +08:00
Nindaleth	c0a1a5de91	gguf-dump : support i-quants (#5841 ) Co-authored-by: Black_Fox <radekliska@gmail.com>	2024-03-10 15:38:19 +08:00
compilade	abb8e001e5	llama : fix llama_copy_state_data with fragmented KV cache (#5840 ) The row size of the saved states was based on kv_self.head while it should be based on llama_kv_cache_cell_max. Existing session files should still work. * llama : fix llama_kv_cache_cell_max inability to return 1 I've also changed its return type to uint32_t, because this function is always used to set the value of uint32_t variables, and because the index already has this type. * llama : fix state size calculation Some bytes in the state were unaccounted for in llama_get_state_size. Since the logits reserve so much space, it did not cause problems.	2024-03-10 15:38:19 +08:00
Pierrick Hymbert	7a5c8bd3b6	ci : schedule slow server tests only on Release or on demand (#5839 )	2024-03-10 15:38:19 +08:00
Pierrick Hymbert	909f62ef71	server : init http requests thread pool with --parallel if set (#5836 )	2024-03-10 15:38:19 +08:00
Georgi Gerganov	49dab82b48	flake.lock: Update (#5842 ) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01) → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-03-10 15:38:19 +08:00
Pierrick Hymbert	23f60349e3	server: tests: passkey challenge / self-extend with context shift demo (#5832 ) * server: tests: add models endpoint scenario * server: /v1/models add some metadata * server: tests: add debug field in context before scenario * server: tests: download model from HF, add batch size * server: tests: add passkey test * server: tests: add group attention params * server: do not truncate prompt tokens if self-extend through group attention is enabled * server: logs: do not truncate log values * server: tests - passkey - first good working value of nga * server: tests: fix server timeout * server: tests: fix passkey, add doc, fix regex content matching, fix timeout * server: tests: fix regex content matching * server: tests: schedule slow tests on master * server: metrics: fix when no prompt processed * server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1 * server: tests: increase timeout for completion * server: tests: keep only the PHI-2 test * server: tests: passkey add a negative test	2024-03-10 15:38:19 +08:00
Michael Podvitskiy	352c2f375f	llama : add abort_callback to interrupt computation (#5409 ) * using abort_callback from ggml to stop llama computation * format fix * a brief explaining comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-10 15:38:19 +08:00
Georgi Gerganov	afe9525a70	ggml : fix IQ3_S AVX implementation (#5834 ) ggml-ci	2024-03-10 15:38:19 +08:00
Jared Van Bortel	dcb8d4439a	convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821 )	2024-03-10 15:38:19 +08:00
Jared Van Bortel	5fd9d9e1ad	convert-hf : make model class definitions self-contained (#5825 )	2024-03-10 15:38:19 +08:00
Kawrakow	2ee066ad9e	ggml : IQ3_S improvements (#5829 ) * iq3_s: somewhat faster AVX2 dot product On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using 16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s. PP-512 increases to 28.5 t/s from 23.8 t/s. * iq3_s: somewhat faster ARM_NEON dot product Still dog slow - 10.7 t/s up from 9.9 t/s. * iq3_s: another small ARM_NEON improvement 10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick that works best on AVX2. * iq3_s: minor improvement on Metal 49.4 t/s -> 50.3 t/s * iq3_s: PPL improvement E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653. * iq3_s: use new grid everywhere * Fix ARM_NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-10 15:38:19 +08:00
Georgi Gerganov	28af53f508	scripts : add pod-llama.sh	2024-03-10 15:38:19 +08:00
Xuan Son Nguyen	a6ebb7be75	llama : refactor internal quantization functions (#5830 )	2024-03-10 15:38:19 +08:00
compilade	316f837abd	llama : fix segfault from unknown model arch name (#5820 ) * llama : fix segfault from unknown model arch name * llama : make all LLM maps const This also requires using `std::map::at` instead of its `operator[]` which does not exist for const maps. * llama : name LLM_ARCH_UNKNOWN to "(unknown)" This avoids errors from `std::map::at` when getting the general name of the model architecture. Using "(unknown)" instead of an empty string as per suggestion https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-1973735284 * llama : remove redundant inner const for LLM_TENSOR_NAMES The extra const won't do anything here as const maps return const references to values. Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * llama : remove redundant nullptr check in llm_arch_from_string Since LLM_ARCH_NAMES is a const map, no spurious elements with a NULL name are inserted anymore, so this check is dead code. --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-03-10 15:38:19 +08:00
Neo Zhang Jianyu	e789d25713	Support multiple GPUs (split mode) on SYCL backend (#5806 ) * suport multiple cards: split-mode - layer\|row * rm warning * rebase with master, support tow new OPs, close feature for -sm=row, fix for unit test * update news * fix merge error * update according to review comments	2024-03-10 15:38:19 +08:00
crasm	439467b4f6	workflows : remove nocleanup arg for check-requirements.sh (#5826 ) Reduces peak tmpfs usage and should prevent the check from failing from running out of space. Fixes the 'No space left on device' issue mentioned in #5703.	2024-03-10 15:38:19 +08:00
Tushar	337b5df18c	build(nix): Introduce flake.formatter for `nix fmt` (#5687 ) * build(nix): Introduce flake.formatter for `nix fmt` * chore: Switch to pkgs.nixfmt-rfc-style	2024-03-10 15:38:19 +08:00
nold	2d16323c7e	convert-hf-to-gguf : require einops for InternLM2ForCausalLM (#5792 )	2024-03-10 15:38:19 +08:00
Sourab Mangrulkar	81ee97614b	llama : add StarCoder2 support (#5795 ) * Add support for starcoder2 * handle rope type * skip rope freq and rotary embeddings from being serialized * resolve comments * Update llama.cpp * remove redundant changes * handle `rope-theta` * llama : change starcoder2 rope type * address comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-10 15:38:19 +08:00
Georgi Gerganov	3851d6bebe	server : remove api_like_OAI.py proxy script (#5808 )	2024-03-10 15:38:19 +08:00
ddpasa	dba99b0778	ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (#5813 )	2024-03-10 15:38:19 +08:00
kunal-vaishnavi	e54be67b1b	gemma : fix bfloat16 -> float16 conversion issue (#5810 )	2024-03-10 15:38:19 +08:00
Miwa / Ensan	d134b79c6a	common : fix flag `--logits-all` to `--all-logits` (#5805 )	2024-03-10 15:38:19 +08:00
Pierrick Hymbert	cd3dca791b	llama : cleanup unused mmq flags (#5772 ) * cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q * remove: mul_mat_q in compare llama bench and usage * update llama-bench --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-03-10 15:38:19 +08:00
Douglas Hanley	51a38fc0c8	unicode : switch to multimap based nfd_map (#5799 ) * switch to multimap based nfd_map due to compile time issues * simplify multimap keys * dont construct new locale every time	2024-03-10 15:38:19 +08:00
Pierrick Hymbert	996a2ff344	server: allow to override threads server pool with --threads-http (#5794 )	2024-03-10 15:38:19 +08:00
Eve	a4a260e57d	ci : add Ubuntu 22 Vulkan CI run (#5789 )	2024-03-10 15:38:19 +08:00
Georgi Gerganov	644d40a95f	server : fix newlines in help (#5785 )	2024-03-10 15:38:19 +08:00
AidanBeltonS	058951722f	[SYCL] Use batched mul_mat pathway (#5591 ) * Use batched mul_mat pathway * rm extra line * Explicitly state scaled data type --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-03-10 15:38:19 +08:00
Xuan Son Nguyen	3e0058dbfd	Server: normalize naming (#5779 ) * server: normalize naming * fix spacing	2024-03-10 15:38:19 +08:00
hazelnutcloud	4b08df1f8b	fix broken merge	2024-02-29 18:13:15 +08:00
hsnmkls	e063bce2b6	Merge branch 'ggerganov:master' into master	2024-02-29 16:21:59 +08:00
Marcus Dunn	d5ab29757e	llama : constified `llama_set_state_data`'s `src` (#5774 )	2024-02-29 10:17:23 +02:00
Georgi Gerganov	87c91c0766	ci : reduce 3b ppl chunks to 1 to avoid timeout (#5771 ) ggml-ci	2024-02-28 21:44:21 +02:00
Eve	317709b2a8	make portability_enumeration_ext apple only (#5757 )	2024-02-28 20:33:37 +01:00
Georgi Gerganov	08c5ee87e4	llama : remove deprecated API (#5770 ) ggml-ci	2024-02-28 18:43:38 +02:00
Georgi Gerganov	78aacf3634	awq-py : remove (#5768 )	2024-02-28 17:36:53 +02:00
Georgi Gerganov	8c0e8f4e73	sync : ggml	2024-02-28 11:17:32 +02:00
slaren	2774b0c974	add google magika inference example (ggml/748) * add magika inference example * ggml : fix unaligned accesses in custom ops * ggml : fix FP32 GELU for values that exceed the FP16 range * use ggml_pool_1d * add README * Update README.md * pad inputs if the files are too small * cleanup ggml-ci	2024-02-28 11:17:06 +02:00
UEXTM.com	5f70671856	Introduce backend GUIDs (ggml/743) * Introduce backend GUIDs Initial proposed implementation of backend GUIDs (Discussed in https://github.com/ggerganov/ggml/pull/741) Hardcoded CPU backend GUID (for now) Change ggml_backend_is_cpu logic to use GUID * Remove redundant functions Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion * Add spaces to match style Co-authored-by: slaren <slarengh@gmail.com> * Fix brace style to match Co-authored-by: slaren <slarengh@gmail.com> * Add void to () in function signature Co-authored-by: slaren <slarengh@gmail.com> * Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid * add guids to all backends ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-28 11:17:05 +02:00
Xuan Son Nguyen	a693bea1e6	server : hit Ctrl+C twice to exit (#5734 ) * server: twice ctrl+C to exit * std::atomic_flag * sigint: message * sigint: stderr * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-02-28 10:55:37 +02:00
compilade	adcb12a9ba	llama : fix non-quantization of expert gating tensors (#5754 ) This reverts a single line from #5475	2024-02-28 10:52:56 +02:00
Douglas Hanley	177628bfd8	llama : improve BERT tokenization (#5740 ) * implement nfd for stripping accents in wpm tokenizer * sort nfd map; reuse iterator * use builtin tolower * add locale include * Simplify to_lower cases Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-02-28 10:51:11 +02:00
Daniel Bevenius	6c4416868d	readme : add link to LLaVA 1.6 models (#5758 ) Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-28 10:39:39 +02:00
Jorge A	efc72253f7	server : add "/chat/completions" alias for "/v1/...` (#5722 ) * Add "/chat/completions" as alias for "/v1/chat/completions" * merge to upstream master * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-28 10:39:15 +02:00

1 2 3 4 5 ...

2351 commits