llama.cpp

Author	SHA1	Message	Date
pudepiedj	f2f002d9af	Correct whitespace/nl editor config	2024-03-04 08:02:28 +00:00
pudepiedj	4089657815	Remove extraneous files	2024-03-04 07:54:00 +00:00
pudepiedj	d532d5b1f7	Remove rtf files	2024-03-04 07:44:37 +00:00
pudepiedj	eb3da36e89	Delete rb and vca modules	2024-03-04 07:43:25 +00:00
pudepiedj	f44e9456a2	server update	2024-03-04 07:18:18 +00:00
pudepiedj	96ddeac1c6	Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch	2024-03-03 11:20:12 +00:00
pudepiedj	480089d00d	improve Llamaserver.py	2024-03-03 11:20:10 +00:00
pudepiedj	54bea4428f	Merge branch 'ggerganov:master' into server_branch	2024-03-03 11:19:25 +00:00
Georgi Gerganov	231ae28f07	readme : add API changes section	2024-03-03 12:44:03 +02:00
Douglas Hanley	475df1d6cf	llama : allow for user specified embedding pooling type (#5849 ) * allow for user specified pooling type * llama : use enum types over int --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-03 12:40:27 +02:00
Nindaleth	87c2e8b279	gguf-dump : support i-quants (#5841 ) Co-authored-by: Black_Fox <radekliska@gmail.com>	2024-03-03 10:43:42 +02:00
compilade	de9692a7d2	llama : fix llama_copy_state_data with fragmented KV cache (#5840 ) The row size of the saved states was based on kv_self.head while it should be based on llama_kv_cache_cell_max. Existing session files should still work. * llama : fix llama_kv_cache_cell_max inability to return 1 I've also changed its return type to uint32_t, because this function is always used to set the value of uint32_t variables, and because the index already has this type. * llama : fix state size calculation Some bytes in the state were unaccounted for in llama_get_state_size. Since the logits reserve so much space, it did not cause problems.	2024-03-03 10:41:55 +02:00
Pierrick Hymbert	e6029348e8	ci : schedule slow server tests only on Release or on demand (#5839 )	2024-03-03 10:35:23 +02:00
Pierrick Hymbert	8ef969afce	server : init http requests thread pool with --parallel if set (#5836 )	2024-03-03 09:48:36 +02:00
pudepiedj	265741aa0f	Merge remote-tracking branch 'origin/master' into server_branch	2024-03-03 06:56:31 +00:00
Georgi Gerganov	fa974646e1	flake.lock: Update (#5842 ) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01) → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-03-02 20:11:31 -08:00
pudepiedj	f3bb1e55c6	Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch	2024-03-02 22:10:30 +00:00
pudepiedj	bf366d2d9a	add api key	2024-03-02 22:10:28 +00:00
Pierrick Hymbert	9731134296	server: tests: passkey challenge / self-extend with context shift demo (#5832 ) * server: tests: add models endpoint scenario * server: /v1/models add some metadata * server: tests: add debug field in context before scenario * server: tests: download model from HF, add batch size * server: tests: add passkey test * server: tests: add group attention params * server: do not truncate prompt tokens if self-extend through group attention is enabled * server: logs: do not truncate log values * server: tests - passkey - first good working value of nga * server: tests: fix server timeout * server: tests: fix passkey, add doc, fix regex content matching, fix timeout * server: tests: fix regex content matching * server: tests: schedule slow tests on master * server: metrics: fix when no prompt processed * server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1 * server: tests: increase timeout for completion * server: tests: keep only the PHI-2 test * server: tests: passkey add a negative test	2024-03-02 22:00:14 +01:00
Michael Podvitskiy	4a6e2d6142	llama : add abort_callback to interrupt computation (#5409 ) * using abort_callback from ggml to stop llama computation * format fix * a brief explaining comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-02 21:52:25 +02:00
Georgi Gerganov	494c870326	ggml : fix IQ3_S AVX implementation (#5834 ) ggml-ci	2024-03-02 20:00:49 +02:00
Jared Van Bortel	4d4d2366fc	convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821 )	2024-03-02 12:27:26 -05:00
Jared Van Bortel	c7a0ad8ec9	convert-hf : make model class definitions self-contained (#5825 )	2024-03-02 12:21:47 -05:00
Kawrakow	bbde6eb256	ggml : IQ3_S improvements (#5829 ) * iq3_s: somewhat faster AVX2 dot product On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using 16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s. PP-512 increases to 28.5 t/s from 23.8 t/s. * iq3_s: somewhat faster ARM_NEON dot product Still dog slow - 10.7 t/s up from 9.9 t/s. * iq3_s: another small ARM_NEON improvement 10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick that works best on AVX2. * iq3_s: minor improvement on Metal 49.4 t/s -> 50.3 t/s * iq3_s: PPL improvement E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653. * iq3_s: use new grid everywhere * Fix ARM_NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-02 17:00:51 +02:00
Georgi Gerganov	ef2cd694c4	scripts : add pod-llama.sh	2024-03-02 16:54:20 +02:00
Xuan Son Nguyen	6c32d8c7ad	llama : refactor internal quantization functions (#5830 )	2024-03-02 16:19:09 +02:00
compilade	802da0091b	llama : fix segfault from unknown model arch name (#5820 ) * llama : fix segfault from unknown model arch name * llama : make all LLM maps const This also requires using `std::map::at` instead of its `operator[]` which does not exist for const maps. * llama : name LLM_ARCH_UNKNOWN to "(unknown)" This avoids errors from `std::map::at` when getting the general name of the model architecture. Using "(unknown)" instead of an empty string as per suggestion https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-1973735284 * llama : remove redundant inner const for LLM_TENSOR_NAMES The extra const won't do anything here as const maps return const references to values. Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * llama : remove redundant nullptr check in llm_arch_from_string Since LLM_ARCH_NAMES is a const map, no spurious elements with a NULL name are inserted anymore, so this check is dead code. --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-03-02 15:42:56 +02:00
pudepiedj	8bda1c1041	Merge branch 'ggerganov:master' into server_branch	2024-03-02 12:09:07 +00:00
Neo Zhang Jianyu	715641391d	Support multiple GPUs (split mode) on SYCL backend (#5806 ) * suport multiple cards: split-mode - layer\|row * rm warning * rebase with master, support tow new OPs, close feature for -sm=row, fix for unit test * update news * fix merge error * update according to review comments	2024-03-02 19:49:30 +08:00
pudepiedj	68814783c5	Merge remote-tracking branch 'origin/master' into server_branch	2024-03-02 10:28:37 +00:00
pudepiedj	5d61ae8d2a	Renaming some vars	2024-03-02 10:24:07 +00:00
crasm	9bf297a02b	workflows : remove nocleanup arg for check-requirements.sh (#5826 ) Reduces peak tmpfs usage and should prevent the check from failing from running out of space. Fixes the 'No space left on device' issue mentioned in #5703.	2024-03-02 00:11:06 -05:00
Tushar	cb5e8f7fc4	build(nix): Introduce flake.formatter for `nix fmt` (#5687 ) * build(nix): Introduce flake.formatter for `nix fmt` * chore: Switch to pkgs.nixfmt-rfc-style	2024-03-01 15:18:26 -08:00
nold	da3b9ba2b7	convert-hf-to-gguf : require einops for InternLM2ForCausalLM (#5792 )	2024-03-01 16:51:12 -05:00
Sourab Mangrulkar	c29af7e225	llama : add StarCoder2 support (#5795 ) * Add support for starcoder2 * handle rope type * skip rope freq and rotary embeddings from being serialized * resolve comments * Update llama.cpp * remove redundant changes * handle `rope-theta` * llama : change starcoder2 rope type * address comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-01 21:30:46 +02:00
Georgi Gerganov	38d16b1426	server : remove api_like_OAI.py proxy script (#5808 )	2024-03-01 20:00:58 +02:00
pudepiedj	f51554180a	Merge remote-tracking branch 'origin/master' into server_branch	2024-03-01 17:26:01 +00:00
ddpasa	c2224f003b	ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (#5813 )	2024-03-01 18:00:00 +01:00
pudepiedj	b47525df0a	server tweak	2024-03-01 15:53:56 +00:00
kunal-vaishnavi	e743386728	gemma : fix bfloat16 -> float16 conversion issue (#5810 )	2024-03-01 16:08:08 +02:00
Miwa / Ensan	f49a535686	common : fix flag `--logits-all` to `--all-logits` (#5805 )	2024-03-01 15:48:56 +02:00
Pierrick Hymbert	3ab8b3a92e	llama : cleanup unused mmq flags (#5772 ) * cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q * remove: mul_mat_q in compare llama bench and usage * update llama-bench --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-03-01 13:39:06 +02:00
Douglas Hanley	9600d59e01	unicode : switch to multimap based nfd_map (#5799 ) * switch to multimap based nfd_map due to compile time issues * simplify multimap keys * dont construct new locale every time	2024-03-01 11:15:36 +02:00
Pierrick Hymbert	5cb02b4a01	server: allow to override threads server pool with --threads-http (#5794 )	2024-03-01 10:08:08 +01:00
Eve	6ea0f010ff	ci : add Ubuntu 22 Vulkan CI run (#5789 )	2024-03-01 10:54:53 +02:00
Georgi Gerganov	f105471ef6	server : fix newlines in help (#5785 )	2024-03-01 09:59:43 +02:00
AidanBeltonS	38d1521608	[SYCL] Use batched mul_mat pathway (#5591 ) * Use batched mul_mat pathway * rm extra line * Explicitly state scaled data type --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-03-01 13:06:47 +05:30
Xuan Son Nguyen	052051d8ae	Server: normalize naming (#5779 ) * server: normalize naming * fix spacing	2024-02-29 21:42:11 +01:00
pudepiedj	13d0948fdc	server tweak	2024-02-29 18:14:08 +00:00
pudepiedj	71f885f2d0	Llamaserver.py changes	2024-02-29 16:56:51 +00:00

1 2 3 4 5 ...

2399 commits