llama.cpp

Author	SHA1	Message	Date
pudepiedj	71f885f2d0	Llamaserver.py changes	2024-02-29 16:56:51 +00:00
pudepiedj	a451708e90	n_ctx change	2024-02-29 12:40:53 +00:00
pudepiedj	ee7f05b52b	Exploring stdout redirection	2024-02-28 12:22:25 +00:00
pudepiedj	b56b9895ed	std::cerr	2024-02-28 12:05:08 +00:00
pudepiedj	dade1cefd4	Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch	2024-02-28 12:03:04 +00:00
pudepiedj	09e087f691	Merge remote-tracking branch 'origin/master' into server_branch	2024-02-28 12:03:01 +00:00
pudepiedj	7516a5b9ee	Merge branch 'ggerganov:master' into server_branch	2024-02-28 12:01:51 +00:00
pudepiedj	9f40bb7983	LOG_VERBOSE sorted	2024-02-28 11:59:45 +00:00
Georgi Gerganov	8c0e8f4e73	sync : ggml	2024-02-28 11:17:32 +02:00
slaren	2774b0c974	add google magika inference example (ggml/748) * add magika inference example * ggml : fix unaligned accesses in custom ops * ggml : fix FP32 GELU for values that exceed the FP16 range * use ggml_pool_1d * add README * Update README.md * pad inputs if the files are too small * cleanup ggml-ci	2024-02-28 11:17:06 +02:00
UEXTM.com	5f70671856	Introduce backend GUIDs (ggml/743) * Introduce backend GUIDs Initial proposed implementation of backend GUIDs (Discussed in https://github.com/ggerganov/ggml/pull/741) Hardcoded CPU backend GUID (for now) Change ggml_backend_is_cpu logic to use GUID * Remove redundant functions Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion * Add spaces to match style Co-authored-by: slaren <slarengh@gmail.com> * Fix brace style to match Co-authored-by: slaren <slarengh@gmail.com> * Add void to () in function signature Co-authored-by: slaren <slarengh@gmail.com> * Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid * add guids to all backends ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-28 11:17:05 +02:00
Xuan Son Nguyen	a693bea1e6	server : hit Ctrl+C twice to exit (#5734 ) * server: twice ctrl+C to exit * std::atomic_flag * sigint: message * sigint: stderr * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-02-28 10:55:37 +02:00
compilade	adcb12a9ba	llama : fix non-quantization of expert gating tensors (#5754 ) This reverts a single line from #5475	2024-02-28 10:52:56 +02:00
Douglas Hanley	177628bfd8	llama : improve BERT tokenization (#5740 ) * implement nfd for stripping accents in wpm tokenizer * sort nfd map; reuse iterator * use builtin tolower * add locale include * Simplify to_lower cases Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-02-28 10:51:11 +02:00
Daniel Bevenius	6c4416868d	readme : add link to LLaVA 1.6 models (#5758 ) Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-28 10:39:39 +02:00
Jorge A	efc72253f7	server : add "/chat/completions" alias for "/v1/...` (#5722 ) * Add "/chat/completions" as alias for "/v1/chat/completions" * merge to upstream master * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-28 10:39:15 +02:00
Kawrakow	7c4263d426	ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (#5760 ) * WIP: make i-quants work for QK_K = 64 * iq2_xs: attempt to fix AVX dot product for QK_K = 64 Tests pass, but I get gibberish. * QK_K = 64 tests pass on ARM_NEON and Metal Sadly, that does not mean it actually works. * Make CUDA compile with QK_K = 64 Tests don't pass, plus we get misaligned access * Q2_K: fixed bug in imatrix quantization for QK_K = 64 * iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-28 10:37:02 +02:00
pudepiedj	ebdc0d3907	Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch	2024-02-27 22:27:12 +00:00
pudepiedj	87d501fc10	Enable log redirection	2024-02-27 22:27:10 +00:00
Kawrakow	cb49e0f8c9	Attempt to fix android build (#5752 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-27 19:16:49 +02:00
pudepiedj	fddedfb950	Merge branch 'ggerganov:master' into server_branch	2024-02-27 15:53:43 +00:00
Kawrakow	0becb22ac0	IQ4_XS: a 4.25 bpw quantization (#5747 ) * Try IQ4_NL with blocks of 64 - does not look good * iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32 * iq4_xs: CUDA works - 133.2 t/s * iq4_xs: AVX2 dot product * iq4_xs: ARM_NEON dot product * iq4_nl: Metal implementation As usual, Metal / Apple Silicon don't like my quants. * iq3_xs: minor fix * iq4_xs: shrink by using IQ3_S for attn_k and attn_q * iq4_xs: revert using IQ3_S for attn_k and attn_v PPL vs size is good, but CPU performance suffers: on M2 Max TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when using IQ3_S vs 133 t/s with pure IQ4_XS. * Fix CI * iq4_xs: Added forgotten check for 256 divisibility --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-27 16:34:24 +02:00
pudepiedj	e8c37fd893	Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch	2024-02-27 14:01:24 +00:00
pudepiedj	6e9b6a18fb	extra comments	2024-02-27 14:01:22 +00:00
pudepiedj	ad4f567d8e	Merge branch 'ggerganov:master' into server_branch	2024-02-27 13:39:30 +00:00
Engininja2	c24a2a6e60	cuda : replace remaining shfl_xor with calls to warp_reduce functions (#5744 )	2024-02-27 14:22:45 +01:00
Engininja2	1f30b7a9f1	ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc (#5742 )	2024-02-27 14:50:18 +02:00
pudepiedj	d17bba34bc	Merge branch 'server_branch' of https://github.com/pudepiedj/llama.cpp into server_branch	2024-02-27 12:38:18 +00:00
pudepiedj	5854b0b86d	Improved apikey code	2024-02-27 12:38:16 +00:00
Georgi Gerganov	9d533a77d0	llama : fix defrag bugs + add parameter (#5735 ) * llama : fix defrag bugs + enable by default ggml-ci * llama : add defrag_thold parameter ggml-ci * llama : cont * llama : disable log message ggml-ci * llama : fix graph size check during defrag	2024-02-27 14:35:51 +02:00
le.chang	cbbd1efa06	Makefile: use variables for cublas (#5689 ) * make: use arch variable for cublas * fix UNAME_M * check opt first --------- Co-authored-by: lindeer <le.chang118@gmail.com>	2024-02-27 03:03:06 +01:00
Xuan Son Nguyen	b11a93df41	fix server hangs on empty prompt (#5733 )	2024-02-26 23:15:48 +01:00
pudepiedj	4bd4ac931c	Merge branch 'ggerganov:master' into server_branch	2024-02-26 17:10:40 +00:00
pudepiedj	02702d975d	Server header and README.md	2024-02-26 17:09:04 +00:00
Kawrakow	a33e6a0d2a	Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (#5721 ) * Adding IQ2_S and IQ2_M as a single cumulative commit * Update examples/quantize/quantize.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-26 18:28:38 +02:00
pudepiedj	465ced3808	Server update	2024-02-26 16:16:23 +00:00
Johannes Gäßler	47bb7b48c7	CUDA: fix DEBUG_CUDA_MALLOC (#5729 )	2024-02-26 15:36:38 +01:00
Artem	c4d7f81786	readme : update ui list (#5731 ) * Add LLMFarm (ui for iOS) to list	2024-02-26 16:15:28 +02:00
AidanBeltonS	e849078c6e	[SYCL] Add support for soft_max ALiBi (#5639 ) * Add support for bias * Update pre-processor * rm commented code * fix format * fix CI --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-02-26 19:32:11 +05:30
pudepiedj	3c23413b8b	Adjust print_timings	2024-02-26 13:49:35 +00:00
pudepiedj	2768634743	Merge remote-tracking branch 'origin/master' into server_branch	2024-02-26 13:31:50 +00:00
pudepiedj	74d13ef335	Server updates	2024-02-26 12:09:06 +00:00
Georgi Gerganov	67fd33132f	unicode : reuse iterator (#5726 )	2024-02-26 14:02:12 +02:00
Pierrick Hymbert	4804215cb8	server: CI fix trailing space (#5728 )	2024-02-26 12:41:34 +02:00
Pierrick Hymbert	8a533f0d90	server: CI tests reduce build matrix (#5725 )	2024-02-26 09:56:10 +01:00
Georgi Gerganov	269de86ba0	llama : fix Gemma rope type (#5691 )	2024-02-26 08:30:17 +02:00
github-actions[bot]	c393733988	flake.lock: Update Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16) → 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)	2024-02-25 22:24:22 +00:00
Pierrick Hymbert	e3965cf35a	server: tests - slow inference causes timeout on the CI (#5715 ) * server: tests - longer inference timeout for CI	2024-02-25 22:48:33 +01:00
Pierrick Hymbert	8b350356b2	server: docs - refresh and tease a little bit more the http server (#5718 ) * server: docs - refresh and tease a little bit more the http server * Rephrase README.md server doc Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-25 21:46:29 +01:00
Georgi Gerganov	bf08e00643	llama : refactor k-shift implementation + KV defragmentation (#5691 ) * llama : refactor k-shift implementation ggml-ci * llama : rename llama_kv_cache_seq_shift to llama_kv_cache_seq_add * llama : cont k-shift refactoring + normalize type names ggml-ci * minor : fix MPI builds * llama : reuse n_rot from the build context ggml-ci * llama : revert enum name changes from this PR ggml-ci * llama : update llama_rope_type * llama : add comment about rope values * llama : fix build * passkey : apply kv cache updates explicitly ggml-ci * llama : change name to llama_kv_cache_update() * llama : add llama_kv_cache_seq_pos_max() * passkey : fix llama_kv_cache_seq_pos_max() usage * llama : some llama_kv_cell simplifications * llama : add llama_kv_cache_compress (EXPERIMENTAL) * llama : add alternative KV cache merging (EXPERIMENTAL) * llama : add llama_kv_cache_defrag * llama : comments * llama : remove llama_kv_cache_compress will add in a separate PR ggml-ci * llama : defragment via non-overlapping moves * llama : ggml_graph based defrag implementation ggml-ci * llama : switch the loop order in build_defrag * llama : add comments	2024-02-25 22:12:24 +02:00

1 2 3 4 5 ...

2345 commits