llama.cpp

Author	SHA1	Message	Date
digiwombat	7332b41f9f	Simple single-line server log for requests	2023-05-31 15:56:27 -04:00
Randall Fitzgerald	96fa480147	Merge pull request #6 from anon998/fix-multibyte Buffer incomplete multibyte characters + other stuff.	2023-05-31 12:14:43 -04:00
anon	3edaf6bd8b	print timings by default	2023-05-31 12:55:19 -03:00
anon	d58e48663d	default penalize_nl to false + format	2023-05-31 12:44:27 -03:00
anon	40e13805d9	print timings + build info I don't know if llama_free is needed but it was used in main.cpp.	2023-05-31 12:44:24 -03:00
anon	dd30219332	buffer incomplete multi-byte characters	2023-05-31 12:31:27 -03:00
anon	27911d6d68	fix default model alias	2023-05-31 12:31:25 -03:00
anon	aa2bbb2d35	fix parameter type	2023-05-31 11:14:34 -03:00
anon	f1710b90dc	add infinite generation when n_predict is -1	2023-05-31 11:14:34 -03:00
anon	284bc293b1	reserve memory for generated_text	2023-05-31 11:14:34 -03:00
anon	2c08f29691	make api server use only a single thread	2023-05-31 09:04:33 -03:00
anon	c1cbde82a1	print error when server can't bind to the interface	2023-05-31 09:04:16 -03:00
Randall Fitzgerald	9f2424ac47	Merge pull request #5 from anon998/stop-stream Stop generating tokens when the stream is closed.	2023-05-30 22:16:32 -04:00
anon	3a079d5cc8	stop generating when the stream is closed	2023-05-30 23:12:00 -03:00
anon	7a8104fbd2	add missing quote when printing stopping strings	2023-05-30 23:11:32 -03:00
digiwombat	b6f536dfb3	Cull to end of generated_text when encountering a stopping string in case it's a partial token. Will roll this back if it proves to be a problem.	2023-05-30 21:14:24 -04:00
Randall Fitzgerald	9197674a6b	Merge pull request #4 from anon998/logging Add the --verbose flag and request logging.	2023-05-30 20:58:18 -04:00
anon	aa0788b650	add --verbose flag and request logging	2023-05-30 21:45:56 -03:00
anon	7a853dc56d	prevent the server from swallowing exceptions in debug mode So it's easier to catch them inside a debugger.	2023-05-30 21:39:30 -03:00
Randall Fitzgerald	e6de69abfb	Merge pull request #3 from anon998/sse Add streaming via server-sent events. Has some changes that I didn't make, and I decided I prefer "stream" to "streaming"	2023-05-30 20:36:52 -04:00
Randall Fitzgerald	2533878b79	Merge branch 'master' into sse	2023-05-30 20:34:48 -04:00
digiwombat	a25f830fe1	Default streaming to false if it's not set in the request body.	2023-05-30 20:17:18 -04:00
digiwombat	38eaf2b7f7	Removed testing fprintf calls.	2023-05-30 19:48:43 -04:00
digiwombat	3292f057dc	Changed to single API endpoint for streaming and non. next-token endpoint removed. "as_loop" setting changed to "streaming"	2023-05-30 19:44:16 -04:00
anon	d6fff56e22	add streaming via server-sent events Removes /next-token endpoint and adds a "stream" parameter to the /completion one.	2023-05-30 19:33:33 -03:00
digiwombat	03ea8f013a	Fix for the regen issue.	2023-05-30 15:48:55 -04:00
Henri Vasserman	42cf4d8433	Merge branch 'master' into master	2023-05-29 01:05:19 +03:00
digiwombat	33b6957177	Fixed failing to return result on stopping token.	2023-05-28 16:45:05 -04:00
Johannes Gäßler	3b126f654f	LLAMA_DEBUG adds debug symbols (#1617 )	2023-05-28 21:01:02 +02:00
digiwombat	6c58f64a3b	--ctx_size flag to --ctx-size to match common.cpp	2023-05-28 14:17:36 -04:00
digiwombat	b38d41ef52	--memory_f32 flag to --memory-f32 to match common.cpp	2023-05-28 13:58:25 -04:00
digiwombat	655899db89	Add ignore_eos option to generation settings.	2023-05-28 13:49:45 -04:00
Kerfuffle	1b78ed2081	Only show -ngl option when relevant + other doc/arg handling updates (#1625 ) 1. Add a `LLAMA_SUPPORTS_GPU_OFFLOAD` define to `llama.h` (defined when compiled with CLBlast or cuBLAS) 2. Update the argument handling in the common example code to only show the `-ngl`, `--n-gpu-layers` option when GPU offload is possible. 3. Add an entry for the `-ngl`, `--n-gpu-layers` option to the `main` and `server` examples documentation 4. Update `main` and `server` examples documentation to use the new style dash separator argument format 5. Update the `server` example to use dash separators for its arguments and adds `-ngl` to `--help` (only shown when compiled with appropriate support). It will still support `--memory_f32` and `--ctx_size` for compatibility. 6. Add a warning discouraging use of `--memory-f32` for the `main` and `server` examples `--help` text as well as documentation. Rationale: https://github.com/ggerganov/llama.cpp/discussions/1593#discussioncomment-6004356	2023-05-28 11:48:57 -06:00
Vladimir Zorin	337aea1139	examples : add --alias option to gpt_params to set use friendly model name (#1614 )	2023-05-28 20:14:24 +03:00
Howard Su	bb051d9723	opencl : no need to allocate cl_mem on heap (#1612 )	2023-05-28 20:13:36 +03:00
Howard Su	ca74884f66	opencl : use strstr to check if fp16 supported (#1611 ) * Use strstr to check if fp16 supported * Ensure ext_buffer is null terminated	2023-05-28 20:09:56 +03:00
Randall Fitzgerald	2c9ee7a052	Apply suggestions from code review Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Henri Vasserman <henv@hot.ee>	2023-05-28 09:34:11 -07:00
Henri Vasserman	74c6f36bf1	Editorconfig suggested fixes delete whitespace	2023-05-28 19:19:34 +03:00
digiwombat	15ddc4903b	Merge remote-tracking branch 'slyecho/server_refactor'	2023-05-28 11:09:32 -04:00
Henri Vasserman	7186d655a1	seed and gen params	2023-05-28 17:03:01 +03:00
digiwombat	7740301db9	Set unspecified generation settings back to default. (Notes below) - If a given set of values coming along doesn't contain top_k for example, but did before, it would have stayed on the old value, I'm pretty sure. This fixes that. - I don't know if this could be done a bit prettier by just setting llama.params = gpt_params(); since I'm not sure how the default constructor would react since there's not one defined.	2023-05-28 09:18:47 -04:00
digiwombat	dda915cac4	Added capturing the stopping word and sending it along with the final JSON. Fixed an fprintf warning Fixed a bug that broke streaming Properly removed thread changing in json (only grabbed batch_size before)	2023-05-28 08:43:38 -04:00
digiwombat	2e5c5ee224	Changed JSON names to match the parameter name rather than the variable name.	2023-05-28 08:12:48 -04:00
digiwombat	23928f2887	Added generation_settings to final json object.	2023-05-28 08:04:05 -04:00
digiwombat	e8efd75492	Initial timeout code and expanded json return on completion. Now passing server params to the help printer so they defaults are ouput. Bad UTF while streaming now returns a replacement character (\uFFFD) Changed some error language very slightly. The JSON now returns extra values, only on `stop` for streaming requests. New JSON Return Values: - tokens_predicted (added to streaming) - seed (just pulls it from params, might return -1) - prompt (Might be useful) - generated_text (Full generated response for streaming requests)	2023-05-28 07:44:31 -04:00
digiwombat	177868e68a	Changed to params/args Seed is now set by the CLI, defaults to -1 if not seed is set Threads and batch size are now properly launch parameters.	2023-05-28 06:29:11 -04:00
Henri Vasserman	549291fe61	keep processed from the beginning this means no limit to the input prompt, it will just get reset again as normal	2023-05-28 12:11:41 +03:00
Randall Fitzgerald	df0e0d094c	Forgot to remove some testing code.	2023-05-28 12:11:14 +03:00
Randall Fitzgerald	f93fe36c5b	Add all generation parameters to server.cpp and allow resetting context sever.cpp left out a few generation parameters and also seems built to assume un-editable chatting with no regens or swipes. I added a simple "reload_ctx" flag that can be passed on generation that will cause the prompt to be reloaded.	2023-05-28 12:11:10 +03:00
Henri Vasserman	51e09944ce	server rewrite Remove unnecessary things and radically rewrite server	2023-05-28 12:10:16 +03:00

1 2 3 4 5 ...

661 commits