llama.cpp

Author	SHA1	Message	Date
digiwombat	7332b41f9f	Simple single-line server log for requests	2023-05-31 15:56:27 -04:00
Randall Fitzgerald	96fa480147	Merge pull request #6 from anon998/fix-multibyte Buffer incomplete multibyte characters + other stuff.	2023-05-31 12:14:43 -04:00
anon	3edaf6bd8b	print timings by default	2023-05-31 12:55:19 -03:00
anon	d58e48663d	default penalize_nl to false + format	2023-05-31 12:44:27 -03:00
anon	40e13805d9	print timings + build info I don't know if llama_free is needed but it was used in main.cpp.	2023-05-31 12:44:24 -03:00
anon	dd30219332	buffer incomplete multi-byte characters	2023-05-31 12:31:27 -03:00
anon	27911d6d68	fix default model alias	2023-05-31 12:31:25 -03:00
anon	aa2bbb2d35	fix parameter type	2023-05-31 11:14:34 -03:00
anon	f1710b90dc	add infinite generation when n_predict is -1	2023-05-31 11:14:34 -03:00
anon	284bc293b1	reserve memory for generated_text	2023-05-31 11:14:34 -03:00
anon	2c08f29691	make api server use only a single thread	2023-05-31 09:04:33 -03:00
anon	c1cbde82a1	print error when server can't bind to the interface	2023-05-31 09:04:16 -03:00
Randall Fitzgerald	9f2424ac47	Merge pull request #5 from anon998/stop-stream Stop generating tokens when the stream is closed.	2023-05-30 22:16:32 -04:00
anon	3a079d5cc8	stop generating when the stream is closed	2023-05-30 23:12:00 -03:00
anon	7a8104fbd2	add missing quote when printing stopping strings	2023-05-30 23:11:32 -03:00
digiwombat	b6f536dfb3	Cull to end of generated_text when encountering a stopping string in case it's a partial token. Will roll this back if it proves to be a problem.	2023-05-30 21:14:24 -04:00
Randall Fitzgerald	9197674a6b	Merge pull request #4 from anon998/logging Add the --verbose flag and request logging.	2023-05-30 20:58:18 -04:00
anon	aa0788b650	add --verbose flag and request logging	2023-05-30 21:45:56 -03:00
anon	7a853dc56d	prevent the server from swallowing exceptions in debug mode So it's easier to catch them inside a debugger.	2023-05-30 21:39:30 -03:00
Randall Fitzgerald	e6de69abfb	Merge pull request #3 from anon998/sse Add streaming via server-sent events. Has some changes that I didn't make, and I decided I prefer "stream" to "streaming"	2023-05-30 20:36:52 -04:00
Randall Fitzgerald	2533878b79	Merge branch 'master' into sse	2023-05-30 20:34:48 -04:00
digiwombat	a25f830fe1	Default streaming to false if it's not set in the request body.	2023-05-30 20:17:18 -04:00
digiwombat	38eaf2b7f7	Removed testing fprintf calls.	2023-05-30 19:48:43 -04:00
digiwombat	3292f057dc	Changed to single API endpoint for streaming and non. next-token endpoint removed. "as_loop" setting changed to "streaming"	2023-05-30 19:44:16 -04:00
anon	d6fff56e22	add streaming via server-sent events Removes /next-token endpoint and adds a "stream" parameter to the /completion one.	2023-05-30 19:33:33 -03:00
digiwombat	03ea8f013a	Fix for the regen issue.	2023-05-30 15:48:55 -04:00
Henri Vasserman	ffb06a345e	OpenLLaMA 3B support (#1588 ) This adds support to llama.cpp to load the model. Currently missing are changes that are required from convert.py to convert the model correctly. It needs some changes to start reading the JSON configuration for HF models instead of deriving the values by guessing. Co-authored-by: FNsi <125447286+FNsi@users.noreply.github.com>	2023-05-30 21:24:22 +03:00
Georgi Gerganov	7552ac5863	ggml : sync cgraph import / export API	2023-05-29 19:31:44 +03:00
Georgi Gerganov	5d1830b99d	ggml : fix bug in ggml_alibi	2023-05-29 19:30:49 +03:00
DannyDaemonic	248367605e	Work around for recalculating logits in cached prompts (Fixes #1585 ) (#1609 ) * Work around for recalculating logits in cached prompts	2023-05-29 05:13:40 -07:00
Jiří Podivín	0e730dd23b	Adding git in container package dependencies (#1621 ) Git added to build packages for version information in docker image Signed-off-by: Jiri Podivin <jpodivin@gmail.com>	2023-05-28 21:45:50 -07:00
Henri Vasserman	42cf4d8433	Merge branch 'master' into master	2023-05-29 01:05:19 +03:00
digiwombat	33b6957177	Fixed failing to return result on stopping token.	2023-05-28 16:45:05 -04:00
Johannes Gäßler	3b126f654f	LLAMA_DEBUG adds debug symbols (#1617 )	2023-05-28 21:01:02 +02:00
digiwombat	6c58f64a3b	--ctx_size flag to --ctx-size to match common.cpp	2023-05-28 14:17:36 -04:00
digiwombat	b38d41ef52	--memory_f32 flag to --memory-f32 to match common.cpp	2023-05-28 13:58:25 -04:00
digiwombat	655899db89	Add ignore_eos option to generation settings.	2023-05-28 13:49:45 -04:00
Kerfuffle	1b78ed2081	Only show -ngl option when relevant + other doc/arg handling updates (#1625 ) 1. Add a `LLAMA_SUPPORTS_GPU_OFFLOAD` define to `llama.h` (defined when compiled with CLBlast or cuBLAS) 2. Update the argument handling in the common example code to only show the `-ngl`, `--n-gpu-layers` option when GPU offload is possible. 3. Add an entry for the `-ngl`, `--n-gpu-layers` option to the `main` and `server` examples documentation 4. Update `main` and `server` examples documentation to use the new style dash separator argument format 5. Update the `server` example to use dash separators for its arguments and adds `-ngl` to `--help` (only shown when compiled with appropriate support). It will still support `--memory_f32` and `--ctx_size` for compatibility. 6. Add a warning discouraging use of `--memory-f32` for the `main` and `server` examples `--help` text as well as documentation. Rationale: https://github.com/ggerganov/llama.cpp/discussions/1593#discussioncomment-6004356	2023-05-28 11:48:57 -06:00
Vladimir Zorin	337aea1139	examples : add --alias option to gpt_params to set use friendly model name (#1614 )	2023-05-28 20:14:24 +03:00
Howard Su	bb051d9723	opencl : no need to allocate cl_mem on heap (#1612 )	2023-05-28 20:13:36 +03:00
Howard Su	ca74884f66	opencl : use strstr to check if fp16 supported (#1611 ) * Use strstr to check if fp16 supported * Ensure ext_buffer is null terminated	2023-05-28 20:09:56 +03:00
Randall Fitzgerald	2c9ee7a052	Apply suggestions from code review Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Henri Vasserman <henv@hot.ee>	2023-05-28 09:34:11 -07:00
Henri Vasserman	74c6f36bf1	Editorconfig suggested fixes delete whitespace	2023-05-28 19:19:34 +03:00
digiwombat	15ddc4903b	Merge remote-tracking branch 'slyecho/server_refactor'	2023-05-28 11:09:32 -04:00
Henri Vasserman	7186d655a1	seed and gen params	2023-05-28 17:03:01 +03:00
digiwombat	7740301db9	Set unspecified generation settings back to default. (Notes below) - If a given set of values coming along doesn't contain top_k for example, but did before, it would have stayed on the old value, I'm pretty sure. This fixes that. - I don't know if this could be done a bit prettier by just setting llama.params = gpt_params(); since I'm not sure how the default constructor would react since there's not one defined.	2023-05-28 09:18:47 -04:00
digiwombat	dda915cac4	Added capturing the stopping word and sending it along with the final JSON. Fixed an fprintf warning Fixed a bug that broke streaming Properly removed thread changing in json (only grabbed batch_size before)	2023-05-28 08:43:38 -04:00
digiwombat	2e5c5ee224	Changed JSON names to match the parameter name rather than the variable name.	2023-05-28 08:12:48 -04:00
digiwombat	23928f2887	Added generation_settings to final json object.	2023-05-28 08:04:05 -04:00
digiwombat	e8efd75492	Initial timeout code and expanded json return on completion. Now passing server params to the help printer so they defaults are ouput. Bad UTF while streaming now returns a replacement character (\uFFFD) Changed some error language very slightly. The JSON now returns extra values, only on `stop` for streaming requests. New JSON Return Values: - tokens_predicted (added to streaming) - seed (just pulls it from params, might return -1) - prompt (Might be useful) - generated_text (Full generated response for streaming requests)	2023-05-28 07:44:31 -04:00

1 2 3 4 5 ...

716 commits