digiwombat
7332b41f9f
Simple single-line server log for requests
2023-05-31 15:56:27 -04:00
Randall Fitzgerald
96fa480147
Merge pull request #6 from anon998/fix-multibyte
...
Buffer incomplete multibyte characters + other stuff.
2023-05-31 12:14:43 -04:00
anon
3edaf6bd8b
print timings by default
2023-05-31 12:55:19 -03:00
anon
d58e48663d
default penalize_nl to false + format
2023-05-31 12:44:27 -03:00
anon
40e13805d9
print timings + build info
...
I don't know if llama_free is needed but it was used in main.cpp.
2023-05-31 12:44:24 -03:00
anon
dd30219332
buffer incomplete multi-byte characters
2023-05-31 12:31:27 -03:00
anon
27911d6d68
fix default model alias
2023-05-31 12:31:25 -03:00
anon
aa2bbb2d35
fix parameter type
2023-05-31 11:14:34 -03:00
anon
f1710b90dc
add infinite generation when n_predict is -1
2023-05-31 11:14:34 -03:00
anon
284bc293b1
reserve memory for generated_text
2023-05-31 11:14:34 -03:00
anon
2c08f29691
make api server use only a single thread
2023-05-31 09:04:33 -03:00
anon
c1cbde82a1
print error when server can't bind to the interface
2023-05-31 09:04:16 -03:00
Randall Fitzgerald
9f2424ac47
Merge pull request #5 from anon998/stop-stream
...
Stop generating tokens when the stream is closed.
2023-05-30 22:16:32 -04:00
anon
3a079d5cc8
stop generating when the stream is closed
2023-05-30 23:12:00 -03:00
anon
7a8104fbd2
add missing quote when printing stopping strings
2023-05-30 23:11:32 -03:00
digiwombat
b6f536dfb3
Cull to end of generated_text when encountering a stopping string in case it's a partial token.
...
Will roll this back if it proves to be a problem.
2023-05-30 21:14:24 -04:00
Randall Fitzgerald
9197674a6b
Merge pull request #4 from anon998/logging
...
Add the --verbose flag and request logging.
2023-05-30 20:58:18 -04:00
anon
aa0788b650
add --verbose flag and request logging
2023-05-30 21:45:56 -03:00
anon
7a853dc56d
prevent the server from swallowing exceptions in debug mode
...
So it's easier to catch them inside a debugger.
2023-05-30 21:39:30 -03:00
Randall Fitzgerald
e6de69abfb
Merge pull request #3 from anon998/sse
...
Add streaming via server-sent events.
Has some changes that I didn't make, and I decided I prefer "stream" to "streaming"
2023-05-30 20:36:52 -04:00
Randall Fitzgerald
2533878b79
Merge branch 'master' into sse
2023-05-30 20:34:48 -04:00
digiwombat
a25f830fe1
Default streaming to false if it's not set in the request body.
2023-05-30 20:17:18 -04:00
digiwombat
38eaf2b7f7
Removed testing fprintf calls.
2023-05-30 19:48:43 -04:00
digiwombat
3292f057dc
Changed to single API endpoint for streaming and non.
...
next-token endpoint removed.
"as_loop" setting changed to "streaming"
2023-05-30 19:44:16 -04:00
anon
d6fff56e22
add streaming via server-sent events
...
Removes /next-token endpoint and adds a "stream" parameter to the
/completion one.
2023-05-30 19:33:33 -03:00
digiwombat
03ea8f013a
Fix for the regen issue.
2023-05-30 15:48:55 -04:00
Henri Vasserman
42cf4d8433
Merge branch 'master' into master
2023-05-29 01:05:19 +03:00
digiwombat
33b6957177
Fixed failing to return result on stopping token.
2023-05-28 16:45:05 -04:00
Johannes Gäßler
3b126f654f
LLAMA_DEBUG adds debug symbols ( #1617 )
2023-05-28 21:01:02 +02:00
digiwombat
6c58f64a3b
--ctx_size flag to --ctx-size to match common.cpp
2023-05-28 14:17:36 -04:00
digiwombat
b38d41ef52
--memory_f32 flag to --memory-f32 to match common.cpp
2023-05-28 13:58:25 -04:00
digiwombat
655899db89
Add ignore_eos option to generation settings.
2023-05-28 13:49:45 -04:00
Kerfuffle
1b78ed2081
Only show -ngl option when relevant + other doc/arg handling updates ( #1625 )
...
1. Add a `LLAMA_SUPPORTS_GPU_OFFLOAD` define to `llama.h` (defined when compiled with CLBlast or cuBLAS)
2. Update the argument handling in the common example code to only show the `-ngl`, `--n-gpu-layers` option when GPU offload is possible.
3. Add an entry for the `-ngl`, `--n-gpu-layers` option to the `main` and `server` examples documentation
4. Update `main` and `server` examples documentation to use the new style dash separator argument format
5. Update the `server` example to use dash separators for its arguments and adds `-ngl` to `--help` (only shown when compiled with appropriate support). It will still support `--memory_f32` and `--ctx_size` for compatibility.
6. Add a warning discouraging use of `--memory-f32` for the `main` and `server` examples `--help` text as well as documentation. Rationale: https://github.com/ggerganov/llama.cpp/discussions/1593#discussioncomment-6004356
2023-05-28 11:48:57 -06:00
Vladimir Zorin
337aea1139
examples : add --alias option to gpt_params to set use friendly model name ( #1614 )
2023-05-28 20:14:24 +03:00
Howard Su
bb051d9723
opencl : no need to allocate cl_mem on heap ( #1612 )
2023-05-28 20:13:36 +03:00
Howard Su
ca74884f66
opencl : use strstr to check if fp16 supported ( #1611 )
...
* Use strstr to check if fp16 supported
* Ensure ext_buffer is null terminated
2023-05-28 20:09:56 +03:00
Randall Fitzgerald
2c9ee7a052
Apply suggestions from code review
...
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
2023-05-28 09:34:11 -07:00
Henri Vasserman
74c6f36bf1
Editorconfig suggested fixes
...
delete whitespace
2023-05-28 19:19:34 +03:00
digiwombat
15ddc4903b
Merge remote-tracking branch 'slyecho/server_refactor'
2023-05-28 11:09:32 -04:00
Henri Vasserman
7186d655a1
seed and gen params
2023-05-28 17:03:01 +03:00
digiwombat
7740301db9
Set unspecified generation settings back to default. (Notes below)
...
- If a given set of values coming along doesn't contain top_k for example, but did before, it would have stayed on the old value, I'm pretty sure. This fixes that.
- I don't know if this could be done a bit prettier by just setting llama.params = gpt_params(); since I'm not sure how the default constructor would react since there's not one defined.
2023-05-28 09:18:47 -04:00
digiwombat
dda915cac4
Added capturing the stopping word and sending it along with the final JSON.
...
Fixed an fprintf warning
Fixed a bug that broke streaming
Properly removed thread changing in json (only grabbed batch_size before)
2023-05-28 08:43:38 -04:00
digiwombat
2e5c5ee224
Changed JSON names to match the parameter name rather than the variable name.
2023-05-28 08:12:48 -04:00
digiwombat
23928f2887
Added generation_settings to final json object.
2023-05-28 08:04:05 -04:00
digiwombat
e8efd75492
Initial timeout code and expanded json return on completion.
...
Now passing server params to the help printer so they defaults are ouput.
Bad UTF while streaming now returns a replacement character (\uFFFD)
Changed some error language very slightly.
The JSON now returns extra values, only on `stop` for streaming requests.
New JSON Return Values:
- tokens_predicted (added to streaming)
- seed (just pulls it from params, might return -1)
- prompt (Might be useful)
- generated_text (Full generated response for streaming requests)
2023-05-28 07:44:31 -04:00
digiwombat
177868e68a
Changed to params/args
...
Seed is now set by the CLI, defaults to -1 if not seed is set
Threads and batch size are now properly launch parameters.
2023-05-28 06:29:11 -04:00
Henri Vasserman
549291fe61
keep processed from the beginning
...
this means no limit to the input prompt,
it will just get reset again as normal
2023-05-28 12:11:41 +03:00
Randall Fitzgerald
df0e0d094c
Forgot to remove some testing code.
2023-05-28 12:11:14 +03:00
Randall Fitzgerald
f93fe36c5b
Add all generation parameters to server.cpp and allow resetting context
...
sever.cpp left out a few generation parameters and also seems built to assume un-editable chatting with no regens or swipes. I added a simple "reload_ctx" flag that can be passed on generation that will cause the prompt to be reloaded.
2023-05-28 12:11:10 +03:00
Henri Vasserman
51e09944ce
server rewrite
...
Remove unnecessary things and radically rewrite server
2023-05-28 12:10:16 +03:00