* kv cache slot search improvements
* Use n_ctx in kv find slot for consistency
* Ensure kv cache head points to a valid slot in llama_decode internal
* Add some comments to prevent dumb people (like me) from getting confused.
Popen() needs to be used with 'with' or have .wait() called or be
destroyed, otherwise there is a zombie child that sticks around until
the object is GC'd.
Fix uploading tensor data to device, including 3D, 4D, and non-contiguous tensors.
Use correct offsets into data that is already in VRAM.
Correct handling of OpenCL events when multiple commands are queued.
* Implement basic chat/completions openai endpoint
-Basic support for openai chat/completions endpoint documented at: https://platform.openai.com/docs/api-reference/chat/create
-Tested with example code from openai for chat/completions and chat/completions with stream=True parameter found here: https://cookbook.openai.com/examples/how_to_stream_completions.
-Tested with Mantella, the skyrim mod that turns all the NPC's into AI chattable characters, which uses openai's acreate / async competions method: https://github.com/art-from-the-machine/Mantella/blob/main/src/output_manager.py
-Tested default koboldcpp api behavior with streaming and non-streaming generate endpoints and running GUI and seems to be fine.
-Still TODO / evaluate before merging:
(1) implement rest of openai chat/completion parameters to the extent possible, mapping to koboldcpp parameters
(2) determine if there is a way to use kobold's prompt formats for certain models when translating openai messages format into a prompt string. (Not sure if possible or where these are in the code)
(3) have chat/completions responses include the actual local model the user is using instead of just koboldcpp (Not sure if this is possible)
Note I am a python noob, so if there is a more elegant way of doing this at minimum hopefully I have done some of the grunt work for you to implement on your own.
* Fix typographical error on deleted streaming argument
-Mistakenly left code relating to streaming argument from main branch in experimental.
* add additional openai chat completions parameters
-support stop parameter mapped to koboldai stop_sequence parameter
-make default max_length / max_tokens parameter consistent with default 80 token length in generate function
-add support for providing name of local model in openai responses
* Revert "add additional openai chat completions parameters"
This reverts commit 443a6f7ff6346f41c78b0a6ff59c063999542327.
* add additional openai chat completions parameters
-support stop parameter mapped to koboldai stop_sequence parameter
-make default max_length / max_tokens parameter consistent with default 80 token length in generate function
-add support for providing name of local model in openai responses
* add /n after formatting prompts from openaiformat
to conform with alpaca standard used as default in lite.koboldai.net
* tidy up and simplify code, do not set globals for streaming
* oai endpoints must start with v1
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>