* kv cache slot search improvements
* Use n_ctx in kv find slot for consistency
* Ensure kv cache head points to a valid slot in llama_decode internal
* Add some comments to prevent dumb people (like me) from getting confused.
Popen() needs to be used with 'with' or have .wait() called or be
destroyed, otherwise there is a zombie child that sticks around until
the object is GC'd.
Fix uploading tensor data to device, including 3D, 4D, and non-contiguous tensors.
Use correct offsets into data that is already in VRAM.
Correct handling of OpenCL events when multiple commands are queued.
* Implement basic chat/completions openai endpoint
-Basic support for openai chat/completions endpoint documented at: https://platform.openai.com/docs/api-reference/chat/create
-Tested with example code from openai for chat/completions and chat/completions with stream=True parameter found here: https://cookbook.openai.com/examples/how_to_stream_completions.
-Tested with Mantella, the skyrim mod that turns all the NPC's into AI chattable characters, which uses openai's acreate / async competions method: https://github.com/art-from-the-machine/Mantella/blob/main/src/output_manager.py
-Tested default koboldcpp api behavior with streaming and non-streaming generate endpoints and running GUI and seems to be fine.
-Still TODO / evaluate before merging:
(1) implement rest of openai chat/completion parameters to the extent possible, mapping to koboldcpp parameters
(2) determine if there is a way to use kobold's prompt formats for certain models when translating openai messages format into a prompt string. (Not sure if possible or where these are in the code)
(3) have chat/completions responses include the actual local model the user is using instead of just koboldcpp (Not sure if this is possible)
Note I am a python noob, so if there is a more elegant way of doing this at minimum hopefully I have done some of the grunt work for you to implement on your own.
* Fix typographical error on deleted streaming argument
-Mistakenly left code relating to streaming argument from main branch in experimental.
* add additional openai chat completions parameters
-support stop parameter mapped to koboldai stop_sequence parameter
-make default max_length / max_tokens parameter consistent with default 80 token length in generate function
-add support for providing name of local model in openai responses
* Revert "add additional openai chat completions parameters"
This reverts commit 443a6f7ff6346f41c78b0a6ff59c063999542327.
* add additional openai chat completions parameters
-support stop parameter mapped to koboldai stop_sequence parameter
-make default max_length / max_tokens parameter consistent with default 80 token length in generate function
-add support for providing name of local model in openai responses
* add /n after formatting prompts from openaiformat
to conform with alpaca standard used as default in lite.koboldai.net
* tidy up and simplify code, do not set globals for streaming
* oai endpoints must start with v1
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
* sync : ggml (conv 1d + 2d updates)
ggml-ci
* ggml : fix UB in q5_0 and q5_1 quantize code
ggml.c:1033:39: runtime error: left shift of 1 by 31 places cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
ggml.c:1081:39: runtime error: left shift of 1 by 31 places cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
ggml-ci
* tests : fix UB in test-quantize-perf
* Added RVV intrinsics support for Q8 quantize row and also improved the existing dot product function for risc-v.
The RVV intrinsics is added for the following quantize row functions
quantize_row_q8_0
quantize_row_q8_1
The following dot product functions have also been optimized by using LMUL = 1/2 instead of LMUL = 1
ggml_vec_dot_q4_0_q8_0
ggml_vec_dot_q4_1_q8_1
ggml_vec_dot_q5_0_q8_0
ggml_vec_dot_q5_1_q8_1
And vector initialization in Q5 by temporary array is also replaced by the vid intrinsics
Signed-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>
* Added RVV intrinsics support for k_quants
This adds RISC-V Vector intrinsics support for the following K_quants functions for both QKK = 256 and QKK = 64
ggml_vec_dot_q2_K_q8_K
ggml_vec_dot_q3_K_q8_K
ggml_vec_dot_q4_K_q8_K
ggml_vec_dot_q5_K_q8_K
ggml_vec_dot_q6_K_q8_K
Signed-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>
---------
Signed-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>