Find a file
2023-03-19 00:21:59 +08:00
.devops Don't tell users to use a bad number of threads (#243) 2023-03-17 19:47:35 +02:00
.github/workflows CI Improvements (#230) 2023-03-18 09:27:12 +02:00
.dockerignore 🚀 Dockerize llamacpp (#132) 2023-03-17 10:47:06 +01:00
.gitignore Nix flake (#40) 2023-03-17 23:03:48 +01:00
CMakeLists.txt add ptread link to fix cmake build under linux (#114) 2023-03-17 13:38:24 -03:00
convert-pth-to-ggml.py 🚀 Dockerize llamacpp (#132) 2023-03-17 10:47:06 +01:00
download-pth.py 🚀 Dockerize llamacpp (#132) 2023-03-17 10:47:06 +01:00
expose.cpp Created a python bindings for llama.cpp and emulated a simple Kobold HTTP API Endpoint 2023-03-19 00:07:11 +08:00
flake.lock Nix flake (#40) 2023-03-17 23:03:48 +01:00
flake.nix Nix flake (#40) 2023-03-17 23:03:48 +01:00
ggml.c Don't tell users to use a bad number of threads (#243) 2023-03-17 19:47:35 +02:00
ggml.h Add RMS norm and use it (#187) 2023-03-16 00:41:38 +02:00
LICENSE Add LICENSE (#21) 2023-03-12 08:36:03 +02:00
llama_for_kobold.py Created a python bindings for llama.cpp and emulated a simple Kobold HTTP API Endpoint 2023-03-19 00:07:11 +08:00
llamalib.dll Updated binaries 2023-03-19 00:09:00 +08:00
main.cpp Merge branch 'ggerganov:master' into concedo 2023-03-19 00:20:09 +08:00
main.exe Updated binaries 2023-03-19 00:09:00 +08:00
Makefile Created a python bindings for llama.cpp and emulated a simple Kobold HTTP API Endpoint 2023-03-19 00:07:11 +08:00
quantize.cpp Windows fixes (#31) 2023-03-12 22:15:00 +02:00
quantize.exe Updated binaries 2023-03-19 00:09:00 +08:00
quantize.sh Add quantize script for batch quantization (#92) 2023-03-13 18:15:20 +02:00
README.md Update README.md 2023-03-19 00:21:59 +08:00
utils.cpp Fix n^2 loop in tokenization (#254) 2023-03-18 11:17:19 +00:00
utils.h Default to 4 threads (#243) 2023-03-17 21:46:46 +02:00

llama-for-kobold

A hacky little script from Concedo that exposes llama.cpp function bindings, allowing it to be used via a simulated Kobold API endpoint.

It's not very usable as there is a fundamental flaw with llama.cpp, which causes generation delay to scale linearly with original prompt length. Nobody knows why or really cares much, so I'm just going to publish whatever I have at this point.

If you care, please contribute to this discussion which, if resolved, will actually make this viable.

Considerations

  • Don't want to use pybind11 due to dependencies on MSVCC
  • ZERO or MINIMAL changes as possible to main.cpp - do not move their function declarations elsewhere!
  • Leave main.cpp UNTOUCHED, We want to be able to update the repo and pull any changes automatically.
  • No dynamic memory allocation! Setup structs with FIXED (known) shapes and sizes for ALL output fields. Python will ALWAYS provide the memory, we just write to it.
  • No external libraries or dependencies. That means no Flask, Pybind and whatever. All You Need Is Python.