updated readme, improved simple launcher

This commit is contained in:
Concedo 2023-06-03 17:17:15 +08:00
parent 6f82e17b7a
commit 8bd9a3a48b
2 changed files with 48 additions and 19 deletions

View file

@ -2,15 +2,10 @@
A self contained distributable from Concedo that exposes llama.cpp function bindings, allowing it to be used via a simulated Kobold API endpoint.
What does it mean? You get llama.cpp with a fancy UI, persistent stories, editing tools, save formats, memory, world info, author's note, characters, scenarios and everything Kobold and Kobold Lite have to offer. In a tiny package around 10 MB in size, excluding model weights.
What does it mean? You get llama.cpp with a fancy UI, persistent stories, editing tools, save formats, memory, world info, author's note, characters, scenarios and everything Kobold and Kobold Lite have to offer. In a tiny package around 20 MB in size, excluding model weights.
![Preview](preview.png)
# Highlights
- Now has experimental CLBlast support.
- Now supports RWKV models WITHOUT pytorch or tokenizers! Yep, just GGML!
- Now supports GPT-NeoX / Pythia models
## Usage
- [Download the latest release here](https://github.com/LostRuins/koboldcpp/releases/latest) or clone the repo.
- Windows binaries are provided in the form of **koboldcpp.exe**, which is a pyinstaller wrapper for a few **.dll** files and **koboldcpp.py**. If you feel concerned, you may prefer to rebuild it yourself with the provided makefiles and scripts.
@ -20,8 +15,9 @@ What does it mean? You get llama.cpp with a fancy UI, persistent stories, editin
- You can also run it using the command line `koboldcpp.exe [ggml_model.bin] [port]`. For info, please check `koboldcpp.exe --help`
- If you are having crashes or issues, you can try turning off BLAS with the `--noblas` flag. You can also try running in a non-avx2 compatibility mode with `--noavx2`. Lastly, you can try turning off mmap with `--nommap`.
- Big context still too slow? Try the `--smartcontext` flag to reduce prompt processing frequency. Also, you can try to run with your GPU using CLBlast, with `--useclblast` flag for a speedup
- Want even more speedup? Combine `--useclblast` with `--gpulayers` to offload entire layers to the GPU! **Much faster, but uses more VRAM**. Experiment to determine number of layers to offload.
For more information, be sure to run the program with the --help flag.
For more information, be sure to run the program with the `--help` flag.
## Compiling on Windows
- If you want to compile your binaries from source at Windows, the easiest way is:
@ -62,12 +58,13 @@ For more information, be sure to run the program with the --help flag.
- The other files are also under the AGPL v3.0 License unless otherwise stated
## Notes
- Generation delay scales linearly with original prompt length. If OpenBLAS is enabled then prompt ingestion becomes about 2-3x faster. This is automatic on windows, but will require linking on OSX and Linux.
- Generation delay scales linearly with original prompt length. If OpenBLAS is enabled then prompt ingestion becomes about 2-3x faster. This is automatic on windows, but will require linking on OSX and Linux. CLBlast speeds this up even further, and `--gpulayers` + `--useclblast` more so.
- I have heard of someone claiming a false AV positive report. The exe is a simple pyinstaller bundle that includes the necessary python scripts and dlls to run. If this still concerns you, you might wish to rebuild everything from source code using the makefile, and you can rebuild the exe yourself with pyinstaller by using `make_pyinstaller.bat`
- Supported GGML models:
- LLAMA (All versions including ggml, ggmf, ggjt, gpt4all). Supports CLBlast and OpenBLAS acceleration for all versions.
- LLAMA (All versions including ggml, ggmf, ggjt v1,v2,v3, openllama, gpt4all). Supports CLBlast and OpenBLAS acceleration for all versions.
- GPT-2 (All versions, including legacy f16, newer format + quanitzed, cerebras) Supports OpenBLAS acceleration only for newer format.
- GPT-J (All versions including legacy f16, newer format + quantized, pyg.cpp, new pygmalion, janeway etc.) Supports OpenBLAS acceleration only for newer format.
- RWKV (all formats except Q4_1_O).
- GPT-NeoX / Pythia / StableLM / Dolly / RedPajama
- MPT models (ggjt v3)
- Basically every single current and historical GGML format that has ever existed should be supported, except for bloomz.cpp due to lack of demand.

View file

@ -447,32 +447,59 @@ def show_gui():
pass
# Adjust size
root.geometry("460x320")
root.geometry("480x360")
root.title("KoboldCpp v"+KcppVersion)
root.grid_columnconfigure(0, weight=1)
tk.Label(root, text = "KoboldCpp Easy Launcher",
font = ("Arial", 12)).pack(pady=4)
font = ("Arial", 12)).grid(row=0,column=0)
tk.Label(root, text = "(Note: KoboldCpp only works with GGML model formats!)",
font = ("Arial", 9)).pack()
font = ("Arial", 9)).grid(row=1,column=0)
opts = ["Use OpenBLAS","Use CLBLast GPU #1","Use CLBLast GPU #2","Use CLBLast GPU #3","Use No BLAS","Use OpenBLAS (Old CPU, noavx2)","Failsafe Mode (Old CPU, noavx)"]
runchoice = tk.StringVar()
runchoice.set("Use OpenBLAS")
tk.OptionMenu( root , runchoice , *opts ).pack()
tk.OptionMenu( root , runchoice , *opts ).grid(row=2,column=0)
frm2 = tk.Frame(root)
threads_var=tk.StringVar()
threads_var.set(str(default_threads))
threads_lbl = tk.Label(frm2, text = 'Threads: ', font=('calibre',10, 'bold'))
threads_input = tk.Entry(frm2,textvariable = threads_var, font=('calibre',10,'normal'))
threads_lbl.grid(row=0,column=0)
threads_input.grid(row=0,column=1)
frm2.grid(row=3,column=0,pady=4)
frm1 = tk.Frame(root)
gpu_layers_var=tk.StringVar()
gpu_layers_var.set("0")
gpu_lbl = tk.Label(frm1, text = 'GPU Layers (CLBlast only): ', font=('calibre',10, 'bold'))
gpu_layers_input = tk.Entry(frm1,textvariable = gpu_layers_var, font=('calibre',10,'normal'))
gpu_lbl.grid(row=0,column=0)
gpu_layers_input.grid(row=0,column=1)
frm1.grid(row=4,column=0,pady=4)
stream = tk.IntVar()
smartcontext = tk.IntVar()
launchbrowser = tk.IntVar(value=1)
unbantokens = tk.IntVar()
tk.Checkbutton(root, text='Streaming Mode',variable=stream, onvalue=1, offvalue=0).pack()
tk.Checkbutton(root, text='Use SmartContext',variable=smartcontext, onvalue=1, offvalue=0).pack()
tk.Checkbutton(root, text='Unban Tokens',variable=unbantokens, onvalue=1, offvalue=0).pack()
tk.Checkbutton(root, text='Launch Browser',variable=launchbrowser, onvalue=1, offvalue=0).pack()
highpriority = tk.IntVar()
disablemmap = tk.IntVar()
frm3 = tk.Frame(root)
tk.Checkbutton(frm3, text='Streaming Mode',variable=stream, onvalue=1, offvalue=0).grid(row=0,column=0)
tk.Checkbutton(frm3, text='Use SmartContext',variable=smartcontext, onvalue=1, offvalue=0).grid(row=0,column=1)
tk.Checkbutton(frm3, text='High Priority',variable=highpriority, onvalue=1, offvalue=0).grid(row=1,column=0)
tk.Checkbutton(frm3, text='Disable MMAP',variable=disablemmap, onvalue=1, offvalue=0).grid(row=1,column=1)
tk.Checkbutton(frm3, text='Unban Tokens',variable=unbantokens, onvalue=1, offvalue=0).grid(row=2,column=0)
tk.Checkbutton(frm3, text='Launch Browser',variable=launchbrowser, onvalue=1, offvalue=0).grid(row=2,column=1)
frm3.grid(row=5,column=0,pady=4)
# Create button, it will change label text
tk.Button( root , text = "Launch", font = ("Impact", 18), bg='#54FA9B', command = guilaunch ).pack(pady=10)
tk.Button( root , text = "Launch", font = ("Impact", 18), bg='#54FA9B', command = guilaunch ).grid(row=6,column=0)
tk.Label(root, text = "(Please use the Command Line for more advanced options)",
font = ("Arial", 9)).pack()
font = ("Arial", 9)).grid(row=7,column=0)
root.mainloop()
@ -482,10 +509,15 @@ def show_gui():
sys.exit()
#load all the vars
args.threads = int(threads_var.get())
args.gpulayers = int(gpu_layers_var.get())
args.stream = (stream.get()==1)
args.smartcontext = (smartcontext.get()==1)
args.launch = (launchbrowser.get()==1)
args.unbantokens = (unbantokens.get()==1)
args.highpriority = (highpriority.get()==1)
args.nommap = (disablemmap.get()==1)
selchoice = runchoice.get()
if selchoice==opts[1]: