various debug logging improvements
This commit is contained in:
parent
dc3472eb58
commit
8775dd99f4
6 changed files with 66 additions and 40 deletions
30
README.md
30
README.md
|
@ -13,14 +13,24 @@ What does it mean? You get llama.cpp with a fancy UI, persistent stories, editin
|
||||||
- To run, execute **koboldcpp.exe** or drag and drop your quantized `ggml_model.bin` file onto the .exe, and then connect with Kobold or Kobold Lite. If you're not on windows, then run the script **KoboldCpp.py** after compiling the libraries.
|
- To run, execute **koboldcpp.exe** or drag and drop your quantized `ggml_model.bin` file onto the .exe, and then connect with Kobold or Kobold Lite. If you're not on windows, then run the script **KoboldCpp.py** after compiling the libraries.
|
||||||
- By default, you can connect to http://localhost:5001
|
- By default, you can connect to http://localhost:5001
|
||||||
- You can also run it using the command line `koboldcpp.exe [ggml_model.bin] [port]`. For info, please check `koboldcpp.exe --help`
|
- You can also run it using the command line `koboldcpp.exe [ggml_model.bin] [port]`. For info, please check `koboldcpp.exe --help`
|
||||||
- If you are having crashes or issues, you can try turning off BLAS with the `--noblas` flag. You can also try running in a non-avx2 compatibility mode with `--noavx2`. Lastly, you can try turning off mmap with `--nommap`.
|
|
||||||
- Big context still too slow? Try the `--smartcontext` flag to reduce prompt processing frequency. Also, you can try to run with your GPU using CLBlast, with `--useclblast` flag for a speedup
|
- Big context still too slow? Try the `--smartcontext` flag to reduce prompt processing frequency. Also, you can try to run with your GPU using CLBlast, with `--useclblast` flag for a speedup
|
||||||
- Want even more speedup? Combine `--useclblast` with `--gpulayers` to offload entire layers to the GPU! **Much faster, but uses more VRAM**. Experiment to determine number of layers to offload.
|
- Want even more speedup? Combine `--useclblast` with `--gpulayers` to offload entire layers to the GPU! **Much faster, but uses more VRAM**. Experiment to determine number of layers to offload.
|
||||||
|
- If you are having crashes or issues, you can try turning off BLAS with the `--noblas` flag. You can also try running in a non-avx2 compatibility mode with `--noavx2`. Lastly, you can try turning off mmap with `--nommap`.
|
||||||
|
|
||||||
For more information, be sure to run the program with the `--help` flag.
|
For more information, be sure to run the program with the `--help` flag.
|
||||||
|
|
||||||
|
## OSX and Linux
|
||||||
|
- You will have to compile your binaries from source. A makefile is provided, simply run `make`
|
||||||
|
- If you want you can also link your own install of OpenBLAS manually with `make LLAMA_OPENBLAS=1`
|
||||||
|
- Alternatively, if you want you can also link your own install of CLBlast manually with `make LLAMA_CLBLAST=1`, for this you will need to obtain and link OpenCL and CLBlast libraries.
|
||||||
|
- For a full featured build, do `make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1`
|
||||||
|
- For Arch Linux: Install `cblas` `openblas` and `clblast`.
|
||||||
|
- For Debian: Install `libclblast-dev` and `libopenblas-dev`.
|
||||||
|
- After all binaries are built, you can run the python script with the command `koboldcpp.py [ggml_model.bin] [port]`
|
||||||
|
- Note: Many OSX users have found that the using Accelerate is actually faster than OpenBLAS. To try, you may wish to run with `--noblas` and compare speeds.
|
||||||
|
|
||||||
## Compiling on Windows
|
## Compiling on Windows
|
||||||
- If you want to compile your binaries from source at Windows, the easiest way is:
|
- You're encouraged to use the .exe released, but if you want to compile your binaries from source at Windows, the easiest way is:
|
||||||
- Use the latest release of w64devkit (https://github.com/skeeto/w64devkit). Be sure to use the "vanilla one", not i686 or other different stuff. If you try they will conflit with the precompiled libs!
|
- Use the latest release of w64devkit (https://github.com/skeeto/w64devkit). Be sure to use the "vanilla one", not i686 or other different stuff. If you try they will conflit with the precompiled libs!
|
||||||
- Make sure you are using the w64devkit integrated terminal, then run 'make' at the KoboldCpp source folder. This will create the .dll files.
|
- Make sure you are using the w64devkit integrated terminal, then run 'make' at the KoboldCpp source folder. This will create the .dll files.
|
||||||
- If you want to generate the .exe file, make sure you have the python module PyInstaller installed with pip ('pip install PyInstaller').
|
- If you want to generate the .exe file, make sure you have the python module PyInstaller installed with pip ('pip install PyInstaller').
|
||||||
|
@ -34,19 +44,13 @@ For more information, be sure to run the program with the `--help` flag.
|
||||||
- Also, replace the existing versions of the corresponding .dll files located in the project directory root (e.g. libopenblas.dll).
|
- Also, replace the existing versions of the corresponding .dll files located in the project directory root (e.g. libopenblas.dll).
|
||||||
- Make the KoboldCPP project using the instructions above.
|
- Make the KoboldCPP project using the instructions above.
|
||||||
|
|
||||||
## OSX and Linux
|
## Android (Termux) Alternative method
|
||||||
- You will have to compile your binaries from source. A makefile is provided, simply run `make`
|
- See https://github.com/ggerganov/llama.cpp/pull/1828/files
|
||||||
- If you want you can also link your own install of OpenBLAS manually with `make LLAMA_OPENBLAS=1`
|
|
||||||
- Alternatively, if you want you can also link your own install of CLBlast manually with `make LLAMA_CLBLAST=1`, for this you will need to obtain and link OpenCL and CLBlast libraries.
|
## CuBLAS?
|
||||||
- For a full featured build, do `make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1`
|
- You can attempt a CuBLAS build with LLAMA_CUBLAS=1 or using the provided CMake file (best for visual studio users). Note that support for CuBLAS is limited.
|
||||||
- For Arch Linux: Install `cblas` `openblas` and `clblast`.
|
|
||||||
- For Debian: Install `libclblast-dev` and `libopenblas-dev`.
|
|
||||||
- After all binaries are built, you can run the python script with the command `koboldcpp.py [ggml_model.bin] [port]`
|
|
||||||
- Note: Many OSX users have found that the using Accelerate is actually faster than OpenBLAS. To try, you may wish to run with `--noblas` and compare speeds.
|
|
||||||
|
|
||||||
## Considerations
|
## Considerations
|
||||||
- ZERO or MINIMAL changes as possible to parent repo files - do not move their function declarations elsewhere! We want to be able to update the repo and pull any changes automatically.
|
|
||||||
- No dynamic memory allocation! Setup structs with FIXED (known) shapes and sizes for ALL output fields. Python will ALWAYS provide the memory, we just write to it.
|
|
||||||
- For Windows: No installation, single file executable, (It Just Works)
|
- For Windows: No installation, single file executable, (It Just Works)
|
||||||
- Since v1.0.6, requires libopenblas, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without BLAS.
|
- Since v1.0.6, requires libopenblas, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without BLAS.
|
||||||
- Since v1.15, requires CLBlast if enabled, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without CLBlast.
|
- Since v1.15, requires CLBlast if enabled, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without CLBlast.
|
||||||
|
|
2
Remote-Link.cmd
Normal file
2
Remote-Link.cmd
Normal file
|
@ -0,0 +1,2 @@
|
||||||
|
curl -L https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-windows-amd64.exe -o cloudflared.exe
|
||||||
|
cloudflared.exe tunnel --url localhost:5001
|
2
expose.h
2
expose.h
|
@ -18,7 +18,7 @@ struct load_model_inputs
|
||||||
const bool unban_tokens;
|
const bool unban_tokens;
|
||||||
const int clblast_info = 0;
|
const int clblast_info = 0;
|
||||||
const int blasbatchsize = 512;
|
const int blasbatchsize = 512;
|
||||||
const bool debugmode;
|
const int debugmode = 0;
|
||||||
const int forceversion = 0;
|
const int forceversion = 0;
|
||||||
const int gpulayers = 0;
|
const int gpulayers = 0;
|
||||||
};
|
};
|
||||||
|
|
|
@ -68,7 +68,7 @@ static int n_batch = 8;
|
||||||
static bool useSmartContext = false;
|
static bool useSmartContext = false;
|
||||||
static bool unbanTokens = false;
|
static bool unbanTokens = false;
|
||||||
static int blasbatchsize = 512;
|
static int blasbatchsize = 512;
|
||||||
static bool debugmode = false;
|
static int debugmode = 0; //-1 = hide all, 0 = normal, 1 = showall
|
||||||
static std::string modelname;
|
static std::string modelname;
|
||||||
static std::vector<gpt_vocab::id> last_n_tokens;
|
static std::vector<gpt_vocab::id> last_n_tokens;
|
||||||
static std::vector<gpt_vocab::id> current_context_tokens;
|
static std::vector<gpt_vocab::id> current_context_tokens;
|
||||||
|
@ -118,7 +118,7 @@ llama_token sample_token(llama_token_data_array * candidates, std::mt19937 & rng
|
||||||
std::discrete_distribution<> dist(probs.begin(), probs.end());
|
std::discrete_distribution<> dist(probs.begin(), probs.end());
|
||||||
int idx = dist(rng);
|
int idx = dist(rng);
|
||||||
|
|
||||||
if(debugmode)
|
if(debugmode==1)
|
||||||
{
|
{
|
||||||
top_picks.push_back(candidates->data[idx]);
|
top_picks.push_back(candidates->data[idx]);
|
||||||
for (size_t i = 0; (i < candidates->size && i<4); ++i)
|
for (size_t i = 0; (i < candidates->size && i<4); ++i)
|
||||||
|
@ -981,9 +981,12 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
|
||||||
printf("Bad format!");
|
printf("Bad format!");
|
||||||
}
|
}
|
||||||
|
|
||||||
printf("\n");
|
if(debugmode!=-1)
|
||||||
|
{
|
||||||
|
printf("\n");
|
||||||
|
}
|
||||||
|
|
||||||
if (debugmode)
|
if (debugmode==1)
|
||||||
{
|
{
|
||||||
std::string outstr = "";
|
std::string outstr = "";
|
||||||
printf("\n[Debug: Dump Input Tokens, format: %d]\n", file_format);
|
printf("\n[Debug: Dump Input Tokens, format: %d]\n", file_format);
|
||||||
|
@ -1013,7 +1016,7 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
|
||||||
// predict
|
// predict
|
||||||
unsigned int embdsize = embd.size();
|
unsigned int embdsize = embd.size();
|
||||||
//print progress
|
//print progress
|
||||||
if (!startedsampling)
|
if (!startedsampling && debugmode!=-1)
|
||||||
{
|
{
|
||||||
printf("\rProcessing Prompt%s (%d / %d tokens)", (blasmode ? " [BLAS]" : ""), input_consumed, embd_inp.size());
|
printf("\rProcessing Prompt%s (%d / %d tokens)", (blasmode ? " [BLAS]" : ""), input_consumed, embd_inp.size());
|
||||||
}
|
}
|
||||||
|
@ -1229,11 +1232,11 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
|
||||||
concat_output += tokenizedstr;
|
concat_output += tokenizedstr;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (startedsampling)
|
if (startedsampling && debugmode!=-1)
|
||||||
{
|
{
|
||||||
printf("\rGenerating (%d / %d tokens)", (params.n_predict - remaining_tokens), params.n_predict);
|
printf("\rGenerating (%d / %d tokens)", (params.n_predict - remaining_tokens), params.n_predict);
|
||||||
}
|
}
|
||||||
if(debugmode && top_picks.size()>0)
|
if(debugmode==1 && top_picks.size()>0)
|
||||||
{
|
{
|
||||||
printf(" [");
|
printf(" [");
|
||||||
bool firstloop = true;
|
bool firstloop = true;
|
||||||
|
@ -1263,7 +1266,10 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
|
||||||
{
|
{
|
||||||
stopper_unused_tokens = remaining_tokens;
|
stopper_unused_tokens = remaining_tokens;
|
||||||
remaining_tokens = 0;
|
remaining_tokens = 0;
|
||||||
printf("\n(Stop sequence triggered: <%s>)", matched.c_str());
|
if(debugmode!=-1)
|
||||||
|
{
|
||||||
|
printf("\n(Stop sequence triggered: <%s>)", matched.c_str());
|
||||||
|
}
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
File diff suppressed because one or more lines are too long
28
koboldcpp.py
28
koboldcpp.py
|
@ -26,7 +26,7 @@ class load_model_inputs(ctypes.Structure):
|
||||||
("unban_tokens", ctypes.c_bool),
|
("unban_tokens", ctypes.c_bool),
|
||||||
("clblast_info", ctypes.c_int),
|
("clblast_info", ctypes.c_int),
|
||||||
("blasbatchsize", ctypes.c_int),
|
("blasbatchsize", ctypes.c_int),
|
||||||
("debugmode", ctypes.c_bool),
|
("debugmode", ctypes.c_int),
|
||||||
("forceversion", ctypes.c_int),
|
("forceversion", ctypes.c_int),
|
||||||
("gpulayers", ctypes.c_int)]
|
("gpulayers", ctypes.c_int)]
|
||||||
|
|
||||||
|
@ -224,7 +224,8 @@ maxctx = 2048
|
||||||
maxlen = 256
|
maxlen = 256
|
||||||
modelbusy = False
|
modelbusy = False
|
||||||
defaultport = 5001
|
defaultport = 5001
|
||||||
KcppVersion = "1.31"
|
KcppVersion = "1.31.1"
|
||||||
|
showdebug = True
|
||||||
|
|
||||||
class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
|
class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
|
||||||
sys_version = ""
|
sys_version = ""
|
||||||
|
@ -238,6 +239,12 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
|
||||||
def __call__(self, *args, **kwargs):
|
def __call__(self, *args, **kwargs):
|
||||||
super().__init__(*args, **kwargs)
|
super().__init__(*args, **kwargs)
|
||||||
|
|
||||||
|
def log_message(self, format, *args):
|
||||||
|
global showdebug
|
||||||
|
if showdebug:
|
||||||
|
super().log_message(format, *args)
|
||||||
|
pass
|
||||||
|
|
||||||
async def generate_text(self, newprompt, genparams, basic_api_flag, stream_flag):
|
async def generate_text(self, newprompt, genparams, basic_api_flag, stream_flag):
|
||||||
|
|
||||||
def run_blocking():
|
def run_blocking():
|
||||||
|
@ -281,7 +288,8 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
|
||||||
else:
|
else:
|
||||||
recvtxt = run_blocking()
|
recvtxt = run_blocking()
|
||||||
|
|
||||||
utfprint("\nOutput: " + recvtxt)
|
if args.debugmode!=-1:
|
||||||
|
utfprint("\nOutput: " + recvtxt)
|
||||||
|
|
||||||
res = {"data": {"seqs":[recvtxt]}} if basic_api_flag else {"results": [{"text": recvtxt}]}
|
res = {"data": {"seqs":[recvtxt]}} if basic_api_flag else {"results": [{"text": recvtxt}]}
|
||||||
|
|
||||||
|
@ -414,7 +422,7 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
|
||||||
self.send_response(200)
|
self.send_response(200)
|
||||||
self.end_headers()
|
self.end_headers()
|
||||||
self.wfile.write(json.dumps({"success": ("true" if ag else "false")}).encode())
|
self.wfile.write(json.dumps({"success": ("true" if ag else "false")}).encode())
|
||||||
print("Generation Aborted")
|
print("\nGeneration Aborted")
|
||||||
modelbusy = False
|
modelbusy = False
|
||||||
return
|
return
|
||||||
|
|
||||||
|
@ -453,7 +461,8 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
|
||||||
utfprint("Body Err: " + str(body))
|
utfprint("Body Err: " + str(body))
|
||||||
return self.send_response(503)
|
return self.send_response(503)
|
||||||
|
|
||||||
utfprint("\nInput: " + json.dumps(genparams))
|
if args.debugmode!=-1:
|
||||||
|
utfprint("\nInput: " + json.dumps(genparams))
|
||||||
|
|
||||||
modelbusy = True
|
modelbusy = True
|
||||||
|
|
||||||
|
@ -714,10 +723,15 @@ def main(args):
|
||||||
sys.exit(2)
|
sys.exit(2)
|
||||||
|
|
||||||
if args.hordeconfig and args.hordeconfig[0]!="":
|
if args.hordeconfig and args.hordeconfig[0]!="":
|
||||||
global friendlymodelname, maxlen
|
global friendlymodelname, maxlen, showdebug
|
||||||
friendlymodelname = "koboldcpp/"+args.hordeconfig[0]
|
friendlymodelname = "koboldcpp/"+args.hordeconfig[0]
|
||||||
if len(args.hordeconfig) > 1:
|
if len(args.hordeconfig) > 1:
|
||||||
maxlen = int(args.hordeconfig[1])
|
maxlen = int(args.hordeconfig[1])
|
||||||
|
if args.debugmode == 0:
|
||||||
|
args.debugmode = -1
|
||||||
|
|
||||||
|
if args.debugmode != 1:
|
||||||
|
showdebug = False
|
||||||
|
|
||||||
if args.highpriority:
|
if args.highpriority:
|
||||||
print("Setting process to Higher Priority - Use Caution")
|
print("Setting process to Higher Priority - Use Caution")
|
||||||
|
@ -839,7 +853,7 @@ if __name__ == '__main__':
|
||||||
parser.add_argument("--nommap", help="If set, do not use mmap to load newer models", action='store_true')
|
parser.add_argument("--nommap", help="If set, do not use mmap to load newer models", action='store_true')
|
||||||
parser.add_argument("--usemlock", help="For Apple Systems. Force system to keep model in RAM rather than swapping or compressing", action='store_true')
|
parser.add_argument("--usemlock", help="For Apple Systems. Force system to keep model in RAM rather than swapping or compressing", action='store_true')
|
||||||
parser.add_argument("--noavx2", help="Do not use AVX2 instructions, a slower compatibility mode for older devices. Does not work with --clblast.", action='store_true')
|
parser.add_argument("--noavx2", help="Do not use AVX2 instructions, a slower compatibility mode for older devices. Does not work with --clblast.", action='store_true')
|
||||||
parser.add_argument("--debugmode", help="Shows additional debug info in the terminal.", action='store_true')
|
parser.add_argument("--debugmode", help="Shows additional debug info in the terminal.", action='store_const', const=1, default=0)
|
||||||
parser.add_argument("--skiplauncher", help="Doesn't display or use the new GUI launcher.", action='store_true')
|
parser.add_argument("--skiplauncher", help="Doesn't display or use the new GUI launcher.", action='store_true')
|
||||||
parser.add_argument("--hordeconfig", help="Sets the display model name to something else, for easy use on AI Horde. An optional second parameter sets the horde max gen length.",metavar=('[hordename]', '[hordelength]'), nargs='+')
|
parser.add_argument("--hordeconfig", help="Sets the display model name to something else, for easy use on AI Horde. An optional second parameter sets the horde max gen length.",metavar=('[hordename]', '[hordelength]'), nargs='+')
|
||||||
compatgroup = parser.add_mutually_exclusive_group()
|
compatgroup = parser.add_mutually_exclusive_group()
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue