various debug logging improvements

This commit is contained in:
Concedo 2023-06-18 15:24:58 +08:00
parent dc3472eb58
commit 8775dd99f4
6 changed files with 66 additions and 40 deletions

View file

@ -13,14 +13,24 @@ What does it mean? You get llama.cpp with a fancy UI, persistent stories, editin
- To run, execute **koboldcpp.exe** or drag and drop your quantized `ggml_model.bin` file onto the .exe, and then connect with Kobold or Kobold Lite. If you're not on windows, then run the script **KoboldCpp.py** after compiling the libraries. - To run, execute **koboldcpp.exe** or drag and drop your quantized `ggml_model.bin` file onto the .exe, and then connect with Kobold or Kobold Lite. If you're not on windows, then run the script **KoboldCpp.py** after compiling the libraries.
- By default, you can connect to http://localhost:5001 - By default, you can connect to http://localhost:5001
- You can also run it using the command line `koboldcpp.exe [ggml_model.bin] [port]`. For info, please check `koboldcpp.exe --help` - You can also run it using the command line `koboldcpp.exe [ggml_model.bin] [port]`. For info, please check `koboldcpp.exe --help`
- If you are having crashes or issues, you can try turning off BLAS with the `--noblas` flag. You can also try running in a non-avx2 compatibility mode with `--noavx2`. Lastly, you can try turning off mmap with `--nommap`.
- Big context still too slow? Try the `--smartcontext` flag to reduce prompt processing frequency. Also, you can try to run with your GPU using CLBlast, with `--useclblast` flag for a speedup - Big context still too slow? Try the `--smartcontext` flag to reduce prompt processing frequency. Also, you can try to run with your GPU using CLBlast, with `--useclblast` flag for a speedup
- Want even more speedup? Combine `--useclblast` with `--gpulayers` to offload entire layers to the GPU! **Much faster, but uses more VRAM**. Experiment to determine number of layers to offload. - Want even more speedup? Combine `--useclblast` with `--gpulayers` to offload entire layers to the GPU! **Much faster, but uses more VRAM**. Experiment to determine number of layers to offload.
- If you are having crashes or issues, you can try turning off BLAS with the `--noblas` flag. You can also try running in a non-avx2 compatibility mode with `--noavx2`. Lastly, you can try turning off mmap with `--nommap`.
For more information, be sure to run the program with the `--help` flag. For more information, be sure to run the program with the `--help` flag.
## OSX and Linux
- You will have to compile your binaries from source. A makefile is provided, simply run `make`
- If you want you can also link your own install of OpenBLAS manually with `make LLAMA_OPENBLAS=1`
- Alternatively, if you want you can also link your own install of CLBlast manually with `make LLAMA_CLBLAST=1`, for this you will need to obtain and link OpenCL and CLBlast libraries.
- For a full featured build, do `make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1`
- For Arch Linux: Install `cblas` `openblas` and `clblast`.
- For Debian: Install `libclblast-dev` and `libopenblas-dev`.
- After all binaries are built, you can run the python script with the command `koboldcpp.py [ggml_model.bin] [port]`
- Note: Many OSX users have found that the using Accelerate is actually faster than OpenBLAS. To try, you may wish to run with `--noblas` and compare speeds.
## Compiling on Windows ## Compiling on Windows
- If you want to compile your binaries from source at Windows, the easiest way is: - You're encouraged to use the .exe released, but if you want to compile your binaries from source at Windows, the easiest way is:
- Use the latest release of w64devkit (https://github.com/skeeto/w64devkit). Be sure to use the "vanilla one", not i686 or other different stuff. If you try they will conflit with the precompiled libs! - Use the latest release of w64devkit (https://github.com/skeeto/w64devkit). Be sure to use the "vanilla one", not i686 or other different stuff. If you try they will conflit with the precompiled libs!
- Make sure you are using the w64devkit integrated terminal, then run 'make' at the KoboldCpp source folder. This will create the .dll files. - Make sure you are using the w64devkit integrated terminal, then run 'make' at the KoboldCpp source folder. This will create the .dll files.
- If you want to generate the .exe file, make sure you have the python module PyInstaller installed with pip ('pip install PyInstaller'). - If you want to generate the .exe file, make sure you have the python module PyInstaller installed with pip ('pip install PyInstaller').
@ -34,19 +44,13 @@ For more information, be sure to run the program with the `--help` flag.
- Also, replace the existing versions of the corresponding .dll files located in the project directory root (e.g. libopenblas.dll). - Also, replace the existing versions of the corresponding .dll files located in the project directory root (e.g. libopenblas.dll).
- Make the KoboldCPP project using the instructions above. - Make the KoboldCPP project using the instructions above.
## OSX and Linux ## Android (Termux) Alternative method
- You will have to compile your binaries from source. A makefile is provided, simply run `make` - See https://github.com/ggerganov/llama.cpp/pull/1828/files
- If you want you can also link your own install of OpenBLAS manually with `make LLAMA_OPENBLAS=1`
- Alternatively, if you want you can also link your own install of CLBlast manually with `make LLAMA_CLBLAST=1`, for this you will need to obtain and link OpenCL and CLBlast libraries. ## CuBLAS?
- For a full featured build, do `make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1` - You can attempt a CuBLAS build with LLAMA_CUBLAS=1 or using the provided CMake file (best for visual studio users). Note that support for CuBLAS is limited.
- For Arch Linux: Install `cblas` `openblas` and `clblast`.
- For Debian: Install `libclblast-dev` and `libopenblas-dev`.
- After all binaries are built, you can run the python script with the command `koboldcpp.py [ggml_model.bin] [port]`
- Note: Many OSX users have found that the using Accelerate is actually faster than OpenBLAS. To try, you may wish to run with `--noblas` and compare speeds.
## Considerations ## Considerations
- ZERO or MINIMAL changes as possible to parent repo files - do not move their function declarations elsewhere! We want to be able to update the repo and pull any changes automatically.
- No dynamic memory allocation! Setup structs with FIXED (known) shapes and sizes for ALL output fields. Python will ALWAYS provide the memory, we just write to it.
- For Windows: No installation, single file executable, (It Just Works) - For Windows: No installation, single file executable, (It Just Works)
- Since v1.0.6, requires libopenblas, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without BLAS. - Since v1.0.6, requires libopenblas, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without BLAS.
- Since v1.15, requires CLBlast if enabled, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without CLBlast. - Since v1.15, requires CLBlast if enabled, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without CLBlast.

2
Remote-Link.cmd Normal file
View file

@ -0,0 +1,2 @@
curl -L https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-windows-amd64.exe -o cloudflared.exe
cloudflared.exe tunnel --url localhost:5001

View file

@ -18,7 +18,7 @@ struct load_model_inputs
const bool unban_tokens; const bool unban_tokens;
const int clblast_info = 0; const int clblast_info = 0;
const int blasbatchsize = 512; const int blasbatchsize = 512;
const bool debugmode; const int debugmode = 0;
const int forceversion = 0; const int forceversion = 0;
const int gpulayers = 0; const int gpulayers = 0;
}; };

View file

@ -68,7 +68,7 @@ static int n_batch = 8;
static bool useSmartContext = false; static bool useSmartContext = false;
static bool unbanTokens = false; static bool unbanTokens = false;
static int blasbatchsize = 512; static int blasbatchsize = 512;
static bool debugmode = false; static int debugmode = 0; //-1 = hide all, 0 = normal, 1 = showall
static std::string modelname; static std::string modelname;
static std::vector<gpt_vocab::id> last_n_tokens; static std::vector<gpt_vocab::id> last_n_tokens;
static std::vector<gpt_vocab::id> current_context_tokens; static std::vector<gpt_vocab::id> current_context_tokens;
@ -118,7 +118,7 @@ llama_token sample_token(llama_token_data_array * candidates, std::mt19937 & rng
std::discrete_distribution<> dist(probs.begin(), probs.end()); std::discrete_distribution<> dist(probs.begin(), probs.end());
int idx = dist(rng); int idx = dist(rng);
if(debugmode) if(debugmode==1)
{ {
top_picks.push_back(candidates->data[idx]); top_picks.push_back(candidates->data[idx]);
for (size_t i = 0; (i < candidates->size && i<4); ++i) for (size_t i = 0; (i < candidates->size && i<4); ++i)
@ -981,9 +981,12 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
printf("Bad format!"); printf("Bad format!");
} }
printf("\n"); if(debugmode!=-1)
{
printf("\n");
}
if (debugmode) if (debugmode==1)
{ {
std::string outstr = ""; std::string outstr = "";
printf("\n[Debug: Dump Input Tokens, format: %d]\n", file_format); printf("\n[Debug: Dump Input Tokens, format: %d]\n", file_format);
@ -1013,7 +1016,7 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
// predict // predict
unsigned int embdsize = embd.size(); unsigned int embdsize = embd.size();
//print progress //print progress
if (!startedsampling) if (!startedsampling && debugmode!=-1)
{ {
printf("\rProcessing Prompt%s (%d / %d tokens)", (blasmode ? " [BLAS]" : ""), input_consumed, embd_inp.size()); printf("\rProcessing Prompt%s (%d / %d tokens)", (blasmode ? " [BLAS]" : ""), input_consumed, embd_inp.size());
} }
@ -1229,11 +1232,11 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
concat_output += tokenizedstr; concat_output += tokenizedstr;
} }
if (startedsampling) if (startedsampling && debugmode!=-1)
{ {
printf("\rGenerating (%d / %d tokens)", (params.n_predict - remaining_tokens), params.n_predict); printf("\rGenerating (%d / %d tokens)", (params.n_predict - remaining_tokens), params.n_predict);
} }
if(debugmode && top_picks.size()>0) if(debugmode==1 && top_picks.size()>0)
{ {
printf(" ["); printf(" [");
bool firstloop = true; bool firstloop = true;
@ -1263,7 +1266,10 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
{ {
stopper_unused_tokens = remaining_tokens; stopper_unused_tokens = remaining_tokens;
remaining_tokens = 0; remaining_tokens = 0;
printf("\n(Stop sequence triggered: <%s>)", matched.c_str()); if(debugmode!=-1)
{
printf("\n(Stop sequence triggered: <%s>)", matched.c_str());
}
break; break;
} }
} }

File diff suppressed because one or more lines are too long

View file

@ -26,7 +26,7 @@ class load_model_inputs(ctypes.Structure):
("unban_tokens", ctypes.c_bool), ("unban_tokens", ctypes.c_bool),
("clblast_info", ctypes.c_int), ("clblast_info", ctypes.c_int),
("blasbatchsize", ctypes.c_int), ("blasbatchsize", ctypes.c_int),
("debugmode", ctypes.c_bool), ("debugmode", ctypes.c_int),
("forceversion", ctypes.c_int), ("forceversion", ctypes.c_int),
("gpulayers", ctypes.c_int)] ("gpulayers", ctypes.c_int)]
@ -224,7 +224,8 @@ maxctx = 2048
maxlen = 256 maxlen = 256
modelbusy = False modelbusy = False
defaultport = 5001 defaultport = 5001
KcppVersion = "1.31" KcppVersion = "1.31.1"
showdebug = True
class ServerRequestHandler(http.server.SimpleHTTPRequestHandler): class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
sys_version = "" sys_version = ""
@ -238,6 +239,12 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
def __call__(self, *args, **kwargs): def __call__(self, *args, **kwargs):
super().__init__(*args, **kwargs) super().__init__(*args, **kwargs)
def log_message(self, format, *args):
global showdebug
if showdebug:
super().log_message(format, *args)
pass
async def generate_text(self, newprompt, genparams, basic_api_flag, stream_flag): async def generate_text(self, newprompt, genparams, basic_api_flag, stream_flag):
def run_blocking(): def run_blocking():
@ -281,7 +288,8 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
else: else:
recvtxt = run_blocking() recvtxt = run_blocking()
utfprint("\nOutput: " + recvtxt) if args.debugmode!=-1:
utfprint("\nOutput: " + recvtxt)
res = {"data": {"seqs":[recvtxt]}} if basic_api_flag else {"results": [{"text": recvtxt}]} res = {"data": {"seqs":[recvtxt]}} if basic_api_flag else {"results": [{"text": recvtxt}]}
@ -414,7 +422,7 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
self.send_response(200) self.send_response(200)
self.end_headers() self.end_headers()
self.wfile.write(json.dumps({"success": ("true" if ag else "false")}).encode()) self.wfile.write(json.dumps({"success": ("true" if ag else "false")}).encode())
print("Generation Aborted") print("\nGeneration Aborted")
modelbusy = False modelbusy = False
return return
@ -453,7 +461,8 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
utfprint("Body Err: " + str(body)) utfprint("Body Err: " + str(body))
return self.send_response(503) return self.send_response(503)
utfprint("\nInput: " + json.dumps(genparams)) if args.debugmode!=-1:
utfprint("\nInput: " + json.dumps(genparams))
modelbusy = True modelbusy = True
@ -714,10 +723,15 @@ def main(args):
sys.exit(2) sys.exit(2)
if args.hordeconfig and args.hordeconfig[0]!="": if args.hordeconfig and args.hordeconfig[0]!="":
global friendlymodelname, maxlen global friendlymodelname, maxlen, showdebug
friendlymodelname = "koboldcpp/"+args.hordeconfig[0] friendlymodelname = "koboldcpp/"+args.hordeconfig[0]
if len(args.hordeconfig) > 1: if len(args.hordeconfig) > 1:
maxlen = int(args.hordeconfig[1]) maxlen = int(args.hordeconfig[1])
if args.debugmode == 0:
args.debugmode = -1
if args.debugmode != 1:
showdebug = False
if args.highpriority: if args.highpriority:
print("Setting process to Higher Priority - Use Caution") print("Setting process to Higher Priority - Use Caution")
@ -839,7 +853,7 @@ if __name__ == '__main__':
parser.add_argument("--nommap", help="If set, do not use mmap to load newer models", action='store_true') parser.add_argument("--nommap", help="If set, do not use mmap to load newer models", action='store_true')
parser.add_argument("--usemlock", help="For Apple Systems. Force system to keep model in RAM rather than swapping or compressing", action='store_true') parser.add_argument("--usemlock", help="For Apple Systems. Force system to keep model in RAM rather than swapping or compressing", action='store_true')
parser.add_argument("--noavx2", help="Do not use AVX2 instructions, a slower compatibility mode for older devices. Does not work with --clblast.", action='store_true') parser.add_argument("--noavx2", help="Do not use AVX2 instructions, a slower compatibility mode for older devices. Does not work with --clblast.", action='store_true')
parser.add_argument("--debugmode", help="Shows additional debug info in the terminal.", action='store_true') parser.add_argument("--debugmode", help="Shows additional debug info in the terminal.", action='store_const', const=1, default=0)
parser.add_argument("--skiplauncher", help="Doesn't display or use the new GUI launcher.", action='store_true') parser.add_argument("--skiplauncher", help="Doesn't display or use the new GUI launcher.", action='store_true')
parser.add_argument("--hordeconfig", help="Sets the display model name to something else, for easy use on AI Horde. An optional second parameter sets the horde max gen length.",metavar=('[hordename]', '[hordelength]'), nargs='+') parser.add_argument("--hordeconfig", help="Sets the display model name to something else, for easy use on AI Horde. An optional second parameter sets the horde max gen length.",metavar=('[hordename]', '[hordelength]'), nargs='+')
compatgroup = parser.add_mutually_exclusive_group() compatgroup = parser.add_mutually_exclusive_group()