various debug logging improvements

This commit is contained in:
Concedo 2023-06-18 15:24:58 +08:00
parent dc3472eb58
commit 8775dd99f4
6 changed files with 66 additions and 40 deletions

View file

@ -13,14 +13,24 @@ What does it mean? You get llama.cpp with a fancy UI, persistent stories, editin
- To run, execute **koboldcpp.exe** or drag and drop your quantized `ggml_model.bin` file onto the .exe, and then connect with Kobold or Kobold Lite. If you're not on windows, then run the script **KoboldCpp.py** after compiling the libraries.
- By default, you can connect to http://localhost:5001
- You can also run it using the command line `koboldcpp.exe [ggml_model.bin] [port]`. For info, please check `koboldcpp.exe --help`
- If you are having crashes or issues, you can try turning off BLAS with the `--noblas` flag. You can also try running in a non-avx2 compatibility mode with `--noavx2`. Lastly, you can try turning off mmap with `--nommap`.
- Big context still too slow? Try the `--smartcontext` flag to reduce prompt processing frequency. Also, you can try to run with your GPU using CLBlast, with `--useclblast` flag for a speedup
- Want even more speedup? Combine `--useclblast` with `--gpulayers` to offload entire layers to the GPU! **Much faster, but uses more VRAM**. Experiment to determine number of layers to offload.
- If you are having crashes or issues, you can try turning off BLAS with the `--noblas` flag. You can also try running in a non-avx2 compatibility mode with `--noavx2`. Lastly, you can try turning off mmap with `--nommap`.
For more information, be sure to run the program with the `--help` flag.
## OSX and Linux
- You will have to compile your binaries from source. A makefile is provided, simply run `make`
- If you want you can also link your own install of OpenBLAS manually with `make LLAMA_OPENBLAS=1`
- Alternatively, if you want you can also link your own install of CLBlast manually with `make LLAMA_CLBLAST=1`, for this you will need to obtain and link OpenCL and CLBlast libraries.
- For a full featured build, do `make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1`
- For Arch Linux: Install `cblas` `openblas` and `clblast`.
- For Debian: Install `libclblast-dev` and `libopenblas-dev`.
- After all binaries are built, you can run the python script with the command `koboldcpp.py [ggml_model.bin] [port]`
- Note: Many OSX users have found that the using Accelerate is actually faster than OpenBLAS. To try, you may wish to run with `--noblas` and compare speeds.
## Compiling on Windows
- If you want to compile your binaries from source at Windows, the easiest way is:
- You're encouraged to use the .exe released, but if you want to compile your binaries from source at Windows, the easiest way is:
- Use the latest release of w64devkit (https://github.com/skeeto/w64devkit). Be sure to use the "vanilla one", not i686 or other different stuff. If you try they will conflit with the precompiled libs!
- Make sure you are using the w64devkit integrated terminal, then run 'make' at the KoboldCpp source folder. This will create the .dll files.
- If you want to generate the .exe file, make sure you have the python module PyInstaller installed with pip ('pip install PyInstaller').
@ -34,19 +44,13 @@ For more information, be sure to run the program with the `--help` flag.
- Also, replace the existing versions of the corresponding .dll files located in the project directory root (e.g. libopenblas.dll).
- Make the KoboldCPP project using the instructions above.
## OSX and Linux
- You will have to compile your binaries from source. A makefile is provided, simply run `make`
- If you want you can also link your own install of OpenBLAS manually with `make LLAMA_OPENBLAS=1`
- Alternatively, if you want you can also link your own install of CLBlast manually with `make LLAMA_CLBLAST=1`, for this you will need to obtain and link OpenCL and CLBlast libraries.
- For a full featured build, do `make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1`
- For Arch Linux: Install `cblas` `openblas` and `clblast`.
- For Debian: Install `libclblast-dev` and `libopenblas-dev`.
- After all binaries are built, you can run the python script with the command `koboldcpp.py [ggml_model.bin] [port]`
- Note: Many OSX users have found that the using Accelerate is actually faster than OpenBLAS. To try, you may wish to run with `--noblas` and compare speeds.
## Android (Termux) Alternative method
- See https://github.com/ggerganov/llama.cpp/pull/1828/files
## CuBLAS?
- You can attempt a CuBLAS build with LLAMA_CUBLAS=1 or using the provided CMake file (best for visual studio users). Note that support for CuBLAS is limited.
## Considerations
- ZERO or MINIMAL changes as possible to parent repo files - do not move their function declarations elsewhere! We want to be able to update the repo and pull any changes automatically.
- No dynamic memory allocation! Setup structs with FIXED (known) shapes and sizes for ALL output fields. Python will ALWAYS provide the memory, we just write to it.
- For Windows: No installation, single file executable, (It Just Works)
- Since v1.0.6, requires libopenblas, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without BLAS.
- Since v1.15, requires CLBlast if enabled, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without CLBlast.

2
Remote-Link.cmd Normal file
View file

@ -0,0 +1,2 @@
curl -L https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-windows-amd64.exe -o cloudflared.exe
cloudflared.exe tunnel --url localhost:5001

View file

@ -18,7 +18,7 @@ struct load_model_inputs
const bool unban_tokens;
const int clblast_info = 0;
const int blasbatchsize = 512;
const bool debugmode;
const int debugmode = 0;
const int forceversion = 0;
const int gpulayers = 0;
};

View file

@ -68,7 +68,7 @@ static int n_batch = 8;
static bool useSmartContext = false;
static bool unbanTokens = false;
static int blasbatchsize = 512;
static bool debugmode = false;
static int debugmode = 0; //-1 = hide all, 0 = normal, 1 = showall
static std::string modelname;
static std::vector<gpt_vocab::id> last_n_tokens;
static std::vector<gpt_vocab::id> current_context_tokens;
@ -118,7 +118,7 @@ llama_token sample_token(llama_token_data_array * candidates, std::mt19937 & rng
std::discrete_distribution<> dist(probs.begin(), probs.end());
int idx = dist(rng);
if(debugmode)
if(debugmode==1)
{
top_picks.push_back(candidates->data[idx]);
for (size_t i = 0; (i < candidates->size && i<4); ++i)
@ -981,9 +981,12 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
printf("Bad format!");
}
if(debugmode!=-1)
{
printf("\n");
}
if (debugmode)
if (debugmode==1)
{
std::string outstr = "";
printf("\n[Debug: Dump Input Tokens, format: %d]\n", file_format);
@ -1013,7 +1016,7 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
// predict
unsigned int embdsize = embd.size();
//print progress
if (!startedsampling)
if (!startedsampling && debugmode!=-1)
{
printf("\rProcessing Prompt%s (%d / %d tokens)", (blasmode ? " [BLAS]" : ""), input_consumed, embd_inp.size());
}
@ -1229,11 +1232,11 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
concat_output += tokenizedstr;
}
if (startedsampling)
if (startedsampling && debugmode!=-1)
{
printf("\rGenerating (%d / %d tokens)", (params.n_predict - remaining_tokens), params.n_predict);
}
if(debugmode && top_picks.size()>0)
if(debugmode==1 && top_picks.size()>0)
{
printf(" [");
bool firstloop = true;
@ -1263,7 +1266,10 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
{
stopper_unused_tokens = remaining_tokens;
remaining_tokens = 0;
if(debugmode!=-1)
{
printf("\n(Stop sequence triggered: <%s>)", matched.c_str());
}
break;
}
}

File diff suppressed because one or more lines are too long

View file

@ -26,7 +26,7 @@ class load_model_inputs(ctypes.Structure):
("unban_tokens", ctypes.c_bool),
("clblast_info", ctypes.c_int),
("blasbatchsize", ctypes.c_int),
("debugmode", ctypes.c_bool),
("debugmode", ctypes.c_int),
("forceversion", ctypes.c_int),
("gpulayers", ctypes.c_int)]
@ -224,7 +224,8 @@ maxctx = 2048
maxlen = 256
modelbusy = False
defaultport = 5001
KcppVersion = "1.31"
KcppVersion = "1.31.1"
showdebug = True
class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
sys_version = ""
@ -238,6 +239,12 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
def __call__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def log_message(self, format, *args):
global showdebug
if showdebug:
super().log_message(format, *args)
pass
async def generate_text(self, newprompt, genparams, basic_api_flag, stream_flag):
def run_blocking():
@ -281,6 +288,7 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
else:
recvtxt = run_blocking()
if args.debugmode!=-1:
utfprint("\nOutput: " + recvtxt)
res = {"data": {"seqs":[recvtxt]}} if basic_api_flag else {"results": [{"text": recvtxt}]}
@ -414,7 +422,7 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
self.send_response(200)
self.end_headers()
self.wfile.write(json.dumps({"success": ("true" if ag else "false")}).encode())
print("Generation Aborted")
print("\nGeneration Aborted")
modelbusy = False
return
@ -453,6 +461,7 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
utfprint("Body Err: " + str(body))
return self.send_response(503)
if args.debugmode!=-1:
utfprint("\nInput: " + json.dumps(genparams))
modelbusy = True
@ -714,10 +723,15 @@ def main(args):
sys.exit(2)
if args.hordeconfig and args.hordeconfig[0]!="":
global friendlymodelname, maxlen
global friendlymodelname, maxlen, showdebug
friendlymodelname = "koboldcpp/"+args.hordeconfig[0]
if len(args.hordeconfig) > 1:
maxlen = int(args.hordeconfig[1])
if args.debugmode == 0:
args.debugmode = -1
if args.debugmode != 1:
showdebug = False
if args.highpriority:
print("Setting process to Higher Priority - Use Caution")
@ -839,7 +853,7 @@ if __name__ == '__main__':
parser.add_argument("--nommap", help="If set, do not use mmap to load newer models", action='store_true')
parser.add_argument("--usemlock", help="For Apple Systems. Force system to keep model in RAM rather than swapping or compressing", action='store_true')
parser.add_argument("--noavx2", help="Do not use AVX2 instructions, a slower compatibility mode for older devices. Does not work with --clblast.", action='store_true')
parser.add_argument("--debugmode", help="Shows additional debug info in the terminal.", action='store_true')
parser.add_argument("--debugmode", help="Shows additional debug info in the terminal.", action='store_const', const=1, default=0)
parser.add_argument("--skiplauncher", help="Doesn't display or use the new GUI launcher.", action='store_true')
parser.add_argument("--hordeconfig", help="Sets the display model name to something else, for easy use on AI Horde. An optional second parameter sets the horde max gen length.",metavar=('[hordename]', '[hordelength]'), nargs='+')
compatgroup = parser.add_mutually_exclusive_group()