various debug logging improvements

2023-06-18 15:24:58 +08:00 · 2023-06-18 15:24:58 +08:00 · 8775dd99f4
commit 8775dd99f4
parent dc3472eb58
6 changed files with 66 additions and 40 deletions
--- a/README.md
+++ b/README.md
@ -13,14 +13,24 @@ What does it mean? You get llama.cpp with a fancy UI, persistent stories, editin
 - To run, execute **koboldcpp.exe** or drag and drop your quantized `ggml_model.bin` file onto the .exe, and then connect with Kobold or Kobold Lite. If you're not on windows, then run the script **KoboldCpp.py** after compiling the libraries.
 - By default, you can connect to http://localhost:5001
 - You can also run it using the command line `koboldcpp.exe [ggml_model.bin] [port]`. For info, please check `koboldcpp.exe --help`
- If you are having crashes or issues, you can try turning off BLAS with the `--noblas` flag. You can also try running in a non-avx2 compatibility mode with `--noavx2`. Lastly, you can try turning off mmap with `--nommap`. 
 - Big context still too slow? Try the `--smartcontext` flag to reduce prompt processing frequency. Also, you can try to run with your GPU using CLBlast, with `--useclblast` flag for a speedup
 - Want even more speedup? Combine `--useclblast` with `--gpulayers` to offload entire layers to the GPU! **Much faster, but uses more VRAM**. Experiment to determine number of layers to offload.
+- If you are having crashes or issues, you can try turning off BLAS with the `--noblas` flag. You can also try running in a non-avx2 compatibility mode with `--noavx2`. Lastly, you can try turning off mmap with `--nommap`.

 For more information, be sure to run the program with the `--help` flag.

+## OSX and Linux
+- You will have to compile your binaries from source. A makefile is provided, simply run `make`
+- If you want you can also link your own install of OpenBLAS manually with `make LLAMA_OPENBLAS=1`
+- Alternatively, if you want you can also link your own install of CLBlast manually with `make LLAMA_CLBLAST=1`, for this you will need to obtain and link OpenCL and CLBlast libraries.
+- For a full featured build, do `make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1`
+  - For Arch Linux: Install `cblas` `openblas` and `clblast`.
+  - For Debian: Install `libclblast-dev` and `libopenblas-dev`.
+- After all binaries are built, you can run the python script with the command `koboldcpp.py [ggml_model.bin] [port]`
+- Note: Many OSX users have found that the using Accelerate is actually faster than OpenBLAS. To try, you may wish to run with `--noblas` and compare speeds.
+
 ## Compiling on Windows
- If you want to compile your binaries from source at Windows, the easiest way is:
+- You're encouraged to use the .exe released, but if you want to compile your binaries from source at Windows, the easiest way is:
  - Use the latest release of w64devkit (https://github.com/skeeto/w64devkit). Be sure to use the "vanilla one", not i686 or other different stuff. If you try they will conflit with the precompiled libs!
  - Make sure you are using the w64devkit integrated terminal, then run 'make' at the KoboldCpp source folder. This will create the .dll files.
  - If you want to generate the .exe file, make sure you have the python module PyInstaller installed with pip ('pip install PyInstaller').
@ -34,19 +44,13 @@ For more information, be sure to run the program with the `--help` flag.
  - Also, replace the existing versions of the corresponding .dll files located in the project directory root (e.g. libopenblas.dll).
  - Make the KoboldCPP project using the instructions above.

-## OSX and Linux
- You will have to compile your binaries from source. A makefile is provided, simply run `make`
- If you want you can also link your own install of OpenBLAS manually with `make LLAMA_OPENBLAS=1`
- Alternatively, if you want you can also link your own install of CLBlast manually with `make LLAMA_CLBLAST=1`, for this you will need to obtain and link OpenCL and CLBlast libraries.
- For a full featured build, do `make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1`
-  - For Arch Linux: Install `cblas` `openblas` and `clblast`. 
-  - For Debian: Install `libclblast-dev` and `libopenblas-dev`.
- After all binaries are built, you can run the python script with the command `koboldcpp.py [ggml_model.bin] [port]`
- Note: Many OSX users have found that the using Accelerate is actually faster than OpenBLAS. To try, you may wish to run with `--noblas` and compare speeds.
+## Android (Termux) Alternative method
+- See https://github.com/ggerganov/llama.cpp/pull/1828/files
+
+## CuBLAS?
+- You can attempt a CuBLAS build with LLAMA_CUBLAS=1 or using the provided CMake file (best for visual studio users). Note that support for CuBLAS is limited.

 ## Considerations
- ZERO or MINIMAL changes as possible to parent repo files - do not move their function declarations elsewhere! We want to be able to update the repo and pull any changes automatically.
- No dynamic memory allocation! Setup structs with FIXED (known) shapes and sizes for ALL output fields. Python will ALWAYS provide the memory, we just write to it.
 - For Windows: No installation, single file executable, (It Just Works)
 - Since v1.0.6, requires libopenblas, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without BLAS.
 - Since v1.15, requires CLBlast if enabled, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without CLBlast.
--- a/Remote-Link.cmd
+++ b/Remote-Link.cmd
@ -0,0 +1,2 @@
+curl -L https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-windows-amd64.exe -o cloudflared.exe
+cloudflared.exe tunnel --url localhost:5001
--- a/expose.h
+++ b/expose.h
@ -18,7 +18,7 @@ struct load_model_inputs
    const bool unban_tokens;
    const int clblast_info = 0;
    const int blasbatchsize = 512;
-    const bool debugmode;
+    const int debugmode = 0;
    const int forceversion = 0;
    const int gpulayers = 0;
 };
--- a/gpttype_adapter.cpp
+++ b/gpttype_adapter.cpp
@ -68,7 +68,7 @@ static int n_batch = 8;
 static bool useSmartContext = false;
 static bool unbanTokens = false;
 static int blasbatchsize = 512;
-static bool debugmode = false;
+static int debugmode = 0; //-1 = hide all, 0 = normal, 1 = showall
 static std::string modelname;
 static std::vector<gpt_vocab::id> last_n_tokens;
 static std::vector<gpt_vocab::id> current_context_tokens;
@ -118,7 +118,7 @@ llama_token sample_token(llama_token_data_array * candidates, std::mt19937 & rng
    std::discrete_distribution<> dist(probs.begin(), probs.end());
    int idx = dist(rng);

-    if(debugmode)
+    if(debugmode==1)
    {
        top_picks.push_back(candidates->data[idx]);
        for (size_t i = 0; (i < candidates->size && i<4); ++i)
@ -981,9 +981,12 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
        printf("Bad format!");
    }

+    if(debugmode!=-1)
+    {
        printf("\n");
+    }

-    if (debugmode)
+    if (debugmode==1)
    {
        std::string outstr = "";
        printf("\n[Debug: Dump Input Tokens, format: %d]\n", file_format);
@ -1013,7 +1016,7 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
        // predict
        unsigned int embdsize = embd.size();
        //print progress
-        if (!startedsampling)
+        if (!startedsampling && debugmode!=-1)
        {
            printf("\rProcessing Prompt%s (%d / %d tokens)", (blasmode ? " [BLAS]" : ""), input_consumed, embd_inp.size());
        }
@ -1229,11 +1232,11 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
                concat_output += tokenizedstr;
            }

-            if (startedsampling)
+            if (startedsampling && debugmode!=-1)
            {
                printf("\rGenerating (%d / %d tokens)", (params.n_predict - remaining_tokens), params.n_predict);
            }
-            if(debugmode && top_picks.size()>0)
+            if(debugmode==1 && top_picks.size()>0)
            {
                printf(" [");
                bool firstloop = true;
@ -1263,7 +1266,10 @@ generation_outputs gpttype_generate(const generation_inputs inputs, generation_o
                {
                    stopper_unused_tokens = remaining_tokens;
                    remaining_tokens = 0;
+                    if(debugmode!=-1)
+                    {
                        printf("\n(Stop sequence triggered: <%s>)", matched.c_str());
+                    }
                    break;
                }
            }
--- a/klite.embd
+++ b/klite.embd
--- a/koboldcpp.py
+++ b/koboldcpp.py
@ -26,7 +26,7 @@ class load_model_inputs(ctypes.Structure):
                ("unban_tokens", ctypes.c_bool),
                ("clblast_info", ctypes.c_int),
                ("blasbatchsize", ctypes.c_int),
-                ("debugmode", ctypes.c_bool),
+                ("debugmode", ctypes.c_int),
                ("forceversion", ctypes.c_int),
                ("gpulayers", ctypes.c_int)]

@ -224,7 +224,8 @@ maxctx = 2048
 maxlen = 256
 modelbusy = False
 defaultport = 5001
-KcppVersion = "1.31"
+KcppVersion = "1.31.1"
+showdebug = True

 class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
    sys_version = ""
@ -238,6 +239,12 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
    def __call__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

+    def log_message(self, format, *args):
+        global showdebug
+        if showdebug:
+            super().log_message(format, *args)
+        pass
+
    async def generate_text(self, newprompt, genparams, basic_api_flag, stream_flag):

        def run_blocking():
@ -281,6 +288,7 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
        else:
            recvtxt = run_blocking()

+        if args.debugmode!=-1:
            utfprint("\nOutput: " + recvtxt)

        res = {"data": {"seqs":[recvtxt]}} if basic_api_flag else {"results": [{"text": recvtxt}]}
@ -414,7 +422,7 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
            self.send_response(200)
            self.end_headers()
            self.wfile.write(json.dumps({"success": ("true" if ag else "false")}).encode())
-            print("Generation Aborted")
+            print("\nGeneration Aborted")
            modelbusy = False
            return

@ -453,6 +461,7 @@ class ServerRequestHandler(http.server.SimpleHTTPRequestHandler):
                utfprint("Body Err: " + str(body))
                return self.send_response(503)

+            if args.debugmode!=-1:
                utfprint("\nInput: " + json.dumps(genparams))

            modelbusy = True
@ -714,10 +723,15 @@ def main(args):
            sys.exit(2)

    if args.hordeconfig and args.hordeconfig[0]!="":
-        global friendlymodelname, maxlen
+        global friendlymodelname, maxlen, showdebug
        friendlymodelname = "koboldcpp/"+args.hordeconfig[0]
        if len(args.hordeconfig) > 1:
            maxlen = int(args.hordeconfig[1])
+        if args.debugmode == 0:
+            args.debugmode = -1
+
+    if args.debugmode != 1:
+        showdebug = False

    if args.highpriority:
        print("Setting process to Higher Priority - Use Caution")
@ -839,7 +853,7 @@ if __name__ == '__main__':
    parser.add_argument("--nommap", help="If set, do not use mmap to load newer models", action='store_true')
    parser.add_argument("--usemlock", help="For Apple Systems. Force system to keep model in RAM rather than swapping or compressing", action='store_true')
    parser.add_argument("--noavx2", help="Do not use AVX2 instructions, a slower compatibility mode for older devices. Does not work with --clblast.", action='store_true')
-    parser.add_argument("--debugmode", help="Shows additional debug info in the terminal.", action='store_true')
+    parser.add_argument("--debugmode", help="Shows additional debug info in the terminal.", action='store_const', const=1, default=0)
    parser.add_argument("--skiplauncher", help="Doesn't display or use the new GUI launcher.", action='store_true')
    parser.add_argument("--hordeconfig", help="Sets the display model name to something else, for easy use on AI Horde. An optional second parameter sets the horde max gen length.",metavar=('[hordename]', '[hordelength]'), nargs='+')
    compatgroup = parser.add_mutually_exclusive_group()