Import redpajama.cpp into COSMO.

This is the relevant commit: bfa6466199
2025-08-08 10:50:28 +00:00 · 2023-05-11 06:50:36 -04:00 · 2023-05-11 06:50:36 -04:00 · 52e471331c
commit 52e471331c
parent 3083335d07
16 changed files with 5913 additions and 17 deletions
--- a/third_party/radpajama/README.cosmo
+++ b/third_party/radpajama/README.cosmo
@ -1,6 +1,6 @@
 DESCRIPTION

-  ggml is a machine learning library useful for LLM inference on CPUs
+  radpajama is a port of ggml for the open source Red Pajama LLM. It started as a fork of redpajama.cpp from Together Computer.

 LICENSE

@ -8,22 +8,12 @@ LICENSE

 ORIGIN

-  https://github.com/ggerganov/llama.cpp
-  commit 0b2da20538d01926b77ea237dd1c930c4d20b686
-  Author: Stephan Walter <stephan@walter.name>
-  Date:   Wed Apr 26 20:26:42 2023 +0000
-  ggml : slightly faster AVX2 implementation for Q5 (#1197)
+  github.com/togethercomputer/redpajama.cpp/
+  commit bfa6466199b8ef92185ecb72e2a550e12baf6602
+  Author: Szhangce <czhang@cs.stanford.edu>
+  Date:   Tue May 9 00:50:22 2023 +0200
+  radpajama : Update README.md 

 LOCAL CHANGES

-  - Make it possible for loaded prompts to be cached to disk
-  - Introduce -v and --verbose flags
-  - Reduce batch size from 512 to 32
-  - Allow --n_keep to specify a substring of prompt
-  - Don't print stats / diagnostics unless -v is passed
-  - Reduce --top_p default from 0.95 to 0.70
-  - Change --reverse-prompt to no longer imply --interactive
-  - Permit --reverse-prompt specifying custom EOS if non-interactive
-  - Refactor headers per cosmo convention
-  - Replace code like 'ggjt' with READ32BE("ggjt")
-  - Remove C++ exceptions; use Die() function instead
+  - Updated headers for COSMO build.
--- a/third_party/radpajama/README.md
+++ b/third_party/radpajama/README.md
@ -0,0 +1,143 @@
+# gglm Support for RedPajama Model
+
+## Ackonwledgement 
+
+We highly appreciate the great effort from the fork of [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp). Our support of the RedPajama Model is mainly based on this implementation. We extend the model configure and fixed a bug when setting use_parallel_residual flag to False in their original implementation. We also extend the chat model for RedPajama.
+
+## Usage:
+
+### RedPajama Chat model:
+
+- Make the code:
+
+        make redpajama-chat quantize-gptneox
+
+
+- Prepare the RedPajama model (f16 and q4_0) for gglm:
+
+        bash ./examples/redpajama/scripts/install-RedPajama-INCITE-Chat-3B-v1.sh
+
+- Run RedPajama chat model (fp16):
+
+        ./redpajama-chat -m ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-f16.bin \
+        -c 2048 \
+        -b 128 \
+        -n 1 \
+        -t 8 \
+        --instruct \
+        --color \
+        --top_k 30 \
+        --top_p 0.95 \
+        --temp 0.8 \
+        --repeat_last_n 3 \
+        --repeat_penalty 1.1 \
+        --seed 0
+
+    Note that you may need to install torch and transformers to run the above scripts, e.g.:
+        
+        pip install torch==2.0.0
+        pip install transformers==4.28.1
+
+
+- Run RedPajama chat model (q4_0):
+
+        ./redpajama-chat -m ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-q4_0.bin \
+        -c 2048 \
+        -b 128 \
+        -n 1 \
+        -t 8 \
+        --instruct \
+        --color \
+        --top_k 30 \
+        --top_p 0.95 \
+        --temp 0.8 \
+        --repeat_last_n 3 \
+        --repeat_penalty 1.1 \
+        --seed 0
+
+- Run other quantized version of RedPajama Chat model (Make sure you get the f16 model prepared before you run this):
+
+  - Make the code to quantize the model if you have not:
+
+        make quantize-gptneox
+
+  - Generate the quantized model, the supported types include: q4_0, q4_1, q4_2, q5_0, q5_1, and q8_0. For example, to run q4_1, you need to do the following convertion:
+
+        python ./examples/redpajama/scripts/quantize-gptneox.py ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-f16.bin --quantize-output-type q4_1
+
+  - Then you can chat with the quantized model:
+
+        ./redpajama-chat -m ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-q4_1.bin \
+        -c 2048 \
+        -b 128 \
+        -n 1 \
+        -t 8 \
+        --instruct \
+        --color \
+        --top_k 30 \
+        --top_p 0.95 \
+        --temp 0.8 \
+        --repeat_last_n 3 \
+        --repeat_penalty 1.1 \
+        --seed 0
+
+
+
+
+### RedPajama Base/Instruct model:
+
+- Make the code:
+
+        make redpajama quantize-gptneox
+
+
+- Prepare the RedPajama Base/Instruct model (f16 and q4_0) for gglm:
+
+        bash ./examples/redpajama/scripts/install-RedPajama-INCITE-Base-3B-v1.sh
+
+        # Or 
+
+        bash ./examples/redpajama/scripts/install-RedPajama-INCITE-Instruct-3B-v1.sh
+
+- Run other quantize version of RedPajama Base/Instruct model (Make sure you get the f16 model prepared before you run this). Then you can generate the quantized model, the supported types include: q4_0, q4_1, q4_2, q5_0, q5_1, and q8_0. For example, to run q4_1, you need to do the following convertion, e.g for RedPajama-Base q8_0:
+
+        python ./examples/redpajama/scripts/quantize-gptneox.py ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Base-3B-v1-f16.bin --quantize-output-type q8_0
+
+- Run RedPajama Base/Instruct model (e.g., RedPajama-Instruct q8_0) :
+
+        ./redpajama -m ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Instruct-3B-v1-q8_0.bin \
+        -c 2048 \
+        -b 128 \
+        -n 1 \
+        -t 8 \
+        --color \
+        --top_k 30 \
+        --top_p 0.95 \
+        --temp 0.8 \
+        --repeat_last_n 3 \
+        --repeat_penalty 1.1 \
+        --seed 0 \
+        --n_predict 256 \
+        --verbose-prompt \
+        -p "How to schedule a tour to Anfield:"
+
+
+## Attribution
+
+The following files are covered by a MIT license and were taken from:
+
+https://github.com/byroneverson/gptneox.cpp
+
+Thank you Byron.
+
+```
+common-gptneox.cpp	
+copy-gptneox.cpp	
+gptneox.cpp		
+quantize-gptneox.cpp
+common-gptneox.h	
+gptneox-util.h		
+gptneox.h
+convert_gptneox_to_ggml.py
+quantize-gptneox.py
+```
--- a/third_party/radpajama/common-gptneox.cpp
+++ b/third_party/radpajama/common-gptneox.cpp
@ -0,0 +1,429 @@
+#include "common-gptneox.h"
+
+#include <cassert>
+#include <cstring>
+#include <fstream>
+#include <string>
+#include <iterator>
+#include <algorithm>
+#include <sstream>
+#include <iostream>
+
+#if defined (_WIN32)
+#include <fcntl.h>
+#include <io.h>
+#pragma comment(lib,"kernel32.lib")
+extern "C" __declspec(dllimport) void* __stdcall GetStdHandle(unsigned long nStdHandle);
+extern "C" __declspec(dllimport) int __stdcall GetConsoleMode(void* hConsoleHandle, unsigned long* lpMode);
+extern "C" __declspec(dllimport) int __stdcall SetConsoleMode(void* hConsoleHandle, unsigned long dwMode);
+extern "C" __declspec(dllimport) int __stdcall SetConsoleCP(unsigned int wCodePageID);
+extern "C" __declspec(dllimport) int __stdcall SetConsoleOutputCP(unsigned int wCodePageID);
+extern "C" __declspec(dllimport) int __stdcall WideCharToMultiByte(unsigned int CodePage, unsigned long dwFlags,
+                                                                   const wchar_t * lpWideCharStr, int cchWideChar,
+                                                                   char * lpMultiByteStr, int cbMultiByte,
+                                                                   const char * lpDefaultChar, bool * lpUsedDefaultChar);
+#define CP_UTF8 65001
+#endif
+
+bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
+    // determine sensible default number of threads.
+    // std::thread::hardware_concurrency may not be equal to the number of cores, or may return 0.
+#ifdef __linux__
+    std::ifstream cpuinfo("/proc/cpuinfo");
+    params.n_threads = std::count(std::istream_iterator<std::string>(cpuinfo),
+                                  std::istream_iterator<std::string>(),
+                                  std::string("processor"));
+#endif
+    if (params.n_threads == 0) {
+        params.n_threads = std::max(1, (int32_t) std::thread::hardware_concurrency());
+    }
+
+    bool invalid_param = false;
+    std::string arg;
+    gpt_params default_params;
+
+    for (int i = 1; i < argc; i++) {
+        arg = argv[i];
+
+        if (arg == "-s" || arg == "--seed") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.seed = std::stoi(argv[i]);
+        } else if (arg == "-t" || arg == "--threads") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.n_threads = std::stoi(argv[i]);
+        } else if (arg == "-p" || arg == "--prompt") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.prompt = argv[i];
+        } else if (arg == "--session") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.path_session = argv[i];
+        } else if (arg == "-f" || arg == "--file") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            std::ifstream file(argv[i]);
+            if (!file) {
+                fprintf(stderr, "error: failed to open file '%s'\n", argv[i]);
+                invalid_param = true;
+                break;
+            }
+            std::copy(std::istreambuf_iterator<char>(file), std::istreambuf_iterator<char>(), back_inserter(params.prompt));
+            if (params.prompt.back() == '\n') {
+                params.prompt.pop_back();
+            }
+        } else if (arg == "-n" || arg == "--n_predict") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.n_predict = std::stoi(argv[i]);
+        } else if (arg == "--top_k") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.top_k = std::stoi(argv[i]);
+        } else if (arg == "-c" || arg == "--ctx_size") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.n_ctx = std::stoi(argv[i]);
+        } else if (arg == "--memory_f32") {
+            params.memory_f16 = false;
+        } else if (arg == "--top_p") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.top_p = std::stof(argv[i]);
+        } else if (arg == "--temp") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.temp = std::stof(argv[i]);
+        } else if (arg == "--tfs") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.tfs_z = std::stof(argv[i]);
+        } else if (arg == "--typical") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.typical_p = std::stof(argv[i]);
+        } else if (arg == "--repeat_last_n") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.repeat_last_n = std::stoi(argv[i]);
+        } else if (arg == "--repeat_penalty") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.repeat_penalty = std::stof(argv[i]);
+        } else if (arg == "--frequency_penalty") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.frequency_penalty = std::stof(argv[i]);
+        } else if (arg == "--presence_penalty") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.presence_penalty = std::stof(argv[i]);
+        } else if (arg == "--mirostat") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.mirostat = std::stoi(argv[i]);
+        } else if (arg == "--mirostat_lr") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.mirostat_eta = std::stof(argv[i]);
+        } else if (arg == "--mirostat_ent") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.mirostat_tau = std::stof(argv[i]);
+        } else if (arg == "-b" || arg == "--batch_size") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.n_batch = std::stoi(argv[i]);
+            params.n_batch = std::min(512, params.n_batch);
+        } else if (arg == "--keep") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.n_keep = std::stoi(argv[i]);
+        } else if (arg == "-m" || arg == "--model") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.model = argv[i];
+        } else if (arg == "--lora") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.lora_adapter = argv[i];
+            params.use_mmap = false;
+        } else if (arg == "--lora-base") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.lora_base = argv[i];
+        } else if (arg == "-i" || arg == "--interactive") {
+            params.interactive = true;
+        } else if (arg == "--embedding") {
+            params.embedding = true;
+        } else if (arg == "--interactive-first") {
+            params.interactive_first = true;
+        } else if (arg == "-ins" || arg == "--instruct") {
+            params.instruct = true;
+        } else if (arg == "--color") {
+            params.use_color = true;
+        } else if (arg == "--mlock") {
+            params.use_mlock = true;
+        } else if (arg == "--no-mmap") {
+            params.use_mmap = false;
+        } else if (arg == "--mtest") {
+            params.mem_test = true;
+        } else if (arg == "--verbose-prompt") {
+            params.verbose_prompt = true;
+        } else if (arg == "-r" || arg == "--reverse-prompt") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.antiprompt.push_back(argv[i]);
+        } else if (arg == "--perplexity") {
+            params.perplexity = true;
+        } else if (arg == "--ignore-eos") {
+            params.logit_bias[gptneox_token_eos()] = -INFINITY;
+        } else if (arg == "--no-penalize-nl") {
+            params.penalize_nl = false;
+        } else if (arg == "-l" || arg == "--logit-bias") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            std::stringstream ss(argv[i]);
+            gptneox_token key;
+            char sign;
+            std::string value_str;
+            try {
+                if (ss >> key && ss >> sign && std::getline(ss, value_str) && (sign == '+' || sign == '-')) {
+                    params.logit_bias[key] = std::stof(value_str) * ((sign == '-') ? -1.0f : 1.0f);
+                } else {
+                    throw std::exception();
+                }
+            } catch (const std::exception &e) {
+                invalid_param = true;
+                break;
+            }
+        } else if (arg == "--n_parts") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.n_parts = std::stoi(argv[i]);
+        } else if (arg == "-h" || arg == "--help") {
+            gpt_print_usage(argc, argv, default_params);
+            exit(0);
+        } else if (arg == "--random-prompt") {
+            params.random_prompt = true;
+        } else if (arg == "--in-prefix") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            params.input_prefix = argv[i];
+        } else {
+            fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
+            gpt_print_usage(argc, argv, default_params);
+            exit(1);
+        }
+    }
+    if (invalid_param) {
+        fprintf(stderr, "error: invalid parameter for argument: %s\n", arg.c_str());
+        gpt_print_usage(argc, argv, default_params);
+        exit(1);
+    }
+
+    return true;
+}
+
+void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
+    fprintf(stderr, "usage: %s [options]\n", argv[0]);
+    fprintf(stderr, "\n");
+    fprintf(stderr, "options:\n");
+    fprintf(stderr, "  -h, --help            show this help message and exit\n");
+    fprintf(stderr, "  -i, --interactive     run in interactive mode\n");
+    fprintf(stderr, "  --interactive-first   run in interactive mode and wait for input right away\n");
+    fprintf(stderr, "  -ins, --instruct      run in instruction mode\n");
+    fprintf(stderr, "  -r PROMPT, --reverse-prompt PROMPT\n");
+    fprintf(stderr, "                        run in interactive mode and poll user input upon seeing PROMPT (can be\n");
+    fprintf(stderr, "                        specified more than once for multiple prompts).\n");
+    fprintf(stderr, "  --color               colorise output to distinguish prompt and user input from generations\n");
+    fprintf(stderr, "  -s SEED, --seed SEED  RNG seed (default: -1, use random seed for <= 0)\n");
+    fprintf(stderr, "  -t N, --threads N     number of threads to use during computation (default: %d)\n", params.n_threads);
+    fprintf(stderr, "  -p PROMPT, --prompt PROMPT\n");
+    fprintf(stderr, "                        prompt to start generation with (default: empty)\n");
+    fprintf(stderr, "  --session FNAME       file to cache model state in (may be large!) (default: none)\n");
+    fprintf(stderr, "  --random-prompt       start with a randomized prompt.\n");
+    fprintf(stderr, "  --in-prefix STRING    string to prefix user inputs with (default: empty)\n");
+    fprintf(stderr, "  -f FNAME, --file FNAME\n");
+    fprintf(stderr, "                        prompt file to start generation.\n");
+    fprintf(stderr, "  -n N, --n_predict N   number of tokens to predict (default: %d, -1 = infinity)\n", params.n_predict);
+    fprintf(stderr, "  --top_k N             top-k sampling (default: %d, 0 = disabled)\n", params.top_k);
+    fprintf(stderr, "  --top_p N             top-p sampling (default: %.1f, 1.0 = disabled)\n", (double)params.top_p);
+    fprintf(stderr, "  --tfs N               tail free sampling, parameter z (default: %.1f, 1.0 = disabled)\n", (double)params.tfs_z);
+    fprintf(stderr, "  --typical N           locally typical sampling, parameter p (default: %.1f, 1.0 = disabled)\n", (double)params.typical_p);
+    fprintf(stderr, "  --repeat_last_n N     last n tokens to consider for penalize (default: %d, 0 = disabled, -1 = ctx_size)\n", params.repeat_last_n);
+    fprintf(stderr, "  --repeat_penalty N    penalize repeat sequence of tokens (default: %.1f, 1.0 = disabled)\n", (double)params.repeat_penalty);
+    fprintf(stderr, "  --presence_penalty N  repeat alpha presence penalty (default: %.1f, 0.0 = disabled)\n", (double)params.presence_penalty);
+    fprintf(stderr, "  --frequency_penalty N repeat alpha frequency penalty (default: %.1f, 0.0 = disabled)\n", (double)params.frequency_penalty);
+    fprintf(stderr, "  --mirostat N          use Mirostat sampling.\n");
+    fprintf(stderr, "                        Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used.\n");
+    fprintf(stderr, "                        (default: %d, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)\n", params.mirostat);
+    fprintf(stderr, "  --mirostat_lr N       Mirostat learning rate, parameter eta (default: %.1f)\n", (double)params.mirostat_eta);
+    fprintf(stderr, "  --mirostat_ent N      Mirostat target entropy, parameter tau (default: %.1f)\n", (double)params.mirostat_tau);
+    fprintf(stderr, "  -l TOKEN_ID(+/-)BIAS, --logit-bias TOKEN_ID(+/-)BIAS\n");
+    fprintf(stderr, "                        modifies the likelihood of token appearing in the completion,\n");
+    fprintf(stderr, "                        i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',\n");
+    fprintf(stderr, "                        or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'\n");
+    fprintf(stderr, "  -c N, --ctx_size N    size of the prompt context (default: %d)\n", params.n_ctx);
+    fprintf(stderr, "  --ignore-eos          ignore end of stream token and continue generating (implies --logit-bias 2-inf)\n");
+    fprintf(stderr, "  --no-penalize-nl      do not penalize newline token\n");
+    fprintf(stderr, "  --memory_f32          use f32 instead of f16 for memory key+value\n");
+    fprintf(stderr, "  --temp N              temperature (default: %.1f)\n", (double)params.temp);
+    fprintf(stderr, "  --n_parts N           number of model parts (default: -1 = determine from dimensions)\n");
+    fprintf(stderr, "  -b N, --batch_size N  batch size for prompt processing (default: %d)\n", params.n_batch);
+    fprintf(stderr, "  --perplexity          compute perplexity over the prompt\n");
+    fprintf(stderr, "  --keep                number of tokens to keep from the initial prompt (default: %d, -1 = all)\n", params.n_keep);
+    if (gptneox_mlock_supported()) {
+        fprintf(stderr, "  --mlock               force system to keep model in RAM rather than swapping or compressing\n");
+    }
+    if (gptneox_mmap_supported()) {
+        fprintf(stderr, "  --no-mmap             do not memory-map model (slower load but may reduce pageouts if not using mlock)\n");
+    }
+    fprintf(stderr, "  --mtest               compute maximum memory usage\n");
+    fprintf(stderr, "  --verbose-prompt      print prompt before generation\n");
+    fprintf(stderr, "  --lora FNAME          apply LoRA adapter (implies --no-mmap)\n");
+    fprintf(stderr, "  --lora-base FNAME     optional model to use as a base for the layers modified by the LoRA adapter\n");
+    fprintf(stderr, "  -m FNAME, --model FNAME\n");
+    fprintf(stderr, "                        model path (default: %s)\n", params.model.c_str());
+    fprintf(stderr, "\n");
+}
+
+std::string gpt_random_prompt(std::mt19937 & rng) {
+    const int r = rng() % 10;
+    switch (r) {
+        case 0: return "So";
+        case 1: return "Once upon a time";
+        case 2: return "When";
+        case 3: return "The";
+        case 4: return "After";
+        case 5: return "If";
+        case 6: return "import";
+        case 7: return "He";
+        case 8: return "She";
+        case 9: return "They";
+        default: return "To";
+    }
+
+    return "The";
+}
+
+// TODO: not great allocating this every time
+std::vector<gptneox_token> gptneox_tokenize(struct gptneox_context * ctx, const std::string & text, bool add_bos) {
+    // initialize to prompt numer of chars, since n_tokens <= n_prompt_chars
+    std::vector<gptneox_token> res(text.size() + (int)add_bos);
+    int n = gptneox_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
+    assert(n >= 0);
+    res.resize(n);
+
+    return res;
+}
+
+/* Keep track of current color of output, and emit ANSI code if it changes. */
+void set_console_color(console_state & con_st, console_color_t color) {
+    if (con_st.use_color && con_st.color != color) {
+        switch(color) {
+            case CONSOLE_COLOR_DEFAULT:
+                printf(ANSI_COLOR_RESET);
+                break;
+            case CONSOLE_COLOR_PROMPT:
+                printf(ANSI_COLOR_YELLOW);
+                break;
+            case CONSOLE_COLOR_USER_INPUT:
+                printf(ANSI_BOLD ANSI_COLOR_GREEN);
+                break;
+        }
+        con_st.color = color;
+    }
+}
+
+#if defined (_WIN32)
+void win32_console_init(bool enable_color) {
+    unsigned long dwMode = 0;
+    void* hConOut = GetStdHandle((unsigned long)-11); // STD_OUTPUT_HANDLE (-11)
+    if (!hConOut || hConOut == (void*)-1 || !GetConsoleMode(hConOut, &dwMode)) {
+        hConOut = GetStdHandle((unsigned long)-12); // STD_ERROR_HANDLE (-12)
+        if (hConOut && (hConOut == (void*)-1 || !GetConsoleMode(hConOut, &dwMode))) {
+            hConOut = 0;
+        }
+    }
+    if (hConOut) {
+        // Enable ANSI colors on Windows 10+
+        if (enable_color && !(dwMode & 0x4)) {
+            SetConsoleMode(hConOut, dwMode | 0x4); // ENABLE_VIRTUAL_TERMINAL_PROCESSING (0x4)
+        }
+        // Set console output codepage to UTF8
+        SetConsoleOutputCP(CP_UTF8);
+    }
+    void* hConIn = GetStdHandle((unsigned long)-10); // STD_INPUT_HANDLE (-10)
+    if (hConIn && hConIn != (void*)-1 && GetConsoleMode(hConIn, &dwMode)) {
+        // Set console input codepage to UTF16
+        _setmode(_fileno(stdin), _O_WTEXT);
+    }
+}
+
+// Convert a wide Unicode string to an UTF8 string
+void win32_utf8_encode(const std::wstring & wstr, std::string & str) {
+    int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
+    std::string strTo(size_needed, 0);
+    WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
+    str = strTo;
+}
+#endif
--- a/third_party/radpajama/common-gptneox.h
+++ b/third_party/radpajama/common-gptneox.h
@ -0,0 +1,108 @@
+// Various helper functions and utilities
+
+#pragma once
+
+#include "gptneox.h"
+
+#include <string>
+#include <vector>
+#include <random>
+#include <thread>
+#include <unordered_map>
+
+//
+// CLI argument parsing
+//
+
+struct gpt_params {
+    int32_t seed          = -1;   // RNG seed
+    int32_t n_threads     = std::min(4, (int32_t) std::thread::hardware_concurrency());
+    int32_t n_predict     = 128;  // new tokens to predict
+    int32_t n_parts       = -1;   // amount of model parts (-1 = determine from model dimensions)
+    int32_t n_ctx         = 512;  // context size
+    int32_t n_batch       = 512;  // batch size for prompt processing (must be >=32 to use BLAS)
+    int32_t n_keep        = 0;    // number of tokens to keep from initial prompt
+
+    // sampling parameters
+    std::unordered_map<gptneox_token, float> logit_bias; // logit bias for specific tokens
+    int32_t top_k             = 40;    // <= 0 to use vocab size
+    float   top_p             = 0.95f; // 1.0 = disabled
+    float   tfs_z             = 1.00f; // 1.0 = disabled
+    float   typical_p         = 1.00f; // 1.0 = disabled
+    float   temp              = 0.80f; // 1.0 = disabled
+    float   repeat_penalty    = 1.10f; // 1.0 = disabled
+    int32_t repeat_last_n     = 64;    // last n tokens to penalize (0 = disable penalty, -1 = context size)
+    float   frequency_penalty = 0.00f; // 0.0 = disabled
+    float   presence_penalty  = 0.00f; // 0.0 = disabled
+    int     mirostat          = 0;     // 0 = disabled, 1 = mirostat, 2 = mirostat 2.0
+    float   mirostat_tau      = 5.00f; // target entropy
+    float   mirostat_eta      = 0.10f; // learning rate
+
+    std::string model  = "./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat/Instruct-3B-v1-f16.bin"; // model path
+    std::string prompt = "";
+    std::string path_session = "";       // path to file for saving/loading model eval state
+    std::string input_prefix = "";       // string to prefix user inputs with
+    std::vector<std::string> antiprompt; // string upon seeing which more user input is prompted
+
+    std::string lora_adapter = "";  // lora adapter path
+    std::string lora_base = "";     // base model path for the lora adapter
+
+    bool memory_f16        = true;  // use f16 instead of f32 for memory kv
+    bool random_prompt     = false; // do not randomize prompt if none provided
+    bool use_color         = false; // use color to distinguish generations and inputs
+    bool interactive       = false; // interactive mode
+
+    bool embedding         = false; // get only sentence embedding
+    bool interactive_first = false; // wait for user input immediately
+
+    bool instruct          = false; // instruction mode
+    bool penalize_nl       = true;  // consider newlines as a repeatable token
+    bool perplexity        = false; // compute perplexity over the prompt
+    bool use_mmap          = true;  // use mmap for faster loads
+    bool use_mlock         = false; // use mlock to keep model in memory
+    bool mem_test          = false; // compute maximum memory usage
+    bool verbose_prompt    = false; // print prompt tokens before generation
+};
+
+bool gpt_params_parse(int argc, char ** argv, gpt_params & params);
+
+void gpt_print_usage(int argc, char ** argv, const gpt_params & params);
+
+std::string gpt_random_prompt(std::mt19937 & rng);
+
+//
+// Vocab utils
+//
+
+std::vector<gptneox_token> gptneox_tokenize(struct gptneox_context * ctx, const std::string & text, bool add_bos);
+
+//
+// Console utils
+//
+
+#define ANSI_COLOR_RED     "\x1b[31m"
+#define ANSI_COLOR_GREEN   "\x1b[32m"
+#define ANSI_COLOR_YELLOW  "\x1b[33m"
+#define ANSI_COLOR_BLUE    "\x1b[34m"
+#define ANSI_COLOR_MAGENTA "\x1b[35m"
+#define ANSI_COLOR_CYAN    "\x1b[36m"
+#define ANSI_COLOR_RESET   "\x1b[0m"
+#define ANSI_BOLD          "\x1b[1m"
+
+enum console_color_t {
+    CONSOLE_COLOR_DEFAULT=0,
+    CONSOLE_COLOR_PROMPT,
+    CONSOLE_COLOR_USER_INPUT
+};
+
+struct console_state {
+    bool use_color = false;
+    console_color_t color = CONSOLE_COLOR_DEFAULT;
+};
+
+void set_console_color(console_state & con_st, console_color_t color);
+
+#if defined (_WIN32)
+void win32_console_init(bool enable_color);
+void win32_utf8_encode(const std::wstring & wstr, std::string & str);
+#endif
--- a/third_party/radpajama/copy-gptneox.cpp
+++ b/third_party/radpajama/copy-gptneox.cpp
@ -0,0 +1,57 @@
+#include "ggml.h"
+#include "gptneox.h"
+
+#include <cstdio>
+#include <map>
+#include <string>
+
+static const std::map<std::string, enum gptneox_ftype> GPTNEOX_FTYPE_MAP = {
+  {"q4_0", GPTNEOX_FTYPE_MOSTLY_Q4_0},
+  {"q4_1", GPTNEOX_FTYPE_MOSTLY_Q4_1},
+  {"q4_2", GPTNEOX_FTYPE_MOSTLY_Q4_2},
+  //{"q4_3", GPTNEOX_FTYPE_MOSTLY_Q4_3},
+  {"q5_0", GPTNEOX_FTYPE_MOSTLY_Q5_0},
+  {"q5_1", GPTNEOX_FTYPE_MOSTLY_Q5_1},
+  {"q8_0", GPTNEOX_FTYPE_MOSTLY_Q8_0},
+};
+
+// usage:
+//  ./quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type
+//
+int main(int argc, char ** argv) {
+    ggml_time_init();
+
+    if (argc < 4) {
+        fprintf(stderr, "usage: %s model-f32.bin model-quant.bin ftype\n", argv[0]);
+        for (auto it = GPTNEOX_FTYPE_MAP.begin(); it != GPTNEOX_FTYPE_MAP.end(); it++) {
+            fprintf(stderr, "  type = \"%s\" or %d\n", it->first.c_str(), it->second);
+        }
+        return 1;
+    }
+
+    // needed to initialize f16 tables
+    {
+        struct ggml_init_params params = { 0, NULL, false };
+        struct ggml_context * ctx = ggml_init(params);
+        ggml_free(ctx);
+    }
+
+    const std::string fname_inp = argv[1];
+    const std::string fname_out = argv[2];
+
+    enum gptneox_ftype ftype;
+    if (argv[3][0] == 'q') {
+        auto it = GPTNEOX_FTYPE_MAP.find(argv[3]);
+        if (it == GPTNEOX_FTYPE_MAP.end()) {
+            fprintf(stderr, "%s: unknown ftype '%s'\n", __func__, argv[3]);
+            return 1;
+        }
+        ftype = it->second;
+    } else {
+        ftype = (enum gptneox_ftype)atoi(argv[3]);
+    }
+
+    gptneox_model_copy(fname_inp.c_str(), fname_out.c_str(), ftype);
+
+    return 0;
+}
--- a/third_party/radpajama/gptneox-util.h
+++ b/third_party/radpajama/gptneox-util.h
@ -0,0 +1,433 @@
+// Internal header to be included only by llama.cpp.
+// Contains wrappers around OS interfaces.
+
+#ifndef GPTNEOX_UTIL_H
+#define GPTNEOX_UTIL_H
+
+#include <cstdio>
+#include <cstdint>
+#include <cerrno>
+#include <cstring>
+#include <cstdarg>
+#include <cstdlib>
+#include <climits>
+
+#include <string>
+#include <vector>
+
+#ifdef __has_include
+    #if __has_include(<unistd.h>)
+        #include <unistd.h>
+        #if defined(_POSIX_MAPPED_FILES)
+            #include <sys/mman.h>
+        #endif
+        #if defined(_POSIX_MEMLOCK_RANGE)
+            #include <sys/resource.h>
+        #endif
+    #endif
+#endif
+
+#if defined(_WIN32)
+    #define WIN32_LEAN_AND_MEAN
+    #ifndef NOMINMAX
+        #define NOMINMAX
+    #endif
+    #include <windows.h>
+    #include <io.h>
+    #include <stdio.h> // for _fseeki64
+#endif
+
+#define GPTNEOX_ASSERT(x) \
+    do { \
+        if (!(x)) { \
+            fprintf(stderr, "GPTNEOX_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \
+            abort(); \
+        } \
+    } while (0)
+
+#ifdef __GNUC__
+#ifdef __MINGW32__
+__attribute__((format(gnu_printf, 1, 2)))
+#else
+__attribute__((format(printf, 1, 2)))
+#endif
+#endif
+static std::string format(const char * fmt, ...) {
+    va_list ap, ap2;
+    va_start(ap, fmt);
+    va_copy(ap2, ap);
+    int size = vsnprintf(NULL, 0, fmt, ap);
+    GPTNEOX_ASSERT(size >= 0 && size < INT_MAX);
+    std::vector<char> buf(size + 1);
+    int size2 = vsnprintf(buf.data(), size + 1, fmt, ap2);
+    GPTNEOX_ASSERT(size2 == size);
+    va_end(ap2);
+    va_end(ap);
+    return std::string(buf.data(), size);
+}
+
+struct gptneox_file {
+    // use FILE * so we don't have to re-open the file to mmap
+    FILE * fp;
+    size_t size;
+
+    gptneox_file(const char * fname, const char * mode) {
+        fp = std::fopen(fname, mode);
+        if (fp == NULL) {
+            throw format("failed to open %s: %s", fname, std::strerror(errno));
+        }
+        seek(0, SEEK_END);
+        size = tell();
+        seek(0, SEEK_SET);
+    }
+
+    size_t tell() const {
+#ifdef _WIN32
+        __int64 ret = _ftelli64(fp);
+#else
+        long ret = std::ftell(fp);
+#endif
+        GPTNEOX_ASSERT(ret != -1); // this really shouldn't fail
+        return (size_t) ret;
+    }
+
+    void seek(size_t offset, int whence) {
+#ifdef _WIN32
+        int ret = _fseeki64(fp, (__int64) offset, whence);
+#else
+        int ret = std::fseek(fp, (long) offset, whence);
+#endif
+        GPTNEOX_ASSERT(ret == 0); // same
+    }
+
+    void read_raw(void * ptr, size_t size) {
+        if (size == 0) {
+            return;
+        }
+        errno = 0;
+        std::size_t ret = std::fread(ptr, size, 1, fp);
+        if (ferror(fp)) {
+            throw format("read error: %s", strerror(errno));
+        }
+        if (ret != 1) {
+            throw std::string("unexpectedly reached end of file");
+        }
+    }
+
+    std::uint32_t read_u32() {
+        std::uint32_t ret;
+        read_raw(&ret, sizeof(ret));
+        return ret;
+    }
+
+    std::string read_string(std::uint32_t len) {
+        std::vector<char> chars(len);
+        read_raw(chars.data(), len);
+        return std::string(chars.data(), len);
+    }
+
+    void write_raw(const void * ptr, size_t size) {
+        if (size == 0) {
+            return;
+        }
+        errno = 0;
+        size_t ret = std::fwrite(ptr, size, 1, fp);
+        if (ret != 1) {
+            throw format("write error: %s", strerror(errno));
+        }
+    }
+
+    void write_u32(std::uint32_t val) {
+        write_raw(&val, sizeof(val));
+    }
+
+    ~gptneox_file() {
+        if (fp) {
+            std::fclose(fp);
+        }
+    }
+};
+
+#if defined(_WIN32)
+static std::string gptneox_format_win_err(DWORD err) {
+    LPSTR buf;
+    size_t size = FormatMessageA(FORMAT_MESSAGE_ALLOCATE_BUFFER | FORMAT_MESSAGE_FROM_SYSTEM | FORMAT_MESSAGE_IGNORE_INSERTS,
+                                 NULL, err, MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT), (LPSTR)&buf, 0, NULL);
+    if (!size) {
+        return "FormatMessageA failed";
+    }
+    std::string ret(buf, size);
+    LocalFree(buf);
+    return ret;
+}
+#endif
+
+struct gptneox_mmap {
+    void * addr;
+    size_t size;
+
+    gptneox_mmap(const gptneox_mmap &) = delete;
+
+#ifdef _POSIX_MAPPED_FILES
+    static constexpr bool SUPPORTED = true;
+
+    gptneox_mmap(struct gptneox_file * file, bool prefetch = true) {
+        size = file->size;
+        int fd = fileno(file->fp);
+        int flags = MAP_SHARED;
+#ifdef __linux__
+        flags |= MAP_POPULATE;
+#endif
+        addr = mmap(NULL, file->size, PROT_READ, flags, fd, 0);
+        if (addr == MAP_FAILED) {
+            throw format("mmap failed: %s", strerror(errno));
+        }
+
+        if (prefetch) {
+            // Advise the kernel to preload the mapped memory
+            if (madvise(addr, file->size, MADV_WILLNEED)) {
+                fprintf(stderr, "warning: madvise(.., MADV_WILLNEED) failed: %s\n",
+                        strerror(errno));
+            }
+        }
+    }
+
+    ~gptneox_mmap() {
+        munmap(addr, size);
+    }
+#elif defined(_WIN32)
+    static constexpr bool SUPPORTED = true;
+
+    gptneox_mmap(struct gptneox_file * file, bool prefetch = true) {
+        size = file->size;
+
+        HANDLE hFile = (HANDLE) _get_osfhandle(_fileno(file->fp));
+
+        HANDLE hMapping = CreateFileMappingA(hFile, NULL, PAGE_READONLY, 0, 0, NULL);
+        DWORD error = GetLastError();
+
+        if (hMapping == NULL) {
+            throw format("CreateFileMappingA failed: %s", gptneox_format_win_err(error).c_str());
+        }
+
+        addr = MapViewOfFile(hMapping, FILE_MAP_READ, 0, 0, 0);
+        error = GetLastError();
+        CloseHandle(hMapping);
+
+        if (addr == NULL) {
+            throw format("MapViewOfFile failed: %s", gptneox_format_win_err(error).c_str());
+        }
+
+        #if _WIN32_WINNT >= _WIN32_WINNT_WIN8
+        if (prefetch) {
+            // Advise the kernel to preload the mapped memory
+            WIN32_MEMORY_RANGE_ENTRY range;
+            range.VirtualAddress = addr;
+            range.NumberOfBytes = (SIZE_T)size;
+            if (!PrefetchVirtualMemory(GetCurrentProcess(), 1, &range, 0)) {
+                fprintf(stderr, "warning: PrefetchVirtualMemory failed: %s\n",
+                        gptneox_format_win_err(GetLastError()).c_str());
+            }
+        }
+        #else
+        #pragma message("warning: You are building for pre-Windows 8; prefetch not supported")
+        #endif // _WIN32_WINNT >= _WIN32_WINNT_WIN8
+    }
+
+    ~gptneox_mmap() {
+        if (!UnmapViewOfFile(addr)) {
+            fprintf(stderr, "warning: UnmapViewOfFile failed: %s\n",
+                    gptneox_format_win_err(GetLastError()).c_str());
+        }
+    }
+#else
+    static constexpr bool SUPPORTED = false;
+
+    gptneox_mmap(struct gptneox_file *) {
+        throw std::string("mmap not supported");
+    }
+#endif
+};
+
+// Represents some region of memory being locked using mlock or VirtualLock;
+// will automatically unlock on destruction.
+struct gptneox_mlock {
+    void * addr = NULL;
+    size_t size = 0;
+    bool failed_already = false;
+
+    gptneox_mlock() {}
+    gptneox_mlock(const gptneox_mlock &) = delete;
+
+    ~gptneox_mlock() {
+        if (size) {
+            raw_unlock(addr, size);
+        }
+    }
+
+    void init(void * addr) {
+        GPTNEOX_ASSERT(this->addr == NULL && this->size == 0);
+        this->addr = addr;
+    }
+
+    void grow_to(size_t target_size) {
+        GPTNEOX_ASSERT(addr);
+        if (failed_already) {
+            return;
+        }
+        size_t granularity = lock_granularity();
+        target_size = (target_size + granularity - 1) & ~(granularity - 1);
+        if (target_size > size) {
+            if (raw_lock((uint8_t *) addr + size, target_size - size)) {
+                size = target_size;
+            } else {
+                failed_already = true;
+            }
+        }
+    }
+
+#ifdef _POSIX_MEMLOCK_RANGE
+    static constexpr bool SUPPORTED = true;
+
+    size_t lock_granularity() {
+        return (size_t) sysconf(_SC_PAGESIZE);
+    }
+
+    #ifdef __APPLE__
+        #define MLOCK_SUGGESTION \
+            "Try increasing the sysctl values 'vm.user_wire_limit' and 'vm.global_user_wire_limit' and/or " \
+            "decreasing 'vm.global_no_user_wire_amount'.  Also try increasing RLIMIT_MLOCK (ulimit -l).\n"
+    #else
+        #define MLOCK_SUGGESTION \
+            "Try increasing RLIMIT_MLOCK ('ulimit -l' as root).\n"
+    #endif
+
+    bool raw_lock(const void * addr, size_t size) {
+        if (!mlock(addr, size)) {
+            return true;
+        } else {
+            char* errmsg = std::strerror(errno);
+            bool suggest = (errno == ENOMEM);
+
+            // Check if the resource limit is fine after all
+            struct rlimit lock_limit;
+            if (suggest && getrlimit(RLIMIT_MEMLOCK, &lock_limit))
+                suggest = false;
+            if (suggest && (lock_limit.rlim_max > lock_limit.rlim_cur + size))
+                suggest = false;
+
+            fprintf(stderr, "warning: failed to mlock %zu-byte buffer (after previously locking %zu bytes): %s\n%s",
+                    size, this->size, errmsg, suggest ? MLOCK_SUGGESTION : "");
+            return false;
+        }
+    }
+
+    #undef MLOCK_SUGGESTION
+
+    void raw_unlock(void * addr, size_t size) {
+        if (munlock(addr, size)) {
+            fprintf(stderr, "warning: failed to munlock buffer: %s\n", std::strerror(errno));
+        }
+    }
+#elif defined(_WIN32)
+    static constexpr bool SUPPORTED = true;
+
+    size_t lock_granularity() {
+        SYSTEM_INFO si;
+        GetSystemInfo(&si);
+        return (size_t) si.dwPageSize;
+    }
+
+    bool raw_lock(void * addr, size_t size) {
+        for (int tries = 1; ; tries++) {
+            if (VirtualLock(addr, size)) {
+                return true;
+            }
+            if (tries == 2) {
+                fprintf(stderr, "warning: failed to VirtualLock %zu-byte buffer (after previously locking %zu bytes): %s\n",
+                        size, this->size, gptneox_format_win_err(GetLastError()).c_str());
+                return false;
+            }
+
+            // It failed but this was only the first try; increase the working
+            // set size and try again.
+            SIZE_T min_ws_size, max_ws_size;
+            if (!GetProcessWorkingSetSize(GetCurrentProcess(), &min_ws_size, &max_ws_size)) {
+                fprintf(stderr, "warning: GetProcessWorkingSetSize failed: %s\n",
+                        gptneox_format_win_err(GetLastError()).c_str());
+                return false;
+            }
+            // Per MSDN: "The maximum number of pages that a process can lock
+            // is equal to the number of pages in its minimum working set minus
+            // a small overhead."
+            // Hopefully a megabyte is enough overhead:
+            size_t increment = size + 1048576;
+            // The minimum must be <= the maximum, so we need to increase both:
+            min_ws_size += increment;
+            max_ws_size += increment;
+            if (!SetProcessWorkingSetSize(GetCurrentProcess(), min_ws_size, max_ws_size)) {
+                fprintf(stderr, "warning: SetProcessWorkingSetSize failed: %s\n",
+                        gptneox_format_win_err(GetLastError()).c_str());
+                return false;
+            }
+        }
+    }
+
+    void raw_unlock(void * addr, size_t size) {
+        if (!VirtualUnlock(addr, size)) {
+            fprintf(stderr, "warning: failed to VirtualUnlock buffer: %s\n",
+                    gptneox_format_win_err(GetLastError()).c_str());
+        }
+    }
+#else
+    static constexpr bool SUPPORTED = false;
+
+    void raw_lock(const void * addr, size_t size) {
+        fprintf(stderr, "warning: mlock not supported on this system\n");
+    }
+
+    void raw_unlock(const void * addr, size_t size) {}
+#endif
+};
+
+// Replacement for std::vector<uint8_t> that doesn't require zero-initialization.
+struct gptneox_buffer {
+    uint8_t * addr = NULL;
+    size_t size = 0;
+
+    void resize(size_t size) {
+        delete[] addr;
+        addr = new uint8_t[size];
+        this->size = size;
+    }
+
+    ~gptneox_buffer() {
+        delete[] addr;
+    }
+};
+
+#ifdef GGML_USE_CUBLAS
+#include "ggml-cuda.h"
+struct gptneox_ctx_buffer {
+    uint8_t * addr = NULL;
+    size_t size = 0;
+
+    void resize(size_t size) {
+        if (addr) {
+            ggml_cuda_host_free(addr);
+        }
+        addr = (uint8_t *) ggml_cuda_host_malloc(size);
+        this->size = size;
+    }
+
+    ~gptneox_ctx_buffer() {
+        if (addr) {
+            ggml_cuda_host_free(addr);
+        }
+    }
+};
+#else
+typedef gptneox_buffer gptneox_ctx_buffer;
+#endif
+
+#endif
--- a/third_party/radpajama/gptneox.cpp
+++ b/third_party/radpajama/gptneox.cpp
--- a/third_party/radpajama/gptneox.h
+++ b/third_party/radpajama/gptneox.h
@ -0,0 +1,275 @@
+#ifndef GPTNEOX_H
+#define GPTNEOX_H
+
+#include <stddef.h>
+#include <stdint.h>
+#include <stdbool.h>
+
+#ifdef GPTNEOX_SHARED
+#    if defined(_WIN32) && !defined(__MINGW32__)
+#        ifdef GPTNEOX_BUILD
+#            define GPTNEOX_API __declspec(dllexport)
+#        else
+#            define GPTNEOX_API __declspec(dllimport)
+#        endif
+#    else
+#        define GPTNEOX_API __attribute__ ((visibility ("default")))
+#    endif
+#else
+#    define GPTNEOX_API
+#endif
+
+#define GPTNEOX_FILE_VERSION 1
+#define GPTNEOX_FILE_MAGIC 0x67676a74 // 'ggjt' in hex
+#define GPTNEOX_FILE_MAGIC_UNVERSIONED 0x67676d6c // pre-versioned files
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+    //
+    // C interface
+    //
+    // TODO: show sample usage
+    //
+
+    struct gptneox_context;
+
+    typedef int gptneox_token;
+
+    typedef struct gptneox_token_data {
+        gptneox_token id;  // token id
+        float logit; // log-odds of the token
+        float p;     // probability of the token
+    } gptneox_token_data;
+
+    typedef struct gptneox_token_data_array {
+        gptneox_token_data * data;
+        size_t size;
+        bool sorted;
+    } gptneox_token_data_array;
+
+    typedef void (*gptneox_progress_callback)(float progress, void *ctx);
+
+    struct gptneox_context_params {
+        int n_ctx;   // text context
+        int n_parts; // -1 for default
+        int seed;    // RNG seed, 0 for random
+
+        bool f16_kv;     // use fp16 for KV cache
+        bool logits_all; // the gptneox_eval() call computes all logits, not just the last one
+        bool vocab_only; // only load the vocabulary, no weights
+        bool use_mmap;   // use mmap if possible
+        bool use_mlock;  // force system to keep model in RAM
+        bool embedding;  // embedding mode only
+
+        // called with a progress value between 0 and 1, pass NULL to disable
+        gptneox_progress_callback progress_callback;
+        // context pointer passed to the progress callback
+        void * progress_callback_user_data;
+    };
+
+    // model file types
+    enum gptneox_ftype {
+        GPTNEOX_FTYPE_ALL_F32     = 0,
+        GPTNEOX_FTYPE_MOSTLY_F16  = 1,  // except 1d tensors
+        GPTNEOX_FTYPE_MOSTLY_Q4_0 = 2,  // except 1d tensors
+        GPTNEOX_FTYPE_MOSTLY_Q4_1 = 3,  // except 1d tensors
+        GPTNEOX_FTYPE_MOSTLY_Q4_1_SOME_F16 = 4, // tok_embeddings.weight and output.weight are F16
+        GPTNEOX_FTYPE_MOSTLY_Q4_2 = 5,  // except 1d tensors
+        // GPTNEOX_FTYPE_MOSTLY_Q4_3 (6) support has been removed
+        GPTNEOX_FTYPE_MOSTLY_Q8_0 = 7,  // except 1d tensors
+        GPTNEOX_FTYPE_MOSTLY_Q5_0 = 8,  // except 1d tensors
+        GPTNEOX_FTYPE_MOSTLY_Q5_1 = 9,  // except 1d tensors
+    };
+
+    GPTNEOX_API struct gptneox_context_params gptneox_context_default_params();
+
+    GPTNEOX_API bool gptneox_mmap_supported();
+    GPTNEOX_API bool gptneox_mlock_supported();
+
+    // Various functions for loading a ggml llama model.
+    // Allocate (almost) all memory needed for the model.
+    // Return NULL on failure
+    GPTNEOX_API struct gptneox_context * gptneox_init_from_file(
+                             const char * path_model,
+            struct gptneox_context_params   params);
+
+    // Frees all allocated memory
+    GPTNEOX_API void gptneox_free(struct gptneox_context * ctx);
+
+    // TODO: not great API - very likely to change
+    // Returns 0 on success
+    // nthread - how many threads to use. If <=0, will use std::thread::hardware_concurrency(), else the number given
+    GPTNEOX_API int gptneox_model_quantize(
+            const char * fname_inp,
+            const char * fname_out,
+      enum gptneox_ftype   ftype,
+            int          nthread);
+
+    GPTNEOX_API int gptneox_model_copy(
+            const char * fname_inp,
+            const char * fname_out,
+            enum gptneox_ftype   ftype);
+
+    // Apply a LoRA adapter to a loaded model
+    // path_base_model is the path to a higher quality model to use as a base for
+    // the layers modified by the adapter. Can be NULL to use the current loaded model.
+    // The model needs to be reloaded before applying a new adapter, otherwise the adapter
+    // will be applied on top of the previous one
+    // Returns 0 on success
+    GPTNEOX_API int gptneox_apply_lora_from_file(
+            struct gptneox_context * ctx,
+                      const char * path_lora,
+                      const char * path_base_model,
+                             int   n_threads);
+
+    // Returns the number of tokens in the KV cache
+    GPTNEOX_API int gptneox_get_kv_cache_token_count(struct gptneox_context * ctx);
+
+    // Sets the current rng seed.
+    GPTNEOX_API void gptneox_set_rng_seed(struct gptneox_context * ctx, int seed);
+
+    // Returns the size in bytes of the state (rng, logits, embedding and kv_cache)
+    GPTNEOX_API size_t gptneox_get_state_size(struct gptneox_context * ctx);
+
+    // Copies the state to the specified destination address.
+    // Destination needs to have allocated enough memory.
+    // Returns the number of bytes copied
+    GPTNEOX_API size_t gptneox_copy_state_data(struct gptneox_context * ctx, uint8_t * dest);
+
+    // Set the state reading from the specified address
+    // Returns the number of bytes read
+    GPTNEOX_API size_t gptneox_set_state_data(struct gptneox_context * ctx, const uint8_t * src);
+
+    // Save/load session file
+    GPTNEOX_API size_t gptneox_load_session_file(struct gptneox_context * ctx, const char * path_session, gptneox_token * tokens_out, size_t n_token_capacity, size_t * n_token_count_out);
+    GPTNEOX_API size_t gptneox_save_session_file(struct gptneox_context * ctx, const char * path_session, const gptneox_token * tokens, size_t n_token_count);
+
+    // Run the llama inference to obtain the logits and probabilities for the next token.
+    // tokens + n_tokens is the provided batch of new tokens to process
+    // n_past is the number of tokens to use from previous eval calls
+    // Returns 0 on success
+    GPTNEOX_API int gptneox_eval(
+            struct gptneox_context * ctx,
+               const gptneox_token * tokens,
+                             int   n_tokens,
+                             int   n_past,
+                             int   n_threads);
+
+    // Convert the provided text into tokens.
+    // The tokens pointer must be large enough to hold the resulting tokens.
+    // Returns the number of tokens on success, no more than n_max_tokens
+    // Returns a negative number on failure - the number of tokens that would have been returned
+    // TODO: not sure if correct
+    GPTNEOX_API int gptneox_tokenize(
+            struct gptneox_context * ctx,
+                      const char * text,
+                     gptneox_token * tokens,
+                             int   n_max_tokens,
+                            bool   add_bos);
+
+    GPTNEOX_API int gptneox_n_vocab(struct gptneox_context * ctx);
+    GPTNEOX_API int gptneox_n_ctx  (struct gptneox_context * ctx);
+    GPTNEOX_API int gptneox_n_embd (struct gptneox_context * ctx);
+
+    // Token logits obtained from the last call to gptneox_eval()
+    // The logits for the last token are stored in the last row
+    // Can be mutated in order to change the probabilities of the next token
+    // Rows: n_tokens
+    // Cols: n_vocab
+    GPTNEOX_API float * gptneox_get_logits(struct gptneox_context * ctx);
+
+    // Get the embeddings for the input
+    // shape: [n_embd] (1-dimensional)
+    GPTNEOX_API float * gptneox_get_embeddings(struct gptneox_context * ctx);
+
+    // Token Id -> String. Uses the vocabulary in the provided context
+    GPTNEOX_API const char * gptneox_token_to_str(struct gptneox_context * ctx, gptneox_token token);
+
+    // String -> Token Id. Uses the vocabulary in the provided context
+    GPTNEOX_API gptneox_token gptneox_str_to_token(struct gptneox_context * ctx, const char * str);
+
+    // Special tokens
+    GPTNEOX_API gptneox_token gptneox_token_bos();
+    GPTNEOX_API gptneox_token gptneox_token_eos();
+    // GPTNEOX_API gptneox_token gptneox_token_nl();
+
+    // TODO: improve the last_n_tokens interface ?
+    GPTNEOX_API gptneox_token gptneox_sample_top_p_top_k(
+       struct gptneox_context * ctx,
+          const gptneox_token * last_n_tokens_data,
+                        int   last_n_tokens_size,
+                        int   top_k,
+                      float   top_p,
+                      float   temp,
+                      float   repeat_penalty);
+
+    // Sampling functions
+
+    /// @details Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix.
+    GPTNEOX_API void gptneox_sample_repetition_penalty(struct gptneox_context * ctx, gptneox_token_data_array * candidates, gptneox_token * last_tokens, size_t last_tokens_size, float penalty);
+
+    /// @details Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.
+    GPTNEOX_API void gptneox_sample_frequency_and_presence_penalties(struct gptneox_context * ctx, gptneox_token_data_array * candidates, gptneox_token * last_tokens, size_t last_tokens_size, float alpha_frequency, float alpha_presence);
+
+    /// @details Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits.
+    GPTNEOX_API void gptneox_sample_softmax(struct gptneox_context * ctx, gptneox_token_data_array * candidates);
+
+    /// @details Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
+    GPTNEOX_API void gptneox_sample_top_k(struct gptneox_context * ctx, gptneox_token_data_array * candidates, int k, size_t min_keep);
+
+    /// @details Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
+    GPTNEOX_API void gptneox_sample_top_p(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float p, size_t min_keep);
+
+    /// @details Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.
+    GPTNEOX_API void gptneox_sample_tail_free(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float z, size_t min_keep);
+
+    /// @details Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
+    GPTNEOX_API void gptneox_sample_typical(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float p, size_t min_keep);
+    GPTNEOX_API void gptneox_sample_temperature(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float temp);
+
+    /// @details Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
+    /// @param candidates A vector of `gptneox_token_data` containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
+    /// @param tau  The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
+    /// @param eta The learning rate used to update `mu` based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause `mu` to be updated more quickly, while a smaller learning rate will result in slower updates.
+    /// @param m The number of tokens considered in the estimation of `s_hat`. This is an arbitrary value that is used to calculate `s_hat`, which in turn helps to calculate the value of `k`. In the paper, they use `m = 100`, but you can experiment with different values to see how it affects the performance of the algorithm.
+    /// @param mu Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (`2 * tau`) and is updated in the algorithm based on the error between the target and observed surprisal.
+    GPTNEOX_API gptneox_token gptneox_sample_token_mirostat(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float tau, float eta, int m, float * mu);
+
+    /// @details Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
+    /// @param candidates A vector of `gptneox_token_data` containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
+    /// @param tau  The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
+    /// @param eta The learning rate used to update `mu` based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause `mu` to be updated more quickly, while a smaller learning rate will result in slower updates.
+    /// @param mu Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (`2 * tau`) and is updated in the algorithm based on the error between the target and observed surprisal.
+    GPTNEOX_API gptneox_token gptneox_sample_token_mirostat_v2(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float tau, float eta, float * mu);
+
+    /// @details Selects the token with the highest probability.
+    GPTNEOX_API gptneox_token gptneox_sample_token_greedy(struct gptneox_context * ctx, gptneox_token_data_array * candidates);
+
+    /// @details Randomly selects a token from the candidates based on their probabilities.
+    GPTNEOX_API gptneox_token gptneox_sample_token(struct gptneox_context * ctx, gptneox_token_data_array * candidates);
+
+    // Performance information
+    GPTNEOX_API void gptneox_print_timings(struct gptneox_context * ctx);
+    GPTNEOX_API void gptneox_reset_timings(struct gptneox_context * ctx);
+
+    // Print system information
+    GPTNEOX_API const char * gptneox_print_system_info(void);
+
+#ifdef __cplusplus
+}
+#endif
+
+// Internal API to be implemented by llama.cpp and used by tests/benchmarks only
+#ifdef GPTNEOX_API_INTERNAL
+
+#include <vector>
+#include <string>
+struct ggml_tensor;
+
+std::vector<std::pair<std::string, struct ggml_tensor *>>& gptneox_internal_get_tensor_map(struct gptneox_context * ctx);
+
+#endif
+
+#endif // GPTNEOX_H
--- a/third_party/radpajama/main-redpajama-chat.cpp
+++ b/third_party/radpajama/main-redpajama-chat.cpp
@ -0,0 +1,467 @@
+// Defines sigaction on msys:
+#ifndef _GNU_SOURCE
+#define _GNU_SOURCE
+#endif
+
+#include "common-gptneox.h"
+#include "gptneox.h"
+
+#include <cassert>
+#include <cinttypes>
+#include <cmath>
+#include <cstdio>
+#include <cstring>
+#include <ctime>
+#include <fstream>
+#include <iostream>
+#include <string>
+#include <vector>
+#include <algorithm>
+
+#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
+#include <signal.h>
+#include <unistd.h>
+#elif defined (_WIN32)
+#include <signal.h>
+#endif
+
+static console_state con_st;
+static gptneox_context ** g_ctx;
+
+static bool is_interacting = false;
+
+#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
+void sigint_handler(int signo) {
+    set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
+    printf("\n"); // this also force flush stdout.
+    if (signo == SIGINT) {
+        if (!is_interacting) {
+            is_interacting=true;
+        } else {
+            gptneox_print_timings(*g_ctx);
+            _exit(130);
+        }
+    }
+}
+#endif
+
+int main(int argc, char ** argv) {
+    gpt_params params;
+    params.model = "./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-f16.bin";
+    
+    if (gpt_params_parse(argc, argv, params) == false) {
+        return 1;
+    }
+    
+    // save choice to use color for later
+    // (note for later: this is a slightly awkward choice)
+    con_st.use_color = params.use_color;
+    
+#if defined (_WIN32)
+    win32_console_init(params.use_color);
+#endif
+    
+    if (params.perplexity) {
+        printf("\n************\n");
+        printf("%s: please use the 'perplexity' tool for perplexity calculations\n", __func__);
+        printf("************\n\n");
+        
+        return 0;
+    }
+    
+    if (params.embedding) {
+        printf("\n************\n");
+        printf("%s: please use the 'embedding' tool for embedding calculations\n", __func__);
+        printf("************\n\n");
+        
+        return 0;
+    }
+    
+    if (params.n_ctx > 2048) {
+        fprintf(stderr, "%s: warning: model does not support context sizes greater than 2048 tokens (%d specified);"
+                "expect poor results\n", __func__, params.n_ctx);
+    }
+    
+    if (params.seed <= 0) {
+        params.seed = time(NULL);
+    }
+    
+    fprintf(stderr, "%s: seed = %d\n", __func__, params.seed);
+    
+    std::mt19937 rng(params.seed);
+    if (params.random_prompt) {
+        params.prompt = gpt_random_prompt(rng);
+    }
+    
+    gptneox_context * ctx;
+    g_ctx = &ctx;
+    
+    // load the model
+    {
+        auto lparams = gptneox_context_default_params();
+        
+        lparams.n_ctx      = params.n_ctx;
+        lparams.n_parts    = params.n_parts;
+        lparams.seed       = params.seed;
+        lparams.f16_kv     = params.memory_f16;
+        lparams.use_mmap   = params.use_mmap;
+        lparams.use_mlock  = params.use_mlock;
+        
+        ctx = gptneox_init_from_file(params.model.c_str(), lparams);
+        
+        if (ctx == NULL) {
+            fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
+            return 1;
+        }
+    }
+    
+    if (!params.lora_adapter.empty()) {
+        int err = gptneox_apply_lora_from_file(ctx,
+                                               params.lora_adapter.c_str(),
+                                               params.lora_base.empty() ? NULL : params.lora_base.c_str(),
+                                               params.n_threads);
+        if (err != 0) {
+            fprintf(stderr, "%s: error: failed to apply lora adapter\n", __func__);
+            return 1;
+        }
+    }
+    
+    // print system information
+    {
+        fprintf(stderr, "\n");
+        fprintf(stderr, "system_info: n_threads = %d / %d | %s\n",
+                params.n_threads, std::thread::hardware_concurrency(), gptneox_print_system_info());
+    }
+    
+    // determine the maximum memory usage needed to do inference for the given n_batch and n_predict parameters
+    if (params.mem_test) {
+        {
+            const std::vector<gptneox_token> tmp(params.n_batch, 0);
+            gptneox_eval(ctx, tmp.data(), tmp.size(), 0, params.n_threads);
+        }
+        
+        {
+            const std::vector<gptneox_token> tmp = { 0, };
+            gptneox_eval(ctx, tmp.data(), tmp.size(), params.n_predict - 1, params.n_threads);
+        }
+        
+        gptneox_print_timings(ctx);
+        gptneox_free(ctx);
+        
+        return 0;
+    }
+
+    // Always interactive for RedPajama chat model
+    params.interactive = true;
+    
+    if (params.interactive) {
+#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
+        struct sigaction sigint_action;
+        sigint_action.sa_handler = sigint_handler;
+        sigemptyset (&sigint_action.sa_mask);
+        sigint_action.sa_flags = 0;
+        sigaction(SIGINT, &sigint_action, NULL);
+#elif defined (_WIN32)
+        signal(SIGINT, sigint_handler);
+#endif
+    }
+    fprintf(stderr, "sampling: temp = %f, top_k = %d, top_p = %f, repeat_last_n = %i, repeat_penalty = %f\n",
+        params.temp, params.top_k, params.top_p, params.repeat_last_n, params.repeat_penalty);
+    fprintf(stderr, "generate: n_ctx = %d, n_batch = %d, n_predict = %d, n_keep = %d\n", params.n_ctx, params.n_batch, params.n_predict, params.n_keep);
+    fprintf(stderr, "\n\n");
+    
+    // TODO: replace with ring-buffer
+    std::vector<gptneox_token> last_n_tokens = std::vector<gptneox_token>();
+    //std::fill(last_n_tokens.begin(), last_n_tokens.end(), 0);
+    
+    set_console_color(con_st, CONSOLE_COLOR_PROMPT);
+    
+    if (params.interactive) {
+        printf("== Running in interactive mode. ==\n"
+#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
+               " - Press Ctrl+C to interject at any time.\n"
+#endif
+               " - Press Return to return control to RedPajama.\n"
+               " - If you want to submit another line, end your input in '\\'.\n\n");
+    }
+    
+    const int32_t top_k          = params.top_k;
+    const float   top_p          = params.top_p;
+    const float   temp           = params.temp;
+    const float   repeat_penalty = params.repeat_penalty;
+    
+    // Chat loop
+    while (true) {
+        is_interacting = true;
+        
+        int n_past = 0;
+        
+        // Get input
+        
+        // potentially set color to indicate we are taking user input
+        set_console_color(con_st, CONSOLE_COLOR_USER_INPUT);
+        
+#if defined (_WIN32)
+        // Windows: must reactivate sigint handler after each signal
+        signal(SIGINT, sigint_handler);
+#endif
+
+        if (params.instruct) {
+            printf("\n<human>: ");
+        }
+
+        std::string buffer;
+        if (!params.input_prefix.empty()) {
+            buffer += params.input_prefix;
+            printf("%s", buffer.c_str());
+        }
+
+        std::string line;
+        bool another_line = true;
+        do {
+#if defined(_WIN32)
+            std::wstring wline;
+            if (!std::getline(std::wcin, wline)) {
+                // input stream is bad or EOF received
+                return 0;
+            }
+            win32_utf8_encode(wline, line);
+#else
+            if (!std::getline(std::cin, line)) {
+                // input stream is bad or EOF received
+                return 0;
+            }
+#endif
+            if (line.empty() || line.back() != '\\') {
+                another_line = false;
+            } else {
+                line.pop_back(); // Remove the continue character
+            }
+            buffer += line;
+            if (another_line) {
+                buffer += '\n';
+            }
+        } while (another_line);
+        
+        is_interacting = false;
+        
+        // done taking input, reset color
+        set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
+        
+        // Check for input
+        if (buffer.length() <= 0) {
+            continue; // Restart loop for input
+        }
+        
+        // Tokenize prompt with RedPajama special tokens
+
+        auto prompt_embd = ::gptneox_tokenize(ctx, buffer, false);
+        auto embd_inp = std::vector<gptneox_token>();
+
+        // Redpajama: insert special tokens for OA. (prefix)
+        embd_inp.push_back(gptneox_str_to_token(ctx, "<"));
+        embd_inp.push_back(gptneox_str_to_token(ctx, "human"));
+        embd_inp.push_back(gptneox_str_to_token(ctx, ">:"));
+        
+        embd_inp.insert(embd_inp.end(), prompt_embd.begin(), prompt_embd.end());
+
+        // Redpajama: insert special tokens for OA. (postfix)
+        embd_inp.push_back(gptneox_str_to_token(ctx, "\n"));
+        embd_inp.push_back(gptneox_str_to_token(ctx, "<"));
+        embd_inp.push_back(gptneox_str_to_token(ctx, "bot"));
+        embd_inp.push_back(gptneox_str_to_token(ctx, ">:"));
+       
+        
+        // Verbose prompt
+        if (params.verbose_prompt) {
+            fprintf(stderr, "\n");
+            fprintf(stderr, "%s: prompt: '%s'\n", __func__, buffer.c_str());
+            fprintf(stderr, "%s: number of tokens in prompt = %zu\n", __func__, embd_inp.size());
+            for (int i = 0; i < (int) embd_inp.size(); i++) {
+                fprintf(stderr, "%6d -> '%s'\n", embd_inp[i], gptneox_token_to_str(ctx, embd_inp[i]));
+            }
+            fprintf(stderr, "\n");
+        }
+        
+        // How many tokens to generate - check if theres space in context for atleast one token (or batch size tokens?)
+        auto inp_size = embd_inp.size();
+        auto space = params.n_ctx - inp_size;
+        if(space <= 0) {
+            fprintf(stderr, "%s : input too long\n", __func__);
+            continue;
+        }
+        // Send batches to eval
+        while (n_past < inp_size) {
+            auto remaining = inp_size - n_past;
+            int n_eval = params.n_batch < remaining ? params.n_batch : remaining;
+            if (gptneox_eval(ctx, &embd_inp[n_past], n_eval, n_past, params.n_threads)) {
+                fprintf(stderr, "<bot>: %s : failed to eval\n", __func__);
+                return 1;
+            }
+            n_past += n_eval;
+        }
+        
+        const int n_ctx = gptneox_n_ctx(ctx);
+        const int n_vocab = gptneox_n_vocab(ctx);
+        
+        const float   temp            = params.temp;
+        const int32_t top_k           = params.top_k <= 0 ? gptneox_n_vocab(ctx) : params.top_k;
+        const float   top_p           = params.top_p;
+        const float   tfs_z           = params.tfs_z;
+        const float   typical_p       = params.typical_p;
+        const int32_t repeat_last_n   = params.repeat_last_n < 0 ? n_ctx : params.repeat_last_n;
+        const float   repeat_penalty  = params.repeat_penalty;
+        const float   alpha_presence  = params.presence_penalty;
+        const float   alpha_frequency = params.frequency_penalty;
+        const int     mirostat        = params.mirostat;
+        const float   mirostat_tau    = params.mirostat_tau;
+        const float   mirostat_eta    = params.mirostat_eta;
+        const bool    penalize_nl     = params.penalize_nl;
+        
+        // Eval until space runs out
+        auto out_count = 0;
+        
+        printf("<bot>:");
+        while (space > 0) {
+            // Get token
+            gptneox_token id = 0;
+            
+            {
+                auto logits = gptneox_get_logits(ctx);
+                
+                // Apply params.logit_bias map
+                for (auto it = params.logit_bias.begin(); it != params.logit_bias.end(); it++) {
+                    logits[it->first] += it->second;
+                }
+
+                std::vector<gptneox_token_data> candidates;
+                candidates.reserve(n_vocab);
+                for (gptneox_token token_id = 0; token_id < n_vocab; token_id++) {
+                    candidates.emplace_back(gptneox_token_data{token_id, logits[token_id], 0.0f});
+                }
+
+                gptneox_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
+
+                // Apply penalties
+                gptneox_token nl_token = gptneox_str_to_token(ctx, "\n");
+                float nl_logit = logits[nl_token];
+                auto last_n_repeat = std::min(std::min((int)last_n_tokens.size(), repeat_last_n), n_ctx);
+                gptneox_sample_repetition_penalty(ctx, &candidates_p,
+                    last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
+                    last_n_repeat, repeat_penalty);
+                gptneox_sample_frequency_and_presence_penalties(ctx, &candidates_p,
+                    last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
+                    last_n_repeat, alpha_frequency, alpha_presence);
+                if (!penalize_nl) {
+                    logits[nl_token] = nl_logit;
+                }
+
+                if (temp <= 0) {
+                    // Greedy sampling
+                    id = gptneox_sample_token_greedy(ctx, &candidates_p);
+                } else {
+                    if (mirostat == 1) {
+                        static float mirostat_mu = 2.0f * mirostat_tau;
+                        const int mirostat_m = 100;
+                        gptneox_sample_temperature(ctx, &candidates_p, temp);
+                        id = gptneox_sample_token_mirostat(ctx, &candidates_p, mirostat_tau, mirostat_eta, mirostat_m, &mirostat_mu);
+                    } else if (mirostat == 2) {
+                        static float mirostat_mu = 2.0f * mirostat_tau;
+                        gptneox_sample_temperature(ctx, &candidates_p, temp);
+                        id = gptneox_sample_token_mirostat_v2(ctx, &candidates_p, mirostat_tau, mirostat_eta, &mirostat_mu);
+                    } else {
+                        // Temperature sampling
+                        gptneox_sample_top_k(ctx, &candidates_p, top_k, 1);
+                        gptneox_sample_tail_free(ctx, &candidates_p, tfs_z, 1);
+                        gptneox_sample_typical(ctx, &candidates_p, typical_p, 1);
+                        gptneox_sample_top_p(ctx, &candidates_p, top_p, 1);
+                        gptneox_sample_temperature(ctx, &candidates_p, temp);
+                        id = gptneox_sample_token(ctx, &candidates_p);
+                    }
+                }
+            }
+            
+            // Inc out count and dec space
+            out_count += 1;
+            space -= 1;
+            // Repeat tokens update
+            last_n_tokens.push_back(id);
+            if (last_n_tokens.size() > params.repeat_last_n) {
+                last_n_tokens.erase(last_n_tokens.begin());
+            }
+            // Redpajama: check if the interactive is done. 
+            //std::cout<<" last_n_tokens.size: "<< last_n_tokens[0] <<" "<< last_n_tokens[1] <<" "<< last_n_tokens[2] << std::endl;
+            if (last_n_tokens.size()==3 && last_n_tokens[0]==gptneox_str_to_token(ctx, "<") 
+            && last_n_tokens[1]==gptneox_str_to_token(ctx, "human") && last_n_tokens[2]==gptneox_str_to_token(ctx, ">:")){
+                space = 0;
+                continue;
+            }
+
+            // Check for eos - end early - check eos before bos in case they are the same
+            if (id == gptneox_token_eos()) {
+                space = 0;
+                continue;
+            }
+            // Check for bos - skip callback if so
+            if (id == gptneox_token_bos()) {
+                continue;
+            }
+            // Convert token to string and display
+            // printf("%s(%d)", gptneox_token_to_str(ctx, id), id);
+            
+            
+            if (last_n_tokens[2]==gptneox_str_to_token(ctx, "<")){
+                ;
+            }
+            else if (last_n_tokens[2]==gptneox_str_to_token(ctx, "human")){
+                if (last_n_tokens[1]==gptneox_str_to_token(ctx, "<")){
+                    ;
+                }
+                else{
+                    printf("%s", gptneox_token_to_str(ctx, id));
+                }
+            }
+            else if (last_n_tokens[1]==gptneox_str_to_token(ctx, "<")){
+                    printf("<");
+                    printf("%s", gptneox_token_to_str(ctx, id));
+                }
+            else{
+                printf("%s", gptneox_token_to_str(ctx, id));
+            }
+            fflush(stdout);
+            // Check if we need to run another eval
+            if (space > 0) {
+                // Send generated token back into model for next generation
+                if (gptneox_eval(ctx, &id, 1, n_past, params.n_threads)) {
+                    fprintf(stderr, "%s : failed to eval\n", __func__);
+                    return 1;
+                }
+                // Increment past count
+                n_past += 1;
+            }
+            // Check for user interrupt
+            if (is_interacting) { space = 0; }
+        }
+        printf("\n"); 
+        //printf("\n %d", space);
+        fflush(stdout);
+    }
+    
+#if defined (_WIN32)
+    signal(SIGINT, SIG_DFL);
+#endif
+
+    gptneox_print_timings(ctx);
+    gptneox_free(ctx);
+
+    set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
+
+    return 0;
+}
+    
+    
+    
+    
+    
+
+
--- a/third_party/radpajama/main-redpajama.cpp
+++ b/third_party/radpajama/main-redpajama.cpp
@ -0,0 +1,622 @@
+// Defines sigaction on msys:
+#ifndef _GNU_SOURCE
+#define _GNU_SOURCE
+#endif
+
+#include "common-gptneox.h"
+#include "gptneox.h"
+
+#include <cassert>
+#include <cinttypes>
+#include <cmath>
+#include <cstdio>
+#include <cstring>
+#include <ctime>
+#include <fstream>
+#include <iostream>
+#include <string>
+#include <vector>
+
+#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
+#include <signal.h>
+#include <unistd.h>
+#elif defined (_WIN32)
+#include <signal.h>
+#endif
+
+static console_state con_st;
+static gptneox_context ** g_ctx;
+
+static bool is_interacting = false;
+
+#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
+void sigint_handler(int signo) {
+    set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
+    printf("\n"); // this also force flush stdout.
+    if (signo == SIGINT) {
+        if (!is_interacting) {
+            is_interacting=true;
+        } else {
+            gptneox_print_timings(*g_ctx);
+            _exit(130);
+        }
+    }
+}
+#endif
+
+int main(int argc, char ** argv) {
+    gpt_params params;
+    params.model = "./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Instruct-3B-v1-f16.bin";
+
+    if (gpt_params_parse(argc, argv, params) == false) {
+        return 1;
+    }
+
+    // save choice to use color for later
+    // (note for later: this is a slightly awkward choice)
+    con_st.use_color = params.use_color;
+
+#if defined (_WIN32)
+    win32_console_init(params.use_color);
+#endif
+
+    if (params.perplexity) {
+        printf("\n************\n");
+        printf("%s: please use the 'perplexity' tool for perplexity calculations\n", __func__);
+        printf("************\n\n");
+
+        return 0;
+    }
+
+    if (params.embedding) {
+        printf("\n************\n");
+        printf("%s: please use the 'embedding' tool for embedding calculations\n", __func__);
+        printf("************\n\n");
+
+        return 0;
+    }
+
+    if (params.n_ctx > 2048) {
+        fprintf(stderr, "%s: warning: model does not support context sizes greater than 2048 tokens (%d specified);"
+                "expect poor results\n", __func__, params.n_ctx);
+    }
+
+    if (params.seed < 0) {
+        params.seed = time(NULL);
+    }
+
+    fprintf(stderr, "%s: seed = %d\n", __func__, params.seed);
+
+    std::mt19937 rng(params.seed);
+    if (params.random_prompt) {
+        params.prompt = gpt_random_prompt(rng);
+    }
+
+//    params.prompt = R"(// this function checks if the number n is prime
+//bool is_prime(int n) {)";
+
+    gptneox_context * ctx;
+    g_ctx = &ctx;
+
+    // load the model
+    {
+        auto lparams = gptneox_context_default_params();
+
+        lparams.n_ctx      = params.n_ctx;
+        lparams.n_parts    = params.n_parts;
+        lparams.seed       = params.seed;
+        lparams.f16_kv     = params.memory_f16;
+        lparams.use_mmap   = params.use_mmap;
+        lparams.use_mlock  = params.use_mlock;
+
+        ctx = gptneox_init_from_file(params.model.c_str(), lparams);
+
+        if (ctx == NULL) {
+            fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
+            return 1;
+        }
+    }
+
+    if (!params.lora_adapter.empty()) {
+        int err = gptneox_apply_lora_from_file(ctx,
+                                             params.lora_adapter.c_str(),
+                                             params.lora_base.empty() ? NULL : params.lora_base.c_str(),
+                                             params.n_threads);
+        if (err != 0) {
+            fprintf(stderr, "%s: error: failed to apply lora adapter\n", __func__);
+            return 1;
+        }
+    }
+
+    // print system information
+    {
+        fprintf(stderr, "\n");
+        fprintf(stderr, "system_info: n_threads = %d / %d | %s\n",
+                params.n_threads, std::thread::hardware_concurrency(), gptneox_print_system_info());
+    }
+
+    // determine the maximum memory usage needed to do inference for the given n_batch and n_predict parameters
+    // uncomment the "used_mem" line in llama.cpp to see the results
+    if (params.mem_test) {
+        {
+            const std::vector<gptneox_token> tmp(params.n_batch, 0);
+            gptneox_eval(ctx, tmp.data(), tmp.size(), 0, params.n_threads);
+        }
+
+        {
+            const std::vector<gptneox_token> tmp = { 0, };
+            gptneox_eval(ctx, tmp.data(), tmp.size(), params.n_predict - 1, params.n_threads);
+        }
+
+        gptneox_print_timings(ctx);
+        gptneox_free(ctx);
+
+        return 0;
+    }
+    
+    std::string path_session = params.path_session;
+    std::vector<gptneox_token> session_tokens;
+
+    if (!path_session.empty()) {
+        fprintf(stderr, "%s: attempting to load saved session from %s..\n", __func__, path_session.c_str());
+
+        // REVIEW - fopen to check for existing session
+        FILE * fp = std::fopen(path_session.c_str(), "rb");
+        if (fp != NULL) {
+            std::fclose(fp);
+
+            session_tokens.resize(params.n_ctx);
+            size_t n_token_count_out = 0;
+            const size_t n_session_bytes = gptneox_load_session_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.capacity(), &n_token_count_out);
+            session_tokens.resize(n_token_count_out);
+
+            if (n_session_bytes > 0) {
+                fprintf(stderr, "%s: loaded %zu bytes of session data!\n", __func__, n_session_bytes);
+            } else {
+                fprintf(stderr, "%s: could not load session file, will recreate\n", __func__);
+            }
+        } else {
+            fprintf(stderr, "%s: session file does not exist, will create\n", __func__);
+        }
+    }
+
+    // tokenize the prompt
+    auto embd_inp = ::gptneox_tokenize(ctx, params.prompt, false); //true);
+
+    const int n_ctx = gptneox_n_ctx(ctx);
+
+    if ((int) embd_inp.size() > n_ctx - 4) {
+        fprintf(stderr, "%s: error: prompt is too long (%d tokens, max %d)\n", __func__, (int) embd_inp.size(), n_ctx - 4);
+        return 1;
+    }
+    
+    // debug message about similarity of saved session, if applicable
+    size_t n_matching_session_tokens = 0;
+    if (session_tokens.size()) {
+        for (gptneox_token id : session_tokens) {
+            if (n_matching_session_tokens >= embd_inp.size() || id != embd_inp[n_matching_session_tokens]) {
+                break;
+            }
+            n_matching_session_tokens++;
+        }
+        if (n_matching_session_tokens >= embd_inp.size()) {
+            fprintf(stderr, "%s: session file has exact match for prompt!\n", __func__);
+        } else if (n_matching_session_tokens < (embd_inp.size() / 2)) {
+            fprintf(stderr, "%s: warning: session file has low similarity to prompt (%zu / %zu tokens); will mostly be reevaluated\n",
+                __func__, n_matching_session_tokens, embd_inp.size());
+        } else {
+            fprintf(stderr, "%s: session file matches %zu / %zu tokens of prompt\n",
+                __func__, n_matching_session_tokens, embd_inp.size());
+        }
+    }
+
+    // number of tokens to keep when resetting context
+    if (params.n_keep < 0 || params.n_keep > (int)embd_inp.size() || params.instruct) {
+        params.n_keep = (int)embd_inp.size();
+    }
+
+    // in instruct mode, we inject a prefix and a suffix to each input by the user
+    if (params.instruct) {
+        params.interactive_first = true;
+        params.antiprompt.push_back("<|prompter|>");
+    }
+
+    // enable interactive mode if reverse prompt or interactive start is specified
+    if (params.antiprompt.size() != 0 || params.interactive_first) {
+        params.interactive = true;
+    }
+
+    // determine newline token
+    auto gptneox_token_newline = ::gptneox_tokenize(ctx, "\n", false);
+
+    if (params.verbose_prompt) {
+        fprintf(stderr, "\n");
+        fprintf(stderr, "%s: prompt: '%s'\n", __func__, params.prompt.c_str());
+        fprintf(stderr, "%s: number of tokens in prompt = %zu\n", __func__, embd_inp.size());
+        for (int i = 0; i < (int) embd_inp.size(); i++) {
+            fprintf(stderr, "%6d -> '%s'\n", embd_inp[i], gptneox_token_to_str(ctx, embd_inp[i]));
+        }
+        if (params.n_keep > 0) {
+        fprintf(stderr, "%s: static prompt based on n_keep: '", __func__);
+            for (int i = 0; i < params.n_keep; i++) {
+                fprintf(stderr, "%s", gptneox_token_to_str(ctx, embd_inp[i]));
+            }
+            fprintf(stderr, "'\n");
+        }
+        fprintf(stderr, "\n");
+    }
+
+    if (params.interactive) {
+#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
+        struct sigaction sigint_action;
+        sigint_action.sa_handler = sigint_handler;
+        sigemptyset (&sigint_action.sa_mask);
+        sigint_action.sa_flags = 0;
+        sigaction(SIGINT, &sigint_action, NULL);
+#elif defined (_WIN32)
+        signal(SIGINT, sigint_handler);
+#endif
+
+        fprintf(stderr, "%s: interactive mode on.\n", __func__);
+
+        if (params.antiprompt.size()) {
+            for (auto antiprompt : params.antiprompt) {
+                fprintf(stderr, "Reverse prompt: '%s'\n", antiprompt.c_str());
+            }
+        }
+
+        if (!params.input_prefix.empty()) {
+            fprintf(stderr, "Input prefix: '%s'\n", params.input_prefix.c_str());
+        }
+    }
+    fprintf(stderr, "sampling: repeat_last_n = %d, repeat_penalty = %f, presence_penalty = %f, frequency_penalty = %f, top_k = %d, tfs_z = %f, top_p = %f, typical_p = %f, temp = %f, mirostat = %d, mirostat_lr = %f, mirostat_ent = %f\n",
+            params.repeat_last_n, params.repeat_penalty, params.presence_penalty, params.frequency_penalty, params.top_k, params.tfs_z, params.top_p, params.typical_p, params.temp, params.mirostat, params.mirostat_eta, params.mirostat_tau);
+    fprintf(stderr, "generate: n_ctx = %d, n_batch = %d, n_predict = %d, n_keep = %d\n", n_ctx, params.n_batch, params.n_predict, params.n_keep);
+    fprintf(stderr, "\n\n");
+
+    // TODO: replace with ring-buffer
+    std::vector<gptneox_token> last_n_tokens(n_ctx);
+    std::fill(last_n_tokens.begin(), last_n_tokens.end(), 0);
+
+    if (params.interactive) {
+        fprintf(stderr, "== Running in interactive mode. ==\n"
+#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
+               " - Press Ctrl+C to interject at any time.\n"
+#endif
+               " - Press Return to return control to RedPajama.\n"
+               " - If you want to submit another line, end your input in '\\'.\n\n");
+        is_interacting = params.interactive_first;
+    }
+
+    bool is_antiprompt = false;
+    bool input_noecho  = false;
+    
+    // HACK - because session saving incurs a non-negligible delay, for now skip re-saving session
+    // if we loaded a session with at least 75% similarity. It's currently just used to speed up the
+    // initial prompt so it doesn't need to be an exact match.
+    bool need_to_save_session = !path_session.empty() && n_matching_session_tokens < (embd_inp.size() * 3 / 4);
+
+
+    int n_past     = 0;
+    int n_remain   = params.n_predict;
+    int n_consumed = 0;
+    int n_session_consumed = 0;
+
+    // the first thing we will do is to output the prompt, so set color accordingly
+    set_console_color(con_st, CONSOLE_COLOR_PROMPT);
+
+    std::vector<gptneox_token> embd;
+
+    while (n_remain != 0 || params.interactive) {
+        // predict
+        if (embd.size() > 0) {
+            // infinite text generation via context swapping
+            // if we run out of context:
+            // - take the n_keep first tokens from the original prompt (via n_past)
+            // - take half of the last (n_ctx - n_keep) tokens and recompute the logits in batches
+            if (n_past + (int) embd.size() > n_ctx) {
+                const int n_left = n_past - params.n_keep;
+
+                n_past = params.n_keep;
+
+                // insert n_left/2 tokens at the start of embd from last_n_tokens
+                embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size(), last_n_tokens.end() - embd.size());
+                
+                // REVIEW - stop saving session if we run out of context
+                path_session = "";
+
+                //printf("\n---\n");
+                //printf("resetting: '");
+                //for (int i = 0; i < (int) embd.size(); i++) {
+                //    printf("%s", gptneox_token_to_str(ctx, embd[i]));
+                //}
+                //printf("'\n");
+                //printf("\n---\n");
+            }
+            
+            // try to reuse a matching prefix from the loaded session instead of re-eval (via n_past)
+            // REVIEW
+            if (n_session_consumed < (int) session_tokens.size()) {
+                size_t i = 0;
+                for ( ; i < embd.size(); i++) {
+                    if (embd[i] != session_tokens[n_session_consumed]) {
+                        session_tokens.resize(n_session_consumed);
+                        break;
+                    }
+
+                    n_past++;
+                    n_session_consumed++;
+
+                    if (n_session_consumed >= (int) session_tokens.size()) {
+                        break;
+                    }
+                }
+                if (i > 0) {
+                    embd.erase(embd.begin(), embd.begin() + i);
+                }
+            }
+
+            // evaluate tokens in batches
+            // embd is typically prepared beforehand to fit within a batch, but not always
+            for (int i = 0; i < (int) embd.size(); i += params.n_batch) {
+                int n_eval = (int) embd.size() - i;
+                if (n_eval > params.n_batch) {
+                    n_eval = params.n_batch;
+                }
+                if (gptneox_eval(ctx, &embd[i], n_eval, n_past, params.n_threads)) {
+                    fprintf(stderr, "%s : failed to eval\n", __func__);
+                    return 1;
+                }
+                n_past += n_eval;
+            }
+            
+            if (embd.size() > 0 && !path_session.empty()) {
+                session_tokens.insert(session_tokens.end(), embd.begin(), embd.end());
+                n_session_consumed = session_tokens.size();
+            }
+        }
+
+        embd.clear();
+
+        if ((int) embd_inp.size() <= n_consumed && !is_interacting) {
+            // out of user input, sample next token
+            const float   temp            = params.temp;
+            const int32_t top_k           = params.top_k <= 0 ? gptneox_n_vocab(ctx) : params.top_k;
+            const float   top_p           = params.top_p;
+            const float   tfs_z           = params.tfs_z;
+            const float   typical_p       = params.typical_p;
+            const int32_t repeat_last_n   = params.repeat_last_n < 0 ? n_ctx : params.repeat_last_n;
+            const float   repeat_penalty  = params.repeat_penalty;
+            const float   alpha_presence  = params.presence_penalty;
+            const float   alpha_frequency = params.frequency_penalty;
+            const int     mirostat        = params.mirostat;
+            const float   mirostat_tau    = params.mirostat_tau;
+            const float   mirostat_eta    = params.mirostat_eta;
+            const bool    penalize_nl     = params.penalize_nl;
+
+            // optionally save the session on first sample (for faster prompt loading next time)
+            if (!path_session.empty() && need_to_save_session) {
+                need_to_save_session = false;
+                gptneox_save_session_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.size());
+            }
+
+            gptneox_token id = 0;
+
+            {
+                auto logits = gptneox_get_logits(ctx);
+                auto n_vocab = gptneox_n_vocab(ctx);
+                
+                // Apply params.logit_bias map
+                for (auto it = params.logit_bias.begin(); it != params.logit_bias.end(); it++) {
+                    logits[it->first] += it->second;
+                }
+
+                std::vector<gptneox_token_data> candidates;
+                candidates.reserve(n_vocab);
+                for (gptneox_token token_id = 0; token_id < n_vocab; token_id++) {
+                    candidates.emplace_back(gptneox_token_data{token_id, logits[token_id], 0.0f});
+                }
+
+                gptneox_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
+
+                // Apply penalties
+                gptneox_token nl_token = gptneox_str_to_token(ctx, "\n");
+                float nl_logit = logits[nl_token];
+                auto last_n_repeat = std::min(std::min((int)last_n_tokens.size(), repeat_last_n), n_ctx);
+                gptneox_sample_repetition_penalty(ctx, &candidates_p,
+                    last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
+                    last_n_repeat, repeat_penalty);
+                gptneox_sample_frequency_and_presence_penalties(ctx, &candidates_p,
+                    last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
+                    last_n_repeat, alpha_frequency, alpha_presence);
+                if (!penalize_nl) {
+                    logits[nl_token] = nl_logit;
+                }
+
+                if (temp <= 0) {
+                    // Greedy sampling
+                    id = gptneox_sample_token_greedy(ctx, &candidates_p);
+                } else {
+                    if (mirostat == 1) {
+                        static float mirostat_mu = 2.0f * mirostat_tau;
+                        const int mirostat_m = 100;
+                        gptneox_sample_temperature(ctx, &candidates_p, temp);
+                        id = gptneox_sample_token_mirostat(ctx, &candidates_p, mirostat_tau, mirostat_eta, mirostat_m, &mirostat_mu);
+                    } else if (mirostat == 2) {
+                        static float mirostat_mu = 2.0f * mirostat_tau;
+                        gptneox_sample_temperature(ctx, &candidates_p, temp);
+                        id = gptneox_sample_token_mirostat_v2(ctx, &candidates_p, mirostat_tau, mirostat_eta, &mirostat_mu);
+                    } else {
+                        // Temperature sampling
+                        gptneox_sample_top_k(ctx, &candidates_p, top_k, 1);
+                        gptneox_sample_tail_free(ctx, &candidates_p, tfs_z, 1);
+                        gptneox_sample_typical(ctx, &candidates_p, typical_p, 1);
+                        gptneox_sample_top_p(ctx, &candidates_p, top_p, 1);
+                        gptneox_sample_temperature(ctx, &candidates_p, temp);
+                        id = gptneox_sample_token(ctx, &candidates_p);
+                    }
+                }
+                // printf("`%d`", candidates_p.size);
+
+                last_n_tokens.erase(last_n_tokens.begin());
+                last_n_tokens.push_back(id);
+            }
+
+            // replace end of text token with newline token when in interactive mode
+            if (id == gptneox_token_eos() && params.interactive && !params.instruct) {
+                id = gptneox_token_newline.front();
+                if (params.antiprompt.size() != 0) {
+                    // tokenize and inject first reverse prompt
+                    const auto first_antiprompt = ::gptneox_tokenize(ctx, params.antiprompt.front(), false);
+                    embd_inp.insert(embd_inp.end(), first_antiprompt.begin(), first_antiprompt.end());
+                }
+            }
+
+            // add it to the context
+            embd.push_back(id);
+
+            // echo this to console
+            input_noecho = false;
+
+            // decrement remaining sampling budget
+            --n_remain;
+        } else {
+            // some user input remains from prompt or interaction, forward it to processing
+            while ((int) embd_inp.size() > n_consumed) {
+                embd.push_back(embd_inp[n_consumed]);
+                last_n_tokens.erase(last_n_tokens.begin());
+                last_n_tokens.push_back(embd_inp[n_consumed]);
+                ++n_consumed;
+                if ((int) embd.size() >= params.n_batch) {
+                    break;
+                }
+            }
+        }
+
+        // display text
+        if (!input_noecho) {
+            for (auto id : embd) {
+                printf("%s", gptneox_token_to_str(ctx, id));
+            }
+            fflush(stdout);
+        }
+        // reset color to default if we there is no pending user input
+        if (!input_noecho && (int)embd_inp.size() == n_consumed) {
+            set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
+        }
+
+        // in interactive mode, and not currently processing queued inputs;
+        // check if we should prompt the user for more
+        if (params.interactive && (int) embd_inp.size() <= n_consumed) {
+
+            // check for reverse prompt
+            if (params.antiprompt.size()) {
+                std::string last_output;
+                for (auto id : last_n_tokens) {
+                    last_output += gptneox_token_to_str(ctx, id);
+                }
+
+                is_antiprompt = false;
+                // Check if each of the reverse prompts appears at the end of the output.
+                for (std::string & antiprompt : params.antiprompt) {
+                    if (last_output.find(antiprompt.c_str(), last_output.length() - antiprompt.length(), antiprompt.length()) != std::string::npos) {
+                        is_interacting = true;
+                        is_antiprompt = true;
+                        set_console_color(con_st, CONSOLE_COLOR_USER_INPUT);
+                        fflush(stdout);
+                        break;
+                    }
+                }
+            }
+
+            if (n_past > 0 && is_interacting) {
+                // potentially set color to indicate we are taking user input
+                set_console_color(con_st, CONSOLE_COLOR_USER_INPUT);
+
+#if defined (_WIN32)
+                // Windows: must reactivate sigint handler after each signal
+                signal(SIGINT, sigint_handler);
+#endif
+
+                if (params.instruct) {
+                    printf("\n> ");
+                }
+
+                std::string buffer;
+                if (!params.input_prefix.empty()) {
+                    buffer += params.input_prefix;
+                    printf("%s", buffer.c_str());
+                }
+
+                std::string line;
+                bool another_line = true;
+                do {
+#if defined(_WIN32)
+                    std::wstring wline;
+                    if (!std::getline(std::wcin, wline)) {
+                        // input stream is bad or EOF received
+                        return 0;
+                    }
+                    win32_utf8_encode(wline, line);
+#else
+                    if (!std::getline(std::cin, line)) {
+                        // input stream is bad or EOF received
+                        return 0;
+                    }
+#endif
+                    if (line.empty() || line.back() != '\\') {
+                        another_line = false;
+                    } else {
+                        line.pop_back(); // Remove the continue character
+                    }
+                    buffer += line + '\n'; // Append the line to the result
+                } while (another_line);
+
+                // done taking input, reset color
+                set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
+
+                // Add tokens to embd only if the input buffer is non-empty
+                // Entering a empty line lets the user pass control back
+                if (buffer.length() > 1) {
+
+                    auto line_inp = ::gptneox_tokenize(ctx, buffer, false);
+                    embd_inp.insert(embd_inp.end(), line_inp.begin(), line_inp.end());
+                    n_remain -= line_inp.size();
+                }
+
+                input_noecho = true; // do not echo this again
+            }
+
+            if (n_past > 0) {
+                is_interacting = false;
+            }
+        }
+
+        // end of text token
+        if (!embd.empty() && embd.back() == gptneox_token_eos()) {
+            if (params.instruct) {
+                is_interacting = true;
+            } else {
+                fprintf(stderr, " [end of text]\n");
+                break;
+            }
+        }
+
+        // In interactive mode, respect the maximum number of tokens and drop back to user input when reached.
+        if (params.interactive && n_remain <= 0 && params.n_predict != -1) {
+            n_remain = params.n_predict;
+            is_interacting = true;
+        }
+    }
+
+#if defined (_WIN32)
+    signal(SIGINT, SIG_DFL);
+#endif
+     printf("\n\n");
+    gptneox_print_timings(ctx);
+    gptneox_free(ctx);
+
+    set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
+
+    return 0;
+}
--- a/third_party/radpajama/quantize-gptneox.cpp
+++ b/third_party/radpajama/quantize-gptneox.cpp
@ -0,0 +1,82 @@
+#include "ggml.h"
+#include "gptneox.h"
+
+#include <cstdio>
+#include <map>
+#include <string>
+
+static const std::map<std::string, enum gptneox_ftype> GPTNEOX_FTYPE_MAP = {
+  {"q4_0", GPTNEOX_FTYPE_MOSTLY_Q4_0},
+  {"q4_1", GPTNEOX_FTYPE_MOSTLY_Q4_1},
+  {"q4_2", GPTNEOX_FTYPE_MOSTLY_Q4_2},
+  //{"q4_3", GPTNEOX_FTYPE_MOSTLY_Q4_3},
+  {"q5_0", GPTNEOX_FTYPE_MOSTLY_Q5_0},
+  {"q5_1", GPTNEOX_FTYPE_MOSTLY_Q5_1},
+  {"q8_0", GPTNEOX_FTYPE_MOSTLY_Q8_0},
+};
+
+// usage:
+//  ./quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type
+//
+int main(int argc, char ** argv) {
+    ggml_time_init();
+
+    if (argc < 4) {
+        fprintf(stderr, "usage: %s model-f32.bin model-quant.bin type [nthread]\n", argv[0]);
+        for (auto it = GPTNEOX_FTYPE_MAP.begin(); it != GPTNEOX_FTYPE_MAP.end(); it++) {
+            fprintf(stderr, "  type = \"%s\" or %d\n", it->first.c_str(), it->second);
+        }
+        return 1;
+    }
+
+    // needed to initialize f16 tables
+    {
+        struct ggml_init_params params = { 0, NULL, false };
+        struct ggml_context * ctx = ggml_init(params);
+        ggml_free(ctx);
+    }
+
+    const std::string fname_inp = argv[1];
+    const std::string fname_out = argv[2];
+
+    enum gptneox_ftype ftype;
+    if (argv[3][0] == 'q') {
+        auto it = GPTNEOX_FTYPE_MAP.find(argv[3]);
+        if (it == GPTNEOX_FTYPE_MAP.end()) {
+            fprintf(stderr, "%s: unknown ftype '%s'\n", __func__, argv[3]);
+            return 1;
+        }
+        ftype = it->second;
+    } else {
+        ftype = (enum gptneox_ftype)atoi(argv[3]);
+    }
+
+    int nthread = argc > 4 ? atoi(argv[4]) : 0;
+
+    const int64_t t_main_start_us = ggml_time_us();
+
+    int64_t t_quantize_us = 0;
+
+    // load the model
+    {
+        const int64_t t_start_us = ggml_time_us();
+
+        if (gptneox_model_quantize(fname_inp.c_str(), fname_out.c_str(), ftype, nthread)) {
+            fprintf(stderr, "%s: failed to quantize model from '%s'\n", __func__, fname_inp.c_str());
+            return 1;
+        }
+
+        t_quantize_us = ggml_time_us() - t_start_us;
+    }
+
+    // report timing
+    {
+        const int64_t t_main_end_us = ggml_time_us();
+
+        printf("\n");
+        printf("%s: quantize time = %8.2f ms\n", __func__, t_quantize_us/1000.0);
+        printf("%s:    total time = %8.2f ms\n", __func__, (t_main_end_us - t_main_start_us)/1000.0);
+    }
+
+    return 0;
+}
--- a/third_party/radpajama/scripts/convert_gptneox_to_ggml.py
+++ b/third_party/radpajama/scripts/convert_gptneox_to_ggml.py
@ -0,0 +1,144 @@
+# Convert Hugging Face fine-tuned gpt-neox-like models to ggml format
+
+import io
+import os
+import sys
+import struct
+import json
+import code
+import torch
+import numpy as np
+
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+# ref: https://github.com/openai/gpt-2/blob/master/src/encoder.py
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a significant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8+n)
+            n += 1
+    cs = [chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+if len(sys.argv) < 3:
+    print("Usage: python convert-hf-to-ggml.py model_name dir-output [use-f32]")
+    print("  model_name: name of the model to convert. Example: 'bigscience/bloomz-560m'")
+    print("  dir-output: directory where the output file will be written")
+    print("  use-f32:    if present, use float32 instead of float16")
+    sys.exit(1)
+
+model_name = sys.argv[1]
+dir_out = sys.argv[2]
+model_cache_dir = dir_out + "-cache"
+
+# make sure the output directory exists
+os.makedirs(dir_out, exist_ok=True)
+
+# possible data types
+#   ftype == 0 -> float32
+#   ftype == 1 -> float16
+#
+# map from ftype to string
+ftype_str = ["f32", "f16"]
+ftype = 1
+if len(sys.argv) > 3:
+    ftype = 0
+
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+print("Loading model: ", model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16 if ftype == 1 else torch.float32, 
+                                             cache_dir=model_cache_dir)
+model.eval()
+for p in model.parameters():
+    p.requires_grad = False
+hparams = model.config.to_dict()
+print("Model loaded: ", model_name)
+
+fn_bin = f"/ggml-{model_name.split('/')[-1]}-{ftype_str[ftype]}.bin"
+fn_out = dir_out + fn_bin
+fout = open(fn_out, "wb")
+
+ggml_file_magic = 0x67676d66 # 0x67676d6c is unversioned
+ggml_file_version = 0x00000001 # v1
+
+hparams["multiple_of"] = 1
+fout.write(struct.pack("i", ggml_file_magic)) # magic: ggmf in hex
+fout.write(struct.pack("i", ggml_file_version))
+fout.write(struct.pack("i", hparams["vocab_size"]))
+fout.write(struct.pack("i", hparams["max_position_embeddings"]))
+fout.write(struct.pack("i", hparams["hidden_size"]))
+fout.write(struct.pack("i", hparams["num_attention_heads"]))
+fout.write(struct.pack("i", hparams["num_hidden_layers"]))
+fout.write(struct.pack("i", int((hparams["hidden_size"] / hparams["num_attention_heads"]
+                             ) * hparams["rotary_pct"]))) # rotary_dim
+fout.write(struct.pack("i", int(hparams["use_parallel_residual"])))
+fout.write(struct.pack("i", ftype))
+
+# Is this correct??
+dot_token = tokenizer.encode(".")[0]
+for i in range(hparams["vocab_size"]):
+    text = tokenizer.decode([i]).encode('utf-8')
+    fout.write(struct.pack("i", len(text)))
+    fout.write(text)
+
+list_vars = model.state_dict()
+
+print(hparams)
+
+for name in list_vars.keys():
+    if name.startswith('gpt_neox.layers.'):
+        if 'attention.masked_bias' in name or \
+            'attention.rotary_emb.inv_freq' in name or \
+            'attention.bias' in name:
+            continue
+    # No gradients for these
+    list_vars[name].requires_grad = False
+    src = name
+    nn = name
+
+    print(src, ' -> ', name)
+    data = list_vars[src].squeeze().numpy()
+    data = data.astype(np.float32)
+
+    n_dims = len(data.shape)
+    print(name, n_dims, data.shape)
+
+    # default type is fp32
+    ftype_cur = 0
+    if ftype == 1 and n_dims > 1:
+        print("  Converting to float16", data.shape, data[:3, :3].tolist())
+        data = data.astype(np.float16)
+        ftype_cur = 1
+    else:
+        print("  Converting to float32", data.shape,
+              data[:3, :3].tolist() if n_dims > 1 else data[:3].tolist())
+        data = data.astype(np.float32)
+
+    # header
+    str = name.encode('utf-8')
+    fout.write(struct.pack("iii", n_dims, len(str), ftype_cur))
+    for i in range(n_dims):
+        fout.write(struct.pack("i", data.shape[n_dims - 1 - i]))
+    print(str)
+    fout.write(str)
+
+    # data
+    data.tofile(fout)
+
+fout.close()
+
+print("Done. Output file: " + fn_out)
+print("")
--- a/third_party/radpajama/scripts/install-RedPajama-INCITE-Base-3B-v1.sh
+++ b/third_party/radpajama/scripts/install-RedPajama-INCITE-Base-3B-v1.sh
@ -0,0 +1,21 @@
+#!/bin/bash
+
+# cd to scripts dir
+cd `dirname $0`
+
+# download model to models dir
+echo "Downloading model"
+python ./convert_gptneox_to_ggml.py togethercomputer/RedPajama-INCITE-Base-3B-v1 ../models/pythia
+
+# remove temp cache dir
+echo "Removing temp cache dir"
+rm -r ../models/pythia-cache
+
+# quantize model
+echo "Quantizing model (q4_0)"
+cd ../../..
+python ./examples/redpajama/scripts/quantize-gptneox.py ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Base-3B-v1-f16.bin
+
+
+# done!
+echo "Done."
--- a/third_party/radpajama/scripts/install-RedPajama-INCITE-Chat-3B-v1.sh
+++ b/third_party/radpajama/scripts/install-RedPajama-INCITE-Chat-3B-v1.sh
@ -0,0 +1,21 @@
+#!/bin/bash
+
+# cd to scripts dir
+cd `dirname $0`
+
+# download model to models dir
+echo "Downloading model"
+python ./convert_gptneox_to_ggml.py togethercomputer/RedPajama-INCITE-Chat-3B-v1 ../models/pythia
+
+# remove temp cache dir
+echo "Removing temp cache dir"
+rm -r ../models/pythia-cache
+
+# quantize model
+echo "Quantizing model (q4_0)"
+cd ../../..
+python ./examples/redpajama/scripts/quantize-gptneox.py ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-f16.bin
+
+
+# done!
+echo "Done."
--- a/third_party/radpajama/scripts/install-RedPajama-INCITE-Instruct-3B-v1.sh
+++ b/third_party/radpajama/scripts/install-RedPajama-INCITE-Instruct-3B-v1.sh
@ -0,0 +1,21 @@
+#!/bin/bash
+
+# cd to scripts dir
+cd `dirname $0`
+
+# download model to models dir
+echo "Downloading model"
+python ./convert_gptneox_to_ggml.py togethercomputer/RedPajama-INCITE-Instruct-3B-v1 ../models/pythia
+
+# remove temp cache dir
+echo "Removing temp cache dir"
+rm -r ../models/pythia-cache
+
+# quantize model
+echo "Quantizing model (q4_0)"
+cd ../../..
+python ./examples/redpajama/scripts/quantize-gptneox.py ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Instruct-3B-v1-f16.bin
+
+
+# done!
+echo "Done."
--- a/third_party/radpajama/scripts/quantize-gptneox.py
+++ b/third_party/radpajama/scripts/quantize-gptneox.py
@ -0,0 +1,141 @@
+#!/usr/bin/env python3
+
+"""Script to execute the "quantize" script on a given set of models."""
+
+import subprocess
+import argparse
+import glob
+import sys
+import os
+
+
+def main():
+    """Update the quantize binary name depending on the platform and parse
+    the command line arguments and execute the script.
+    """
+
+    if "linux" in sys.platform or "darwin" in sys.platform:
+        quantize_script_binary = "quantize-gptneox"
+
+    elif "win32" in sys.platform or "cygwin" in sys.platform:
+        quantize_script_binary = "quantize-gptneox.exe"
+
+    else:
+        print("WARNING: Unknown platform. Assuming a UNIX-like OS.\n")
+        quantize_script_binary = "quantize-gptneox"
+
+    parser = argparse.ArgumentParser(
+        prog='python3 quantize-gptneox.py',
+        description='This script quantizes the given models by applying the '
+        f'"{quantize_script_binary}" script on them.'
+    )
+    parser.add_argument('model_path')
+    #parser.add_argument(
+    #    'models', nargs='+', choices=('7B', '13B', '30B', '65B'),
+    #    help='The models to quantize.'
+    #)
+    parser.add_argument(
+        '-r', '--remove-16', action='store_true', dest='remove_f16',
+        help='Remove the f16 model after quantizing it.'
+    )
+    #parser.add_argument(
+    #    '-m', '--models-path', dest='models_path',
+    #    default=os.path.join(os.getcwd(), "models"),
+    #    help='Specify the directory where the models are located.'
+    #)
+    parser.add_argument(
+        '-q', '--quantize-script-path', dest='quantize_script_path',
+        default=os.path.join(os.getcwd(), quantize_script_binary),
+        help='Specify the path to the "quantize" script.'
+    )
+
+    parser.add_argument(
+        '--quantize-output-type', dest='quantize_output_type', type=str,
+        default='q4_0',
+        help='Specify the path to the "quantize" script.'
+    )
+
+
+    # TODO: Revise this code
+    # parser.add_argument(
+    #     '-t', '--threads', dest='threads', type='int',
+    #     default=os.cpu_count(),
+    #     help='Specify the number of threads to use to quantize many models at '
+    #     'once. Defaults to os.cpu_count().'
+    # )
+
+    args = parser.parse_args()
+    args.model_path = os.path.abspath(args.model_path)
+    #args.models_path = os.path.abspath(args.models_path)
+
+    if not os.path.isfile(args.quantize_script_path):
+        print(
+            f'The "{quantize_script_binary}" script was not found in the '
+            "current location.\nIf you want to use it from another location, "
+            "set the --quantize-script-path argument from the command line."
+        )
+        sys.exit(1)
+
+    #for model in args.models:
+    # The model is separated in various parts
+    # (ggml-model-f16.bin, ggml-model-f16.bin.0, ggml-model-f16.bin.1...)
+    #f16_model_path_base = os.path.join(
+    #    args.models_path, model, "ggml-model-f16.bin"
+    #)
+    f16_model_path_base = args.model_path
+
+    if not os.path.isfile(f16_model_path_base):
+        print(f'The file %s was not found' % f16_model_path_base)
+        sys.exit(1)
+
+    f16_model_parts_paths = map(
+        lambda filename: os.path.join(f16_model_path_base, filename),
+        glob.glob(f"{f16_model_path_base}*")
+    )
+
+    for f16_model_part_path in f16_model_parts_paths:
+        if not os.path.isfile(f16_model_part_path):
+            print(
+                f"The f16 model {os.path.basename(f16_model_part_path)} "
+                f"was not found in {args.models_path}{os.path.sep}"
+                ". If you want to use it from another location, set the "
+                "--models-path argument from the command line."
+            )
+            sys.exit(1)
+
+        __run_quantize_script(
+            args.quantize_script_path, f16_model_part_path, args.quantize_output_type
+        )
+
+        if args.remove_f16:
+            os.remove(f16_model_part_path)
+
+
+# This was extracted to a top-level function for parallelization, if
+# implemented. See https://github.com/ggerganov/llama.cpp/pull/222/commits/f8db3d6cd91bf1a1342db9d29e3092bc12dd783c#r1140496406
+
+def __run_quantize_script(script_path, f16_model_part_path, quantize_output_type):
+    """Run the quantize script specifying the path to it and the path to the
+    f16 model to quantize.
+    """
+
+    new_quantized_model_path = f16_model_part_path.replace("f16", quantize_output_type)
+    subprocess.run(
+        [script_path, f16_model_part_path, new_quantized_model_path, quantize_output_type],
+        check=True
+    )
+
+
+if __name__ == "__main__":
+    try:
+        main()
+
+    except subprocess.CalledProcessError:
+        print("\nAn error ocurred while trying to quantize the models.")
+        sys.exit(1)
+
+    except KeyboardInterrupt:
+        sys.exit(0)
+
+    else:
+        print("\nSuccesfully quantized all models.")