Import redpajama.cpp into COSMO.

This is the relevant commit: bfa6466199
This commit is contained in:
Ariel Nunez 2023-05-11 06:50:36 -04:00
parent 3083335d07
commit 52e471331c
16 changed files with 5913 additions and 17 deletions

View file

@ -1,6 +1,6 @@
DESCRIPTION
ggml is a machine learning library useful for LLM inference on CPUs
radpajama is a port of ggml for the open source Red Pajama LLM. It started as a fork of redpajama.cpp from Together Computer.
LICENSE
@ -8,22 +8,12 @@ LICENSE
ORIGIN
https://github.com/ggerganov/llama.cpp
commit 0b2da20538d01926b77ea237dd1c930c4d20b686
Author: Stephan Walter <stephan@walter.name>
Date: Wed Apr 26 20:26:42 2023 +0000
ggml : slightly faster AVX2 implementation for Q5 (#1197)
github.com/togethercomputer/redpajama.cpp/
commit bfa6466199b8ef92185ecb72e2a550e12baf6602
Author: Szhangce <czhang@cs.stanford.edu>
Date: Tue May 9 00:50:22 2023 +0200
radpajama : Update README.md
LOCAL CHANGES
- Make it possible for loaded prompts to be cached to disk
- Introduce -v and --verbose flags
- Reduce batch size from 512 to 32
- Allow --n_keep to specify a substring of prompt
- Don't print stats / diagnostics unless -v is passed
- Reduce --top_p default from 0.95 to 0.70
- Change --reverse-prompt to no longer imply --interactive
- Permit --reverse-prompt specifying custom EOS if non-interactive
- Refactor headers per cosmo convention
- Replace code like 'ggjt' with READ32BE("ggjt")
- Remove C++ exceptions; use Die() function instead
- Updated headers for COSMO build.

143
third_party/radpajama/README.md vendored Normal file
View file

@ -0,0 +1,143 @@
# gglm Support for RedPajama Model
## Ackonwledgement
We highly appreciate the great effort from the fork of [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp). Our support of the RedPajama Model is mainly based on this implementation. We extend the model configure and fixed a bug when setting use_parallel_residual flag to False in their original implementation. We also extend the chat model for RedPajama.
## Usage:
### RedPajama Chat model:
- Make the code:
make redpajama-chat quantize-gptneox
- Prepare the RedPajama model (f16 and q4_0) for gglm:
bash ./examples/redpajama/scripts/install-RedPajama-INCITE-Chat-3B-v1.sh
- Run RedPajama chat model (fp16):
./redpajama-chat -m ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-f16.bin \
-c 2048 \
-b 128 \
-n 1 \
-t 8 \
--instruct \
--color \
--top_k 30 \
--top_p 0.95 \
--temp 0.8 \
--repeat_last_n 3 \
--repeat_penalty 1.1 \
--seed 0
Note that you may need to install torch and transformers to run the above scripts, e.g.:
pip install torch==2.0.0
pip install transformers==4.28.1
- Run RedPajama chat model (q4_0):
./redpajama-chat -m ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-q4_0.bin \
-c 2048 \
-b 128 \
-n 1 \
-t 8 \
--instruct \
--color \
--top_k 30 \
--top_p 0.95 \
--temp 0.8 \
--repeat_last_n 3 \
--repeat_penalty 1.1 \
--seed 0
- Run other quantized version of RedPajama Chat model (Make sure you get the f16 model prepared before you run this):
- Make the code to quantize the model if you have not:
make quantize-gptneox
- Generate the quantized model, the supported types include: q4_0, q4_1, q4_2, q5_0, q5_1, and q8_0. For example, to run q4_1, you need to do the following convertion:
python ./examples/redpajama/scripts/quantize-gptneox.py ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-f16.bin --quantize-output-type q4_1
- Then you can chat with the quantized model:
./redpajama-chat -m ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-q4_1.bin \
-c 2048 \
-b 128 \
-n 1 \
-t 8 \
--instruct \
--color \
--top_k 30 \
--top_p 0.95 \
--temp 0.8 \
--repeat_last_n 3 \
--repeat_penalty 1.1 \
--seed 0
### RedPajama Base/Instruct model:
- Make the code:
make redpajama quantize-gptneox
- Prepare the RedPajama Base/Instruct model (f16 and q4_0) for gglm:
bash ./examples/redpajama/scripts/install-RedPajama-INCITE-Base-3B-v1.sh
# Or
bash ./examples/redpajama/scripts/install-RedPajama-INCITE-Instruct-3B-v1.sh
- Run other quantize version of RedPajama Base/Instruct model (Make sure you get the f16 model prepared before you run this). Then you can generate the quantized model, the supported types include: q4_0, q4_1, q4_2, q5_0, q5_1, and q8_0. For example, to run q4_1, you need to do the following convertion, e.g for RedPajama-Base q8_0:
python ./examples/redpajama/scripts/quantize-gptneox.py ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Base-3B-v1-f16.bin --quantize-output-type q8_0
- Run RedPajama Base/Instruct model (e.g., RedPajama-Instruct q8_0) :
./redpajama -m ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Instruct-3B-v1-q8_0.bin \
-c 2048 \
-b 128 \
-n 1 \
-t 8 \
--color \
--top_k 30 \
--top_p 0.95 \
--temp 0.8 \
--repeat_last_n 3 \
--repeat_penalty 1.1 \
--seed 0 \
--n_predict 256 \
--verbose-prompt \
-p "How to schedule a tour to Anfield:"
## Attribution
The following files are covered by a MIT license and were taken from:
https://github.com/byroneverson/gptneox.cpp
Thank you Byron.
```
common-gptneox.cpp
copy-gptneox.cpp
gptneox.cpp
quantize-gptneox.cpp
common-gptneox.h
gptneox-util.h
gptneox.h
convert_gptneox_to_ggml.py
quantize-gptneox.py
```

429
third_party/radpajama/common-gptneox.cpp vendored Normal file
View file

@ -0,0 +1,429 @@
#include "common-gptneox.h"
#include <cassert>
#include <cstring>
#include <fstream>
#include <string>
#include <iterator>
#include <algorithm>
#include <sstream>
#include <iostream>
#if defined (_WIN32)
#include <fcntl.h>
#include <io.h>
#pragma comment(lib,"kernel32.lib")
extern "C" __declspec(dllimport) void* __stdcall GetStdHandle(unsigned long nStdHandle);
extern "C" __declspec(dllimport) int __stdcall GetConsoleMode(void* hConsoleHandle, unsigned long* lpMode);
extern "C" __declspec(dllimport) int __stdcall SetConsoleMode(void* hConsoleHandle, unsigned long dwMode);
extern "C" __declspec(dllimport) int __stdcall SetConsoleCP(unsigned int wCodePageID);
extern "C" __declspec(dllimport) int __stdcall SetConsoleOutputCP(unsigned int wCodePageID);
extern "C" __declspec(dllimport) int __stdcall WideCharToMultiByte(unsigned int CodePage, unsigned long dwFlags,
const wchar_t * lpWideCharStr, int cchWideChar,
char * lpMultiByteStr, int cbMultiByte,
const char * lpDefaultChar, bool * lpUsedDefaultChar);
#define CP_UTF8 65001
#endif
bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
// determine sensible default number of threads.
// std::thread::hardware_concurrency may not be equal to the number of cores, or may return 0.
#ifdef __linux__
std::ifstream cpuinfo("/proc/cpuinfo");
params.n_threads = std::count(std::istream_iterator<std::string>(cpuinfo),
std::istream_iterator<std::string>(),
std::string("processor"));
#endif
if (params.n_threads == 0) {
params.n_threads = std::max(1, (int32_t) std::thread::hardware_concurrency());
}
bool invalid_param = false;
std::string arg;
gpt_params default_params;
for (int i = 1; i < argc; i++) {
arg = argv[i];
if (arg == "-s" || arg == "--seed") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.seed = std::stoi(argv[i]);
} else if (arg == "-t" || arg == "--threads") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.n_threads = std::stoi(argv[i]);
} else if (arg == "-p" || arg == "--prompt") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.prompt = argv[i];
} else if (arg == "--session") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.path_session = argv[i];
} else if (arg == "-f" || arg == "--file") {
if (++i >= argc) {
invalid_param = true;
break;
}
std::ifstream file(argv[i]);
if (!file) {
fprintf(stderr, "error: failed to open file '%s'\n", argv[i]);
invalid_param = true;
break;
}
std::copy(std::istreambuf_iterator<char>(file), std::istreambuf_iterator<char>(), back_inserter(params.prompt));
if (params.prompt.back() == '\n') {
params.prompt.pop_back();
}
} else if (arg == "-n" || arg == "--n_predict") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.n_predict = std::stoi(argv[i]);
} else if (arg == "--top_k") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.top_k = std::stoi(argv[i]);
} else if (arg == "-c" || arg == "--ctx_size") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.n_ctx = std::stoi(argv[i]);
} else if (arg == "--memory_f32") {
params.memory_f16 = false;
} else if (arg == "--top_p") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.top_p = std::stof(argv[i]);
} else if (arg == "--temp") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.temp = std::stof(argv[i]);
} else if (arg == "--tfs") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.tfs_z = std::stof(argv[i]);
} else if (arg == "--typical") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.typical_p = std::stof(argv[i]);
} else if (arg == "--repeat_last_n") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.repeat_last_n = std::stoi(argv[i]);
} else if (arg == "--repeat_penalty") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.repeat_penalty = std::stof(argv[i]);
} else if (arg == "--frequency_penalty") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.frequency_penalty = std::stof(argv[i]);
} else if (arg == "--presence_penalty") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.presence_penalty = std::stof(argv[i]);
} else if (arg == "--mirostat") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.mirostat = std::stoi(argv[i]);
} else if (arg == "--mirostat_lr") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.mirostat_eta = std::stof(argv[i]);
} else if (arg == "--mirostat_ent") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.mirostat_tau = std::stof(argv[i]);
} else if (arg == "-b" || arg == "--batch_size") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.n_batch = std::stoi(argv[i]);
params.n_batch = std::min(512, params.n_batch);
} else if (arg == "--keep") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.n_keep = std::stoi(argv[i]);
} else if (arg == "-m" || arg == "--model") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.model = argv[i];
} else if (arg == "--lora") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.lora_adapter = argv[i];
params.use_mmap = false;
} else if (arg == "--lora-base") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.lora_base = argv[i];
} else if (arg == "-i" || arg == "--interactive") {
params.interactive = true;
} else if (arg == "--embedding") {
params.embedding = true;
} else if (arg == "--interactive-first") {
params.interactive_first = true;
} else if (arg == "-ins" || arg == "--instruct") {
params.instruct = true;
} else if (arg == "--color") {
params.use_color = true;
} else if (arg == "--mlock") {
params.use_mlock = true;
} else if (arg == "--no-mmap") {
params.use_mmap = false;
} else if (arg == "--mtest") {
params.mem_test = true;
} else if (arg == "--verbose-prompt") {
params.verbose_prompt = true;
} else if (arg == "-r" || arg == "--reverse-prompt") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.antiprompt.push_back(argv[i]);
} else if (arg == "--perplexity") {
params.perplexity = true;
} else if (arg == "--ignore-eos") {
params.logit_bias[gptneox_token_eos()] = -INFINITY;
} else if (arg == "--no-penalize-nl") {
params.penalize_nl = false;
} else if (arg == "-l" || arg == "--logit-bias") {
if (++i >= argc) {
invalid_param = true;
break;
}
std::stringstream ss(argv[i]);
gptneox_token key;
char sign;
std::string value_str;
try {
if (ss >> key && ss >> sign && std::getline(ss, value_str) && (sign == '+' || sign == '-')) {
params.logit_bias[key] = std::stof(value_str) * ((sign == '-') ? -1.0f : 1.0f);
} else {
throw std::exception();
}
} catch (const std::exception &e) {
invalid_param = true;
break;
}
} else if (arg == "--n_parts") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.n_parts = std::stoi(argv[i]);
} else if (arg == "-h" || arg == "--help") {
gpt_print_usage(argc, argv, default_params);
exit(0);
} else if (arg == "--random-prompt") {
params.random_prompt = true;
} else if (arg == "--in-prefix") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.input_prefix = argv[i];
} else {
fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
gpt_print_usage(argc, argv, default_params);
exit(1);
}
}
if (invalid_param) {
fprintf(stderr, "error: invalid parameter for argument: %s\n", arg.c_str());
gpt_print_usage(argc, argv, default_params);
exit(1);
}
return true;
}
void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
fprintf(stderr, "usage: %s [options]\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "options:\n");
fprintf(stderr, " -h, --help show this help message and exit\n");
fprintf(stderr, " -i, --interactive run in interactive mode\n");
fprintf(stderr, " --interactive-first run in interactive mode and wait for input right away\n");
fprintf(stderr, " -ins, --instruct run in instruction mode\n");
fprintf(stderr, " -r PROMPT, --reverse-prompt PROMPT\n");
fprintf(stderr, " run in interactive mode and poll user input upon seeing PROMPT (can be\n");
fprintf(stderr, " specified more than once for multiple prompts).\n");
fprintf(stderr, " --color colorise output to distinguish prompt and user input from generations\n");
fprintf(stderr, " -s SEED, --seed SEED RNG seed (default: -1, use random seed for <= 0)\n");
fprintf(stderr, " -t N, --threads N number of threads to use during computation (default: %d)\n", params.n_threads);
fprintf(stderr, " -p PROMPT, --prompt PROMPT\n");
fprintf(stderr, " prompt to start generation with (default: empty)\n");
fprintf(stderr, " --session FNAME file to cache model state in (may be large!) (default: none)\n");
fprintf(stderr, " --random-prompt start with a randomized prompt.\n");
fprintf(stderr, " --in-prefix STRING string to prefix user inputs with (default: empty)\n");
fprintf(stderr, " -f FNAME, --file FNAME\n");
fprintf(stderr, " prompt file to start generation.\n");
fprintf(stderr, " -n N, --n_predict N number of tokens to predict (default: %d, -1 = infinity)\n", params.n_predict);
fprintf(stderr, " --top_k N top-k sampling (default: %d, 0 = disabled)\n", params.top_k);
fprintf(stderr, " --top_p N top-p sampling (default: %.1f, 1.0 = disabled)\n", (double)params.top_p);
fprintf(stderr, " --tfs N tail free sampling, parameter z (default: %.1f, 1.0 = disabled)\n", (double)params.tfs_z);
fprintf(stderr, " --typical N locally typical sampling, parameter p (default: %.1f, 1.0 = disabled)\n", (double)params.typical_p);
fprintf(stderr, " --repeat_last_n N last n tokens to consider for penalize (default: %d, 0 = disabled, -1 = ctx_size)\n", params.repeat_last_n);
fprintf(stderr, " --repeat_penalty N penalize repeat sequence of tokens (default: %.1f, 1.0 = disabled)\n", (double)params.repeat_penalty);
fprintf(stderr, " --presence_penalty N repeat alpha presence penalty (default: %.1f, 0.0 = disabled)\n", (double)params.presence_penalty);
fprintf(stderr, " --frequency_penalty N repeat alpha frequency penalty (default: %.1f, 0.0 = disabled)\n", (double)params.frequency_penalty);
fprintf(stderr, " --mirostat N use Mirostat sampling.\n");
fprintf(stderr, " Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used.\n");
fprintf(stderr, " (default: %d, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)\n", params.mirostat);
fprintf(stderr, " --mirostat_lr N Mirostat learning rate, parameter eta (default: %.1f)\n", (double)params.mirostat_eta);
fprintf(stderr, " --mirostat_ent N Mirostat target entropy, parameter tau (default: %.1f)\n", (double)params.mirostat_tau);
fprintf(stderr, " -l TOKEN_ID(+/-)BIAS, --logit-bias TOKEN_ID(+/-)BIAS\n");
fprintf(stderr, " modifies the likelihood of token appearing in the completion,\n");
fprintf(stderr, " i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',\n");
fprintf(stderr, " or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'\n");
fprintf(stderr, " -c N, --ctx_size N size of the prompt context (default: %d)\n", params.n_ctx);
fprintf(stderr, " --ignore-eos ignore end of stream token and continue generating (implies --logit-bias 2-inf)\n");
fprintf(stderr, " --no-penalize-nl do not penalize newline token\n");
fprintf(stderr, " --memory_f32 use f32 instead of f16 for memory key+value\n");
fprintf(stderr, " --temp N temperature (default: %.1f)\n", (double)params.temp);
fprintf(stderr, " --n_parts N number of model parts (default: -1 = determine from dimensions)\n");
fprintf(stderr, " -b N, --batch_size N batch size for prompt processing (default: %d)\n", params.n_batch);
fprintf(stderr, " --perplexity compute perplexity over the prompt\n");
fprintf(stderr, " --keep number of tokens to keep from the initial prompt (default: %d, -1 = all)\n", params.n_keep);
if (gptneox_mlock_supported()) {
fprintf(stderr, " --mlock force system to keep model in RAM rather than swapping or compressing\n");
}
if (gptneox_mmap_supported()) {
fprintf(stderr, " --no-mmap do not memory-map model (slower load but may reduce pageouts if not using mlock)\n");
}
fprintf(stderr, " --mtest compute maximum memory usage\n");
fprintf(stderr, " --verbose-prompt print prompt before generation\n");
fprintf(stderr, " --lora FNAME apply LoRA adapter (implies --no-mmap)\n");
fprintf(stderr, " --lora-base FNAME optional model to use as a base for the layers modified by the LoRA adapter\n");
fprintf(stderr, " -m FNAME, --model FNAME\n");
fprintf(stderr, " model path (default: %s)\n", params.model.c_str());
fprintf(stderr, "\n");
}
std::string gpt_random_prompt(std::mt19937 & rng) {
const int r = rng() % 10;
switch (r) {
case 0: return "So";
case 1: return "Once upon a time";
case 2: return "When";
case 3: return "The";
case 4: return "After";
case 5: return "If";
case 6: return "import";
case 7: return "He";
case 8: return "She";
case 9: return "They";
default: return "To";
}
return "The";
}
// TODO: not great allocating this every time
std::vector<gptneox_token> gptneox_tokenize(struct gptneox_context * ctx, const std::string & text, bool add_bos) {
// initialize to prompt numer of chars, since n_tokens <= n_prompt_chars
std::vector<gptneox_token> res(text.size() + (int)add_bos);
int n = gptneox_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
assert(n >= 0);
res.resize(n);
return res;
}
/* Keep track of current color of output, and emit ANSI code if it changes. */
void set_console_color(console_state & con_st, console_color_t color) {
if (con_st.use_color && con_st.color != color) {
switch(color) {
case CONSOLE_COLOR_DEFAULT:
printf(ANSI_COLOR_RESET);
break;
case CONSOLE_COLOR_PROMPT:
printf(ANSI_COLOR_YELLOW);
break;
case CONSOLE_COLOR_USER_INPUT:
printf(ANSI_BOLD ANSI_COLOR_GREEN);
break;
}
con_st.color = color;
}
}
#if defined (_WIN32)
void win32_console_init(bool enable_color) {
unsigned long dwMode = 0;
void* hConOut = GetStdHandle((unsigned long)-11); // STD_OUTPUT_HANDLE (-11)
if (!hConOut || hConOut == (void*)-1 || !GetConsoleMode(hConOut, &dwMode)) {
hConOut = GetStdHandle((unsigned long)-12); // STD_ERROR_HANDLE (-12)
if (hConOut && (hConOut == (void*)-1 || !GetConsoleMode(hConOut, &dwMode))) {
hConOut = 0;
}
}
if (hConOut) {
// Enable ANSI colors on Windows 10+
if (enable_color && !(dwMode & 0x4)) {
SetConsoleMode(hConOut, dwMode | 0x4); // ENABLE_VIRTUAL_TERMINAL_PROCESSING (0x4)
}
// Set console output codepage to UTF8
SetConsoleOutputCP(CP_UTF8);
}
void* hConIn = GetStdHandle((unsigned long)-10); // STD_INPUT_HANDLE (-10)
if (hConIn && hConIn != (void*)-1 && GetConsoleMode(hConIn, &dwMode)) {
// Set console input codepage to UTF16
_setmode(_fileno(stdin), _O_WTEXT);
}
}
// Convert a wide Unicode string to an UTF8 string
void win32_utf8_encode(const std::wstring & wstr, std::string & str) {
int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
std::string strTo(size_needed, 0);
WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
str = strTo;
}
#endif

108
third_party/radpajama/common-gptneox.h vendored Normal file
View file

@ -0,0 +1,108 @@
// Various helper functions and utilities
#pragma once
#include "gptneox.h"
#include <string>
#include <vector>
#include <random>
#include <thread>
#include <unordered_map>
//
// CLI argument parsing
//
struct gpt_params {
int32_t seed = -1; // RNG seed
int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
int32_t n_predict = 128; // new tokens to predict
int32_t n_parts = -1; // amount of model parts (-1 = determine from model dimensions)
int32_t n_ctx = 512; // context size
int32_t n_batch = 512; // batch size for prompt processing (must be >=32 to use BLAS)
int32_t n_keep = 0; // number of tokens to keep from initial prompt
// sampling parameters
std::unordered_map<gptneox_token, float> logit_bias; // logit bias for specific tokens
int32_t top_k = 40; // <= 0 to use vocab size
float top_p = 0.95f; // 1.0 = disabled
float tfs_z = 1.00f; // 1.0 = disabled
float typical_p = 1.00f; // 1.0 = disabled
float temp = 0.80f; // 1.0 = disabled
float repeat_penalty = 1.10f; // 1.0 = disabled
int32_t repeat_last_n = 64; // last n tokens to penalize (0 = disable penalty, -1 = context size)
float frequency_penalty = 0.00f; // 0.0 = disabled
float presence_penalty = 0.00f; // 0.0 = disabled
int mirostat = 0; // 0 = disabled, 1 = mirostat, 2 = mirostat 2.0
float mirostat_tau = 5.00f; // target entropy
float mirostat_eta = 0.10f; // learning rate
std::string model = "./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat/Instruct-3B-v1-f16.bin"; // model path
std::string prompt = "";
std::string path_session = ""; // path to file for saving/loading model eval state
std::string input_prefix = ""; // string to prefix user inputs with
std::vector<std::string> antiprompt; // string upon seeing which more user input is prompted
std::string lora_adapter = ""; // lora adapter path
std::string lora_base = ""; // base model path for the lora adapter
bool memory_f16 = true; // use f16 instead of f32 for memory kv
bool random_prompt = false; // do not randomize prompt if none provided
bool use_color = false; // use color to distinguish generations and inputs
bool interactive = false; // interactive mode
bool embedding = false; // get only sentence embedding
bool interactive_first = false; // wait for user input immediately
bool instruct = false; // instruction mode
bool penalize_nl = true; // consider newlines as a repeatable token
bool perplexity = false; // compute perplexity over the prompt
bool use_mmap = true; // use mmap for faster loads
bool use_mlock = false; // use mlock to keep model in memory
bool mem_test = false; // compute maximum memory usage
bool verbose_prompt = false; // print prompt tokens before generation
};
bool gpt_params_parse(int argc, char ** argv, gpt_params & params);
void gpt_print_usage(int argc, char ** argv, const gpt_params & params);
std::string gpt_random_prompt(std::mt19937 & rng);
//
// Vocab utils
//
std::vector<gptneox_token> gptneox_tokenize(struct gptneox_context * ctx, const std::string & text, bool add_bos);
//
// Console utils
//
#define ANSI_COLOR_RED "\x1b[31m"
#define ANSI_COLOR_GREEN "\x1b[32m"
#define ANSI_COLOR_YELLOW "\x1b[33m"
#define ANSI_COLOR_BLUE "\x1b[34m"
#define ANSI_COLOR_MAGENTA "\x1b[35m"
#define ANSI_COLOR_CYAN "\x1b[36m"
#define ANSI_COLOR_RESET "\x1b[0m"
#define ANSI_BOLD "\x1b[1m"
enum console_color_t {
CONSOLE_COLOR_DEFAULT=0,
CONSOLE_COLOR_PROMPT,
CONSOLE_COLOR_USER_INPUT
};
struct console_state {
bool use_color = false;
console_color_t color = CONSOLE_COLOR_DEFAULT;
};
void set_console_color(console_state & con_st, console_color_t color);
#if defined (_WIN32)
void win32_console_init(bool enable_color);
void win32_utf8_encode(const std::wstring & wstr, std::string & str);
#endif

57
third_party/radpajama/copy-gptneox.cpp vendored Normal file
View file

@ -0,0 +1,57 @@
#include "ggml.h"
#include "gptneox.h"
#include <cstdio>
#include <map>
#include <string>
static const std::map<std::string, enum gptneox_ftype> GPTNEOX_FTYPE_MAP = {
{"q4_0", GPTNEOX_FTYPE_MOSTLY_Q4_0},
{"q4_1", GPTNEOX_FTYPE_MOSTLY_Q4_1},
{"q4_2", GPTNEOX_FTYPE_MOSTLY_Q4_2},
//{"q4_3", GPTNEOX_FTYPE_MOSTLY_Q4_3},
{"q5_0", GPTNEOX_FTYPE_MOSTLY_Q5_0},
{"q5_1", GPTNEOX_FTYPE_MOSTLY_Q5_1},
{"q8_0", GPTNEOX_FTYPE_MOSTLY_Q8_0},
};
// usage:
// ./quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type
//
int main(int argc, char ** argv) {
ggml_time_init();
if (argc < 4) {
fprintf(stderr, "usage: %s model-f32.bin model-quant.bin ftype\n", argv[0]);
for (auto it = GPTNEOX_FTYPE_MAP.begin(); it != GPTNEOX_FTYPE_MAP.end(); it++) {
fprintf(stderr, " type = \"%s\" or %d\n", it->first.c_str(), it->second);
}
return 1;
}
// needed to initialize f16 tables
{
struct ggml_init_params params = { 0, NULL, false };
struct ggml_context * ctx = ggml_init(params);
ggml_free(ctx);
}
const std::string fname_inp = argv[1];
const std::string fname_out = argv[2];
enum gptneox_ftype ftype;
if (argv[3][0] == 'q') {
auto it = GPTNEOX_FTYPE_MAP.find(argv[3]);
if (it == GPTNEOX_FTYPE_MAP.end()) {
fprintf(stderr, "%s: unknown ftype '%s'\n", __func__, argv[3]);
return 1;
}
ftype = it->second;
} else {
ftype = (enum gptneox_ftype)atoi(argv[3]);
}
gptneox_model_copy(fname_inp.c_str(), fname_out.c_str(), ftype);
return 0;
}

433
third_party/radpajama/gptneox-util.h vendored Normal file
View file

@ -0,0 +1,433 @@
// Internal header to be included only by llama.cpp.
// Contains wrappers around OS interfaces.
#ifndef GPTNEOX_UTIL_H
#define GPTNEOX_UTIL_H
#include <cstdio>
#include <cstdint>
#include <cerrno>
#include <cstring>
#include <cstdarg>
#include <cstdlib>
#include <climits>
#include <string>
#include <vector>
#ifdef __has_include
#if __has_include(<unistd.h>)
#include <unistd.h>
#if defined(_POSIX_MAPPED_FILES)
#include <sys/mman.h>
#endif
#if defined(_POSIX_MEMLOCK_RANGE)
#include <sys/resource.h>
#endif
#endif
#endif
#if defined(_WIN32)
#define WIN32_LEAN_AND_MEAN
#ifndef NOMINMAX
#define NOMINMAX
#endif
#include <windows.h>
#include <io.h>
#include <stdio.h> // for _fseeki64
#endif
#define GPTNEOX_ASSERT(x) \
do { \
if (!(x)) { \
fprintf(stderr, "GPTNEOX_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \
abort(); \
} \
} while (0)
#ifdef __GNUC__
#ifdef __MINGW32__
__attribute__((format(gnu_printf, 1, 2)))
#else
__attribute__((format(printf, 1, 2)))
#endif
#endif
static std::string format(const char * fmt, ...) {
va_list ap, ap2;
va_start(ap, fmt);
va_copy(ap2, ap);
int size = vsnprintf(NULL, 0, fmt, ap);
GPTNEOX_ASSERT(size >= 0 && size < INT_MAX);
std::vector<char> buf(size + 1);
int size2 = vsnprintf(buf.data(), size + 1, fmt, ap2);
GPTNEOX_ASSERT(size2 == size);
va_end(ap2);
va_end(ap);
return std::string(buf.data(), size);
}
struct gptneox_file {
// use FILE * so we don't have to re-open the file to mmap
FILE * fp;
size_t size;
gptneox_file(const char * fname, const char * mode) {
fp = std::fopen(fname, mode);
if (fp == NULL) {
throw format("failed to open %s: %s", fname, std::strerror(errno));
}
seek(0, SEEK_END);
size = tell();
seek(0, SEEK_SET);
}
size_t tell() const {
#ifdef _WIN32
__int64 ret = _ftelli64(fp);
#else
long ret = std::ftell(fp);
#endif
GPTNEOX_ASSERT(ret != -1); // this really shouldn't fail
return (size_t) ret;
}
void seek(size_t offset, int whence) {
#ifdef _WIN32
int ret = _fseeki64(fp, (__int64) offset, whence);
#else
int ret = std::fseek(fp, (long) offset, whence);
#endif
GPTNEOX_ASSERT(ret == 0); // same
}
void read_raw(void * ptr, size_t size) {
if (size == 0) {
return;
}
errno = 0;
std::size_t ret = std::fread(ptr, size, 1, fp);
if (ferror(fp)) {
throw format("read error: %s", strerror(errno));
}
if (ret != 1) {
throw std::string("unexpectedly reached end of file");
}
}
std::uint32_t read_u32() {
std::uint32_t ret;
read_raw(&ret, sizeof(ret));
return ret;
}
std::string read_string(std::uint32_t len) {
std::vector<char> chars(len);
read_raw(chars.data(), len);
return std::string(chars.data(), len);
}
void write_raw(const void * ptr, size_t size) {
if (size == 0) {
return;
}
errno = 0;
size_t ret = std::fwrite(ptr, size, 1, fp);
if (ret != 1) {
throw format("write error: %s", strerror(errno));
}
}
void write_u32(std::uint32_t val) {
write_raw(&val, sizeof(val));
}
~gptneox_file() {
if (fp) {
std::fclose(fp);
}
}
};
#if defined(_WIN32)
static std::string gptneox_format_win_err(DWORD err) {
LPSTR buf;
size_t size = FormatMessageA(FORMAT_MESSAGE_ALLOCATE_BUFFER | FORMAT_MESSAGE_FROM_SYSTEM | FORMAT_MESSAGE_IGNORE_INSERTS,
NULL, err, MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT), (LPSTR)&buf, 0, NULL);
if (!size) {
return "FormatMessageA failed";
}
std::string ret(buf, size);
LocalFree(buf);
return ret;
}
#endif
struct gptneox_mmap {
void * addr;
size_t size;
gptneox_mmap(const gptneox_mmap &) = delete;
#ifdef _POSIX_MAPPED_FILES
static constexpr bool SUPPORTED = true;
gptneox_mmap(struct gptneox_file * file, bool prefetch = true) {
size = file->size;
int fd = fileno(file->fp);
int flags = MAP_SHARED;
#ifdef __linux__
flags |= MAP_POPULATE;
#endif
addr = mmap(NULL, file->size, PROT_READ, flags, fd, 0);
if (addr == MAP_FAILED) {
throw format("mmap failed: %s", strerror(errno));
}
if (prefetch) {
// Advise the kernel to preload the mapped memory
if (madvise(addr, file->size, MADV_WILLNEED)) {
fprintf(stderr, "warning: madvise(.., MADV_WILLNEED) failed: %s\n",
strerror(errno));
}
}
}
~gptneox_mmap() {
munmap(addr, size);
}
#elif defined(_WIN32)
static constexpr bool SUPPORTED = true;
gptneox_mmap(struct gptneox_file * file, bool prefetch = true) {
size = file->size;
HANDLE hFile = (HANDLE) _get_osfhandle(_fileno(file->fp));
HANDLE hMapping = CreateFileMappingA(hFile, NULL, PAGE_READONLY, 0, 0, NULL);
DWORD error = GetLastError();
if (hMapping == NULL) {
throw format("CreateFileMappingA failed: %s", gptneox_format_win_err(error).c_str());
}
addr = MapViewOfFile(hMapping, FILE_MAP_READ, 0, 0, 0);
error = GetLastError();
CloseHandle(hMapping);
if (addr == NULL) {
throw format("MapViewOfFile failed: %s", gptneox_format_win_err(error).c_str());
}
#if _WIN32_WINNT >= _WIN32_WINNT_WIN8
if (prefetch) {
// Advise the kernel to preload the mapped memory
WIN32_MEMORY_RANGE_ENTRY range;
range.VirtualAddress = addr;
range.NumberOfBytes = (SIZE_T)size;
if (!PrefetchVirtualMemory(GetCurrentProcess(), 1, &range, 0)) {
fprintf(stderr, "warning: PrefetchVirtualMemory failed: %s\n",
gptneox_format_win_err(GetLastError()).c_str());
}
}
#else
#pragma message("warning: You are building for pre-Windows 8; prefetch not supported")
#endif // _WIN32_WINNT >= _WIN32_WINNT_WIN8
}
~gptneox_mmap() {
if (!UnmapViewOfFile(addr)) {
fprintf(stderr, "warning: UnmapViewOfFile failed: %s\n",
gptneox_format_win_err(GetLastError()).c_str());
}
}
#else
static constexpr bool SUPPORTED = false;
gptneox_mmap(struct gptneox_file *) {
throw std::string("mmap not supported");
}
#endif
};
// Represents some region of memory being locked using mlock or VirtualLock;
// will automatically unlock on destruction.
struct gptneox_mlock {
void * addr = NULL;
size_t size = 0;
bool failed_already = false;
gptneox_mlock() {}
gptneox_mlock(const gptneox_mlock &) = delete;
~gptneox_mlock() {
if (size) {
raw_unlock(addr, size);
}
}
void init(void * addr) {
GPTNEOX_ASSERT(this->addr == NULL && this->size == 0);
this->addr = addr;
}
void grow_to(size_t target_size) {
GPTNEOX_ASSERT(addr);
if (failed_already) {
return;
}
size_t granularity = lock_granularity();
target_size = (target_size + granularity - 1) & ~(granularity - 1);
if (target_size > size) {
if (raw_lock((uint8_t *) addr + size, target_size - size)) {
size = target_size;
} else {
failed_already = true;
}
}
}
#ifdef _POSIX_MEMLOCK_RANGE
static constexpr bool SUPPORTED = true;
size_t lock_granularity() {
return (size_t) sysconf(_SC_PAGESIZE);
}
#ifdef __APPLE__
#define MLOCK_SUGGESTION \
"Try increasing the sysctl values 'vm.user_wire_limit' and 'vm.global_user_wire_limit' and/or " \
"decreasing 'vm.global_no_user_wire_amount'. Also try increasing RLIMIT_MLOCK (ulimit -l).\n"
#else
#define MLOCK_SUGGESTION \
"Try increasing RLIMIT_MLOCK ('ulimit -l' as root).\n"
#endif
bool raw_lock(const void * addr, size_t size) {
if (!mlock(addr, size)) {
return true;
} else {
char* errmsg = std::strerror(errno);
bool suggest = (errno == ENOMEM);
// Check if the resource limit is fine after all
struct rlimit lock_limit;
if (suggest && getrlimit(RLIMIT_MEMLOCK, &lock_limit))
suggest = false;
if (suggest && (lock_limit.rlim_max > lock_limit.rlim_cur + size))
suggest = false;
fprintf(stderr, "warning: failed to mlock %zu-byte buffer (after previously locking %zu bytes): %s\n%s",
size, this->size, errmsg, suggest ? MLOCK_SUGGESTION : "");
return false;
}
}
#undef MLOCK_SUGGESTION
void raw_unlock(void * addr, size_t size) {
if (munlock(addr, size)) {
fprintf(stderr, "warning: failed to munlock buffer: %s\n", std::strerror(errno));
}
}
#elif defined(_WIN32)
static constexpr bool SUPPORTED = true;
size_t lock_granularity() {
SYSTEM_INFO si;
GetSystemInfo(&si);
return (size_t) si.dwPageSize;
}
bool raw_lock(void * addr, size_t size) {
for (int tries = 1; ; tries++) {
if (VirtualLock(addr, size)) {
return true;
}
if (tries == 2) {
fprintf(stderr, "warning: failed to VirtualLock %zu-byte buffer (after previously locking %zu bytes): %s\n",
size, this->size, gptneox_format_win_err(GetLastError()).c_str());
return false;
}
// It failed but this was only the first try; increase the working
// set size and try again.
SIZE_T min_ws_size, max_ws_size;
if (!GetProcessWorkingSetSize(GetCurrentProcess(), &min_ws_size, &max_ws_size)) {
fprintf(stderr, "warning: GetProcessWorkingSetSize failed: %s\n",
gptneox_format_win_err(GetLastError()).c_str());
return false;
}
// Per MSDN: "The maximum number of pages that a process can lock
// is equal to the number of pages in its minimum working set minus
// a small overhead."
// Hopefully a megabyte is enough overhead:
size_t increment = size + 1048576;
// The minimum must be <= the maximum, so we need to increase both:
min_ws_size += increment;
max_ws_size += increment;
if (!SetProcessWorkingSetSize(GetCurrentProcess(), min_ws_size, max_ws_size)) {
fprintf(stderr, "warning: SetProcessWorkingSetSize failed: %s\n",
gptneox_format_win_err(GetLastError()).c_str());
return false;
}
}
}
void raw_unlock(void * addr, size_t size) {
if (!VirtualUnlock(addr, size)) {
fprintf(stderr, "warning: failed to VirtualUnlock buffer: %s\n",
gptneox_format_win_err(GetLastError()).c_str());
}
}
#else
static constexpr bool SUPPORTED = false;
void raw_lock(const void * addr, size_t size) {
fprintf(stderr, "warning: mlock not supported on this system\n");
}
void raw_unlock(const void * addr, size_t size) {}
#endif
};
// Replacement for std::vector<uint8_t> that doesn't require zero-initialization.
struct gptneox_buffer {
uint8_t * addr = NULL;
size_t size = 0;
void resize(size_t size) {
delete[] addr;
addr = new uint8_t[size];
this->size = size;
}
~gptneox_buffer() {
delete[] addr;
}
};
#ifdef GGML_USE_CUBLAS
#include "ggml-cuda.h"
struct gptneox_ctx_buffer {
uint8_t * addr = NULL;
size_t size = 0;
void resize(size_t size) {
if (addr) {
ggml_cuda_host_free(addr);
}
addr = (uint8_t *) ggml_cuda_host_malloc(size);
this->size = size;
}
~gptneox_ctx_buffer() {
if (addr) {
ggml_cuda_host_free(addr);
}
}
};
#else
typedef gptneox_buffer gptneox_ctx_buffer;
#endif
#endif

2942
third_party/radpajama/gptneox.cpp vendored Normal file

File diff suppressed because it is too large Load diff

275
third_party/radpajama/gptneox.h vendored Normal file
View file

@ -0,0 +1,275 @@
#ifndef GPTNEOX_H
#define GPTNEOX_H
#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
#ifdef GPTNEOX_SHARED
# if defined(_WIN32) && !defined(__MINGW32__)
# ifdef GPTNEOX_BUILD
# define GPTNEOX_API __declspec(dllexport)
# else
# define GPTNEOX_API __declspec(dllimport)
# endif
# else
# define GPTNEOX_API __attribute__ ((visibility ("default")))
# endif
#else
# define GPTNEOX_API
#endif
#define GPTNEOX_FILE_VERSION 1
#define GPTNEOX_FILE_MAGIC 0x67676a74 // 'ggjt' in hex
#define GPTNEOX_FILE_MAGIC_UNVERSIONED 0x67676d6c // pre-versioned files
#ifdef __cplusplus
extern "C" {
#endif
//
// C interface
//
// TODO: show sample usage
//
struct gptneox_context;
typedef int gptneox_token;
typedef struct gptneox_token_data {
gptneox_token id; // token id
float logit; // log-odds of the token
float p; // probability of the token
} gptneox_token_data;
typedef struct gptneox_token_data_array {
gptneox_token_data * data;
size_t size;
bool sorted;
} gptneox_token_data_array;
typedef void (*gptneox_progress_callback)(float progress, void *ctx);
struct gptneox_context_params {
int n_ctx; // text context
int n_parts; // -1 for default
int seed; // RNG seed, 0 for random
bool f16_kv; // use fp16 for KV cache
bool logits_all; // the gptneox_eval() call computes all logits, not just the last one
bool vocab_only; // only load the vocabulary, no weights
bool use_mmap; // use mmap if possible
bool use_mlock; // force system to keep model in RAM
bool embedding; // embedding mode only
// called with a progress value between 0 and 1, pass NULL to disable
gptneox_progress_callback progress_callback;
// context pointer passed to the progress callback
void * progress_callback_user_data;
};
// model file types
enum gptneox_ftype {
GPTNEOX_FTYPE_ALL_F32 = 0,
GPTNEOX_FTYPE_MOSTLY_F16 = 1, // except 1d tensors
GPTNEOX_FTYPE_MOSTLY_Q4_0 = 2, // except 1d tensors
GPTNEOX_FTYPE_MOSTLY_Q4_1 = 3, // except 1d tensors
GPTNEOX_FTYPE_MOSTLY_Q4_1_SOME_F16 = 4, // tok_embeddings.weight and output.weight are F16
GPTNEOX_FTYPE_MOSTLY_Q4_2 = 5, // except 1d tensors
// GPTNEOX_FTYPE_MOSTLY_Q4_3 (6) support has been removed
GPTNEOX_FTYPE_MOSTLY_Q8_0 = 7, // except 1d tensors
GPTNEOX_FTYPE_MOSTLY_Q5_0 = 8, // except 1d tensors
GPTNEOX_FTYPE_MOSTLY_Q5_1 = 9, // except 1d tensors
};
GPTNEOX_API struct gptneox_context_params gptneox_context_default_params();
GPTNEOX_API bool gptneox_mmap_supported();
GPTNEOX_API bool gptneox_mlock_supported();
// Various functions for loading a ggml llama model.
// Allocate (almost) all memory needed for the model.
// Return NULL on failure
GPTNEOX_API struct gptneox_context * gptneox_init_from_file(
const char * path_model,
struct gptneox_context_params params);
// Frees all allocated memory
GPTNEOX_API void gptneox_free(struct gptneox_context * ctx);
// TODO: not great API - very likely to change
// Returns 0 on success
// nthread - how many threads to use. If <=0, will use std::thread::hardware_concurrency(), else the number given
GPTNEOX_API int gptneox_model_quantize(
const char * fname_inp,
const char * fname_out,
enum gptneox_ftype ftype,
int nthread);
GPTNEOX_API int gptneox_model_copy(
const char * fname_inp,
const char * fname_out,
enum gptneox_ftype ftype);
// Apply a LoRA adapter to a loaded model
// path_base_model is the path to a higher quality model to use as a base for
// the layers modified by the adapter. Can be NULL to use the current loaded model.
// The model needs to be reloaded before applying a new adapter, otherwise the adapter
// will be applied on top of the previous one
// Returns 0 on success
GPTNEOX_API int gptneox_apply_lora_from_file(
struct gptneox_context * ctx,
const char * path_lora,
const char * path_base_model,
int n_threads);
// Returns the number of tokens in the KV cache
GPTNEOX_API int gptneox_get_kv_cache_token_count(struct gptneox_context * ctx);
// Sets the current rng seed.
GPTNEOX_API void gptneox_set_rng_seed(struct gptneox_context * ctx, int seed);
// Returns the size in bytes of the state (rng, logits, embedding and kv_cache)
GPTNEOX_API size_t gptneox_get_state_size(struct gptneox_context * ctx);
// Copies the state to the specified destination address.
// Destination needs to have allocated enough memory.
// Returns the number of bytes copied
GPTNEOX_API size_t gptneox_copy_state_data(struct gptneox_context * ctx, uint8_t * dest);
// Set the state reading from the specified address
// Returns the number of bytes read
GPTNEOX_API size_t gptneox_set_state_data(struct gptneox_context * ctx, const uint8_t * src);
// Save/load session file
GPTNEOX_API size_t gptneox_load_session_file(struct gptneox_context * ctx, const char * path_session, gptneox_token * tokens_out, size_t n_token_capacity, size_t * n_token_count_out);
GPTNEOX_API size_t gptneox_save_session_file(struct gptneox_context * ctx, const char * path_session, const gptneox_token * tokens, size_t n_token_count);
// Run the llama inference to obtain the logits and probabilities for the next token.
// tokens + n_tokens is the provided batch of new tokens to process
// n_past is the number of tokens to use from previous eval calls
// Returns 0 on success
GPTNEOX_API int gptneox_eval(
struct gptneox_context * ctx,
const gptneox_token * tokens,
int n_tokens,
int n_past,
int n_threads);
// Convert the provided text into tokens.
// The tokens pointer must be large enough to hold the resulting tokens.
// Returns the number of tokens on success, no more than n_max_tokens
// Returns a negative number on failure - the number of tokens that would have been returned
// TODO: not sure if correct
GPTNEOX_API int gptneox_tokenize(
struct gptneox_context * ctx,
const char * text,
gptneox_token * tokens,
int n_max_tokens,
bool add_bos);
GPTNEOX_API int gptneox_n_vocab(struct gptneox_context * ctx);
GPTNEOX_API int gptneox_n_ctx (struct gptneox_context * ctx);
GPTNEOX_API int gptneox_n_embd (struct gptneox_context * ctx);
// Token logits obtained from the last call to gptneox_eval()
// The logits for the last token are stored in the last row
// Can be mutated in order to change the probabilities of the next token
// Rows: n_tokens
// Cols: n_vocab
GPTNEOX_API float * gptneox_get_logits(struct gptneox_context * ctx);
// Get the embeddings for the input
// shape: [n_embd] (1-dimensional)
GPTNEOX_API float * gptneox_get_embeddings(struct gptneox_context * ctx);
// Token Id -> String. Uses the vocabulary in the provided context
GPTNEOX_API const char * gptneox_token_to_str(struct gptneox_context * ctx, gptneox_token token);
// String -> Token Id. Uses the vocabulary in the provided context
GPTNEOX_API gptneox_token gptneox_str_to_token(struct gptneox_context * ctx, const char * str);
// Special tokens
GPTNEOX_API gptneox_token gptneox_token_bos();
GPTNEOX_API gptneox_token gptneox_token_eos();
// GPTNEOX_API gptneox_token gptneox_token_nl();
// TODO: improve the last_n_tokens interface ?
GPTNEOX_API gptneox_token gptneox_sample_top_p_top_k(
struct gptneox_context * ctx,
const gptneox_token * last_n_tokens_data,
int last_n_tokens_size,
int top_k,
float top_p,
float temp,
float repeat_penalty);
// Sampling functions
/// @details Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix.
GPTNEOX_API void gptneox_sample_repetition_penalty(struct gptneox_context * ctx, gptneox_token_data_array * candidates, gptneox_token * last_tokens, size_t last_tokens_size, float penalty);
/// @details Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.
GPTNEOX_API void gptneox_sample_frequency_and_presence_penalties(struct gptneox_context * ctx, gptneox_token_data_array * candidates, gptneox_token * last_tokens, size_t last_tokens_size, float alpha_frequency, float alpha_presence);
/// @details Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits.
GPTNEOX_API void gptneox_sample_softmax(struct gptneox_context * ctx, gptneox_token_data_array * candidates);
/// @details Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
GPTNEOX_API void gptneox_sample_top_k(struct gptneox_context * ctx, gptneox_token_data_array * candidates, int k, size_t min_keep);
/// @details Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
GPTNEOX_API void gptneox_sample_top_p(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float p, size_t min_keep);
/// @details Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.
GPTNEOX_API void gptneox_sample_tail_free(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float z, size_t min_keep);
/// @details Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
GPTNEOX_API void gptneox_sample_typical(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float p, size_t min_keep);
GPTNEOX_API void gptneox_sample_temperature(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float temp);
/// @details Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
/// @param candidates A vector of `gptneox_token_data` containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
/// @param tau The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
/// @param eta The learning rate used to update `mu` based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause `mu` to be updated more quickly, while a smaller learning rate will result in slower updates.
/// @param m The number of tokens considered in the estimation of `s_hat`. This is an arbitrary value that is used to calculate `s_hat`, which in turn helps to calculate the value of `k`. In the paper, they use `m = 100`, but you can experiment with different values to see how it affects the performance of the algorithm.
/// @param mu Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (`2 * tau`) and is updated in the algorithm based on the error between the target and observed surprisal.
GPTNEOX_API gptneox_token gptneox_sample_token_mirostat(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float tau, float eta, int m, float * mu);
/// @details Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
/// @param candidates A vector of `gptneox_token_data` containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
/// @param tau The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
/// @param eta The learning rate used to update `mu` based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause `mu` to be updated more quickly, while a smaller learning rate will result in slower updates.
/// @param mu Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (`2 * tau`) and is updated in the algorithm based on the error between the target and observed surprisal.
GPTNEOX_API gptneox_token gptneox_sample_token_mirostat_v2(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float tau, float eta, float * mu);
/// @details Selects the token with the highest probability.
GPTNEOX_API gptneox_token gptneox_sample_token_greedy(struct gptneox_context * ctx, gptneox_token_data_array * candidates);
/// @details Randomly selects a token from the candidates based on their probabilities.
GPTNEOX_API gptneox_token gptneox_sample_token(struct gptneox_context * ctx, gptneox_token_data_array * candidates);
// Performance information
GPTNEOX_API void gptneox_print_timings(struct gptneox_context * ctx);
GPTNEOX_API void gptneox_reset_timings(struct gptneox_context * ctx);
// Print system information
GPTNEOX_API const char * gptneox_print_system_info(void);
#ifdef __cplusplus
}
#endif
// Internal API to be implemented by llama.cpp and used by tests/benchmarks only
#ifdef GPTNEOX_API_INTERNAL
#include <vector>
#include <string>
struct ggml_tensor;
std::vector<std::pair<std::string, struct ggml_tensor *>>& gptneox_internal_get_tensor_map(struct gptneox_context * ctx);
#endif
#endif // GPTNEOX_H

View file

@ -0,0 +1,467 @@
// Defines sigaction on msys:
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include "common-gptneox.h"
#include "gptneox.h"
#include <cassert>
#include <cinttypes>
#include <cmath>
#include <cstdio>
#include <cstring>
#include <ctime>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
#include <signal.h>
#include <unistd.h>
#elif defined (_WIN32)
#include <signal.h>
#endif
static console_state con_st;
static gptneox_context ** g_ctx;
static bool is_interacting = false;
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
void sigint_handler(int signo) {
set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
printf("\n"); // this also force flush stdout.
if (signo == SIGINT) {
if (!is_interacting) {
is_interacting=true;
} else {
gptneox_print_timings(*g_ctx);
_exit(130);
}
}
}
#endif
int main(int argc, char ** argv) {
gpt_params params;
params.model = "./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-f16.bin";
if (gpt_params_parse(argc, argv, params) == false) {
return 1;
}
// save choice to use color for later
// (note for later: this is a slightly awkward choice)
con_st.use_color = params.use_color;
#if defined (_WIN32)
win32_console_init(params.use_color);
#endif
if (params.perplexity) {
printf("\n************\n");
printf("%s: please use the 'perplexity' tool for perplexity calculations\n", __func__);
printf("************\n\n");
return 0;
}
if (params.embedding) {
printf("\n************\n");
printf("%s: please use the 'embedding' tool for embedding calculations\n", __func__);
printf("************\n\n");
return 0;
}
if (params.n_ctx > 2048) {
fprintf(stderr, "%s: warning: model does not support context sizes greater than 2048 tokens (%d specified);"
"expect poor results\n", __func__, params.n_ctx);
}
if (params.seed <= 0) {
params.seed = time(NULL);
}
fprintf(stderr, "%s: seed = %d\n", __func__, params.seed);
std::mt19937 rng(params.seed);
if (params.random_prompt) {
params.prompt = gpt_random_prompt(rng);
}
gptneox_context * ctx;
g_ctx = &ctx;
// load the model
{
auto lparams = gptneox_context_default_params();
lparams.n_ctx = params.n_ctx;
lparams.n_parts = params.n_parts;
lparams.seed = params.seed;
lparams.f16_kv = params.memory_f16;
lparams.use_mmap = params.use_mmap;
lparams.use_mlock = params.use_mlock;
ctx = gptneox_init_from_file(params.model.c_str(), lparams);
if (ctx == NULL) {
fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
return 1;
}
}
if (!params.lora_adapter.empty()) {
int err = gptneox_apply_lora_from_file(ctx,
params.lora_adapter.c_str(),
params.lora_base.empty() ? NULL : params.lora_base.c_str(),
params.n_threads);
if (err != 0) {
fprintf(stderr, "%s: error: failed to apply lora adapter\n", __func__);
return 1;
}
}
// print system information
{
fprintf(stderr, "\n");
fprintf(stderr, "system_info: n_threads = %d / %d | %s\n",
params.n_threads, std::thread::hardware_concurrency(), gptneox_print_system_info());
}
// determine the maximum memory usage needed to do inference for the given n_batch and n_predict parameters
if (params.mem_test) {
{
const std::vector<gptneox_token> tmp(params.n_batch, 0);
gptneox_eval(ctx, tmp.data(), tmp.size(), 0, params.n_threads);
}
{
const std::vector<gptneox_token> tmp = { 0, };
gptneox_eval(ctx, tmp.data(), tmp.size(), params.n_predict - 1, params.n_threads);
}
gptneox_print_timings(ctx);
gptneox_free(ctx);
return 0;
}
// Always interactive for RedPajama chat model
params.interactive = true;
if (params.interactive) {
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
struct sigaction sigint_action;
sigint_action.sa_handler = sigint_handler;
sigemptyset (&sigint_action.sa_mask);
sigint_action.sa_flags = 0;
sigaction(SIGINT, &sigint_action, NULL);
#elif defined (_WIN32)
signal(SIGINT, sigint_handler);
#endif
}
fprintf(stderr, "sampling: temp = %f, top_k = %d, top_p = %f, repeat_last_n = %i, repeat_penalty = %f\n",
params.temp, params.top_k, params.top_p, params.repeat_last_n, params.repeat_penalty);
fprintf(stderr, "generate: n_ctx = %d, n_batch = %d, n_predict = %d, n_keep = %d\n", params.n_ctx, params.n_batch, params.n_predict, params.n_keep);
fprintf(stderr, "\n\n");
// TODO: replace with ring-buffer
std::vector<gptneox_token> last_n_tokens = std::vector<gptneox_token>();
//std::fill(last_n_tokens.begin(), last_n_tokens.end(), 0);
set_console_color(con_st, CONSOLE_COLOR_PROMPT);
if (params.interactive) {
printf("== Running in interactive mode. ==\n"
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
" - Press Ctrl+C to interject at any time.\n"
#endif
" - Press Return to return control to RedPajama.\n"
" - If you want to submit another line, end your input in '\\'.\n\n");
}
const int32_t top_k = params.top_k;
const float top_p = params.top_p;
const float temp = params.temp;
const float repeat_penalty = params.repeat_penalty;
// Chat loop
while (true) {
is_interacting = true;
int n_past = 0;
// Get input
// potentially set color to indicate we are taking user input
set_console_color(con_st, CONSOLE_COLOR_USER_INPUT);
#if defined (_WIN32)
// Windows: must reactivate sigint handler after each signal
signal(SIGINT, sigint_handler);
#endif
if (params.instruct) {
printf("\n<human>: ");
}
std::string buffer;
if (!params.input_prefix.empty()) {
buffer += params.input_prefix;
printf("%s", buffer.c_str());
}
std::string line;
bool another_line = true;
do {
#if defined(_WIN32)
std::wstring wline;
if (!std::getline(std::wcin, wline)) {
// input stream is bad or EOF received
return 0;
}
win32_utf8_encode(wline, line);
#else
if (!std::getline(std::cin, line)) {
// input stream is bad or EOF received
return 0;
}
#endif
if (line.empty() || line.back() != '\\') {
another_line = false;
} else {
line.pop_back(); // Remove the continue character
}
buffer += line;
if (another_line) {
buffer += '\n';
}
} while (another_line);
is_interacting = false;
// done taking input, reset color
set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
// Check for input
if (buffer.length() <= 0) {
continue; // Restart loop for input
}
// Tokenize prompt with RedPajama special tokens
auto prompt_embd = ::gptneox_tokenize(ctx, buffer, false);
auto embd_inp = std::vector<gptneox_token>();
// Redpajama: insert special tokens for OA. (prefix)
embd_inp.push_back(gptneox_str_to_token(ctx, "<"));
embd_inp.push_back(gptneox_str_to_token(ctx, "human"));
embd_inp.push_back(gptneox_str_to_token(ctx, ">:"));
embd_inp.insert(embd_inp.end(), prompt_embd.begin(), prompt_embd.end());
// Redpajama: insert special tokens for OA. (postfix)
embd_inp.push_back(gptneox_str_to_token(ctx, "\n"));
embd_inp.push_back(gptneox_str_to_token(ctx, "<"));
embd_inp.push_back(gptneox_str_to_token(ctx, "bot"));
embd_inp.push_back(gptneox_str_to_token(ctx, ">:"));
// Verbose prompt
if (params.verbose_prompt) {
fprintf(stderr, "\n");
fprintf(stderr, "%s: prompt: '%s'\n", __func__, buffer.c_str());
fprintf(stderr, "%s: number of tokens in prompt = %zu\n", __func__, embd_inp.size());
for (int i = 0; i < (int) embd_inp.size(); i++) {
fprintf(stderr, "%6d -> '%s'\n", embd_inp[i], gptneox_token_to_str(ctx, embd_inp[i]));
}
fprintf(stderr, "\n");
}
// How many tokens to generate - check if theres space in context for atleast one token (or batch size tokens?)
auto inp_size = embd_inp.size();
auto space = params.n_ctx - inp_size;
if(space <= 0) {
fprintf(stderr, "%s : input too long\n", __func__);
continue;
}
// Send batches to eval
while (n_past < inp_size) {
auto remaining = inp_size - n_past;
int n_eval = params.n_batch < remaining ? params.n_batch : remaining;
if (gptneox_eval(ctx, &embd_inp[n_past], n_eval, n_past, params.n_threads)) {
fprintf(stderr, "<bot>: %s : failed to eval\n", __func__);
return 1;
}
n_past += n_eval;
}
const int n_ctx = gptneox_n_ctx(ctx);
const int n_vocab = gptneox_n_vocab(ctx);
const float temp = params.temp;
const int32_t top_k = params.top_k <= 0 ? gptneox_n_vocab(ctx) : params.top_k;
const float top_p = params.top_p;
const float tfs_z = params.tfs_z;
const float typical_p = params.typical_p;
const int32_t repeat_last_n = params.repeat_last_n < 0 ? n_ctx : params.repeat_last_n;
const float repeat_penalty = params.repeat_penalty;
const float alpha_presence = params.presence_penalty;
const float alpha_frequency = params.frequency_penalty;
const int mirostat = params.mirostat;
const float mirostat_tau = params.mirostat_tau;
const float mirostat_eta = params.mirostat_eta;
const bool penalize_nl = params.penalize_nl;
// Eval until space runs out
auto out_count = 0;
printf("<bot>:");
while (space > 0) {
// Get token
gptneox_token id = 0;
{
auto logits = gptneox_get_logits(ctx);
// Apply params.logit_bias map
for (auto it = params.logit_bias.begin(); it != params.logit_bias.end(); it++) {
logits[it->first] += it->second;
}
std::vector<gptneox_token_data> candidates;
candidates.reserve(n_vocab);
for (gptneox_token token_id = 0; token_id < n_vocab; token_id++) {
candidates.emplace_back(gptneox_token_data{token_id, logits[token_id], 0.0f});
}
gptneox_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
// Apply penalties
gptneox_token nl_token = gptneox_str_to_token(ctx, "\n");
float nl_logit = logits[nl_token];
auto last_n_repeat = std::min(std::min((int)last_n_tokens.size(), repeat_last_n), n_ctx);
gptneox_sample_repetition_penalty(ctx, &candidates_p,
last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
last_n_repeat, repeat_penalty);
gptneox_sample_frequency_and_presence_penalties(ctx, &candidates_p,
last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
last_n_repeat, alpha_frequency, alpha_presence);
if (!penalize_nl) {
logits[nl_token] = nl_logit;
}
if (temp <= 0) {
// Greedy sampling
id = gptneox_sample_token_greedy(ctx, &candidates_p);
} else {
if (mirostat == 1) {
static float mirostat_mu = 2.0f * mirostat_tau;
const int mirostat_m = 100;
gptneox_sample_temperature(ctx, &candidates_p, temp);
id = gptneox_sample_token_mirostat(ctx, &candidates_p, mirostat_tau, mirostat_eta, mirostat_m, &mirostat_mu);
} else if (mirostat == 2) {
static float mirostat_mu = 2.0f * mirostat_tau;
gptneox_sample_temperature(ctx, &candidates_p, temp);
id = gptneox_sample_token_mirostat_v2(ctx, &candidates_p, mirostat_tau, mirostat_eta, &mirostat_mu);
} else {
// Temperature sampling
gptneox_sample_top_k(ctx, &candidates_p, top_k, 1);
gptneox_sample_tail_free(ctx, &candidates_p, tfs_z, 1);
gptneox_sample_typical(ctx, &candidates_p, typical_p, 1);
gptneox_sample_top_p(ctx, &candidates_p, top_p, 1);
gptneox_sample_temperature(ctx, &candidates_p, temp);
id = gptneox_sample_token(ctx, &candidates_p);
}
}
}
// Inc out count and dec space
out_count += 1;
space -= 1;
// Repeat tokens update
last_n_tokens.push_back(id);
if (last_n_tokens.size() > params.repeat_last_n) {
last_n_tokens.erase(last_n_tokens.begin());
}
// Redpajama: check if the interactive is done.
//std::cout<<" last_n_tokens.size: "<< last_n_tokens[0] <<" "<< last_n_tokens[1] <<" "<< last_n_tokens[2] << std::endl;
if (last_n_tokens.size()==3 && last_n_tokens[0]==gptneox_str_to_token(ctx, "<")
&& last_n_tokens[1]==gptneox_str_to_token(ctx, "human") && last_n_tokens[2]==gptneox_str_to_token(ctx, ">:")){
space = 0;
continue;
}
// Check for eos - end early - check eos before bos in case they are the same
if (id == gptneox_token_eos()) {
space = 0;
continue;
}
// Check for bos - skip callback if so
if (id == gptneox_token_bos()) {
continue;
}
// Convert token to string and display
// printf("%s(%d)", gptneox_token_to_str(ctx, id), id);
if (last_n_tokens[2]==gptneox_str_to_token(ctx, "<")){
;
}
else if (last_n_tokens[2]==gptneox_str_to_token(ctx, "human")){
if (last_n_tokens[1]==gptneox_str_to_token(ctx, "<")){
;
}
else{
printf("%s", gptneox_token_to_str(ctx, id));
}
}
else if (last_n_tokens[1]==gptneox_str_to_token(ctx, "<")){
printf("<");
printf("%s", gptneox_token_to_str(ctx, id));
}
else{
printf("%s", gptneox_token_to_str(ctx, id));
}
fflush(stdout);
// Check if we need to run another eval
if (space > 0) {
// Send generated token back into model for next generation
if (gptneox_eval(ctx, &id, 1, n_past, params.n_threads)) {
fprintf(stderr, "%s : failed to eval\n", __func__);
return 1;
}
// Increment past count
n_past += 1;
}
// Check for user interrupt
if (is_interacting) { space = 0; }
}
printf("\n");
//printf("\n %d", space);
fflush(stdout);
}
#if defined (_WIN32)
signal(SIGINT, SIG_DFL);
#endif
gptneox_print_timings(ctx);
gptneox_free(ctx);
set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
return 0;
}

622
third_party/radpajama/main-redpajama.cpp vendored Normal file
View file

@ -0,0 +1,622 @@
// Defines sigaction on msys:
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include "common-gptneox.h"
#include "gptneox.h"
#include <cassert>
#include <cinttypes>
#include <cmath>
#include <cstdio>
#include <cstring>
#include <ctime>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
#include <signal.h>
#include <unistd.h>
#elif defined (_WIN32)
#include <signal.h>
#endif
static console_state con_st;
static gptneox_context ** g_ctx;
static bool is_interacting = false;
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
void sigint_handler(int signo) {
set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
printf("\n"); // this also force flush stdout.
if (signo == SIGINT) {
if (!is_interacting) {
is_interacting=true;
} else {
gptneox_print_timings(*g_ctx);
_exit(130);
}
}
}
#endif
int main(int argc, char ** argv) {
gpt_params params;
params.model = "./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Instruct-3B-v1-f16.bin";
if (gpt_params_parse(argc, argv, params) == false) {
return 1;
}
// save choice to use color for later
// (note for later: this is a slightly awkward choice)
con_st.use_color = params.use_color;
#if defined (_WIN32)
win32_console_init(params.use_color);
#endif
if (params.perplexity) {
printf("\n************\n");
printf("%s: please use the 'perplexity' tool for perplexity calculations\n", __func__);
printf("************\n\n");
return 0;
}
if (params.embedding) {
printf("\n************\n");
printf("%s: please use the 'embedding' tool for embedding calculations\n", __func__);
printf("************\n\n");
return 0;
}
if (params.n_ctx > 2048) {
fprintf(stderr, "%s: warning: model does not support context sizes greater than 2048 tokens (%d specified);"
"expect poor results\n", __func__, params.n_ctx);
}
if (params.seed < 0) {
params.seed = time(NULL);
}
fprintf(stderr, "%s: seed = %d\n", __func__, params.seed);
std::mt19937 rng(params.seed);
if (params.random_prompt) {
params.prompt = gpt_random_prompt(rng);
}
// params.prompt = R"(// this function checks if the number n is prime
//bool is_prime(int n) {)";
gptneox_context * ctx;
g_ctx = &ctx;
// load the model
{
auto lparams = gptneox_context_default_params();
lparams.n_ctx = params.n_ctx;
lparams.n_parts = params.n_parts;
lparams.seed = params.seed;
lparams.f16_kv = params.memory_f16;
lparams.use_mmap = params.use_mmap;
lparams.use_mlock = params.use_mlock;
ctx = gptneox_init_from_file(params.model.c_str(), lparams);
if (ctx == NULL) {
fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
return 1;
}
}
if (!params.lora_adapter.empty()) {
int err = gptneox_apply_lora_from_file(ctx,
params.lora_adapter.c_str(),
params.lora_base.empty() ? NULL : params.lora_base.c_str(),
params.n_threads);
if (err != 0) {
fprintf(stderr, "%s: error: failed to apply lora adapter\n", __func__);
return 1;
}
}
// print system information
{
fprintf(stderr, "\n");
fprintf(stderr, "system_info: n_threads = %d / %d | %s\n",
params.n_threads, std::thread::hardware_concurrency(), gptneox_print_system_info());
}
// determine the maximum memory usage needed to do inference for the given n_batch and n_predict parameters
// uncomment the "used_mem" line in llama.cpp to see the results
if (params.mem_test) {
{
const std::vector<gptneox_token> tmp(params.n_batch, 0);
gptneox_eval(ctx, tmp.data(), tmp.size(), 0, params.n_threads);
}
{
const std::vector<gptneox_token> tmp = { 0, };
gptneox_eval(ctx, tmp.data(), tmp.size(), params.n_predict - 1, params.n_threads);
}
gptneox_print_timings(ctx);
gptneox_free(ctx);
return 0;
}
std::string path_session = params.path_session;
std::vector<gptneox_token> session_tokens;
if (!path_session.empty()) {
fprintf(stderr, "%s: attempting to load saved session from %s..\n", __func__, path_session.c_str());
// REVIEW - fopen to check for existing session
FILE * fp = std::fopen(path_session.c_str(), "rb");
if (fp != NULL) {
std::fclose(fp);
session_tokens.resize(params.n_ctx);
size_t n_token_count_out = 0;
const size_t n_session_bytes = gptneox_load_session_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.capacity(), &n_token_count_out);
session_tokens.resize(n_token_count_out);
if (n_session_bytes > 0) {
fprintf(stderr, "%s: loaded %zu bytes of session data!\n", __func__, n_session_bytes);
} else {
fprintf(stderr, "%s: could not load session file, will recreate\n", __func__);
}
} else {
fprintf(stderr, "%s: session file does not exist, will create\n", __func__);
}
}
// tokenize the prompt
auto embd_inp = ::gptneox_tokenize(ctx, params.prompt, false); //true);
const int n_ctx = gptneox_n_ctx(ctx);
if ((int) embd_inp.size() > n_ctx - 4) {
fprintf(stderr, "%s: error: prompt is too long (%d tokens, max %d)\n", __func__, (int) embd_inp.size(), n_ctx - 4);
return 1;
}
// debug message about similarity of saved session, if applicable
size_t n_matching_session_tokens = 0;
if (session_tokens.size()) {
for (gptneox_token id : session_tokens) {
if (n_matching_session_tokens >= embd_inp.size() || id != embd_inp[n_matching_session_tokens]) {
break;
}
n_matching_session_tokens++;
}
if (n_matching_session_tokens >= embd_inp.size()) {
fprintf(stderr, "%s: session file has exact match for prompt!\n", __func__);
} else if (n_matching_session_tokens < (embd_inp.size() / 2)) {
fprintf(stderr, "%s: warning: session file has low similarity to prompt (%zu / %zu tokens); will mostly be reevaluated\n",
__func__, n_matching_session_tokens, embd_inp.size());
} else {
fprintf(stderr, "%s: session file matches %zu / %zu tokens of prompt\n",
__func__, n_matching_session_tokens, embd_inp.size());
}
}
// number of tokens to keep when resetting context
if (params.n_keep < 0 || params.n_keep > (int)embd_inp.size() || params.instruct) {
params.n_keep = (int)embd_inp.size();
}
// in instruct mode, we inject a prefix and a suffix to each input by the user
if (params.instruct) {
params.interactive_first = true;
params.antiprompt.push_back("<|prompter|>");
}
// enable interactive mode if reverse prompt or interactive start is specified
if (params.antiprompt.size() != 0 || params.interactive_first) {
params.interactive = true;
}
// determine newline token
auto gptneox_token_newline = ::gptneox_tokenize(ctx, "\n", false);
if (params.verbose_prompt) {
fprintf(stderr, "\n");
fprintf(stderr, "%s: prompt: '%s'\n", __func__, params.prompt.c_str());
fprintf(stderr, "%s: number of tokens in prompt = %zu\n", __func__, embd_inp.size());
for (int i = 0; i < (int) embd_inp.size(); i++) {
fprintf(stderr, "%6d -> '%s'\n", embd_inp[i], gptneox_token_to_str(ctx, embd_inp[i]));
}
if (params.n_keep > 0) {
fprintf(stderr, "%s: static prompt based on n_keep: '", __func__);
for (int i = 0; i < params.n_keep; i++) {
fprintf(stderr, "%s", gptneox_token_to_str(ctx, embd_inp[i]));
}
fprintf(stderr, "'\n");
}
fprintf(stderr, "\n");
}
if (params.interactive) {
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
struct sigaction sigint_action;
sigint_action.sa_handler = sigint_handler;
sigemptyset (&sigint_action.sa_mask);
sigint_action.sa_flags = 0;
sigaction(SIGINT, &sigint_action, NULL);
#elif defined (_WIN32)
signal(SIGINT, sigint_handler);
#endif
fprintf(stderr, "%s: interactive mode on.\n", __func__);
if (params.antiprompt.size()) {
for (auto antiprompt : params.antiprompt) {
fprintf(stderr, "Reverse prompt: '%s'\n", antiprompt.c_str());
}
}
if (!params.input_prefix.empty()) {
fprintf(stderr, "Input prefix: '%s'\n", params.input_prefix.c_str());
}
}
fprintf(stderr, "sampling: repeat_last_n = %d, repeat_penalty = %f, presence_penalty = %f, frequency_penalty = %f, top_k = %d, tfs_z = %f, top_p = %f, typical_p = %f, temp = %f, mirostat = %d, mirostat_lr = %f, mirostat_ent = %f\n",
params.repeat_last_n, params.repeat_penalty, params.presence_penalty, params.frequency_penalty, params.top_k, params.tfs_z, params.top_p, params.typical_p, params.temp, params.mirostat, params.mirostat_eta, params.mirostat_tau);
fprintf(stderr, "generate: n_ctx = %d, n_batch = %d, n_predict = %d, n_keep = %d\n", n_ctx, params.n_batch, params.n_predict, params.n_keep);
fprintf(stderr, "\n\n");
// TODO: replace with ring-buffer
std::vector<gptneox_token> last_n_tokens(n_ctx);
std::fill(last_n_tokens.begin(), last_n_tokens.end(), 0);
if (params.interactive) {
fprintf(stderr, "== Running in interactive mode. ==\n"
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
" - Press Ctrl+C to interject at any time.\n"
#endif
" - Press Return to return control to RedPajama.\n"
" - If you want to submit another line, end your input in '\\'.\n\n");
is_interacting = params.interactive_first;
}
bool is_antiprompt = false;
bool input_noecho = false;
// HACK - because session saving incurs a non-negligible delay, for now skip re-saving session
// if we loaded a session with at least 75% similarity. It's currently just used to speed up the
// initial prompt so it doesn't need to be an exact match.
bool need_to_save_session = !path_session.empty() && n_matching_session_tokens < (embd_inp.size() * 3 / 4);
int n_past = 0;
int n_remain = params.n_predict;
int n_consumed = 0;
int n_session_consumed = 0;
// the first thing we will do is to output the prompt, so set color accordingly
set_console_color(con_st, CONSOLE_COLOR_PROMPT);
std::vector<gptneox_token> embd;
while (n_remain != 0 || params.interactive) {
// predict
if (embd.size() > 0) {
// infinite text generation via context swapping
// if we run out of context:
// - take the n_keep first tokens from the original prompt (via n_past)
// - take half of the last (n_ctx - n_keep) tokens and recompute the logits in batches
if (n_past + (int) embd.size() > n_ctx) {
const int n_left = n_past - params.n_keep;
n_past = params.n_keep;
// insert n_left/2 tokens at the start of embd from last_n_tokens
embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size(), last_n_tokens.end() - embd.size());
// REVIEW - stop saving session if we run out of context
path_session = "";
//printf("\n---\n");
//printf("resetting: '");
//for (int i = 0; i < (int) embd.size(); i++) {
// printf("%s", gptneox_token_to_str(ctx, embd[i]));
//}
//printf("'\n");
//printf("\n---\n");
}
// try to reuse a matching prefix from the loaded session instead of re-eval (via n_past)
// REVIEW
if (n_session_consumed < (int) session_tokens.size()) {
size_t i = 0;
for ( ; i < embd.size(); i++) {
if (embd[i] != session_tokens[n_session_consumed]) {
session_tokens.resize(n_session_consumed);
break;
}
n_past++;
n_session_consumed++;
if (n_session_consumed >= (int) session_tokens.size()) {
break;
}
}
if (i > 0) {
embd.erase(embd.begin(), embd.begin() + i);
}
}
// evaluate tokens in batches
// embd is typically prepared beforehand to fit within a batch, but not always
for (int i = 0; i < (int) embd.size(); i += params.n_batch) {
int n_eval = (int) embd.size() - i;
if (n_eval > params.n_batch) {
n_eval = params.n_batch;
}
if (gptneox_eval(ctx, &embd[i], n_eval, n_past, params.n_threads)) {
fprintf(stderr, "%s : failed to eval\n", __func__);
return 1;
}
n_past += n_eval;
}
if (embd.size() > 0 && !path_session.empty()) {
session_tokens.insert(session_tokens.end(), embd.begin(), embd.end());
n_session_consumed = session_tokens.size();
}
}
embd.clear();
if ((int) embd_inp.size() <= n_consumed && !is_interacting) {
// out of user input, sample next token
const float temp = params.temp;
const int32_t top_k = params.top_k <= 0 ? gptneox_n_vocab(ctx) : params.top_k;
const float top_p = params.top_p;
const float tfs_z = params.tfs_z;
const float typical_p = params.typical_p;
const int32_t repeat_last_n = params.repeat_last_n < 0 ? n_ctx : params.repeat_last_n;
const float repeat_penalty = params.repeat_penalty;
const float alpha_presence = params.presence_penalty;
const float alpha_frequency = params.frequency_penalty;
const int mirostat = params.mirostat;
const float mirostat_tau = params.mirostat_tau;
const float mirostat_eta = params.mirostat_eta;
const bool penalize_nl = params.penalize_nl;
// optionally save the session on first sample (for faster prompt loading next time)
if (!path_session.empty() && need_to_save_session) {
need_to_save_session = false;
gptneox_save_session_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.size());
}
gptneox_token id = 0;
{
auto logits = gptneox_get_logits(ctx);
auto n_vocab = gptneox_n_vocab(ctx);
// Apply params.logit_bias map
for (auto it = params.logit_bias.begin(); it != params.logit_bias.end(); it++) {
logits[it->first] += it->second;
}
std::vector<gptneox_token_data> candidates;
candidates.reserve(n_vocab);
for (gptneox_token token_id = 0; token_id < n_vocab; token_id++) {
candidates.emplace_back(gptneox_token_data{token_id, logits[token_id], 0.0f});
}
gptneox_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
// Apply penalties
gptneox_token nl_token = gptneox_str_to_token(ctx, "\n");
float nl_logit = logits[nl_token];
auto last_n_repeat = std::min(std::min((int)last_n_tokens.size(), repeat_last_n), n_ctx);
gptneox_sample_repetition_penalty(ctx, &candidates_p,
last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
last_n_repeat, repeat_penalty);
gptneox_sample_frequency_and_presence_penalties(ctx, &candidates_p,
last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
last_n_repeat, alpha_frequency, alpha_presence);
if (!penalize_nl) {
logits[nl_token] = nl_logit;
}
if (temp <= 0) {
// Greedy sampling
id = gptneox_sample_token_greedy(ctx, &candidates_p);
} else {
if (mirostat == 1) {
static float mirostat_mu = 2.0f * mirostat_tau;
const int mirostat_m = 100;
gptneox_sample_temperature(ctx, &candidates_p, temp);
id = gptneox_sample_token_mirostat(ctx, &candidates_p, mirostat_tau, mirostat_eta, mirostat_m, &mirostat_mu);
} else if (mirostat == 2) {
static float mirostat_mu = 2.0f * mirostat_tau;
gptneox_sample_temperature(ctx, &candidates_p, temp);
id = gptneox_sample_token_mirostat_v2(ctx, &candidates_p, mirostat_tau, mirostat_eta, &mirostat_mu);
} else {
// Temperature sampling
gptneox_sample_top_k(ctx, &candidates_p, top_k, 1);
gptneox_sample_tail_free(ctx, &candidates_p, tfs_z, 1);
gptneox_sample_typical(ctx, &candidates_p, typical_p, 1);
gptneox_sample_top_p(ctx, &candidates_p, top_p, 1);
gptneox_sample_temperature(ctx, &candidates_p, temp);
id = gptneox_sample_token(ctx, &candidates_p);
}
}
// printf("`%d`", candidates_p.size);
last_n_tokens.erase(last_n_tokens.begin());
last_n_tokens.push_back(id);
}
// replace end of text token with newline token when in interactive mode
if (id == gptneox_token_eos() && params.interactive && !params.instruct) {
id = gptneox_token_newline.front();
if (params.antiprompt.size() != 0) {
// tokenize and inject first reverse prompt
const auto first_antiprompt = ::gptneox_tokenize(ctx, params.antiprompt.front(), false);
embd_inp.insert(embd_inp.end(), first_antiprompt.begin(), first_antiprompt.end());
}
}
// add it to the context
embd.push_back(id);
// echo this to console
input_noecho = false;
// decrement remaining sampling budget
--n_remain;
} else {
// some user input remains from prompt or interaction, forward it to processing
while ((int) embd_inp.size() > n_consumed) {
embd.push_back(embd_inp[n_consumed]);
last_n_tokens.erase(last_n_tokens.begin());
last_n_tokens.push_back(embd_inp[n_consumed]);
++n_consumed;
if ((int) embd.size() >= params.n_batch) {
break;
}
}
}
// display text
if (!input_noecho) {
for (auto id : embd) {
printf("%s", gptneox_token_to_str(ctx, id));
}
fflush(stdout);
}
// reset color to default if we there is no pending user input
if (!input_noecho && (int)embd_inp.size() == n_consumed) {
set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
}
// in interactive mode, and not currently processing queued inputs;
// check if we should prompt the user for more
if (params.interactive && (int) embd_inp.size() <= n_consumed) {
// check for reverse prompt
if (params.antiprompt.size()) {
std::string last_output;
for (auto id : last_n_tokens) {
last_output += gptneox_token_to_str(ctx, id);
}
is_antiprompt = false;
// Check if each of the reverse prompts appears at the end of the output.
for (std::string & antiprompt : params.antiprompt) {
if (last_output.find(antiprompt.c_str(), last_output.length() - antiprompt.length(), antiprompt.length()) != std::string::npos) {
is_interacting = true;
is_antiprompt = true;
set_console_color(con_st, CONSOLE_COLOR_USER_INPUT);
fflush(stdout);
break;
}
}
}
if (n_past > 0 && is_interacting) {
// potentially set color to indicate we are taking user input
set_console_color(con_st, CONSOLE_COLOR_USER_INPUT);
#if defined (_WIN32)
// Windows: must reactivate sigint handler after each signal
signal(SIGINT, sigint_handler);
#endif
if (params.instruct) {
printf("\n> ");
}
std::string buffer;
if (!params.input_prefix.empty()) {
buffer += params.input_prefix;
printf("%s", buffer.c_str());
}
std::string line;
bool another_line = true;
do {
#if defined(_WIN32)
std::wstring wline;
if (!std::getline(std::wcin, wline)) {
// input stream is bad or EOF received
return 0;
}
win32_utf8_encode(wline, line);
#else
if (!std::getline(std::cin, line)) {
// input stream is bad or EOF received
return 0;
}
#endif
if (line.empty() || line.back() != '\\') {
another_line = false;
} else {
line.pop_back(); // Remove the continue character
}
buffer += line + '\n'; // Append the line to the result
} while (another_line);
// done taking input, reset color
set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
// Add tokens to embd only if the input buffer is non-empty
// Entering a empty line lets the user pass control back
if (buffer.length() > 1) {
auto line_inp = ::gptneox_tokenize(ctx, buffer, false);
embd_inp.insert(embd_inp.end(), line_inp.begin(), line_inp.end());
n_remain -= line_inp.size();
}
input_noecho = true; // do not echo this again
}
if (n_past > 0) {
is_interacting = false;
}
}
// end of text token
if (!embd.empty() && embd.back() == gptneox_token_eos()) {
if (params.instruct) {
is_interacting = true;
} else {
fprintf(stderr, " [end of text]\n");
break;
}
}
// In interactive mode, respect the maximum number of tokens and drop back to user input when reached.
if (params.interactive && n_remain <= 0 && params.n_predict != -1) {
n_remain = params.n_predict;
is_interacting = true;
}
}
#if defined (_WIN32)
signal(SIGINT, SIG_DFL);
#endif
printf("\n\n");
gptneox_print_timings(ctx);
gptneox_free(ctx);
set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
return 0;
}

View file

@ -0,0 +1,82 @@
#include "ggml.h"
#include "gptneox.h"
#include <cstdio>
#include <map>
#include <string>
static const std::map<std::string, enum gptneox_ftype> GPTNEOX_FTYPE_MAP = {
{"q4_0", GPTNEOX_FTYPE_MOSTLY_Q4_0},
{"q4_1", GPTNEOX_FTYPE_MOSTLY_Q4_1},
{"q4_2", GPTNEOX_FTYPE_MOSTLY_Q4_2},
//{"q4_3", GPTNEOX_FTYPE_MOSTLY_Q4_3},
{"q5_0", GPTNEOX_FTYPE_MOSTLY_Q5_0},
{"q5_1", GPTNEOX_FTYPE_MOSTLY_Q5_1},
{"q8_0", GPTNEOX_FTYPE_MOSTLY_Q8_0},
};
// usage:
// ./quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type
//
int main(int argc, char ** argv) {
ggml_time_init();
if (argc < 4) {
fprintf(stderr, "usage: %s model-f32.bin model-quant.bin type [nthread]\n", argv[0]);
for (auto it = GPTNEOX_FTYPE_MAP.begin(); it != GPTNEOX_FTYPE_MAP.end(); it++) {
fprintf(stderr, " type = \"%s\" or %d\n", it->first.c_str(), it->second);
}
return 1;
}
// needed to initialize f16 tables
{
struct ggml_init_params params = { 0, NULL, false };
struct ggml_context * ctx = ggml_init(params);
ggml_free(ctx);
}
const std::string fname_inp = argv[1];
const std::string fname_out = argv[2];
enum gptneox_ftype ftype;
if (argv[3][0] == 'q') {
auto it = GPTNEOX_FTYPE_MAP.find(argv[3]);
if (it == GPTNEOX_FTYPE_MAP.end()) {
fprintf(stderr, "%s: unknown ftype '%s'\n", __func__, argv[3]);
return 1;
}
ftype = it->second;
} else {
ftype = (enum gptneox_ftype)atoi(argv[3]);
}
int nthread = argc > 4 ? atoi(argv[4]) : 0;
const int64_t t_main_start_us = ggml_time_us();
int64_t t_quantize_us = 0;
// load the model
{
const int64_t t_start_us = ggml_time_us();
if (gptneox_model_quantize(fname_inp.c_str(), fname_out.c_str(), ftype, nthread)) {
fprintf(stderr, "%s: failed to quantize model from '%s'\n", __func__, fname_inp.c_str());
return 1;
}
t_quantize_us = ggml_time_us() - t_start_us;
}
// report timing
{
const int64_t t_main_end_us = ggml_time_us();
printf("\n");
printf("%s: quantize time = %8.2f ms\n", __func__, t_quantize_us/1000.0);
printf("%s: total time = %8.2f ms\n", __func__, (t_main_end_us - t_main_start_us)/1000.0);
}
return 0;
}

View file

@ -0,0 +1,144 @@
# Convert Hugging Face fine-tuned gpt-neox-like models to ggml format
import io
import os
import sys
import struct
import json
import code
import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
# ref: https://github.com/openai/gpt-2/blob/master/src/encoder.py
def bytes_to_unicode():
"""
Returns list of utf-8 byte and a corresponding list of unicode strings.
The reversible bpe codes work on unicode strings.
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
This is a significant percentage of your normal, say, 32K bpe vocab.
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
And avoids mapping to whitespace/control characters the bpe code barfs on.
"""
bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
cs = bs[:]
n = 0
for b in range(2**8):
if b not in bs:
bs.append(b)
cs.append(2**8+n)
n += 1
cs = [chr(n) for n in cs]
return dict(zip(bs, cs))
if len(sys.argv) < 3:
print("Usage: python convert-hf-to-ggml.py model_name dir-output [use-f32]")
print(" model_name: name of the model to convert. Example: 'bigscience/bloomz-560m'")
print(" dir-output: directory where the output file will be written")
print(" use-f32: if present, use float32 instead of float16")
sys.exit(1)
model_name = sys.argv[1]
dir_out = sys.argv[2]
model_cache_dir = dir_out + "-cache"
# make sure the output directory exists
os.makedirs(dir_out, exist_ok=True)
# possible data types
# ftype == 0 -> float32
# ftype == 1 -> float16
#
# map from ftype to string
ftype_str = ["f32", "f16"]
ftype = 1
if len(sys.argv) > 3:
ftype = 0
tokenizer = AutoTokenizer.from_pretrained(model_name)
print("Loading model: ", model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16 if ftype == 1 else torch.float32,
cache_dir=model_cache_dir)
model.eval()
for p in model.parameters():
p.requires_grad = False
hparams = model.config.to_dict()
print("Model loaded: ", model_name)
fn_bin = f"/ggml-{model_name.split('/')[-1]}-{ftype_str[ftype]}.bin"
fn_out = dir_out + fn_bin
fout = open(fn_out, "wb")
ggml_file_magic = 0x67676d66 # 0x67676d6c is unversioned
ggml_file_version = 0x00000001 # v1
hparams["multiple_of"] = 1
fout.write(struct.pack("i", ggml_file_magic)) # magic: ggmf in hex
fout.write(struct.pack("i", ggml_file_version))
fout.write(struct.pack("i", hparams["vocab_size"]))
fout.write(struct.pack("i", hparams["max_position_embeddings"]))
fout.write(struct.pack("i", hparams["hidden_size"]))
fout.write(struct.pack("i", hparams["num_attention_heads"]))
fout.write(struct.pack("i", hparams["num_hidden_layers"]))
fout.write(struct.pack("i", int((hparams["hidden_size"] / hparams["num_attention_heads"]
) * hparams["rotary_pct"]))) # rotary_dim
fout.write(struct.pack("i", int(hparams["use_parallel_residual"])))
fout.write(struct.pack("i", ftype))
# Is this correct??
dot_token = tokenizer.encode(".")[0]
for i in range(hparams["vocab_size"]):
text = tokenizer.decode([i]).encode('utf-8')
fout.write(struct.pack("i", len(text)))
fout.write(text)
list_vars = model.state_dict()
print(hparams)
for name in list_vars.keys():
if name.startswith('gpt_neox.layers.'):
if 'attention.masked_bias' in name or \
'attention.rotary_emb.inv_freq' in name or \
'attention.bias' in name:
continue
# No gradients for these
list_vars[name].requires_grad = False
src = name
nn = name
print(src, ' -> ', name)
data = list_vars[src].squeeze().numpy()
data = data.astype(np.float32)
n_dims = len(data.shape)
print(name, n_dims, data.shape)
# default type is fp32
ftype_cur = 0
if ftype == 1 and n_dims > 1:
print(" Converting to float16", data.shape, data[:3, :3].tolist())
data = data.astype(np.float16)
ftype_cur = 1
else:
print(" Converting to float32", data.shape,
data[:3, :3].tolist() if n_dims > 1 else data[:3].tolist())
data = data.astype(np.float32)
# header
str = name.encode('utf-8')
fout.write(struct.pack("iii", n_dims, len(str), ftype_cur))
for i in range(n_dims):
fout.write(struct.pack("i", data.shape[n_dims - 1 - i]))
print(str)
fout.write(str)
# data
data.tofile(fout)
fout.close()
print("Done. Output file: " + fn_out)
print("")

View file

@ -0,0 +1,21 @@
#!/bin/bash
# cd to scripts dir
cd `dirname $0`
# download model to models dir
echo "Downloading model"
python ./convert_gptneox_to_ggml.py togethercomputer/RedPajama-INCITE-Base-3B-v1 ../models/pythia
# remove temp cache dir
echo "Removing temp cache dir"
rm -r ../models/pythia-cache
# quantize model
echo "Quantizing model (q4_0)"
cd ../../..
python ./examples/redpajama/scripts/quantize-gptneox.py ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Base-3B-v1-f16.bin
# done!
echo "Done."

View file

@ -0,0 +1,21 @@
#!/bin/bash
# cd to scripts dir
cd `dirname $0`
# download model to models dir
echo "Downloading model"
python ./convert_gptneox_to_ggml.py togethercomputer/RedPajama-INCITE-Chat-3B-v1 ../models/pythia
# remove temp cache dir
echo "Removing temp cache dir"
rm -r ../models/pythia-cache
# quantize model
echo "Quantizing model (q4_0)"
cd ../../..
python ./examples/redpajama/scripts/quantize-gptneox.py ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-f16.bin
# done!
echo "Done."

View file

@ -0,0 +1,21 @@
#!/bin/bash
# cd to scripts dir
cd `dirname $0`
# download model to models dir
echo "Downloading model"
python ./convert_gptneox_to_ggml.py togethercomputer/RedPajama-INCITE-Instruct-3B-v1 ../models/pythia
# remove temp cache dir
echo "Removing temp cache dir"
rm -r ../models/pythia-cache
# quantize model
echo "Quantizing model (q4_0)"
cd ../../..
python ./examples/redpajama/scripts/quantize-gptneox.py ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Instruct-3B-v1-f16.bin
# done!
echo "Done."

View file

@ -0,0 +1,141 @@
#!/usr/bin/env python3
"""Script to execute the "quantize" script on a given set of models."""
import subprocess
import argparse
import glob
import sys
import os
def main():
"""Update the quantize binary name depending on the platform and parse
the command line arguments and execute the script.
"""
if "linux" in sys.platform or "darwin" in sys.platform:
quantize_script_binary = "quantize-gptneox"
elif "win32" in sys.platform or "cygwin" in sys.platform:
quantize_script_binary = "quantize-gptneox.exe"
else:
print("WARNING: Unknown platform. Assuming a UNIX-like OS.\n")
quantize_script_binary = "quantize-gptneox"
parser = argparse.ArgumentParser(
prog='python3 quantize-gptneox.py',
description='This script quantizes the given models by applying the '
f'"{quantize_script_binary}" script on them.'
)
parser.add_argument('model_path')
#parser.add_argument(
# 'models', nargs='+', choices=('7B', '13B', '30B', '65B'),
# help='The models to quantize.'
#)
parser.add_argument(
'-r', '--remove-16', action='store_true', dest='remove_f16',
help='Remove the f16 model after quantizing it.'
)
#parser.add_argument(
# '-m', '--models-path', dest='models_path',
# default=os.path.join(os.getcwd(), "models"),
# help='Specify the directory where the models are located.'
#)
parser.add_argument(
'-q', '--quantize-script-path', dest='quantize_script_path',
default=os.path.join(os.getcwd(), quantize_script_binary),
help='Specify the path to the "quantize" script.'
)
parser.add_argument(
'--quantize-output-type', dest='quantize_output_type', type=str,
default='q4_0',
help='Specify the path to the "quantize" script.'
)
# TODO: Revise this code
# parser.add_argument(
# '-t', '--threads', dest='threads', type='int',
# default=os.cpu_count(),
# help='Specify the number of threads to use to quantize many models at '
# 'once. Defaults to os.cpu_count().'
# )
args = parser.parse_args()
args.model_path = os.path.abspath(args.model_path)
#args.models_path = os.path.abspath(args.models_path)
if not os.path.isfile(args.quantize_script_path):
print(
f'The "{quantize_script_binary}" script was not found in the '
"current location.\nIf you want to use it from another location, "
"set the --quantize-script-path argument from the command line."
)
sys.exit(1)
#for model in args.models:
# The model is separated in various parts
# (ggml-model-f16.bin, ggml-model-f16.bin.0, ggml-model-f16.bin.1...)
#f16_model_path_base = os.path.join(
# args.models_path, model, "ggml-model-f16.bin"
#)
f16_model_path_base = args.model_path
if not os.path.isfile(f16_model_path_base):
print(f'The file %s was not found' % f16_model_path_base)
sys.exit(1)
f16_model_parts_paths = map(
lambda filename: os.path.join(f16_model_path_base, filename),
glob.glob(f"{f16_model_path_base}*")
)
for f16_model_part_path in f16_model_parts_paths:
if not os.path.isfile(f16_model_part_path):
print(
f"The f16 model {os.path.basename(f16_model_part_path)} "
f"was not found in {args.models_path}{os.path.sep}"
". If you want to use it from another location, set the "
"--models-path argument from the command line."
)
sys.exit(1)
__run_quantize_script(
args.quantize_script_path, f16_model_part_path, args.quantize_output_type
)
if args.remove_f16:
os.remove(f16_model_part_path)
# This was extracted to a top-level function for parallelization, if
# implemented. See https://github.com/ggerganov/llama.cpp/pull/222/commits/f8db3d6cd91bf1a1342db9d29e3092bc12dd783c#r1140496406
def __run_quantize_script(script_path, f16_model_part_path, quantize_output_type):
"""Run the quantize script specifying the path to it and the path to the
f16 model to quantize.
"""
new_quantized_model_path = f16_model_part_path.replace("f16", quantize_output_type)
subprocess.run(
[script_path, f16_model_part_path, new_quantized_model_path, quantize_output_type],
check=True
)
if __name__ == "__main__":
try:
main()
except subprocess.CalledProcessError:
print("\nAn error ocurred while trying to quantize the models.")
sys.exit(1)
except KeyboardInterrupt:
sys.exit(0)
else:
print("\nSuccesfully quantized all models.")