mirror of
https://github.com/jart/cosmopolitan.git
synced 2025-08-08 10:50:28 +00:00
Import redpajama.cpp into COSMO.
This is the relevant commit: bfa6466199
This commit is contained in:
parent
3083335d07
commit
52e471331c
16 changed files with 5913 additions and 17 deletions
24
third_party/radpajama/README.cosmo
vendored
24
third_party/radpajama/README.cosmo
vendored
|
@ -1,6 +1,6 @@
|
|||
DESCRIPTION
|
||||
|
||||
ggml is a machine learning library useful for LLM inference on CPUs
|
||||
radpajama is a port of ggml for the open source Red Pajama LLM. It started as a fork of redpajama.cpp from Together Computer.
|
||||
|
||||
LICENSE
|
||||
|
||||
|
@ -8,22 +8,12 @@ LICENSE
|
|||
|
||||
ORIGIN
|
||||
|
||||
https://github.com/ggerganov/llama.cpp
|
||||
commit 0b2da20538d01926b77ea237dd1c930c4d20b686
|
||||
Author: Stephan Walter <stephan@walter.name>
|
||||
Date: Wed Apr 26 20:26:42 2023 +0000
|
||||
ggml : slightly faster AVX2 implementation for Q5 (#1197)
|
||||
github.com/togethercomputer/redpajama.cpp/
|
||||
commit bfa6466199b8ef92185ecb72e2a550e12baf6602
|
||||
Author: Szhangce <czhang@cs.stanford.edu>
|
||||
Date: Tue May 9 00:50:22 2023 +0200
|
||||
radpajama : Update README.md
|
||||
|
||||
LOCAL CHANGES
|
||||
|
||||
- Make it possible for loaded prompts to be cached to disk
|
||||
- Introduce -v and --verbose flags
|
||||
- Reduce batch size from 512 to 32
|
||||
- Allow --n_keep to specify a substring of prompt
|
||||
- Don't print stats / diagnostics unless -v is passed
|
||||
- Reduce --top_p default from 0.95 to 0.70
|
||||
- Change --reverse-prompt to no longer imply --interactive
|
||||
- Permit --reverse-prompt specifying custom EOS if non-interactive
|
||||
- Refactor headers per cosmo convention
|
||||
- Replace code like 'ggjt' with READ32BE("ggjt")
|
||||
- Remove C++ exceptions; use Die() function instead
|
||||
- Updated headers for COSMO build.
|
||||
|
|
143
third_party/radpajama/README.md
vendored
Normal file
143
third_party/radpajama/README.md
vendored
Normal file
|
@ -0,0 +1,143 @@
|
|||
# gglm Support for RedPajama Model
|
||||
|
||||
## Ackonwledgement
|
||||
|
||||
We highly appreciate the great effort from the fork of [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp). Our support of the RedPajama Model is mainly based on this implementation. We extend the model configure and fixed a bug when setting use_parallel_residual flag to False in their original implementation. We also extend the chat model for RedPajama.
|
||||
|
||||
## Usage:
|
||||
|
||||
### RedPajama Chat model:
|
||||
|
||||
- Make the code:
|
||||
|
||||
make redpajama-chat quantize-gptneox
|
||||
|
||||
|
||||
- Prepare the RedPajama model (f16 and q4_0) for gglm:
|
||||
|
||||
bash ./examples/redpajama/scripts/install-RedPajama-INCITE-Chat-3B-v1.sh
|
||||
|
||||
- Run RedPajama chat model (fp16):
|
||||
|
||||
./redpajama-chat -m ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-f16.bin \
|
||||
-c 2048 \
|
||||
-b 128 \
|
||||
-n 1 \
|
||||
-t 8 \
|
||||
--instruct \
|
||||
--color \
|
||||
--top_k 30 \
|
||||
--top_p 0.95 \
|
||||
--temp 0.8 \
|
||||
--repeat_last_n 3 \
|
||||
--repeat_penalty 1.1 \
|
||||
--seed 0
|
||||
|
||||
Note that you may need to install torch and transformers to run the above scripts, e.g.:
|
||||
|
||||
pip install torch==2.0.0
|
||||
pip install transformers==4.28.1
|
||||
|
||||
|
||||
- Run RedPajama chat model (q4_0):
|
||||
|
||||
./redpajama-chat -m ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-q4_0.bin \
|
||||
-c 2048 \
|
||||
-b 128 \
|
||||
-n 1 \
|
||||
-t 8 \
|
||||
--instruct \
|
||||
--color \
|
||||
--top_k 30 \
|
||||
--top_p 0.95 \
|
||||
--temp 0.8 \
|
||||
--repeat_last_n 3 \
|
||||
--repeat_penalty 1.1 \
|
||||
--seed 0
|
||||
|
||||
- Run other quantized version of RedPajama Chat model (Make sure you get the f16 model prepared before you run this):
|
||||
|
||||
- Make the code to quantize the model if you have not:
|
||||
|
||||
make quantize-gptneox
|
||||
|
||||
- Generate the quantized model, the supported types include: q4_0, q4_1, q4_2, q5_0, q5_1, and q8_0. For example, to run q4_1, you need to do the following convertion:
|
||||
|
||||
python ./examples/redpajama/scripts/quantize-gptneox.py ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-f16.bin --quantize-output-type q4_1
|
||||
|
||||
- Then you can chat with the quantized model:
|
||||
|
||||
./redpajama-chat -m ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-q4_1.bin \
|
||||
-c 2048 \
|
||||
-b 128 \
|
||||
-n 1 \
|
||||
-t 8 \
|
||||
--instruct \
|
||||
--color \
|
||||
--top_k 30 \
|
||||
--top_p 0.95 \
|
||||
--temp 0.8 \
|
||||
--repeat_last_n 3 \
|
||||
--repeat_penalty 1.1 \
|
||||
--seed 0
|
||||
|
||||
|
||||
|
||||
|
||||
### RedPajama Base/Instruct model:
|
||||
|
||||
- Make the code:
|
||||
|
||||
make redpajama quantize-gptneox
|
||||
|
||||
|
||||
- Prepare the RedPajama Base/Instruct model (f16 and q4_0) for gglm:
|
||||
|
||||
bash ./examples/redpajama/scripts/install-RedPajama-INCITE-Base-3B-v1.sh
|
||||
|
||||
# Or
|
||||
|
||||
bash ./examples/redpajama/scripts/install-RedPajama-INCITE-Instruct-3B-v1.sh
|
||||
|
||||
- Run other quantize version of RedPajama Base/Instruct model (Make sure you get the f16 model prepared before you run this). Then you can generate the quantized model, the supported types include: q4_0, q4_1, q4_2, q5_0, q5_1, and q8_0. For example, to run q4_1, you need to do the following convertion, e.g for RedPajama-Base q8_0:
|
||||
|
||||
python ./examples/redpajama/scripts/quantize-gptneox.py ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Base-3B-v1-f16.bin --quantize-output-type q8_0
|
||||
|
||||
- Run RedPajama Base/Instruct model (e.g., RedPajama-Instruct q8_0) :
|
||||
|
||||
./redpajama -m ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Instruct-3B-v1-q8_0.bin \
|
||||
-c 2048 \
|
||||
-b 128 \
|
||||
-n 1 \
|
||||
-t 8 \
|
||||
--color \
|
||||
--top_k 30 \
|
||||
--top_p 0.95 \
|
||||
--temp 0.8 \
|
||||
--repeat_last_n 3 \
|
||||
--repeat_penalty 1.1 \
|
||||
--seed 0 \
|
||||
--n_predict 256 \
|
||||
--verbose-prompt \
|
||||
-p "How to schedule a tour to Anfield:"
|
||||
|
||||
|
||||
## Attribution
|
||||
|
||||
The following files are covered by a MIT license and were taken from:
|
||||
|
||||
https://github.com/byroneverson/gptneox.cpp
|
||||
|
||||
Thank you Byron.
|
||||
|
||||
```
|
||||
common-gptneox.cpp
|
||||
copy-gptneox.cpp
|
||||
gptneox.cpp
|
||||
quantize-gptneox.cpp
|
||||
common-gptneox.h
|
||||
gptneox-util.h
|
||||
gptneox.h
|
||||
convert_gptneox_to_ggml.py
|
||||
quantize-gptneox.py
|
||||
```
|
429
third_party/radpajama/common-gptneox.cpp
vendored
Normal file
429
third_party/radpajama/common-gptneox.cpp
vendored
Normal file
|
@ -0,0 +1,429 @@
|
|||
#include "common-gptneox.h"
|
||||
|
||||
#include <cassert>
|
||||
#include <cstring>
|
||||
#include <fstream>
|
||||
#include <string>
|
||||
#include <iterator>
|
||||
#include <algorithm>
|
||||
#include <sstream>
|
||||
#include <iostream>
|
||||
|
||||
#if defined (_WIN32)
|
||||
#include <fcntl.h>
|
||||
#include <io.h>
|
||||
#pragma comment(lib,"kernel32.lib")
|
||||
extern "C" __declspec(dllimport) void* __stdcall GetStdHandle(unsigned long nStdHandle);
|
||||
extern "C" __declspec(dllimport) int __stdcall GetConsoleMode(void* hConsoleHandle, unsigned long* lpMode);
|
||||
extern "C" __declspec(dllimport) int __stdcall SetConsoleMode(void* hConsoleHandle, unsigned long dwMode);
|
||||
extern "C" __declspec(dllimport) int __stdcall SetConsoleCP(unsigned int wCodePageID);
|
||||
extern "C" __declspec(dllimport) int __stdcall SetConsoleOutputCP(unsigned int wCodePageID);
|
||||
extern "C" __declspec(dllimport) int __stdcall WideCharToMultiByte(unsigned int CodePage, unsigned long dwFlags,
|
||||
const wchar_t * lpWideCharStr, int cchWideChar,
|
||||
char * lpMultiByteStr, int cbMultiByte,
|
||||
const char * lpDefaultChar, bool * lpUsedDefaultChar);
|
||||
#define CP_UTF8 65001
|
||||
#endif
|
||||
|
||||
bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
|
||||
// determine sensible default number of threads.
|
||||
// std::thread::hardware_concurrency may not be equal to the number of cores, or may return 0.
|
||||
#ifdef __linux__
|
||||
std::ifstream cpuinfo("/proc/cpuinfo");
|
||||
params.n_threads = std::count(std::istream_iterator<std::string>(cpuinfo),
|
||||
std::istream_iterator<std::string>(),
|
||||
std::string("processor"));
|
||||
#endif
|
||||
if (params.n_threads == 0) {
|
||||
params.n_threads = std::max(1, (int32_t) std::thread::hardware_concurrency());
|
||||
}
|
||||
|
||||
bool invalid_param = false;
|
||||
std::string arg;
|
||||
gpt_params default_params;
|
||||
|
||||
for (int i = 1; i < argc; i++) {
|
||||
arg = argv[i];
|
||||
|
||||
if (arg == "-s" || arg == "--seed") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.seed = std::stoi(argv[i]);
|
||||
} else if (arg == "-t" || arg == "--threads") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.n_threads = std::stoi(argv[i]);
|
||||
} else if (arg == "-p" || arg == "--prompt") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.prompt = argv[i];
|
||||
} else if (arg == "--session") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.path_session = argv[i];
|
||||
} else if (arg == "-f" || arg == "--file") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
std::ifstream file(argv[i]);
|
||||
if (!file) {
|
||||
fprintf(stderr, "error: failed to open file '%s'\n", argv[i]);
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
std::copy(std::istreambuf_iterator<char>(file), std::istreambuf_iterator<char>(), back_inserter(params.prompt));
|
||||
if (params.prompt.back() == '\n') {
|
||||
params.prompt.pop_back();
|
||||
}
|
||||
} else if (arg == "-n" || arg == "--n_predict") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.n_predict = std::stoi(argv[i]);
|
||||
} else if (arg == "--top_k") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.top_k = std::stoi(argv[i]);
|
||||
} else if (arg == "-c" || arg == "--ctx_size") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.n_ctx = std::stoi(argv[i]);
|
||||
} else if (arg == "--memory_f32") {
|
||||
params.memory_f16 = false;
|
||||
} else if (arg == "--top_p") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.top_p = std::stof(argv[i]);
|
||||
} else if (arg == "--temp") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.temp = std::stof(argv[i]);
|
||||
} else if (arg == "--tfs") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.tfs_z = std::stof(argv[i]);
|
||||
} else if (arg == "--typical") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.typical_p = std::stof(argv[i]);
|
||||
} else if (arg == "--repeat_last_n") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.repeat_last_n = std::stoi(argv[i]);
|
||||
} else if (arg == "--repeat_penalty") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.repeat_penalty = std::stof(argv[i]);
|
||||
} else if (arg == "--frequency_penalty") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.frequency_penalty = std::stof(argv[i]);
|
||||
} else if (arg == "--presence_penalty") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.presence_penalty = std::stof(argv[i]);
|
||||
} else if (arg == "--mirostat") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.mirostat = std::stoi(argv[i]);
|
||||
} else if (arg == "--mirostat_lr") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.mirostat_eta = std::stof(argv[i]);
|
||||
} else if (arg == "--mirostat_ent") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.mirostat_tau = std::stof(argv[i]);
|
||||
} else if (arg == "-b" || arg == "--batch_size") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.n_batch = std::stoi(argv[i]);
|
||||
params.n_batch = std::min(512, params.n_batch);
|
||||
} else if (arg == "--keep") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.n_keep = std::stoi(argv[i]);
|
||||
} else if (arg == "-m" || arg == "--model") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.model = argv[i];
|
||||
} else if (arg == "--lora") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.lora_adapter = argv[i];
|
||||
params.use_mmap = false;
|
||||
} else if (arg == "--lora-base") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.lora_base = argv[i];
|
||||
} else if (arg == "-i" || arg == "--interactive") {
|
||||
params.interactive = true;
|
||||
} else if (arg == "--embedding") {
|
||||
params.embedding = true;
|
||||
} else if (arg == "--interactive-first") {
|
||||
params.interactive_first = true;
|
||||
} else if (arg == "-ins" || arg == "--instruct") {
|
||||
params.instruct = true;
|
||||
} else if (arg == "--color") {
|
||||
params.use_color = true;
|
||||
} else if (arg == "--mlock") {
|
||||
params.use_mlock = true;
|
||||
} else if (arg == "--no-mmap") {
|
||||
params.use_mmap = false;
|
||||
} else if (arg == "--mtest") {
|
||||
params.mem_test = true;
|
||||
} else if (arg == "--verbose-prompt") {
|
||||
params.verbose_prompt = true;
|
||||
} else if (arg == "-r" || arg == "--reverse-prompt") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.antiprompt.push_back(argv[i]);
|
||||
} else if (arg == "--perplexity") {
|
||||
params.perplexity = true;
|
||||
} else if (arg == "--ignore-eos") {
|
||||
params.logit_bias[gptneox_token_eos()] = -INFINITY;
|
||||
} else if (arg == "--no-penalize-nl") {
|
||||
params.penalize_nl = false;
|
||||
} else if (arg == "-l" || arg == "--logit-bias") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
std::stringstream ss(argv[i]);
|
||||
gptneox_token key;
|
||||
char sign;
|
||||
std::string value_str;
|
||||
try {
|
||||
if (ss >> key && ss >> sign && std::getline(ss, value_str) && (sign == '+' || sign == '-')) {
|
||||
params.logit_bias[key] = std::stof(value_str) * ((sign == '-') ? -1.0f : 1.0f);
|
||||
} else {
|
||||
throw std::exception();
|
||||
}
|
||||
} catch (const std::exception &e) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
} else if (arg == "--n_parts") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.n_parts = std::stoi(argv[i]);
|
||||
} else if (arg == "-h" || arg == "--help") {
|
||||
gpt_print_usage(argc, argv, default_params);
|
||||
exit(0);
|
||||
} else if (arg == "--random-prompt") {
|
||||
params.random_prompt = true;
|
||||
} else if (arg == "--in-prefix") {
|
||||
if (++i >= argc) {
|
||||
invalid_param = true;
|
||||
break;
|
||||
}
|
||||
params.input_prefix = argv[i];
|
||||
} else {
|
||||
fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
|
||||
gpt_print_usage(argc, argv, default_params);
|
||||
exit(1);
|
||||
}
|
||||
}
|
||||
if (invalid_param) {
|
||||
fprintf(stderr, "error: invalid parameter for argument: %s\n", arg.c_str());
|
||||
gpt_print_usage(argc, argv, default_params);
|
||||
exit(1);
|
||||
}
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
|
||||
fprintf(stderr, "usage: %s [options]\n", argv[0]);
|
||||
fprintf(stderr, "\n");
|
||||
fprintf(stderr, "options:\n");
|
||||
fprintf(stderr, " -h, --help show this help message and exit\n");
|
||||
fprintf(stderr, " -i, --interactive run in interactive mode\n");
|
||||
fprintf(stderr, " --interactive-first run in interactive mode and wait for input right away\n");
|
||||
fprintf(stderr, " -ins, --instruct run in instruction mode\n");
|
||||
fprintf(stderr, " -r PROMPT, --reverse-prompt PROMPT\n");
|
||||
fprintf(stderr, " run in interactive mode and poll user input upon seeing PROMPT (can be\n");
|
||||
fprintf(stderr, " specified more than once for multiple prompts).\n");
|
||||
fprintf(stderr, " --color colorise output to distinguish prompt and user input from generations\n");
|
||||
fprintf(stderr, " -s SEED, --seed SEED RNG seed (default: -1, use random seed for <= 0)\n");
|
||||
fprintf(stderr, " -t N, --threads N number of threads to use during computation (default: %d)\n", params.n_threads);
|
||||
fprintf(stderr, " -p PROMPT, --prompt PROMPT\n");
|
||||
fprintf(stderr, " prompt to start generation with (default: empty)\n");
|
||||
fprintf(stderr, " --session FNAME file to cache model state in (may be large!) (default: none)\n");
|
||||
fprintf(stderr, " --random-prompt start with a randomized prompt.\n");
|
||||
fprintf(stderr, " --in-prefix STRING string to prefix user inputs with (default: empty)\n");
|
||||
fprintf(stderr, " -f FNAME, --file FNAME\n");
|
||||
fprintf(stderr, " prompt file to start generation.\n");
|
||||
fprintf(stderr, " -n N, --n_predict N number of tokens to predict (default: %d, -1 = infinity)\n", params.n_predict);
|
||||
fprintf(stderr, " --top_k N top-k sampling (default: %d, 0 = disabled)\n", params.top_k);
|
||||
fprintf(stderr, " --top_p N top-p sampling (default: %.1f, 1.0 = disabled)\n", (double)params.top_p);
|
||||
fprintf(stderr, " --tfs N tail free sampling, parameter z (default: %.1f, 1.0 = disabled)\n", (double)params.tfs_z);
|
||||
fprintf(stderr, " --typical N locally typical sampling, parameter p (default: %.1f, 1.0 = disabled)\n", (double)params.typical_p);
|
||||
fprintf(stderr, " --repeat_last_n N last n tokens to consider for penalize (default: %d, 0 = disabled, -1 = ctx_size)\n", params.repeat_last_n);
|
||||
fprintf(stderr, " --repeat_penalty N penalize repeat sequence of tokens (default: %.1f, 1.0 = disabled)\n", (double)params.repeat_penalty);
|
||||
fprintf(stderr, " --presence_penalty N repeat alpha presence penalty (default: %.1f, 0.0 = disabled)\n", (double)params.presence_penalty);
|
||||
fprintf(stderr, " --frequency_penalty N repeat alpha frequency penalty (default: %.1f, 0.0 = disabled)\n", (double)params.frequency_penalty);
|
||||
fprintf(stderr, " --mirostat N use Mirostat sampling.\n");
|
||||
fprintf(stderr, " Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used.\n");
|
||||
fprintf(stderr, " (default: %d, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)\n", params.mirostat);
|
||||
fprintf(stderr, " --mirostat_lr N Mirostat learning rate, parameter eta (default: %.1f)\n", (double)params.mirostat_eta);
|
||||
fprintf(stderr, " --mirostat_ent N Mirostat target entropy, parameter tau (default: %.1f)\n", (double)params.mirostat_tau);
|
||||
fprintf(stderr, " -l TOKEN_ID(+/-)BIAS, --logit-bias TOKEN_ID(+/-)BIAS\n");
|
||||
fprintf(stderr, " modifies the likelihood of token appearing in the completion,\n");
|
||||
fprintf(stderr, " i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',\n");
|
||||
fprintf(stderr, " or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'\n");
|
||||
fprintf(stderr, " -c N, --ctx_size N size of the prompt context (default: %d)\n", params.n_ctx);
|
||||
fprintf(stderr, " --ignore-eos ignore end of stream token and continue generating (implies --logit-bias 2-inf)\n");
|
||||
fprintf(stderr, " --no-penalize-nl do not penalize newline token\n");
|
||||
fprintf(stderr, " --memory_f32 use f32 instead of f16 for memory key+value\n");
|
||||
fprintf(stderr, " --temp N temperature (default: %.1f)\n", (double)params.temp);
|
||||
fprintf(stderr, " --n_parts N number of model parts (default: -1 = determine from dimensions)\n");
|
||||
fprintf(stderr, " -b N, --batch_size N batch size for prompt processing (default: %d)\n", params.n_batch);
|
||||
fprintf(stderr, " --perplexity compute perplexity over the prompt\n");
|
||||
fprintf(stderr, " --keep number of tokens to keep from the initial prompt (default: %d, -1 = all)\n", params.n_keep);
|
||||
if (gptneox_mlock_supported()) {
|
||||
fprintf(stderr, " --mlock force system to keep model in RAM rather than swapping or compressing\n");
|
||||
}
|
||||
if (gptneox_mmap_supported()) {
|
||||
fprintf(stderr, " --no-mmap do not memory-map model (slower load but may reduce pageouts if not using mlock)\n");
|
||||
}
|
||||
fprintf(stderr, " --mtest compute maximum memory usage\n");
|
||||
fprintf(stderr, " --verbose-prompt print prompt before generation\n");
|
||||
fprintf(stderr, " --lora FNAME apply LoRA adapter (implies --no-mmap)\n");
|
||||
fprintf(stderr, " --lora-base FNAME optional model to use as a base for the layers modified by the LoRA adapter\n");
|
||||
fprintf(stderr, " -m FNAME, --model FNAME\n");
|
||||
fprintf(stderr, " model path (default: %s)\n", params.model.c_str());
|
||||
fprintf(stderr, "\n");
|
||||
}
|
||||
|
||||
std::string gpt_random_prompt(std::mt19937 & rng) {
|
||||
const int r = rng() % 10;
|
||||
switch (r) {
|
||||
case 0: return "So";
|
||||
case 1: return "Once upon a time";
|
||||
case 2: return "When";
|
||||
case 3: return "The";
|
||||
case 4: return "After";
|
||||
case 5: return "If";
|
||||
case 6: return "import";
|
||||
case 7: return "He";
|
||||
case 8: return "She";
|
||||
case 9: return "They";
|
||||
default: return "To";
|
||||
}
|
||||
|
||||
return "The";
|
||||
}
|
||||
|
||||
// TODO: not great allocating this every time
|
||||
std::vector<gptneox_token> gptneox_tokenize(struct gptneox_context * ctx, const std::string & text, bool add_bos) {
|
||||
// initialize to prompt numer of chars, since n_tokens <= n_prompt_chars
|
||||
std::vector<gptneox_token> res(text.size() + (int)add_bos);
|
||||
int n = gptneox_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
|
||||
assert(n >= 0);
|
||||
res.resize(n);
|
||||
|
||||
return res;
|
||||
}
|
||||
|
||||
/* Keep track of current color of output, and emit ANSI code if it changes. */
|
||||
void set_console_color(console_state & con_st, console_color_t color) {
|
||||
if (con_st.use_color && con_st.color != color) {
|
||||
switch(color) {
|
||||
case CONSOLE_COLOR_DEFAULT:
|
||||
printf(ANSI_COLOR_RESET);
|
||||
break;
|
||||
case CONSOLE_COLOR_PROMPT:
|
||||
printf(ANSI_COLOR_YELLOW);
|
||||
break;
|
||||
case CONSOLE_COLOR_USER_INPUT:
|
||||
printf(ANSI_BOLD ANSI_COLOR_GREEN);
|
||||
break;
|
||||
}
|
||||
con_st.color = color;
|
||||
}
|
||||
}
|
||||
|
||||
#if defined (_WIN32)
|
||||
void win32_console_init(bool enable_color) {
|
||||
unsigned long dwMode = 0;
|
||||
void* hConOut = GetStdHandle((unsigned long)-11); // STD_OUTPUT_HANDLE (-11)
|
||||
if (!hConOut || hConOut == (void*)-1 || !GetConsoleMode(hConOut, &dwMode)) {
|
||||
hConOut = GetStdHandle((unsigned long)-12); // STD_ERROR_HANDLE (-12)
|
||||
if (hConOut && (hConOut == (void*)-1 || !GetConsoleMode(hConOut, &dwMode))) {
|
||||
hConOut = 0;
|
||||
}
|
||||
}
|
||||
if (hConOut) {
|
||||
// Enable ANSI colors on Windows 10+
|
||||
if (enable_color && !(dwMode & 0x4)) {
|
||||
SetConsoleMode(hConOut, dwMode | 0x4); // ENABLE_VIRTUAL_TERMINAL_PROCESSING (0x4)
|
||||
}
|
||||
// Set console output codepage to UTF8
|
||||
SetConsoleOutputCP(CP_UTF8);
|
||||
}
|
||||
void* hConIn = GetStdHandle((unsigned long)-10); // STD_INPUT_HANDLE (-10)
|
||||
if (hConIn && hConIn != (void*)-1 && GetConsoleMode(hConIn, &dwMode)) {
|
||||
// Set console input codepage to UTF16
|
||||
_setmode(_fileno(stdin), _O_WTEXT);
|
||||
}
|
||||
}
|
||||
|
||||
// Convert a wide Unicode string to an UTF8 string
|
||||
void win32_utf8_encode(const std::wstring & wstr, std::string & str) {
|
||||
int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
|
||||
std::string strTo(size_needed, 0);
|
||||
WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
|
||||
str = strTo;
|
||||
}
|
||||
#endif
|
108
third_party/radpajama/common-gptneox.h
vendored
Normal file
108
third_party/radpajama/common-gptneox.h
vendored
Normal file
|
@ -0,0 +1,108 @@
|
|||
// Various helper functions and utilities
|
||||
|
||||
#pragma once
|
||||
|
||||
#include "gptneox.h"
|
||||
|
||||
#include <string>
|
||||
#include <vector>
|
||||
#include <random>
|
||||
#include <thread>
|
||||
#include <unordered_map>
|
||||
|
||||
//
|
||||
// CLI argument parsing
|
||||
//
|
||||
|
||||
struct gpt_params {
|
||||
int32_t seed = -1; // RNG seed
|
||||
int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
|
||||
int32_t n_predict = 128; // new tokens to predict
|
||||
int32_t n_parts = -1; // amount of model parts (-1 = determine from model dimensions)
|
||||
int32_t n_ctx = 512; // context size
|
||||
int32_t n_batch = 512; // batch size for prompt processing (must be >=32 to use BLAS)
|
||||
int32_t n_keep = 0; // number of tokens to keep from initial prompt
|
||||
|
||||
// sampling parameters
|
||||
std::unordered_map<gptneox_token, float> logit_bias; // logit bias for specific tokens
|
||||
int32_t top_k = 40; // <= 0 to use vocab size
|
||||
float top_p = 0.95f; // 1.0 = disabled
|
||||
float tfs_z = 1.00f; // 1.0 = disabled
|
||||
float typical_p = 1.00f; // 1.0 = disabled
|
||||
float temp = 0.80f; // 1.0 = disabled
|
||||
float repeat_penalty = 1.10f; // 1.0 = disabled
|
||||
int32_t repeat_last_n = 64; // last n tokens to penalize (0 = disable penalty, -1 = context size)
|
||||
float frequency_penalty = 0.00f; // 0.0 = disabled
|
||||
float presence_penalty = 0.00f; // 0.0 = disabled
|
||||
int mirostat = 0; // 0 = disabled, 1 = mirostat, 2 = mirostat 2.0
|
||||
float mirostat_tau = 5.00f; // target entropy
|
||||
float mirostat_eta = 0.10f; // learning rate
|
||||
|
||||
std::string model = "./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat/Instruct-3B-v1-f16.bin"; // model path
|
||||
std::string prompt = "";
|
||||
std::string path_session = ""; // path to file for saving/loading model eval state
|
||||
std::string input_prefix = ""; // string to prefix user inputs with
|
||||
std::vector<std::string> antiprompt; // string upon seeing which more user input is prompted
|
||||
|
||||
std::string lora_adapter = ""; // lora adapter path
|
||||
std::string lora_base = ""; // base model path for the lora adapter
|
||||
|
||||
bool memory_f16 = true; // use f16 instead of f32 for memory kv
|
||||
bool random_prompt = false; // do not randomize prompt if none provided
|
||||
bool use_color = false; // use color to distinguish generations and inputs
|
||||
bool interactive = false; // interactive mode
|
||||
|
||||
bool embedding = false; // get only sentence embedding
|
||||
bool interactive_first = false; // wait for user input immediately
|
||||
|
||||
bool instruct = false; // instruction mode
|
||||
bool penalize_nl = true; // consider newlines as a repeatable token
|
||||
bool perplexity = false; // compute perplexity over the prompt
|
||||
bool use_mmap = true; // use mmap for faster loads
|
||||
bool use_mlock = false; // use mlock to keep model in memory
|
||||
bool mem_test = false; // compute maximum memory usage
|
||||
bool verbose_prompt = false; // print prompt tokens before generation
|
||||
};
|
||||
|
||||
bool gpt_params_parse(int argc, char ** argv, gpt_params & params);
|
||||
|
||||
void gpt_print_usage(int argc, char ** argv, const gpt_params & params);
|
||||
|
||||
std::string gpt_random_prompt(std::mt19937 & rng);
|
||||
|
||||
//
|
||||
// Vocab utils
|
||||
//
|
||||
|
||||
std::vector<gptneox_token> gptneox_tokenize(struct gptneox_context * ctx, const std::string & text, bool add_bos);
|
||||
|
||||
//
|
||||
// Console utils
|
||||
//
|
||||
|
||||
#define ANSI_COLOR_RED "\x1b[31m"
|
||||
#define ANSI_COLOR_GREEN "\x1b[32m"
|
||||
#define ANSI_COLOR_YELLOW "\x1b[33m"
|
||||
#define ANSI_COLOR_BLUE "\x1b[34m"
|
||||
#define ANSI_COLOR_MAGENTA "\x1b[35m"
|
||||
#define ANSI_COLOR_CYAN "\x1b[36m"
|
||||
#define ANSI_COLOR_RESET "\x1b[0m"
|
||||
#define ANSI_BOLD "\x1b[1m"
|
||||
|
||||
enum console_color_t {
|
||||
CONSOLE_COLOR_DEFAULT=0,
|
||||
CONSOLE_COLOR_PROMPT,
|
||||
CONSOLE_COLOR_USER_INPUT
|
||||
};
|
||||
|
||||
struct console_state {
|
||||
bool use_color = false;
|
||||
console_color_t color = CONSOLE_COLOR_DEFAULT;
|
||||
};
|
||||
|
||||
void set_console_color(console_state & con_st, console_color_t color);
|
||||
|
||||
#if defined (_WIN32)
|
||||
void win32_console_init(bool enable_color);
|
||||
void win32_utf8_encode(const std::wstring & wstr, std::string & str);
|
||||
#endif
|
57
third_party/radpajama/copy-gptneox.cpp
vendored
Normal file
57
third_party/radpajama/copy-gptneox.cpp
vendored
Normal file
|
@ -0,0 +1,57 @@
|
|||
#include "ggml.h"
|
||||
#include "gptneox.h"
|
||||
|
||||
#include <cstdio>
|
||||
#include <map>
|
||||
#include <string>
|
||||
|
||||
static const std::map<std::string, enum gptneox_ftype> GPTNEOX_FTYPE_MAP = {
|
||||
{"q4_0", GPTNEOX_FTYPE_MOSTLY_Q4_0},
|
||||
{"q4_1", GPTNEOX_FTYPE_MOSTLY_Q4_1},
|
||||
{"q4_2", GPTNEOX_FTYPE_MOSTLY_Q4_2},
|
||||
//{"q4_3", GPTNEOX_FTYPE_MOSTLY_Q4_3},
|
||||
{"q5_0", GPTNEOX_FTYPE_MOSTLY_Q5_0},
|
||||
{"q5_1", GPTNEOX_FTYPE_MOSTLY_Q5_1},
|
||||
{"q8_0", GPTNEOX_FTYPE_MOSTLY_Q8_0},
|
||||
};
|
||||
|
||||
// usage:
|
||||
// ./quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type
|
||||
//
|
||||
int main(int argc, char ** argv) {
|
||||
ggml_time_init();
|
||||
|
||||
if (argc < 4) {
|
||||
fprintf(stderr, "usage: %s model-f32.bin model-quant.bin ftype\n", argv[0]);
|
||||
for (auto it = GPTNEOX_FTYPE_MAP.begin(); it != GPTNEOX_FTYPE_MAP.end(); it++) {
|
||||
fprintf(stderr, " type = \"%s\" or %d\n", it->first.c_str(), it->second);
|
||||
}
|
||||
return 1;
|
||||
}
|
||||
|
||||
// needed to initialize f16 tables
|
||||
{
|
||||
struct ggml_init_params params = { 0, NULL, false };
|
||||
struct ggml_context * ctx = ggml_init(params);
|
||||
ggml_free(ctx);
|
||||
}
|
||||
|
||||
const std::string fname_inp = argv[1];
|
||||
const std::string fname_out = argv[2];
|
||||
|
||||
enum gptneox_ftype ftype;
|
||||
if (argv[3][0] == 'q') {
|
||||
auto it = GPTNEOX_FTYPE_MAP.find(argv[3]);
|
||||
if (it == GPTNEOX_FTYPE_MAP.end()) {
|
||||
fprintf(stderr, "%s: unknown ftype '%s'\n", __func__, argv[3]);
|
||||
return 1;
|
||||
}
|
||||
ftype = it->second;
|
||||
} else {
|
||||
ftype = (enum gptneox_ftype)atoi(argv[3]);
|
||||
}
|
||||
|
||||
gptneox_model_copy(fname_inp.c_str(), fname_out.c_str(), ftype);
|
||||
|
||||
return 0;
|
||||
}
|
433
third_party/radpajama/gptneox-util.h
vendored
Normal file
433
third_party/radpajama/gptneox-util.h
vendored
Normal file
|
@ -0,0 +1,433 @@
|
|||
// Internal header to be included only by llama.cpp.
|
||||
// Contains wrappers around OS interfaces.
|
||||
|
||||
#ifndef GPTNEOX_UTIL_H
|
||||
#define GPTNEOX_UTIL_H
|
||||
|
||||
#include <cstdio>
|
||||
#include <cstdint>
|
||||
#include <cerrno>
|
||||
#include <cstring>
|
||||
#include <cstdarg>
|
||||
#include <cstdlib>
|
||||
#include <climits>
|
||||
|
||||
#include <string>
|
||||
#include <vector>
|
||||
|
||||
#ifdef __has_include
|
||||
#if __has_include(<unistd.h>)
|
||||
#include <unistd.h>
|
||||
#if defined(_POSIX_MAPPED_FILES)
|
||||
#include <sys/mman.h>
|
||||
#endif
|
||||
#if defined(_POSIX_MEMLOCK_RANGE)
|
||||
#include <sys/resource.h>
|
||||
#endif
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#if defined(_WIN32)
|
||||
#define WIN32_LEAN_AND_MEAN
|
||||
#ifndef NOMINMAX
|
||||
#define NOMINMAX
|
||||
#endif
|
||||
#include <windows.h>
|
||||
#include <io.h>
|
||||
#include <stdio.h> // for _fseeki64
|
||||
#endif
|
||||
|
||||
#define GPTNEOX_ASSERT(x) \
|
||||
do { \
|
||||
if (!(x)) { \
|
||||
fprintf(stderr, "GPTNEOX_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \
|
||||
abort(); \
|
||||
} \
|
||||
} while (0)
|
||||
|
||||
#ifdef __GNUC__
|
||||
#ifdef __MINGW32__
|
||||
__attribute__((format(gnu_printf, 1, 2)))
|
||||
#else
|
||||
__attribute__((format(printf, 1, 2)))
|
||||
#endif
|
||||
#endif
|
||||
static std::string format(const char * fmt, ...) {
|
||||
va_list ap, ap2;
|
||||
va_start(ap, fmt);
|
||||
va_copy(ap2, ap);
|
||||
int size = vsnprintf(NULL, 0, fmt, ap);
|
||||
GPTNEOX_ASSERT(size >= 0 && size < INT_MAX);
|
||||
std::vector<char> buf(size + 1);
|
||||
int size2 = vsnprintf(buf.data(), size + 1, fmt, ap2);
|
||||
GPTNEOX_ASSERT(size2 == size);
|
||||
va_end(ap2);
|
||||
va_end(ap);
|
||||
return std::string(buf.data(), size);
|
||||
}
|
||||
|
||||
struct gptneox_file {
|
||||
// use FILE * so we don't have to re-open the file to mmap
|
||||
FILE * fp;
|
||||
size_t size;
|
||||
|
||||
gptneox_file(const char * fname, const char * mode) {
|
||||
fp = std::fopen(fname, mode);
|
||||
if (fp == NULL) {
|
||||
throw format("failed to open %s: %s", fname, std::strerror(errno));
|
||||
}
|
||||
seek(0, SEEK_END);
|
||||
size = tell();
|
||||
seek(0, SEEK_SET);
|
||||
}
|
||||
|
||||
size_t tell() const {
|
||||
#ifdef _WIN32
|
||||
__int64 ret = _ftelli64(fp);
|
||||
#else
|
||||
long ret = std::ftell(fp);
|
||||
#endif
|
||||
GPTNEOX_ASSERT(ret != -1); // this really shouldn't fail
|
||||
return (size_t) ret;
|
||||
}
|
||||
|
||||
void seek(size_t offset, int whence) {
|
||||
#ifdef _WIN32
|
||||
int ret = _fseeki64(fp, (__int64) offset, whence);
|
||||
#else
|
||||
int ret = std::fseek(fp, (long) offset, whence);
|
||||
#endif
|
||||
GPTNEOX_ASSERT(ret == 0); // same
|
||||
}
|
||||
|
||||
void read_raw(void * ptr, size_t size) {
|
||||
if (size == 0) {
|
||||
return;
|
||||
}
|
||||
errno = 0;
|
||||
std::size_t ret = std::fread(ptr, size, 1, fp);
|
||||
if (ferror(fp)) {
|
||||
throw format("read error: %s", strerror(errno));
|
||||
}
|
||||
if (ret != 1) {
|
||||
throw std::string("unexpectedly reached end of file");
|
||||
}
|
||||
}
|
||||
|
||||
std::uint32_t read_u32() {
|
||||
std::uint32_t ret;
|
||||
read_raw(&ret, sizeof(ret));
|
||||
return ret;
|
||||
}
|
||||
|
||||
std::string read_string(std::uint32_t len) {
|
||||
std::vector<char> chars(len);
|
||||
read_raw(chars.data(), len);
|
||||
return std::string(chars.data(), len);
|
||||
}
|
||||
|
||||
void write_raw(const void * ptr, size_t size) {
|
||||
if (size == 0) {
|
||||
return;
|
||||
}
|
||||
errno = 0;
|
||||
size_t ret = std::fwrite(ptr, size, 1, fp);
|
||||
if (ret != 1) {
|
||||
throw format("write error: %s", strerror(errno));
|
||||
}
|
||||
}
|
||||
|
||||
void write_u32(std::uint32_t val) {
|
||||
write_raw(&val, sizeof(val));
|
||||
}
|
||||
|
||||
~gptneox_file() {
|
||||
if (fp) {
|
||||
std::fclose(fp);
|
||||
}
|
||||
}
|
||||
};
|
||||
|
||||
#if defined(_WIN32)
|
||||
static std::string gptneox_format_win_err(DWORD err) {
|
||||
LPSTR buf;
|
||||
size_t size = FormatMessageA(FORMAT_MESSAGE_ALLOCATE_BUFFER | FORMAT_MESSAGE_FROM_SYSTEM | FORMAT_MESSAGE_IGNORE_INSERTS,
|
||||
NULL, err, MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT), (LPSTR)&buf, 0, NULL);
|
||||
if (!size) {
|
||||
return "FormatMessageA failed";
|
||||
}
|
||||
std::string ret(buf, size);
|
||||
LocalFree(buf);
|
||||
return ret;
|
||||
}
|
||||
#endif
|
||||
|
||||
struct gptneox_mmap {
|
||||
void * addr;
|
||||
size_t size;
|
||||
|
||||
gptneox_mmap(const gptneox_mmap &) = delete;
|
||||
|
||||
#ifdef _POSIX_MAPPED_FILES
|
||||
static constexpr bool SUPPORTED = true;
|
||||
|
||||
gptneox_mmap(struct gptneox_file * file, bool prefetch = true) {
|
||||
size = file->size;
|
||||
int fd = fileno(file->fp);
|
||||
int flags = MAP_SHARED;
|
||||
#ifdef __linux__
|
||||
flags |= MAP_POPULATE;
|
||||
#endif
|
||||
addr = mmap(NULL, file->size, PROT_READ, flags, fd, 0);
|
||||
if (addr == MAP_FAILED) {
|
||||
throw format("mmap failed: %s", strerror(errno));
|
||||
}
|
||||
|
||||
if (prefetch) {
|
||||
// Advise the kernel to preload the mapped memory
|
||||
if (madvise(addr, file->size, MADV_WILLNEED)) {
|
||||
fprintf(stderr, "warning: madvise(.., MADV_WILLNEED) failed: %s\n",
|
||||
strerror(errno));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
~gptneox_mmap() {
|
||||
munmap(addr, size);
|
||||
}
|
||||
#elif defined(_WIN32)
|
||||
static constexpr bool SUPPORTED = true;
|
||||
|
||||
gptneox_mmap(struct gptneox_file * file, bool prefetch = true) {
|
||||
size = file->size;
|
||||
|
||||
HANDLE hFile = (HANDLE) _get_osfhandle(_fileno(file->fp));
|
||||
|
||||
HANDLE hMapping = CreateFileMappingA(hFile, NULL, PAGE_READONLY, 0, 0, NULL);
|
||||
DWORD error = GetLastError();
|
||||
|
||||
if (hMapping == NULL) {
|
||||
throw format("CreateFileMappingA failed: %s", gptneox_format_win_err(error).c_str());
|
||||
}
|
||||
|
||||
addr = MapViewOfFile(hMapping, FILE_MAP_READ, 0, 0, 0);
|
||||
error = GetLastError();
|
||||
CloseHandle(hMapping);
|
||||
|
||||
if (addr == NULL) {
|
||||
throw format("MapViewOfFile failed: %s", gptneox_format_win_err(error).c_str());
|
||||
}
|
||||
|
||||
#if _WIN32_WINNT >= _WIN32_WINNT_WIN8
|
||||
if (prefetch) {
|
||||
// Advise the kernel to preload the mapped memory
|
||||
WIN32_MEMORY_RANGE_ENTRY range;
|
||||
range.VirtualAddress = addr;
|
||||
range.NumberOfBytes = (SIZE_T)size;
|
||||
if (!PrefetchVirtualMemory(GetCurrentProcess(), 1, &range, 0)) {
|
||||
fprintf(stderr, "warning: PrefetchVirtualMemory failed: %s\n",
|
||||
gptneox_format_win_err(GetLastError()).c_str());
|
||||
}
|
||||
}
|
||||
#else
|
||||
#pragma message("warning: You are building for pre-Windows 8; prefetch not supported")
|
||||
#endif // _WIN32_WINNT >= _WIN32_WINNT_WIN8
|
||||
}
|
||||
|
||||
~gptneox_mmap() {
|
||||
if (!UnmapViewOfFile(addr)) {
|
||||
fprintf(stderr, "warning: UnmapViewOfFile failed: %s\n",
|
||||
gptneox_format_win_err(GetLastError()).c_str());
|
||||
}
|
||||
}
|
||||
#else
|
||||
static constexpr bool SUPPORTED = false;
|
||||
|
||||
gptneox_mmap(struct gptneox_file *) {
|
||||
throw std::string("mmap not supported");
|
||||
}
|
||||
#endif
|
||||
};
|
||||
|
||||
// Represents some region of memory being locked using mlock or VirtualLock;
|
||||
// will automatically unlock on destruction.
|
||||
struct gptneox_mlock {
|
||||
void * addr = NULL;
|
||||
size_t size = 0;
|
||||
bool failed_already = false;
|
||||
|
||||
gptneox_mlock() {}
|
||||
gptneox_mlock(const gptneox_mlock &) = delete;
|
||||
|
||||
~gptneox_mlock() {
|
||||
if (size) {
|
||||
raw_unlock(addr, size);
|
||||
}
|
||||
}
|
||||
|
||||
void init(void * addr) {
|
||||
GPTNEOX_ASSERT(this->addr == NULL && this->size == 0);
|
||||
this->addr = addr;
|
||||
}
|
||||
|
||||
void grow_to(size_t target_size) {
|
||||
GPTNEOX_ASSERT(addr);
|
||||
if (failed_already) {
|
||||
return;
|
||||
}
|
||||
size_t granularity = lock_granularity();
|
||||
target_size = (target_size + granularity - 1) & ~(granularity - 1);
|
||||
if (target_size > size) {
|
||||
if (raw_lock((uint8_t *) addr + size, target_size - size)) {
|
||||
size = target_size;
|
||||
} else {
|
||||
failed_already = true;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#ifdef _POSIX_MEMLOCK_RANGE
|
||||
static constexpr bool SUPPORTED = true;
|
||||
|
||||
size_t lock_granularity() {
|
||||
return (size_t) sysconf(_SC_PAGESIZE);
|
||||
}
|
||||
|
||||
#ifdef __APPLE__
|
||||
#define MLOCK_SUGGESTION \
|
||||
"Try increasing the sysctl values 'vm.user_wire_limit' and 'vm.global_user_wire_limit' and/or " \
|
||||
"decreasing 'vm.global_no_user_wire_amount'. Also try increasing RLIMIT_MLOCK (ulimit -l).\n"
|
||||
#else
|
||||
#define MLOCK_SUGGESTION \
|
||||
"Try increasing RLIMIT_MLOCK ('ulimit -l' as root).\n"
|
||||
#endif
|
||||
|
||||
bool raw_lock(const void * addr, size_t size) {
|
||||
if (!mlock(addr, size)) {
|
||||
return true;
|
||||
} else {
|
||||
char* errmsg = std::strerror(errno);
|
||||
bool suggest = (errno == ENOMEM);
|
||||
|
||||
// Check if the resource limit is fine after all
|
||||
struct rlimit lock_limit;
|
||||
if (suggest && getrlimit(RLIMIT_MEMLOCK, &lock_limit))
|
||||
suggest = false;
|
||||
if (suggest && (lock_limit.rlim_max > lock_limit.rlim_cur + size))
|
||||
suggest = false;
|
||||
|
||||
fprintf(stderr, "warning: failed to mlock %zu-byte buffer (after previously locking %zu bytes): %s\n%s",
|
||||
size, this->size, errmsg, suggest ? MLOCK_SUGGESTION : "");
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
#undef MLOCK_SUGGESTION
|
||||
|
||||
void raw_unlock(void * addr, size_t size) {
|
||||
if (munlock(addr, size)) {
|
||||
fprintf(stderr, "warning: failed to munlock buffer: %s\n", std::strerror(errno));
|
||||
}
|
||||
}
|
||||
#elif defined(_WIN32)
|
||||
static constexpr bool SUPPORTED = true;
|
||||
|
||||
size_t lock_granularity() {
|
||||
SYSTEM_INFO si;
|
||||
GetSystemInfo(&si);
|
||||
return (size_t) si.dwPageSize;
|
||||
}
|
||||
|
||||
bool raw_lock(void * addr, size_t size) {
|
||||
for (int tries = 1; ; tries++) {
|
||||
if (VirtualLock(addr, size)) {
|
||||
return true;
|
||||
}
|
||||
if (tries == 2) {
|
||||
fprintf(stderr, "warning: failed to VirtualLock %zu-byte buffer (after previously locking %zu bytes): %s\n",
|
||||
size, this->size, gptneox_format_win_err(GetLastError()).c_str());
|
||||
return false;
|
||||
}
|
||||
|
||||
// It failed but this was only the first try; increase the working
|
||||
// set size and try again.
|
||||
SIZE_T min_ws_size, max_ws_size;
|
||||
if (!GetProcessWorkingSetSize(GetCurrentProcess(), &min_ws_size, &max_ws_size)) {
|
||||
fprintf(stderr, "warning: GetProcessWorkingSetSize failed: %s\n",
|
||||
gptneox_format_win_err(GetLastError()).c_str());
|
||||
return false;
|
||||
}
|
||||
// Per MSDN: "The maximum number of pages that a process can lock
|
||||
// is equal to the number of pages in its minimum working set minus
|
||||
// a small overhead."
|
||||
// Hopefully a megabyte is enough overhead:
|
||||
size_t increment = size + 1048576;
|
||||
// The minimum must be <= the maximum, so we need to increase both:
|
||||
min_ws_size += increment;
|
||||
max_ws_size += increment;
|
||||
if (!SetProcessWorkingSetSize(GetCurrentProcess(), min_ws_size, max_ws_size)) {
|
||||
fprintf(stderr, "warning: SetProcessWorkingSetSize failed: %s\n",
|
||||
gptneox_format_win_err(GetLastError()).c_str());
|
||||
return false;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
void raw_unlock(void * addr, size_t size) {
|
||||
if (!VirtualUnlock(addr, size)) {
|
||||
fprintf(stderr, "warning: failed to VirtualUnlock buffer: %s\n",
|
||||
gptneox_format_win_err(GetLastError()).c_str());
|
||||
}
|
||||
}
|
||||
#else
|
||||
static constexpr bool SUPPORTED = false;
|
||||
|
||||
void raw_lock(const void * addr, size_t size) {
|
||||
fprintf(stderr, "warning: mlock not supported on this system\n");
|
||||
}
|
||||
|
||||
void raw_unlock(const void * addr, size_t size) {}
|
||||
#endif
|
||||
};
|
||||
|
||||
// Replacement for std::vector<uint8_t> that doesn't require zero-initialization.
|
||||
struct gptneox_buffer {
|
||||
uint8_t * addr = NULL;
|
||||
size_t size = 0;
|
||||
|
||||
void resize(size_t size) {
|
||||
delete[] addr;
|
||||
addr = new uint8_t[size];
|
||||
this->size = size;
|
||||
}
|
||||
|
||||
~gptneox_buffer() {
|
||||
delete[] addr;
|
||||
}
|
||||
};
|
||||
|
||||
#ifdef GGML_USE_CUBLAS
|
||||
#include "ggml-cuda.h"
|
||||
struct gptneox_ctx_buffer {
|
||||
uint8_t * addr = NULL;
|
||||
size_t size = 0;
|
||||
|
||||
void resize(size_t size) {
|
||||
if (addr) {
|
||||
ggml_cuda_host_free(addr);
|
||||
}
|
||||
addr = (uint8_t *) ggml_cuda_host_malloc(size);
|
||||
this->size = size;
|
||||
}
|
||||
|
||||
~gptneox_ctx_buffer() {
|
||||
if (addr) {
|
||||
ggml_cuda_host_free(addr);
|
||||
}
|
||||
}
|
||||
};
|
||||
#else
|
||||
typedef gptneox_buffer gptneox_ctx_buffer;
|
||||
#endif
|
||||
|
||||
#endif
|
2942
third_party/radpajama/gptneox.cpp
vendored
Normal file
2942
third_party/radpajama/gptneox.cpp
vendored
Normal file
File diff suppressed because it is too large
Load diff
275
third_party/radpajama/gptneox.h
vendored
Normal file
275
third_party/radpajama/gptneox.h
vendored
Normal file
|
@ -0,0 +1,275 @@
|
|||
#ifndef GPTNEOX_H
|
||||
#define GPTNEOX_H
|
||||
|
||||
#include <stddef.h>
|
||||
#include <stdint.h>
|
||||
#include <stdbool.h>
|
||||
|
||||
#ifdef GPTNEOX_SHARED
|
||||
# if defined(_WIN32) && !defined(__MINGW32__)
|
||||
# ifdef GPTNEOX_BUILD
|
||||
# define GPTNEOX_API __declspec(dllexport)
|
||||
# else
|
||||
# define GPTNEOX_API __declspec(dllimport)
|
||||
# endif
|
||||
# else
|
||||
# define GPTNEOX_API __attribute__ ((visibility ("default")))
|
||||
# endif
|
||||
#else
|
||||
# define GPTNEOX_API
|
||||
#endif
|
||||
|
||||
#define GPTNEOX_FILE_VERSION 1
|
||||
#define GPTNEOX_FILE_MAGIC 0x67676a74 // 'ggjt' in hex
|
||||
#define GPTNEOX_FILE_MAGIC_UNVERSIONED 0x67676d6c // pre-versioned files
|
||||
|
||||
#ifdef __cplusplus
|
||||
extern "C" {
|
||||
#endif
|
||||
|
||||
//
|
||||
// C interface
|
||||
//
|
||||
// TODO: show sample usage
|
||||
//
|
||||
|
||||
struct gptneox_context;
|
||||
|
||||
typedef int gptneox_token;
|
||||
|
||||
typedef struct gptneox_token_data {
|
||||
gptneox_token id; // token id
|
||||
float logit; // log-odds of the token
|
||||
float p; // probability of the token
|
||||
} gptneox_token_data;
|
||||
|
||||
typedef struct gptneox_token_data_array {
|
||||
gptneox_token_data * data;
|
||||
size_t size;
|
||||
bool sorted;
|
||||
} gptneox_token_data_array;
|
||||
|
||||
typedef void (*gptneox_progress_callback)(float progress, void *ctx);
|
||||
|
||||
struct gptneox_context_params {
|
||||
int n_ctx; // text context
|
||||
int n_parts; // -1 for default
|
||||
int seed; // RNG seed, 0 for random
|
||||
|
||||
bool f16_kv; // use fp16 for KV cache
|
||||
bool logits_all; // the gptneox_eval() call computes all logits, not just the last one
|
||||
bool vocab_only; // only load the vocabulary, no weights
|
||||
bool use_mmap; // use mmap if possible
|
||||
bool use_mlock; // force system to keep model in RAM
|
||||
bool embedding; // embedding mode only
|
||||
|
||||
// called with a progress value between 0 and 1, pass NULL to disable
|
||||
gptneox_progress_callback progress_callback;
|
||||
// context pointer passed to the progress callback
|
||||
void * progress_callback_user_data;
|
||||
};
|
||||
|
||||
// model file types
|
||||
enum gptneox_ftype {
|
||||
GPTNEOX_FTYPE_ALL_F32 = 0,
|
||||
GPTNEOX_FTYPE_MOSTLY_F16 = 1, // except 1d tensors
|
||||
GPTNEOX_FTYPE_MOSTLY_Q4_0 = 2, // except 1d tensors
|
||||
GPTNEOX_FTYPE_MOSTLY_Q4_1 = 3, // except 1d tensors
|
||||
GPTNEOX_FTYPE_MOSTLY_Q4_1_SOME_F16 = 4, // tok_embeddings.weight and output.weight are F16
|
||||
GPTNEOX_FTYPE_MOSTLY_Q4_2 = 5, // except 1d tensors
|
||||
// GPTNEOX_FTYPE_MOSTLY_Q4_3 (6) support has been removed
|
||||
GPTNEOX_FTYPE_MOSTLY_Q8_0 = 7, // except 1d tensors
|
||||
GPTNEOX_FTYPE_MOSTLY_Q5_0 = 8, // except 1d tensors
|
||||
GPTNEOX_FTYPE_MOSTLY_Q5_1 = 9, // except 1d tensors
|
||||
};
|
||||
|
||||
GPTNEOX_API struct gptneox_context_params gptneox_context_default_params();
|
||||
|
||||
GPTNEOX_API bool gptneox_mmap_supported();
|
||||
GPTNEOX_API bool gptneox_mlock_supported();
|
||||
|
||||
// Various functions for loading a ggml llama model.
|
||||
// Allocate (almost) all memory needed for the model.
|
||||
// Return NULL on failure
|
||||
GPTNEOX_API struct gptneox_context * gptneox_init_from_file(
|
||||
const char * path_model,
|
||||
struct gptneox_context_params params);
|
||||
|
||||
// Frees all allocated memory
|
||||
GPTNEOX_API void gptneox_free(struct gptneox_context * ctx);
|
||||
|
||||
// TODO: not great API - very likely to change
|
||||
// Returns 0 on success
|
||||
// nthread - how many threads to use. If <=0, will use std::thread::hardware_concurrency(), else the number given
|
||||
GPTNEOX_API int gptneox_model_quantize(
|
||||
const char * fname_inp,
|
||||
const char * fname_out,
|
||||
enum gptneox_ftype ftype,
|
||||
int nthread);
|
||||
|
||||
GPTNEOX_API int gptneox_model_copy(
|
||||
const char * fname_inp,
|
||||
const char * fname_out,
|
||||
enum gptneox_ftype ftype);
|
||||
|
||||
// Apply a LoRA adapter to a loaded model
|
||||
// path_base_model is the path to a higher quality model to use as a base for
|
||||
// the layers modified by the adapter. Can be NULL to use the current loaded model.
|
||||
// The model needs to be reloaded before applying a new adapter, otherwise the adapter
|
||||
// will be applied on top of the previous one
|
||||
// Returns 0 on success
|
||||
GPTNEOX_API int gptneox_apply_lora_from_file(
|
||||
struct gptneox_context * ctx,
|
||||
const char * path_lora,
|
||||
const char * path_base_model,
|
||||
int n_threads);
|
||||
|
||||
// Returns the number of tokens in the KV cache
|
||||
GPTNEOX_API int gptneox_get_kv_cache_token_count(struct gptneox_context * ctx);
|
||||
|
||||
// Sets the current rng seed.
|
||||
GPTNEOX_API void gptneox_set_rng_seed(struct gptneox_context * ctx, int seed);
|
||||
|
||||
// Returns the size in bytes of the state (rng, logits, embedding and kv_cache)
|
||||
GPTNEOX_API size_t gptneox_get_state_size(struct gptneox_context * ctx);
|
||||
|
||||
// Copies the state to the specified destination address.
|
||||
// Destination needs to have allocated enough memory.
|
||||
// Returns the number of bytes copied
|
||||
GPTNEOX_API size_t gptneox_copy_state_data(struct gptneox_context * ctx, uint8_t * dest);
|
||||
|
||||
// Set the state reading from the specified address
|
||||
// Returns the number of bytes read
|
||||
GPTNEOX_API size_t gptneox_set_state_data(struct gptneox_context * ctx, const uint8_t * src);
|
||||
|
||||
// Save/load session file
|
||||
GPTNEOX_API size_t gptneox_load_session_file(struct gptneox_context * ctx, const char * path_session, gptneox_token * tokens_out, size_t n_token_capacity, size_t * n_token_count_out);
|
||||
GPTNEOX_API size_t gptneox_save_session_file(struct gptneox_context * ctx, const char * path_session, const gptneox_token * tokens, size_t n_token_count);
|
||||
|
||||
// Run the llama inference to obtain the logits and probabilities for the next token.
|
||||
// tokens + n_tokens is the provided batch of new tokens to process
|
||||
// n_past is the number of tokens to use from previous eval calls
|
||||
// Returns 0 on success
|
||||
GPTNEOX_API int gptneox_eval(
|
||||
struct gptneox_context * ctx,
|
||||
const gptneox_token * tokens,
|
||||
int n_tokens,
|
||||
int n_past,
|
||||
int n_threads);
|
||||
|
||||
// Convert the provided text into tokens.
|
||||
// The tokens pointer must be large enough to hold the resulting tokens.
|
||||
// Returns the number of tokens on success, no more than n_max_tokens
|
||||
// Returns a negative number on failure - the number of tokens that would have been returned
|
||||
// TODO: not sure if correct
|
||||
GPTNEOX_API int gptneox_tokenize(
|
||||
struct gptneox_context * ctx,
|
||||
const char * text,
|
||||
gptneox_token * tokens,
|
||||
int n_max_tokens,
|
||||
bool add_bos);
|
||||
|
||||
GPTNEOX_API int gptneox_n_vocab(struct gptneox_context * ctx);
|
||||
GPTNEOX_API int gptneox_n_ctx (struct gptneox_context * ctx);
|
||||
GPTNEOX_API int gptneox_n_embd (struct gptneox_context * ctx);
|
||||
|
||||
// Token logits obtained from the last call to gptneox_eval()
|
||||
// The logits for the last token are stored in the last row
|
||||
// Can be mutated in order to change the probabilities of the next token
|
||||
// Rows: n_tokens
|
||||
// Cols: n_vocab
|
||||
GPTNEOX_API float * gptneox_get_logits(struct gptneox_context * ctx);
|
||||
|
||||
// Get the embeddings for the input
|
||||
// shape: [n_embd] (1-dimensional)
|
||||
GPTNEOX_API float * gptneox_get_embeddings(struct gptneox_context * ctx);
|
||||
|
||||
// Token Id -> String. Uses the vocabulary in the provided context
|
||||
GPTNEOX_API const char * gptneox_token_to_str(struct gptneox_context * ctx, gptneox_token token);
|
||||
|
||||
// String -> Token Id. Uses the vocabulary in the provided context
|
||||
GPTNEOX_API gptneox_token gptneox_str_to_token(struct gptneox_context * ctx, const char * str);
|
||||
|
||||
// Special tokens
|
||||
GPTNEOX_API gptneox_token gptneox_token_bos();
|
||||
GPTNEOX_API gptneox_token gptneox_token_eos();
|
||||
// GPTNEOX_API gptneox_token gptneox_token_nl();
|
||||
|
||||
// TODO: improve the last_n_tokens interface ?
|
||||
GPTNEOX_API gptneox_token gptneox_sample_top_p_top_k(
|
||||
struct gptneox_context * ctx,
|
||||
const gptneox_token * last_n_tokens_data,
|
||||
int last_n_tokens_size,
|
||||
int top_k,
|
||||
float top_p,
|
||||
float temp,
|
||||
float repeat_penalty);
|
||||
|
||||
// Sampling functions
|
||||
|
||||
/// @details Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix.
|
||||
GPTNEOX_API void gptneox_sample_repetition_penalty(struct gptneox_context * ctx, gptneox_token_data_array * candidates, gptneox_token * last_tokens, size_t last_tokens_size, float penalty);
|
||||
|
||||
/// @details Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.
|
||||
GPTNEOX_API void gptneox_sample_frequency_and_presence_penalties(struct gptneox_context * ctx, gptneox_token_data_array * candidates, gptneox_token * last_tokens, size_t last_tokens_size, float alpha_frequency, float alpha_presence);
|
||||
|
||||
/// @details Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits.
|
||||
GPTNEOX_API void gptneox_sample_softmax(struct gptneox_context * ctx, gptneox_token_data_array * candidates);
|
||||
|
||||
/// @details Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
|
||||
GPTNEOX_API void gptneox_sample_top_k(struct gptneox_context * ctx, gptneox_token_data_array * candidates, int k, size_t min_keep);
|
||||
|
||||
/// @details Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
|
||||
GPTNEOX_API void gptneox_sample_top_p(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float p, size_t min_keep);
|
||||
|
||||
/// @details Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.
|
||||
GPTNEOX_API void gptneox_sample_tail_free(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float z, size_t min_keep);
|
||||
|
||||
/// @details Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
|
||||
GPTNEOX_API void gptneox_sample_typical(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float p, size_t min_keep);
|
||||
GPTNEOX_API void gptneox_sample_temperature(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float temp);
|
||||
|
||||
/// @details Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
|
||||
/// @param candidates A vector of `gptneox_token_data` containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
|
||||
/// @param tau The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
|
||||
/// @param eta The learning rate used to update `mu` based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause `mu` to be updated more quickly, while a smaller learning rate will result in slower updates.
|
||||
/// @param m The number of tokens considered in the estimation of `s_hat`. This is an arbitrary value that is used to calculate `s_hat`, which in turn helps to calculate the value of `k`. In the paper, they use `m = 100`, but you can experiment with different values to see how it affects the performance of the algorithm.
|
||||
/// @param mu Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (`2 * tau`) and is updated in the algorithm based on the error between the target and observed surprisal.
|
||||
GPTNEOX_API gptneox_token gptneox_sample_token_mirostat(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float tau, float eta, int m, float * mu);
|
||||
|
||||
/// @details Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
|
||||
/// @param candidates A vector of `gptneox_token_data` containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
|
||||
/// @param tau The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
|
||||
/// @param eta The learning rate used to update `mu` based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause `mu` to be updated more quickly, while a smaller learning rate will result in slower updates.
|
||||
/// @param mu Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (`2 * tau`) and is updated in the algorithm based on the error between the target and observed surprisal.
|
||||
GPTNEOX_API gptneox_token gptneox_sample_token_mirostat_v2(struct gptneox_context * ctx, gptneox_token_data_array * candidates, float tau, float eta, float * mu);
|
||||
|
||||
/// @details Selects the token with the highest probability.
|
||||
GPTNEOX_API gptneox_token gptneox_sample_token_greedy(struct gptneox_context * ctx, gptneox_token_data_array * candidates);
|
||||
|
||||
/// @details Randomly selects a token from the candidates based on their probabilities.
|
||||
GPTNEOX_API gptneox_token gptneox_sample_token(struct gptneox_context * ctx, gptneox_token_data_array * candidates);
|
||||
|
||||
// Performance information
|
||||
GPTNEOX_API void gptneox_print_timings(struct gptneox_context * ctx);
|
||||
GPTNEOX_API void gptneox_reset_timings(struct gptneox_context * ctx);
|
||||
|
||||
// Print system information
|
||||
GPTNEOX_API const char * gptneox_print_system_info(void);
|
||||
|
||||
#ifdef __cplusplus
|
||||
}
|
||||
#endif
|
||||
|
||||
// Internal API to be implemented by llama.cpp and used by tests/benchmarks only
|
||||
#ifdef GPTNEOX_API_INTERNAL
|
||||
|
||||
#include <vector>
|
||||
#include <string>
|
||||
struct ggml_tensor;
|
||||
|
||||
std::vector<std::pair<std::string, struct ggml_tensor *>>& gptneox_internal_get_tensor_map(struct gptneox_context * ctx);
|
||||
|
||||
#endif
|
||||
|
||||
#endif // GPTNEOX_H
|
467
third_party/radpajama/main-redpajama-chat.cpp
vendored
Normal file
467
third_party/radpajama/main-redpajama-chat.cpp
vendored
Normal file
|
@ -0,0 +1,467 @@
|
|||
// Defines sigaction on msys:
|
||||
#ifndef _GNU_SOURCE
|
||||
#define _GNU_SOURCE
|
||||
#endif
|
||||
|
||||
#include "common-gptneox.h"
|
||||
#include "gptneox.h"
|
||||
|
||||
#include <cassert>
|
||||
#include <cinttypes>
|
||||
#include <cmath>
|
||||
#include <cstdio>
|
||||
#include <cstring>
|
||||
#include <ctime>
|
||||
#include <fstream>
|
||||
#include <iostream>
|
||||
#include <string>
|
||||
#include <vector>
|
||||
#include <algorithm>
|
||||
|
||||
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
|
||||
#include <signal.h>
|
||||
#include <unistd.h>
|
||||
#elif defined (_WIN32)
|
||||
#include <signal.h>
|
||||
#endif
|
||||
|
||||
static console_state con_st;
|
||||
static gptneox_context ** g_ctx;
|
||||
|
||||
static bool is_interacting = false;
|
||||
|
||||
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
|
||||
void sigint_handler(int signo) {
|
||||
set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
|
||||
printf("\n"); // this also force flush stdout.
|
||||
if (signo == SIGINT) {
|
||||
if (!is_interacting) {
|
||||
is_interacting=true;
|
||||
} else {
|
||||
gptneox_print_timings(*g_ctx);
|
||||
_exit(130);
|
||||
}
|
||||
}
|
||||
}
|
||||
#endif
|
||||
|
||||
int main(int argc, char ** argv) {
|
||||
gpt_params params;
|
||||
params.model = "./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-f16.bin";
|
||||
|
||||
if (gpt_params_parse(argc, argv, params) == false) {
|
||||
return 1;
|
||||
}
|
||||
|
||||
// save choice to use color for later
|
||||
// (note for later: this is a slightly awkward choice)
|
||||
con_st.use_color = params.use_color;
|
||||
|
||||
#if defined (_WIN32)
|
||||
win32_console_init(params.use_color);
|
||||
#endif
|
||||
|
||||
if (params.perplexity) {
|
||||
printf("\n************\n");
|
||||
printf("%s: please use the 'perplexity' tool for perplexity calculations\n", __func__);
|
||||
printf("************\n\n");
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
if (params.embedding) {
|
||||
printf("\n************\n");
|
||||
printf("%s: please use the 'embedding' tool for embedding calculations\n", __func__);
|
||||
printf("************\n\n");
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
if (params.n_ctx > 2048) {
|
||||
fprintf(stderr, "%s: warning: model does not support context sizes greater than 2048 tokens (%d specified);"
|
||||
"expect poor results\n", __func__, params.n_ctx);
|
||||
}
|
||||
|
||||
if (params.seed <= 0) {
|
||||
params.seed = time(NULL);
|
||||
}
|
||||
|
||||
fprintf(stderr, "%s: seed = %d\n", __func__, params.seed);
|
||||
|
||||
std::mt19937 rng(params.seed);
|
||||
if (params.random_prompt) {
|
||||
params.prompt = gpt_random_prompt(rng);
|
||||
}
|
||||
|
||||
gptneox_context * ctx;
|
||||
g_ctx = &ctx;
|
||||
|
||||
// load the model
|
||||
{
|
||||
auto lparams = gptneox_context_default_params();
|
||||
|
||||
lparams.n_ctx = params.n_ctx;
|
||||
lparams.n_parts = params.n_parts;
|
||||
lparams.seed = params.seed;
|
||||
lparams.f16_kv = params.memory_f16;
|
||||
lparams.use_mmap = params.use_mmap;
|
||||
lparams.use_mlock = params.use_mlock;
|
||||
|
||||
ctx = gptneox_init_from_file(params.model.c_str(), lparams);
|
||||
|
||||
if (ctx == NULL) {
|
||||
fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
|
||||
return 1;
|
||||
}
|
||||
}
|
||||
|
||||
if (!params.lora_adapter.empty()) {
|
||||
int err = gptneox_apply_lora_from_file(ctx,
|
||||
params.lora_adapter.c_str(),
|
||||
params.lora_base.empty() ? NULL : params.lora_base.c_str(),
|
||||
params.n_threads);
|
||||
if (err != 0) {
|
||||
fprintf(stderr, "%s: error: failed to apply lora adapter\n", __func__);
|
||||
return 1;
|
||||
}
|
||||
}
|
||||
|
||||
// print system information
|
||||
{
|
||||
fprintf(stderr, "\n");
|
||||
fprintf(stderr, "system_info: n_threads = %d / %d | %s\n",
|
||||
params.n_threads, std::thread::hardware_concurrency(), gptneox_print_system_info());
|
||||
}
|
||||
|
||||
// determine the maximum memory usage needed to do inference for the given n_batch and n_predict parameters
|
||||
if (params.mem_test) {
|
||||
{
|
||||
const std::vector<gptneox_token> tmp(params.n_batch, 0);
|
||||
gptneox_eval(ctx, tmp.data(), tmp.size(), 0, params.n_threads);
|
||||
}
|
||||
|
||||
{
|
||||
const std::vector<gptneox_token> tmp = { 0, };
|
||||
gptneox_eval(ctx, tmp.data(), tmp.size(), params.n_predict - 1, params.n_threads);
|
||||
}
|
||||
|
||||
gptneox_print_timings(ctx);
|
||||
gptneox_free(ctx);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
// Always interactive for RedPajama chat model
|
||||
params.interactive = true;
|
||||
|
||||
if (params.interactive) {
|
||||
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
|
||||
struct sigaction sigint_action;
|
||||
sigint_action.sa_handler = sigint_handler;
|
||||
sigemptyset (&sigint_action.sa_mask);
|
||||
sigint_action.sa_flags = 0;
|
||||
sigaction(SIGINT, &sigint_action, NULL);
|
||||
#elif defined (_WIN32)
|
||||
signal(SIGINT, sigint_handler);
|
||||
#endif
|
||||
}
|
||||
fprintf(stderr, "sampling: temp = %f, top_k = %d, top_p = %f, repeat_last_n = %i, repeat_penalty = %f\n",
|
||||
params.temp, params.top_k, params.top_p, params.repeat_last_n, params.repeat_penalty);
|
||||
fprintf(stderr, "generate: n_ctx = %d, n_batch = %d, n_predict = %d, n_keep = %d\n", params.n_ctx, params.n_batch, params.n_predict, params.n_keep);
|
||||
fprintf(stderr, "\n\n");
|
||||
|
||||
// TODO: replace with ring-buffer
|
||||
std::vector<gptneox_token> last_n_tokens = std::vector<gptneox_token>();
|
||||
//std::fill(last_n_tokens.begin(), last_n_tokens.end(), 0);
|
||||
|
||||
set_console_color(con_st, CONSOLE_COLOR_PROMPT);
|
||||
|
||||
if (params.interactive) {
|
||||
printf("== Running in interactive mode. ==\n"
|
||||
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
|
||||
" - Press Ctrl+C to interject at any time.\n"
|
||||
#endif
|
||||
" - Press Return to return control to RedPajama.\n"
|
||||
" - If you want to submit another line, end your input in '\\'.\n\n");
|
||||
}
|
||||
|
||||
const int32_t top_k = params.top_k;
|
||||
const float top_p = params.top_p;
|
||||
const float temp = params.temp;
|
||||
const float repeat_penalty = params.repeat_penalty;
|
||||
|
||||
// Chat loop
|
||||
while (true) {
|
||||
is_interacting = true;
|
||||
|
||||
int n_past = 0;
|
||||
|
||||
// Get input
|
||||
|
||||
// potentially set color to indicate we are taking user input
|
||||
set_console_color(con_st, CONSOLE_COLOR_USER_INPUT);
|
||||
|
||||
#if defined (_WIN32)
|
||||
// Windows: must reactivate sigint handler after each signal
|
||||
signal(SIGINT, sigint_handler);
|
||||
#endif
|
||||
|
||||
if (params.instruct) {
|
||||
printf("\n<human>: ");
|
||||
}
|
||||
|
||||
std::string buffer;
|
||||
if (!params.input_prefix.empty()) {
|
||||
buffer += params.input_prefix;
|
||||
printf("%s", buffer.c_str());
|
||||
}
|
||||
|
||||
std::string line;
|
||||
bool another_line = true;
|
||||
do {
|
||||
#if defined(_WIN32)
|
||||
std::wstring wline;
|
||||
if (!std::getline(std::wcin, wline)) {
|
||||
// input stream is bad or EOF received
|
||||
return 0;
|
||||
}
|
||||
win32_utf8_encode(wline, line);
|
||||
#else
|
||||
if (!std::getline(std::cin, line)) {
|
||||
// input stream is bad or EOF received
|
||||
return 0;
|
||||
}
|
||||
#endif
|
||||
if (line.empty() || line.back() != '\\') {
|
||||
another_line = false;
|
||||
} else {
|
||||
line.pop_back(); // Remove the continue character
|
||||
}
|
||||
buffer += line;
|
||||
if (another_line) {
|
||||
buffer += '\n';
|
||||
}
|
||||
} while (another_line);
|
||||
|
||||
is_interacting = false;
|
||||
|
||||
// done taking input, reset color
|
||||
set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
|
||||
|
||||
// Check for input
|
||||
if (buffer.length() <= 0) {
|
||||
continue; // Restart loop for input
|
||||
}
|
||||
|
||||
// Tokenize prompt with RedPajama special tokens
|
||||
|
||||
auto prompt_embd = ::gptneox_tokenize(ctx, buffer, false);
|
||||
auto embd_inp = std::vector<gptneox_token>();
|
||||
|
||||
// Redpajama: insert special tokens for OA. (prefix)
|
||||
embd_inp.push_back(gptneox_str_to_token(ctx, "<"));
|
||||
embd_inp.push_back(gptneox_str_to_token(ctx, "human"));
|
||||
embd_inp.push_back(gptneox_str_to_token(ctx, ">:"));
|
||||
|
||||
embd_inp.insert(embd_inp.end(), prompt_embd.begin(), prompt_embd.end());
|
||||
|
||||
// Redpajama: insert special tokens for OA. (postfix)
|
||||
embd_inp.push_back(gptneox_str_to_token(ctx, "\n"));
|
||||
embd_inp.push_back(gptneox_str_to_token(ctx, "<"));
|
||||
embd_inp.push_back(gptneox_str_to_token(ctx, "bot"));
|
||||
embd_inp.push_back(gptneox_str_to_token(ctx, ">:"));
|
||||
|
||||
|
||||
// Verbose prompt
|
||||
if (params.verbose_prompt) {
|
||||
fprintf(stderr, "\n");
|
||||
fprintf(stderr, "%s: prompt: '%s'\n", __func__, buffer.c_str());
|
||||
fprintf(stderr, "%s: number of tokens in prompt = %zu\n", __func__, embd_inp.size());
|
||||
for (int i = 0; i < (int) embd_inp.size(); i++) {
|
||||
fprintf(stderr, "%6d -> '%s'\n", embd_inp[i], gptneox_token_to_str(ctx, embd_inp[i]));
|
||||
}
|
||||
fprintf(stderr, "\n");
|
||||
}
|
||||
|
||||
// How many tokens to generate - check if theres space in context for atleast one token (or batch size tokens?)
|
||||
auto inp_size = embd_inp.size();
|
||||
auto space = params.n_ctx - inp_size;
|
||||
if(space <= 0) {
|
||||
fprintf(stderr, "%s : input too long\n", __func__);
|
||||
continue;
|
||||
}
|
||||
// Send batches to eval
|
||||
while (n_past < inp_size) {
|
||||
auto remaining = inp_size - n_past;
|
||||
int n_eval = params.n_batch < remaining ? params.n_batch : remaining;
|
||||
if (gptneox_eval(ctx, &embd_inp[n_past], n_eval, n_past, params.n_threads)) {
|
||||
fprintf(stderr, "<bot>: %s : failed to eval\n", __func__);
|
||||
return 1;
|
||||
}
|
||||
n_past += n_eval;
|
||||
}
|
||||
|
||||
const int n_ctx = gptneox_n_ctx(ctx);
|
||||
const int n_vocab = gptneox_n_vocab(ctx);
|
||||
|
||||
const float temp = params.temp;
|
||||
const int32_t top_k = params.top_k <= 0 ? gptneox_n_vocab(ctx) : params.top_k;
|
||||
const float top_p = params.top_p;
|
||||
const float tfs_z = params.tfs_z;
|
||||
const float typical_p = params.typical_p;
|
||||
const int32_t repeat_last_n = params.repeat_last_n < 0 ? n_ctx : params.repeat_last_n;
|
||||
const float repeat_penalty = params.repeat_penalty;
|
||||
const float alpha_presence = params.presence_penalty;
|
||||
const float alpha_frequency = params.frequency_penalty;
|
||||
const int mirostat = params.mirostat;
|
||||
const float mirostat_tau = params.mirostat_tau;
|
||||
const float mirostat_eta = params.mirostat_eta;
|
||||
const bool penalize_nl = params.penalize_nl;
|
||||
|
||||
// Eval until space runs out
|
||||
auto out_count = 0;
|
||||
|
||||
printf("<bot>:");
|
||||
while (space > 0) {
|
||||
// Get token
|
||||
gptneox_token id = 0;
|
||||
|
||||
{
|
||||
auto logits = gptneox_get_logits(ctx);
|
||||
|
||||
// Apply params.logit_bias map
|
||||
for (auto it = params.logit_bias.begin(); it != params.logit_bias.end(); it++) {
|
||||
logits[it->first] += it->second;
|
||||
}
|
||||
|
||||
std::vector<gptneox_token_data> candidates;
|
||||
candidates.reserve(n_vocab);
|
||||
for (gptneox_token token_id = 0; token_id < n_vocab; token_id++) {
|
||||
candidates.emplace_back(gptneox_token_data{token_id, logits[token_id], 0.0f});
|
||||
}
|
||||
|
||||
gptneox_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
|
||||
|
||||
// Apply penalties
|
||||
gptneox_token nl_token = gptneox_str_to_token(ctx, "\n");
|
||||
float nl_logit = logits[nl_token];
|
||||
auto last_n_repeat = std::min(std::min((int)last_n_tokens.size(), repeat_last_n), n_ctx);
|
||||
gptneox_sample_repetition_penalty(ctx, &candidates_p,
|
||||
last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
|
||||
last_n_repeat, repeat_penalty);
|
||||
gptneox_sample_frequency_and_presence_penalties(ctx, &candidates_p,
|
||||
last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
|
||||
last_n_repeat, alpha_frequency, alpha_presence);
|
||||
if (!penalize_nl) {
|
||||
logits[nl_token] = nl_logit;
|
||||
}
|
||||
|
||||
if (temp <= 0) {
|
||||
// Greedy sampling
|
||||
id = gptneox_sample_token_greedy(ctx, &candidates_p);
|
||||
} else {
|
||||
if (mirostat == 1) {
|
||||
static float mirostat_mu = 2.0f * mirostat_tau;
|
||||
const int mirostat_m = 100;
|
||||
gptneox_sample_temperature(ctx, &candidates_p, temp);
|
||||
id = gptneox_sample_token_mirostat(ctx, &candidates_p, mirostat_tau, mirostat_eta, mirostat_m, &mirostat_mu);
|
||||
} else if (mirostat == 2) {
|
||||
static float mirostat_mu = 2.0f * mirostat_tau;
|
||||
gptneox_sample_temperature(ctx, &candidates_p, temp);
|
||||
id = gptneox_sample_token_mirostat_v2(ctx, &candidates_p, mirostat_tau, mirostat_eta, &mirostat_mu);
|
||||
} else {
|
||||
// Temperature sampling
|
||||
gptneox_sample_top_k(ctx, &candidates_p, top_k, 1);
|
||||
gptneox_sample_tail_free(ctx, &candidates_p, tfs_z, 1);
|
||||
gptneox_sample_typical(ctx, &candidates_p, typical_p, 1);
|
||||
gptneox_sample_top_p(ctx, &candidates_p, top_p, 1);
|
||||
gptneox_sample_temperature(ctx, &candidates_p, temp);
|
||||
id = gptneox_sample_token(ctx, &candidates_p);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Inc out count and dec space
|
||||
out_count += 1;
|
||||
space -= 1;
|
||||
// Repeat tokens update
|
||||
last_n_tokens.push_back(id);
|
||||
if (last_n_tokens.size() > params.repeat_last_n) {
|
||||
last_n_tokens.erase(last_n_tokens.begin());
|
||||
}
|
||||
// Redpajama: check if the interactive is done.
|
||||
//std::cout<<" last_n_tokens.size: "<< last_n_tokens[0] <<" "<< last_n_tokens[1] <<" "<< last_n_tokens[2] << std::endl;
|
||||
if (last_n_tokens.size()==3 && last_n_tokens[0]==gptneox_str_to_token(ctx, "<")
|
||||
&& last_n_tokens[1]==gptneox_str_to_token(ctx, "human") && last_n_tokens[2]==gptneox_str_to_token(ctx, ">:")){
|
||||
space = 0;
|
||||
continue;
|
||||
}
|
||||
|
||||
// Check for eos - end early - check eos before bos in case they are the same
|
||||
if (id == gptneox_token_eos()) {
|
||||
space = 0;
|
||||
continue;
|
||||
}
|
||||
// Check for bos - skip callback if so
|
||||
if (id == gptneox_token_bos()) {
|
||||
continue;
|
||||
}
|
||||
// Convert token to string and display
|
||||
// printf("%s(%d)", gptneox_token_to_str(ctx, id), id);
|
||||
|
||||
|
||||
if (last_n_tokens[2]==gptneox_str_to_token(ctx, "<")){
|
||||
;
|
||||
}
|
||||
else if (last_n_tokens[2]==gptneox_str_to_token(ctx, "human")){
|
||||
if (last_n_tokens[1]==gptneox_str_to_token(ctx, "<")){
|
||||
;
|
||||
}
|
||||
else{
|
||||
printf("%s", gptneox_token_to_str(ctx, id));
|
||||
}
|
||||
}
|
||||
else if (last_n_tokens[1]==gptneox_str_to_token(ctx, "<")){
|
||||
printf("<");
|
||||
printf("%s", gptneox_token_to_str(ctx, id));
|
||||
}
|
||||
else{
|
||||
printf("%s", gptneox_token_to_str(ctx, id));
|
||||
}
|
||||
fflush(stdout);
|
||||
// Check if we need to run another eval
|
||||
if (space > 0) {
|
||||
// Send generated token back into model for next generation
|
||||
if (gptneox_eval(ctx, &id, 1, n_past, params.n_threads)) {
|
||||
fprintf(stderr, "%s : failed to eval\n", __func__);
|
||||
return 1;
|
||||
}
|
||||
// Increment past count
|
||||
n_past += 1;
|
||||
}
|
||||
// Check for user interrupt
|
||||
if (is_interacting) { space = 0; }
|
||||
}
|
||||
printf("\n");
|
||||
//printf("\n %d", space);
|
||||
fflush(stdout);
|
||||
}
|
||||
|
||||
#if defined (_WIN32)
|
||||
signal(SIGINT, SIG_DFL);
|
||||
#endif
|
||||
|
||||
gptneox_print_timings(ctx);
|
||||
gptneox_free(ctx);
|
||||
|
||||
set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
622
third_party/radpajama/main-redpajama.cpp
vendored
Normal file
622
third_party/radpajama/main-redpajama.cpp
vendored
Normal file
|
@ -0,0 +1,622 @@
|
|||
// Defines sigaction on msys:
|
||||
#ifndef _GNU_SOURCE
|
||||
#define _GNU_SOURCE
|
||||
#endif
|
||||
|
||||
#include "common-gptneox.h"
|
||||
#include "gptneox.h"
|
||||
|
||||
#include <cassert>
|
||||
#include <cinttypes>
|
||||
#include <cmath>
|
||||
#include <cstdio>
|
||||
#include <cstring>
|
||||
#include <ctime>
|
||||
#include <fstream>
|
||||
#include <iostream>
|
||||
#include <string>
|
||||
#include <vector>
|
||||
|
||||
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
|
||||
#include <signal.h>
|
||||
#include <unistd.h>
|
||||
#elif defined (_WIN32)
|
||||
#include <signal.h>
|
||||
#endif
|
||||
|
||||
static console_state con_st;
|
||||
static gptneox_context ** g_ctx;
|
||||
|
||||
static bool is_interacting = false;
|
||||
|
||||
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
|
||||
void sigint_handler(int signo) {
|
||||
set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
|
||||
printf("\n"); // this also force flush stdout.
|
||||
if (signo == SIGINT) {
|
||||
if (!is_interacting) {
|
||||
is_interacting=true;
|
||||
} else {
|
||||
gptneox_print_timings(*g_ctx);
|
||||
_exit(130);
|
||||
}
|
||||
}
|
||||
}
|
||||
#endif
|
||||
|
||||
int main(int argc, char ** argv) {
|
||||
gpt_params params;
|
||||
params.model = "./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Instruct-3B-v1-f16.bin";
|
||||
|
||||
if (gpt_params_parse(argc, argv, params) == false) {
|
||||
return 1;
|
||||
}
|
||||
|
||||
// save choice to use color for later
|
||||
// (note for later: this is a slightly awkward choice)
|
||||
con_st.use_color = params.use_color;
|
||||
|
||||
#if defined (_WIN32)
|
||||
win32_console_init(params.use_color);
|
||||
#endif
|
||||
|
||||
if (params.perplexity) {
|
||||
printf("\n************\n");
|
||||
printf("%s: please use the 'perplexity' tool for perplexity calculations\n", __func__);
|
||||
printf("************\n\n");
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
if (params.embedding) {
|
||||
printf("\n************\n");
|
||||
printf("%s: please use the 'embedding' tool for embedding calculations\n", __func__);
|
||||
printf("************\n\n");
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
if (params.n_ctx > 2048) {
|
||||
fprintf(stderr, "%s: warning: model does not support context sizes greater than 2048 tokens (%d specified);"
|
||||
"expect poor results\n", __func__, params.n_ctx);
|
||||
}
|
||||
|
||||
if (params.seed < 0) {
|
||||
params.seed = time(NULL);
|
||||
}
|
||||
|
||||
fprintf(stderr, "%s: seed = %d\n", __func__, params.seed);
|
||||
|
||||
std::mt19937 rng(params.seed);
|
||||
if (params.random_prompt) {
|
||||
params.prompt = gpt_random_prompt(rng);
|
||||
}
|
||||
|
||||
// params.prompt = R"(// this function checks if the number n is prime
|
||||
//bool is_prime(int n) {)";
|
||||
|
||||
gptneox_context * ctx;
|
||||
g_ctx = &ctx;
|
||||
|
||||
// load the model
|
||||
{
|
||||
auto lparams = gptneox_context_default_params();
|
||||
|
||||
lparams.n_ctx = params.n_ctx;
|
||||
lparams.n_parts = params.n_parts;
|
||||
lparams.seed = params.seed;
|
||||
lparams.f16_kv = params.memory_f16;
|
||||
lparams.use_mmap = params.use_mmap;
|
||||
lparams.use_mlock = params.use_mlock;
|
||||
|
||||
ctx = gptneox_init_from_file(params.model.c_str(), lparams);
|
||||
|
||||
if (ctx == NULL) {
|
||||
fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
|
||||
return 1;
|
||||
}
|
||||
}
|
||||
|
||||
if (!params.lora_adapter.empty()) {
|
||||
int err = gptneox_apply_lora_from_file(ctx,
|
||||
params.lora_adapter.c_str(),
|
||||
params.lora_base.empty() ? NULL : params.lora_base.c_str(),
|
||||
params.n_threads);
|
||||
if (err != 0) {
|
||||
fprintf(stderr, "%s: error: failed to apply lora adapter\n", __func__);
|
||||
return 1;
|
||||
}
|
||||
}
|
||||
|
||||
// print system information
|
||||
{
|
||||
fprintf(stderr, "\n");
|
||||
fprintf(stderr, "system_info: n_threads = %d / %d | %s\n",
|
||||
params.n_threads, std::thread::hardware_concurrency(), gptneox_print_system_info());
|
||||
}
|
||||
|
||||
// determine the maximum memory usage needed to do inference for the given n_batch and n_predict parameters
|
||||
// uncomment the "used_mem" line in llama.cpp to see the results
|
||||
if (params.mem_test) {
|
||||
{
|
||||
const std::vector<gptneox_token> tmp(params.n_batch, 0);
|
||||
gptneox_eval(ctx, tmp.data(), tmp.size(), 0, params.n_threads);
|
||||
}
|
||||
|
||||
{
|
||||
const std::vector<gptneox_token> tmp = { 0, };
|
||||
gptneox_eval(ctx, tmp.data(), tmp.size(), params.n_predict - 1, params.n_threads);
|
||||
}
|
||||
|
||||
gptneox_print_timings(ctx);
|
||||
gptneox_free(ctx);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
std::string path_session = params.path_session;
|
||||
std::vector<gptneox_token> session_tokens;
|
||||
|
||||
if (!path_session.empty()) {
|
||||
fprintf(stderr, "%s: attempting to load saved session from %s..\n", __func__, path_session.c_str());
|
||||
|
||||
// REVIEW - fopen to check for existing session
|
||||
FILE * fp = std::fopen(path_session.c_str(), "rb");
|
||||
if (fp != NULL) {
|
||||
std::fclose(fp);
|
||||
|
||||
session_tokens.resize(params.n_ctx);
|
||||
size_t n_token_count_out = 0;
|
||||
const size_t n_session_bytes = gptneox_load_session_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.capacity(), &n_token_count_out);
|
||||
session_tokens.resize(n_token_count_out);
|
||||
|
||||
if (n_session_bytes > 0) {
|
||||
fprintf(stderr, "%s: loaded %zu bytes of session data!\n", __func__, n_session_bytes);
|
||||
} else {
|
||||
fprintf(stderr, "%s: could not load session file, will recreate\n", __func__);
|
||||
}
|
||||
} else {
|
||||
fprintf(stderr, "%s: session file does not exist, will create\n", __func__);
|
||||
}
|
||||
}
|
||||
|
||||
// tokenize the prompt
|
||||
auto embd_inp = ::gptneox_tokenize(ctx, params.prompt, false); //true);
|
||||
|
||||
const int n_ctx = gptneox_n_ctx(ctx);
|
||||
|
||||
if ((int) embd_inp.size() > n_ctx - 4) {
|
||||
fprintf(stderr, "%s: error: prompt is too long (%d tokens, max %d)\n", __func__, (int) embd_inp.size(), n_ctx - 4);
|
||||
return 1;
|
||||
}
|
||||
|
||||
// debug message about similarity of saved session, if applicable
|
||||
size_t n_matching_session_tokens = 0;
|
||||
if (session_tokens.size()) {
|
||||
for (gptneox_token id : session_tokens) {
|
||||
if (n_matching_session_tokens >= embd_inp.size() || id != embd_inp[n_matching_session_tokens]) {
|
||||
break;
|
||||
}
|
||||
n_matching_session_tokens++;
|
||||
}
|
||||
if (n_matching_session_tokens >= embd_inp.size()) {
|
||||
fprintf(stderr, "%s: session file has exact match for prompt!\n", __func__);
|
||||
} else if (n_matching_session_tokens < (embd_inp.size() / 2)) {
|
||||
fprintf(stderr, "%s: warning: session file has low similarity to prompt (%zu / %zu tokens); will mostly be reevaluated\n",
|
||||
__func__, n_matching_session_tokens, embd_inp.size());
|
||||
} else {
|
||||
fprintf(stderr, "%s: session file matches %zu / %zu tokens of prompt\n",
|
||||
__func__, n_matching_session_tokens, embd_inp.size());
|
||||
}
|
||||
}
|
||||
|
||||
// number of tokens to keep when resetting context
|
||||
if (params.n_keep < 0 || params.n_keep > (int)embd_inp.size() || params.instruct) {
|
||||
params.n_keep = (int)embd_inp.size();
|
||||
}
|
||||
|
||||
// in instruct mode, we inject a prefix and a suffix to each input by the user
|
||||
if (params.instruct) {
|
||||
params.interactive_first = true;
|
||||
params.antiprompt.push_back("<|prompter|>");
|
||||
}
|
||||
|
||||
// enable interactive mode if reverse prompt or interactive start is specified
|
||||
if (params.antiprompt.size() != 0 || params.interactive_first) {
|
||||
params.interactive = true;
|
||||
}
|
||||
|
||||
// determine newline token
|
||||
auto gptneox_token_newline = ::gptneox_tokenize(ctx, "\n", false);
|
||||
|
||||
if (params.verbose_prompt) {
|
||||
fprintf(stderr, "\n");
|
||||
fprintf(stderr, "%s: prompt: '%s'\n", __func__, params.prompt.c_str());
|
||||
fprintf(stderr, "%s: number of tokens in prompt = %zu\n", __func__, embd_inp.size());
|
||||
for (int i = 0; i < (int) embd_inp.size(); i++) {
|
||||
fprintf(stderr, "%6d -> '%s'\n", embd_inp[i], gptneox_token_to_str(ctx, embd_inp[i]));
|
||||
}
|
||||
if (params.n_keep > 0) {
|
||||
fprintf(stderr, "%s: static prompt based on n_keep: '", __func__);
|
||||
for (int i = 0; i < params.n_keep; i++) {
|
||||
fprintf(stderr, "%s", gptneox_token_to_str(ctx, embd_inp[i]));
|
||||
}
|
||||
fprintf(stderr, "'\n");
|
||||
}
|
||||
fprintf(stderr, "\n");
|
||||
}
|
||||
|
||||
if (params.interactive) {
|
||||
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
|
||||
struct sigaction sigint_action;
|
||||
sigint_action.sa_handler = sigint_handler;
|
||||
sigemptyset (&sigint_action.sa_mask);
|
||||
sigint_action.sa_flags = 0;
|
||||
sigaction(SIGINT, &sigint_action, NULL);
|
||||
#elif defined (_WIN32)
|
||||
signal(SIGINT, sigint_handler);
|
||||
#endif
|
||||
|
||||
fprintf(stderr, "%s: interactive mode on.\n", __func__);
|
||||
|
||||
if (params.antiprompt.size()) {
|
||||
for (auto antiprompt : params.antiprompt) {
|
||||
fprintf(stderr, "Reverse prompt: '%s'\n", antiprompt.c_str());
|
||||
}
|
||||
}
|
||||
|
||||
if (!params.input_prefix.empty()) {
|
||||
fprintf(stderr, "Input prefix: '%s'\n", params.input_prefix.c_str());
|
||||
}
|
||||
}
|
||||
fprintf(stderr, "sampling: repeat_last_n = %d, repeat_penalty = %f, presence_penalty = %f, frequency_penalty = %f, top_k = %d, tfs_z = %f, top_p = %f, typical_p = %f, temp = %f, mirostat = %d, mirostat_lr = %f, mirostat_ent = %f\n",
|
||||
params.repeat_last_n, params.repeat_penalty, params.presence_penalty, params.frequency_penalty, params.top_k, params.tfs_z, params.top_p, params.typical_p, params.temp, params.mirostat, params.mirostat_eta, params.mirostat_tau);
|
||||
fprintf(stderr, "generate: n_ctx = %d, n_batch = %d, n_predict = %d, n_keep = %d\n", n_ctx, params.n_batch, params.n_predict, params.n_keep);
|
||||
fprintf(stderr, "\n\n");
|
||||
|
||||
// TODO: replace with ring-buffer
|
||||
std::vector<gptneox_token> last_n_tokens(n_ctx);
|
||||
std::fill(last_n_tokens.begin(), last_n_tokens.end(), 0);
|
||||
|
||||
if (params.interactive) {
|
||||
fprintf(stderr, "== Running in interactive mode. ==\n"
|
||||
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
|
||||
" - Press Ctrl+C to interject at any time.\n"
|
||||
#endif
|
||||
" - Press Return to return control to RedPajama.\n"
|
||||
" - If you want to submit another line, end your input in '\\'.\n\n");
|
||||
is_interacting = params.interactive_first;
|
||||
}
|
||||
|
||||
bool is_antiprompt = false;
|
||||
bool input_noecho = false;
|
||||
|
||||
// HACK - because session saving incurs a non-negligible delay, for now skip re-saving session
|
||||
// if we loaded a session with at least 75% similarity. It's currently just used to speed up the
|
||||
// initial prompt so it doesn't need to be an exact match.
|
||||
bool need_to_save_session = !path_session.empty() && n_matching_session_tokens < (embd_inp.size() * 3 / 4);
|
||||
|
||||
|
||||
int n_past = 0;
|
||||
int n_remain = params.n_predict;
|
||||
int n_consumed = 0;
|
||||
int n_session_consumed = 0;
|
||||
|
||||
// the first thing we will do is to output the prompt, so set color accordingly
|
||||
set_console_color(con_st, CONSOLE_COLOR_PROMPT);
|
||||
|
||||
std::vector<gptneox_token> embd;
|
||||
|
||||
while (n_remain != 0 || params.interactive) {
|
||||
// predict
|
||||
if (embd.size() > 0) {
|
||||
// infinite text generation via context swapping
|
||||
// if we run out of context:
|
||||
// - take the n_keep first tokens from the original prompt (via n_past)
|
||||
// - take half of the last (n_ctx - n_keep) tokens and recompute the logits in batches
|
||||
if (n_past + (int) embd.size() > n_ctx) {
|
||||
const int n_left = n_past - params.n_keep;
|
||||
|
||||
n_past = params.n_keep;
|
||||
|
||||
// insert n_left/2 tokens at the start of embd from last_n_tokens
|
||||
embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size(), last_n_tokens.end() - embd.size());
|
||||
|
||||
// REVIEW - stop saving session if we run out of context
|
||||
path_session = "";
|
||||
|
||||
//printf("\n---\n");
|
||||
//printf("resetting: '");
|
||||
//for (int i = 0; i < (int) embd.size(); i++) {
|
||||
// printf("%s", gptneox_token_to_str(ctx, embd[i]));
|
||||
//}
|
||||
//printf("'\n");
|
||||
//printf("\n---\n");
|
||||
}
|
||||
|
||||
// try to reuse a matching prefix from the loaded session instead of re-eval (via n_past)
|
||||
// REVIEW
|
||||
if (n_session_consumed < (int) session_tokens.size()) {
|
||||
size_t i = 0;
|
||||
for ( ; i < embd.size(); i++) {
|
||||
if (embd[i] != session_tokens[n_session_consumed]) {
|
||||
session_tokens.resize(n_session_consumed);
|
||||
break;
|
||||
}
|
||||
|
||||
n_past++;
|
||||
n_session_consumed++;
|
||||
|
||||
if (n_session_consumed >= (int) session_tokens.size()) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
if (i > 0) {
|
||||
embd.erase(embd.begin(), embd.begin() + i);
|
||||
}
|
||||
}
|
||||
|
||||
// evaluate tokens in batches
|
||||
// embd is typically prepared beforehand to fit within a batch, but not always
|
||||
for (int i = 0; i < (int) embd.size(); i += params.n_batch) {
|
||||
int n_eval = (int) embd.size() - i;
|
||||
if (n_eval > params.n_batch) {
|
||||
n_eval = params.n_batch;
|
||||
}
|
||||
if (gptneox_eval(ctx, &embd[i], n_eval, n_past, params.n_threads)) {
|
||||
fprintf(stderr, "%s : failed to eval\n", __func__);
|
||||
return 1;
|
||||
}
|
||||
n_past += n_eval;
|
||||
}
|
||||
|
||||
if (embd.size() > 0 && !path_session.empty()) {
|
||||
session_tokens.insert(session_tokens.end(), embd.begin(), embd.end());
|
||||
n_session_consumed = session_tokens.size();
|
||||
}
|
||||
}
|
||||
|
||||
embd.clear();
|
||||
|
||||
if ((int) embd_inp.size() <= n_consumed && !is_interacting) {
|
||||
// out of user input, sample next token
|
||||
const float temp = params.temp;
|
||||
const int32_t top_k = params.top_k <= 0 ? gptneox_n_vocab(ctx) : params.top_k;
|
||||
const float top_p = params.top_p;
|
||||
const float tfs_z = params.tfs_z;
|
||||
const float typical_p = params.typical_p;
|
||||
const int32_t repeat_last_n = params.repeat_last_n < 0 ? n_ctx : params.repeat_last_n;
|
||||
const float repeat_penalty = params.repeat_penalty;
|
||||
const float alpha_presence = params.presence_penalty;
|
||||
const float alpha_frequency = params.frequency_penalty;
|
||||
const int mirostat = params.mirostat;
|
||||
const float mirostat_tau = params.mirostat_tau;
|
||||
const float mirostat_eta = params.mirostat_eta;
|
||||
const bool penalize_nl = params.penalize_nl;
|
||||
|
||||
// optionally save the session on first sample (for faster prompt loading next time)
|
||||
if (!path_session.empty() && need_to_save_session) {
|
||||
need_to_save_session = false;
|
||||
gptneox_save_session_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.size());
|
||||
}
|
||||
|
||||
gptneox_token id = 0;
|
||||
|
||||
{
|
||||
auto logits = gptneox_get_logits(ctx);
|
||||
auto n_vocab = gptneox_n_vocab(ctx);
|
||||
|
||||
// Apply params.logit_bias map
|
||||
for (auto it = params.logit_bias.begin(); it != params.logit_bias.end(); it++) {
|
||||
logits[it->first] += it->second;
|
||||
}
|
||||
|
||||
std::vector<gptneox_token_data> candidates;
|
||||
candidates.reserve(n_vocab);
|
||||
for (gptneox_token token_id = 0; token_id < n_vocab; token_id++) {
|
||||
candidates.emplace_back(gptneox_token_data{token_id, logits[token_id], 0.0f});
|
||||
}
|
||||
|
||||
gptneox_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
|
||||
|
||||
// Apply penalties
|
||||
gptneox_token nl_token = gptneox_str_to_token(ctx, "\n");
|
||||
float nl_logit = logits[nl_token];
|
||||
auto last_n_repeat = std::min(std::min((int)last_n_tokens.size(), repeat_last_n), n_ctx);
|
||||
gptneox_sample_repetition_penalty(ctx, &candidates_p,
|
||||
last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
|
||||
last_n_repeat, repeat_penalty);
|
||||
gptneox_sample_frequency_and_presence_penalties(ctx, &candidates_p,
|
||||
last_n_tokens.data() + last_n_tokens.size() - last_n_repeat,
|
||||
last_n_repeat, alpha_frequency, alpha_presence);
|
||||
if (!penalize_nl) {
|
||||
logits[nl_token] = nl_logit;
|
||||
}
|
||||
|
||||
if (temp <= 0) {
|
||||
// Greedy sampling
|
||||
id = gptneox_sample_token_greedy(ctx, &candidates_p);
|
||||
} else {
|
||||
if (mirostat == 1) {
|
||||
static float mirostat_mu = 2.0f * mirostat_tau;
|
||||
const int mirostat_m = 100;
|
||||
gptneox_sample_temperature(ctx, &candidates_p, temp);
|
||||
id = gptneox_sample_token_mirostat(ctx, &candidates_p, mirostat_tau, mirostat_eta, mirostat_m, &mirostat_mu);
|
||||
} else if (mirostat == 2) {
|
||||
static float mirostat_mu = 2.0f * mirostat_tau;
|
||||
gptneox_sample_temperature(ctx, &candidates_p, temp);
|
||||
id = gptneox_sample_token_mirostat_v2(ctx, &candidates_p, mirostat_tau, mirostat_eta, &mirostat_mu);
|
||||
} else {
|
||||
// Temperature sampling
|
||||
gptneox_sample_top_k(ctx, &candidates_p, top_k, 1);
|
||||
gptneox_sample_tail_free(ctx, &candidates_p, tfs_z, 1);
|
||||
gptneox_sample_typical(ctx, &candidates_p, typical_p, 1);
|
||||
gptneox_sample_top_p(ctx, &candidates_p, top_p, 1);
|
||||
gptneox_sample_temperature(ctx, &candidates_p, temp);
|
||||
id = gptneox_sample_token(ctx, &candidates_p);
|
||||
}
|
||||
}
|
||||
// printf("`%d`", candidates_p.size);
|
||||
|
||||
last_n_tokens.erase(last_n_tokens.begin());
|
||||
last_n_tokens.push_back(id);
|
||||
}
|
||||
|
||||
// replace end of text token with newline token when in interactive mode
|
||||
if (id == gptneox_token_eos() && params.interactive && !params.instruct) {
|
||||
id = gptneox_token_newline.front();
|
||||
if (params.antiprompt.size() != 0) {
|
||||
// tokenize and inject first reverse prompt
|
||||
const auto first_antiprompt = ::gptneox_tokenize(ctx, params.antiprompt.front(), false);
|
||||
embd_inp.insert(embd_inp.end(), first_antiprompt.begin(), first_antiprompt.end());
|
||||
}
|
||||
}
|
||||
|
||||
// add it to the context
|
||||
embd.push_back(id);
|
||||
|
||||
// echo this to console
|
||||
input_noecho = false;
|
||||
|
||||
// decrement remaining sampling budget
|
||||
--n_remain;
|
||||
} else {
|
||||
// some user input remains from prompt or interaction, forward it to processing
|
||||
while ((int) embd_inp.size() > n_consumed) {
|
||||
embd.push_back(embd_inp[n_consumed]);
|
||||
last_n_tokens.erase(last_n_tokens.begin());
|
||||
last_n_tokens.push_back(embd_inp[n_consumed]);
|
||||
++n_consumed;
|
||||
if ((int) embd.size() >= params.n_batch) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// display text
|
||||
if (!input_noecho) {
|
||||
for (auto id : embd) {
|
||||
printf("%s", gptneox_token_to_str(ctx, id));
|
||||
}
|
||||
fflush(stdout);
|
||||
}
|
||||
// reset color to default if we there is no pending user input
|
||||
if (!input_noecho && (int)embd_inp.size() == n_consumed) {
|
||||
set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
|
||||
}
|
||||
|
||||
// in interactive mode, and not currently processing queued inputs;
|
||||
// check if we should prompt the user for more
|
||||
if (params.interactive && (int) embd_inp.size() <= n_consumed) {
|
||||
|
||||
// check for reverse prompt
|
||||
if (params.antiprompt.size()) {
|
||||
std::string last_output;
|
||||
for (auto id : last_n_tokens) {
|
||||
last_output += gptneox_token_to_str(ctx, id);
|
||||
}
|
||||
|
||||
is_antiprompt = false;
|
||||
// Check if each of the reverse prompts appears at the end of the output.
|
||||
for (std::string & antiprompt : params.antiprompt) {
|
||||
if (last_output.find(antiprompt.c_str(), last_output.length() - antiprompt.length(), antiprompt.length()) != std::string::npos) {
|
||||
is_interacting = true;
|
||||
is_antiprompt = true;
|
||||
set_console_color(con_st, CONSOLE_COLOR_USER_INPUT);
|
||||
fflush(stdout);
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (n_past > 0 && is_interacting) {
|
||||
// potentially set color to indicate we are taking user input
|
||||
set_console_color(con_st, CONSOLE_COLOR_USER_INPUT);
|
||||
|
||||
#if defined (_WIN32)
|
||||
// Windows: must reactivate sigint handler after each signal
|
||||
signal(SIGINT, sigint_handler);
|
||||
#endif
|
||||
|
||||
if (params.instruct) {
|
||||
printf("\n> ");
|
||||
}
|
||||
|
||||
std::string buffer;
|
||||
if (!params.input_prefix.empty()) {
|
||||
buffer += params.input_prefix;
|
||||
printf("%s", buffer.c_str());
|
||||
}
|
||||
|
||||
std::string line;
|
||||
bool another_line = true;
|
||||
do {
|
||||
#if defined(_WIN32)
|
||||
std::wstring wline;
|
||||
if (!std::getline(std::wcin, wline)) {
|
||||
// input stream is bad or EOF received
|
||||
return 0;
|
||||
}
|
||||
win32_utf8_encode(wline, line);
|
||||
#else
|
||||
if (!std::getline(std::cin, line)) {
|
||||
// input stream is bad or EOF received
|
||||
return 0;
|
||||
}
|
||||
#endif
|
||||
if (line.empty() || line.back() != '\\') {
|
||||
another_line = false;
|
||||
} else {
|
||||
line.pop_back(); // Remove the continue character
|
||||
}
|
||||
buffer += line + '\n'; // Append the line to the result
|
||||
} while (another_line);
|
||||
|
||||
// done taking input, reset color
|
||||
set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
|
||||
|
||||
// Add tokens to embd only if the input buffer is non-empty
|
||||
// Entering a empty line lets the user pass control back
|
||||
if (buffer.length() > 1) {
|
||||
|
||||
auto line_inp = ::gptneox_tokenize(ctx, buffer, false);
|
||||
embd_inp.insert(embd_inp.end(), line_inp.begin(), line_inp.end());
|
||||
n_remain -= line_inp.size();
|
||||
}
|
||||
|
||||
input_noecho = true; // do not echo this again
|
||||
}
|
||||
|
||||
if (n_past > 0) {
|
||||
is_interacting = false;
|
||||
}
|
||||
}
|
||||
|
||||
// end of text token
|
||||
if (!embd.empty() && embd.back() == gptneox_token_eos()) {
|
||||
if (params.instruct) {
|
||||
is_interacting = true;
|
||||
} else {
|
||||
fprintf(stderr, " [end of text]\n");
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
// In interactive mode, respect the maximum number of tokens and drop back to user input when reached.
|
||||
if (params.interactive && n_remain <= 0 && params.n_predict != -1) {
|
||||
n_remain = params.n_predict;
|
||||
is_interacting = true;
|
||||
}
|
||||
}
|
||||
|
||||
#if defined (_WIN32)
|
||||
signal(SIGINT, SIG_DFL);
|
||||
#endif
|
||||
printf("\n\n");
|
||||
gptneox_print_timings(ctx);
|
||||
gptneox_free(ctx);
|
||||
|
||||
set_console_color(con_st, CONSOLE_COLOR_DEFAULT);
|
||||
|
||||
return 0;
|
||||
}
|
82
third_party/radpajama/quantize-gptneox.cpp
vendored
Normal file
82
third_party/radpajama/quantize-gptneox.cpp
vendored
Normal file
|
@ -0,0 +1,82 @@
|
|||
#include "ggml.h"
|
||||
#include "gptneox.h"
|
||||
|
||||
#include <cstdio>
|
||||
#include <map>
|
||||
#include <string>
|
||||
|
||||
static const std::map<std::string, enum gptneox_ftype> GPTNEOX_FTYPE_MAP = {
|
||||
{"q4_0", GPTNEOX_FTYPE_MOSTLY_Q4_0},
|
||||
{"q4_1", GPTNEOX_FTYPE_MOSTLY_Q4_1},
|
||||
{"q4_2", GPTNEOX_FTYPE_MOSTLY_Q4_2},
|
||||
//{"q4_3", GPTNEOX_FTYPE_MOSTLY_Q4_3},
|
||||
{"q5_0", GPTNEOX_FTYPE_MOSTLY_Q5_0},
|
||||
{"q5_1", GPTNEOX_FTYPE_MOSTLY_Q5_1},
|
||||
{"q8_0", GPTNEOX_FTYPE_MOSTLY_Q8_0},
|
||||
};
|
||||
|
||||
// usage:
|
||||
// ./quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type
|
||||
//
|
||||
int main(int argc, char ** argv) {
|
||||
ggml_time_init();
|
||||
|
||||
if (argc < 4) {
|
||||
fprintf(stderr, "usage: %s model-f32.bin model-quant.bin type [nthread]\n", argv[0]);
|
||||
for (auto it = GPTNEOX_FTYPE_MAP.begin(); it != GPTNEOX_FTYPE_MAP.end(); it++) {
|
||||
fprintf(stderr, " type = \"%s\" or %d\n", it->first.c_str(), it->second);
|
||||
}
|
||||
return 1;
|
||||
}
|
||||
|
||||
// needed to initialize f16 tables
|
||||
{
|
||||
struct ggml_init_params params = { 0, NULL, false };
|
||||
struct ggml_context * ctx = ggml_init(params);
|
||||
ggml_free(ctx);
|
||||
}
|
||||
|
||||
const std::string fname_inp = argv[1];
|
||||
const std::string fname_out = argv[2];
|
||||
|
||||
enum gptneox_ftype ftype;
|
||||
if (argv[3][0] == 'q') {
|
||||
auto it = GPTNEOX_FTYPE_MAP.find(argv[3]);
|
||||
if (it == GPTNEOX_FTYPE_MAP.end()) {
|
||||
fprintf(stderr, "%s: unknown ftype '%s'\n", __func__, argv[3]);
|
||||
return 1;
|
||||
}
|
||||
ftype = it->second;
|
||||
} else {
|
||||
ftype = (enum gptneox_ftype)atoi(argv[3]);
|
||||
}
|
||||
|
||||
int nthread = argc > 4 ? atoi(argv[4]) : 0;
|
||||
|
||||
const int64_t t_main_start_us = ggml_time_us();
|
||||
|
||||
int64_t t_quantize_us = 0;
|
||||
|
||||
// load the model
|
||||
{
|
||||
const int64_t t_start_us = ggml_time_us();
|
||||
|
||||
if (gptneox_model_quantize(fname_inp.c_str(), fname_out.c_str(), ftype, nthread)) {
|
||||
fprintf(stderr, "%s: failed to quantize model from '%s'\n", __func__, fname_inp.c_str());
|
||||
return 1;
|
||||
}
|
||||
|
||||
t_quantize_us = ggml_time_us() - t_start_us;
|
||||
}
|
||||
|
||||
// report timing
|
||||
{
|
||||
const int64_t t_main_end_us = ggml_time_us();
|
||||
|
||||
printf("\n");
|
||||
printf("%s: quantize time = %8.2f ms\n", __func__, t_quantize_us/1000.0);
|
||||
printf("%s: total time = %8.2f ms\n", __func__, (t_main_end_us - t_main_start_us)/1000.0);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
144
third_party/radpajama/scripts/convert_gptneox_to_ggml.py
vendored
Normal file
144
third_party/radpajama/scripts/convert_gptneox_to_ggml.py
vendored
Normal file
|
@ -0,0 +1,144 @@
|
|||
# Convert Hugging Face fine-tuned gpt-neox-like models to ggml format
|
||||
|
||||
import io
|
||||
import os
|
||||
import sys
|
||||
import struct
|
||||
import json
|
||||
import code
|
||||
import torch
|
||||
import numpy as np
|
||||
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
# ref: https://github.com/openai/gpt-2/blob/master/src/encoder.py
|
||||
def bytes_to_unicode():
|
||||
"""
|
||||
Returns list of utf-8 byte and a corresponding list of unicode strings.
|
||||
The reversible bpe codes work on unicode strings.
|
||||
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
|
||||
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
|
||||
This is a significant percentage of your normal, say, 32K bpe vocab.
|
||||
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
|
||||
And avoids mapping to whitespace/control characters the bpe code barfs on.
|
||||
"""
|
||||
bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
|
||||
cs = bs[:]
|
||||
n = 0
|
||||
for b in range(2**8):
|
||||
if b not in bs:
|
||||
bs.append(b)
|
||||
cs.append(2**8+n)
|
||||
n += 1
|
||||
cs = [chr(n) for n in cs]
|
||||
return dict(zip(bs, cs))
|
||||
|
||||
if len(sys.argv) < 3:
|
||||
print("Usage: python convert-hf-to-ggml.py model_name dir-output [use-f32]")
|
||||
print(" model_name: name of the model to convert. Example: 'bigscience/bloomz-560m'")
|
||||
print(" dir-output: directory where the output file will be written")
|
||||
print(" use-f32: if present, use float32 instead of float16")
|
||||
sys.exit(1)
|
||||
|
||||
model_name = sys.argv[1]
|
||||
dir_out = sys.argv[2]
|
||||
model_cache_dir = dir_out + "-cache"
|
||||
|
||||
# make sure the output directory exists
|
||||
os.makedirs(dir_out, exist_ok=True)
|
||||
|
||||
# possible data types
|
||||
# ftype == 0 -> float32
|
||||
# ftype == 1 -> float16
|
||||
#
|
||||
# map from ftype to string
|
||||
ftype_str = ["f32", "f16"]
|
||||
ftype = 1
|
||||
if len(sys.argv) > 3:
|
||||
ftype = 0
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
print("Loading model: ", model_name)
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16 if ftype == 1 else torch.float32,
|
||||
cache_dir=model_cache_dir)
|
||||
model.eval()
|
||||
for p in model.parameters():
|
||||
p.requires_grad = False
|
||||
hparams = model.config.to_dict()
|
||||
print("Model loaded: ", model_name)
|
||||
|
||||
fn_bin = f"/ggml-{model_name.split('/')[-1]}-{ftype_str[ftype]}.bin"
|
||||
fn_out = dir_out + fn_bin
|
||||
fout = open(fn_out, "wb")
|
||||
|
||||
ggml_file_magic = 0x67676d66 # 0x67676d6c is unversioned
|
||||
ggml_file_version = 0x00000001 # v1
|
||||
|
||||
hparams["multiple_of"] = 1
|
||||
fout.write(struct.pack("i", ggml_file_magic)) # magic: ggmf in hex
|
||||
fout.write(struct.pack("i", ggml_file_version))
|
||||
fout.write(struct.pack("i", hparams["vocab_size"]))
|
||||
fout.write(struct.pack("i", hparams["max_position_embeddings"]))
|
||||
fout.write(struct.pack("i", hparams["hidden_size"]))
|
||||
fout.write(struct.pack("i", hparams["num_attention_heads"]))
|
||||
fout.write(struct.pack("i", hparams["num_hidden_layers"]))
|
||||
fout.write(struct.pack("i", int((hparams["hidden_size"] / hparams["num_attention_heads"]
|
||||
) * hparams["rotary_pct"]))) # rotary_dim
|
||||
fout.write(struct.pack("i", int(hparams["use_parallel_residual"])))
|
||||
fout.write(struct.pack("i", ftype))
|
||||
|
||||
# Is this correct??
|
||||
dot_token = tokenizer.encode(".")[0]
|
||||
for i in range(hparams["vocab_size"]):
|
||||
text = tokenizer.decode([i]).encode('utf-8')
|
||||
fout.write(struct.pack("i", len(text)))
|
||||
fout.write(text)
|
||||
|
||||
list_vars = model.state_dict()
|
||||
|
||||
print(hparams)
|
||||
|
||||
for name in list_vars.keys():
|
||||
if name.startswith('gpt_neox.layers.'):
|
||||
if 'attention.masked_bias' in name or \
|
||||
'attention.rotary_emb.inv_freq' in name or \
|
||||
'attention.bias' in name:
|
||||
continue
|
||||
# No gradients for these
|
||||
list_vars[name].requires_grad = False
|
||||
src = name
|
||||
nn = name
|
||||
|
||||
print(src, ' -> ', name)
|
||||
data = list_vars[src].squeeze().numpy()
|
||||
data = data.astype(np.float32)
|
||||
|
||||
n_dims = len(data.shape)
|
||||
print(name, n_dims, data.shape)
|
||||
|
||||
# default type is fp32
|
||||
ftype_cur = 0
|
||||
if ftype == 1 and n_dims > 1:
|
||||
print(" Converting to float16", data.shape, data[:3, :3].tolist())
|
||||
data = data.astype(np.float16)
|
||||
ftype_cur = 1
|
||||
else:
|
||||
print(" Converting to float32", data.shape,
|
||||
data[:3, :3].tolist() if n_dims > 1 else data[:3].tolist())
|
||||
data = data.astype(np.float32)
|
||||
|
||||
# header
|
||||
str = name.encode('utf-8')
|
||||
fout.write(struct.pack("iii", n_dims, len(str), ftype_cur))
|
||||
for i in range(n_dims):
|
||||
fout.write(struct.pack("i", data.shape[n_dims - 1 - i]))
|
||||
print(str)
|
||||
fout.write(str)
|
||||
|
||||
# data
|
||||
data.tofile(fout)
|
||||
|
||||
fout.close()
|
||||
|
||||
print("Done. Output file: " + fn_out)
|
||||
print("")
|
21
third_party/radpajama/scripts/install-RedPajama-INCITE-Base-3B-v1.sh
vendored
Normal file
21
third_party/radpajama/scripts/install-RedPajama-INCITE-Base-3B-v1.sh
vendored
Normal file
|
@ -0,0 +1,21 @@
|
|||
#!/bin/bash
|
||||
|
||||
# cd to scripts dir
|
||||
cd `dirname $0`
|
||||
|
||||
# download model to models dir
|
||||
echo "Downloading model"
|
||||
python ./convert_gptneox_to_ggml.py togethercomputer/RedPajama-INCITE-Base-3B-v1 ../models/pythia
|
||||
|
||||
# remove temp cache dir
|
||||
echo "Removing temp cache dir"
|
||||
rm -r ../models/pythia-cache
|
||||
|
||||
# quantize model
|
||||
echo "Quantizing model (q4_0)"
|
||||
cd ../../..
|
||||
python ./examples/redpajama/scripts/quantize-gptneox.py ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Base-3B-v1-f16.bin
|
||||
|
||||
|
||||
# done!
|
||||
echo "Done."
|
21
third_party/radpajama/scripts/install-RedPajama-INCITE-Chat-3B-v1.sh
vendored
Normal file
21
third_party/radpajama/scripts/install-RedPajama-INCITE-Chat-3B-v1.sh
vendored
Normal file
|
@ -0,0 +1,21 @@
|
|||
#!/bin/bash
|
||||
|
||||
# cd to scripts dir
|
||||
cd `dirname $0`
|
||||
|
||||
# download model to models dir
|
||||
echo "Downloading model"
|
||||
python ./convert_gptneox_to_ggml.py togethercomputer/RedPajama-INCITE-Chat-3B-v1 ../models/pythia
|
||||
|
||||
# remove temp cache dir
|
||||
echo "Removing temp cache dir"
|
||||
rm -r ../models/pythia-cache
|
||||
|
||||
# quantize model
|
||||
echo "Quantizing model (q4_0)"
|
||||
cd ../../..
|
||||
python ./examples/redpajama/scripts/quantize-gptneox.py ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Chat-3B-v1-f16.bin
|
||||
|
||||
|
||||
# done!
|
||||
echo "Done."
|
21
third_party/radpajama/scripts/install-RedPajama-INCITE-Instruct-3B-v1.sh
vendored
Normal file
21
third_party/radpajama/scripts/install-RedPajama-INCITE-Instruct-3B-v1.sh
vendored
Normal file
|
@ -0,0 +1,21 @@
|
|||
#!/bin/bash
|
||||
|
||||
# cd to scripts dir
|
||||
cd `dirname $0`
|
||||
|
||||
# download model to models dir
|
||||
echo "Downloading model"
|
||||
python ./convert_gptneox_to_ggml.py togethercomputer/RedPajama-INCITE-Instruct-3B-v1 ../models/pythia
|
||||
|
||||
# remove temp cache dir
|
||||
echo "Removing temp cache dir"
|
||||
rm -r ../models/pythia-cache
|
||||
|
||||
# quantize model
|
||||
echo "Quantizing model (q4_0)"
|
||||
cd ../../..
|
||||
python ./examples/redpajama/scripts/quantize-gptneox.py ./examples/redpajama/models/pythia/ggml-RedPajama-INCITE-Instruct-3B-v1-f16.bin
|
||||
|
||||
|
||||
# done!
|
||||
echo "Done."
|
141
third_party/radpajama/scripts/quantize-gptneox.py
vendored
Normal file
141
third_party/radpajama/scripts/quantize-gptneox.py
vendored
Normal file
|
@ -0,0 +1,141 @@
|
|||
#!/usr/bin/env python3
|
||||
|
||||
"""Script to execute the "quantize" script on a given set of models."""
|
||||
|
||||
import subprocess
|
||||
import argparse
|
||||
import glob
|
||||
import sys
|
||||
import os
|
||||
|
||||
|
||||
def main():
|
||||
"""Update the quantize binary name depending on the platform and parse
|
||||
the command line arguments and execute the script.
|
||||
"""
|
||||
|
||||
if "linux" in sys.platform or "darwin" in sys.platform:
|
||||
quantize_script_binary = "quantize-gptneox"
|
||||
|
||||
elif "win32" in sys.platform or "cygwin" in sys.platform:
|
||||
quantize_script_binary = "quantize-gptneox.exe"
|
||||
|
||||
else:
|
||||
print("WARNING: Unknown platform. Assuming a UNIX-like OS.\n")
|
||||
quantize_script_binary = "quantize-gptneox"
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
prog='python3 quantize-gptneox.py',
|
||||
description='This script quantizes the given models by applying the '
|
||||
f'"{quantize_script_binary}" script on them.'
|
||||
)
|
||||
parser.add_argument('model_path')
|
||||
#parser.add_argument(
|
||||
# 'models', nargs='+', choices=('7B', '13B', '30B', '65B'),
|
||||
# help='The models to quantize.'
|
||||
#)
|
||||
parser.add_argument(
|
||||
'-r', '--remove-16', action='store_true', dest='remove_f16',
|
||||
help='Remove the f16 model after quantizing it.'
|
||||
)
|
||||
#parser.add_argument(
|
||||
# '-m', '--models-path', dest='models_path',
|
||||
# default=os.path.join(os.getcwd(), "models"),
|
||||
# help='Specify the directory where the models are located.'
|
||||
#)
|
||||
parser.add_argument(
|
||||
'-q', '--quantize-script-path', dest='quantize_script_path',
|
||||
default=os.path.join(os.getcwd(), quantize_script_binary),
|
||||
help='Specify the path to the "quantize" script.'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--quantize-output-type', dest='quantize_output_type', type=str,
|
||||
default='q4_0',
|
||||
help='Specify the path to the "quantize" script.'
|
||||
)
|
||||
|
||||
|
||||
# TODO: Revise this code
|
||||
# parser.add_argument(
|
||||
# '-t', '--threads', dest='threads', type='int',
|
||||
# default=os.cpu_count(),
|
||||
# help='Specify the number of threads to use to quantize many models at '
|
||||
# 'once. Defaults to os.cpu_count().'
|
||||
# )
|
||||
|
||||
args = parser.parse_args()
|
||||
args.model_path = os.path.abspath(args.model_path)
|
||||
#args.models_path = os.path.abspath(args.models_path)
|
||||
|
||||
if not os.path.isfile(args.quantize_script_path):
|
||||
print(
|
||||
f'The "{quantize_script_binary}" script was not found in the '
|
||||
"current location.\nIf you want to use it from another location, "
|
||||
"set the --quantize-script-path argument from the command line."
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
#for model in args.models:
|
||||
# The model is separated in various parts
|
||||
# (ggml-model-f16.bin, ggml-model-f16.bin.0, ggml-model-f16.bin.1...)
|
||||
#f16_model_path_base = os.path.join(
|
||||
# args.models_path, model, "ggml-model-f16.bin"
|
||||
#)
|
||||
f16_model_path_base = args.model_path
|
||||
|
||||
if not os.path.isfile(f16_model_path_base):
|
||||
print(f'The file %s was not found' % f16_model_path_base)
|
||||
sys.exit(1)
|
||||
|
||||
f16_model_parts_paths = map(
|
||||
lambda filename: os.path.join(f16_model_path_base, filename),
|
||||
glob.glob(f"{f16_model_path_base}*")
|
||||
)
|
||||
|
||||
for f16_model_part_path in f16_model_parts_paths:
|
||||
if not os.path.isfile(f16_model_part_path):
|
||||
print(
|
||||
f"The f16 model {os.path.basename(f16_model_part_path)} "
|
||||
f"was not found in {args.models_path}{os.path.sep}"
|
||||
". If you want to use it from another location, set the "
|
||||
"--models-path argument from the command line."
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
__run_quantize_script(
|
||||
args.quantize_script_path, f16_model_part_path, args.quantize_output_type
|
||||
)
|
||||
|
||||
if args.remove_f16:
|
||||
os.remove(f16_model_part_path)
|
||||
|
||||
|
||||
# This was extracted to a top-level function for parallelization, if
|
||||
# implemented. See https://github.com/ggerganov/llama.cpp/pull/222/commits/f8db3d6cd91bf1a1342db9d29e3092bc12dd783c#r1140496406
|
||||
|
||||
def __run_quantize_script(script_path, f16_model_part_path, quantize_output_type):
|
||||
"""Run the quantize script specifying the path to it and the path to the
|
||||
f16 model to quantize.
|
||||
"""
|
||||
|
||||
new_quantized_model_path = f16_model_part_path.replace("f16", quantize_output_type)
|
||||
subprocess.run(
|
||||
[script_path, f16_model_part_path, new_quantized_model_path, quantize_output_type],
|
||||
check=True
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
main()
|
||||
|
||||
except subprocess.CalledProcessError:
|
||||
print("\nAn error ocurred while trying to quantize the models.")
|
||||
sys.exit(1)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
sys.exit(0)
|
||||
|
||||
else:
|
||||
print("\nSuccesfully quantized all models.")
|
Loading…
Add table
Add a link
Reference in a new issue