Online GPU slicing (#11)
* move gpu slicing python code into a module * remove dead code in exporting gpu split * streamline solver and export with one entrypoint * new powerinfer.py module * wip: invoke Python to generate gpu split on the fly * wip: load gpu split on demand * wip: new gpu split file format * wip: generate and load new gpu idx format * wip: generate and load gpu index on the fly * minor: calculate total VRAM offloading via FFN splitting * add option to disble gpu index * bugfix * wip: bug fix for segment fault * bugfix * bugfix and testing * temporary fix for neuron factor in solving * fix: generated gpu idx path * Update README about gpu index
This commit is contained in:
parent
ded0613bd4
commit
bb486b88e1
16 changed files with 419 additions and 481 deletions
27
README.md
27
README.md
|
@ -71,6 +71,7 @@ And new features coming soon:
|
||||||
```bash
|
```bash
|
||||||
git clone https://github.com/SJTU-IPADS/PowerInfer
|
git clone https://github.com/SJTU-IPADS/PowerInfer
|
||||||
cd PowerInfer
|
cd PowerInfer
|
||||||
|
pip install -r requirements.txt # install Python helpers' dependencies
|
||||||
```
|
```
|
||||||
### Build
|
### Build
|
||||||
In order to build PowerInfer you have two different options. These commands are supposed to be run from the root directory of the project.
|
In order to build PowerInfer you have two different options. These commands are supposed to be run from the root directory of the project.
|
||||||
|
@ -89,7 +90,8 @@ cmake --build build --config Release
|
||||||
|
|
||||||
## Model Weights
|
## Model Weights
|
||||||
|
|
||||||
PowerInfer models are stored in a special format called *PowerInfer GGUF* based on GGUF format, consisting of both LLM weights and predictor weights. You can download PowerInfer GGUF weights from Hugging Face or convert them from the original model weights and predictor weights.
|
PowerInfer models are stored in a special format called *PowerInfer GGUF* based on GGUF format, consisting of both LLM weights and predictor weights.
|
||||||
|
You can obtain PowerInfer GGUF weights at `*.powerinfer.gguf` as well as profiled model activation statistics under `activation/` for 'hot'-neuron offloading from each Hugging Face model repo under "PowerInfer GGUF Format" column. You can also convert them from the original model weights and predictor weights.
|
||||||
|
|
||||||
| Base Model | PowerInfer GGUF Format | Original Model | Predictor |
|
| Base Model | PowerInfer GGUF Format | Original Model | Predictor |
|
||||||
|------------|------------------|----------------|---------------------|
|
|------------|------------------|----------------|---------------------|
|
||||||
|
@ -102,14 +104,16 @@ PowerInfer models are stored in a special format called *PowerInfer GGUF* based
|
||||||
|
|
||||||
For CPU-only and CPU-GPU hybrid inference with all available VRAM, you can use the following instructions to run PowerInfer:
|
For CPU-only and CPU-GPU hybrid inference with all available VRAM, you can use the following instructions to run PowerInfer:
|
||||||
```bash
|
```bash
|
||||||
./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt
|
./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt
|
||||||
```
|
# ./build/bin/main -m ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"
|
||||||
If you want to limit the VRAM usage of GPU:
|
|
||||||
```bash
|
|
||||||
./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt --vram-budget $vram_gb
|
|
||||||
```
|
```
|
||||||
|
|
||||||
As for now, it requires an offline-generated "GPU index" file to split FFNs on GPU. And we found these files are hard to maintain and distribute. We will ship automatic FFN split based on VRAM capacity via [#11](https://github.com/SJTU-IPADS/PowerInfer/pull/11) very soon.
|
If you want to limit the VRAM usage of GPU:
|
||||||
|
```bash
|
||||||
|
./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt --vram-budget $vram_gb
|
||||||
|
# ./build/bin/main -m ./ReluLLaMA-7B-PowerInfer-GGUF/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 8
|
||||||
|
```
|
||||||
|
Under CPU-GPU hybrid inference, PowerInfer will automatically offload all dense activation blocks to GPU and split FFN on GPU if possible.
|
||||||
|
|
||||||
## Evaluation
|
## Evaluation
|
||||||
|
|
||||||
|
@ -119,6 +123,13 @@ As for now, it requires an offline-generated "GPU index" file to split FFNs on G
|
||||||
|
|
||||||
PowerInfer achieves up to 11x and 8x speedup for FP16 and INT4 models!
|
PowerInfer achieves up to 11x and 8x speedup for FP16 and INT4 models!
|
||||||
|
|
||||||
|
## FAQs
|
||||||
|
1. What if I encountered `CUDA_ERROR_OUT_OF_MEMORY`?
|
||||||
|
- You can try to run with `--reset-gpu-index` argument to rebuild GPU index for this model to avoid any stale cache.
|
||||||
|
- Due to our current implementation, model offloading might not be accurate as expected. You can try with `--vram-budget` with a slightly lower value or `--disable-gpu-index` to disable FFN offloading.
|
||||||
|
2. What if...
|
||||||
|
- Issues are welcomed! Please feel free to open an issue and attach your running environment and running parameters. We will try our best to help you.
|
||||||
|
|
||||||
## TODOs
|
## TODOs
|
||||||
We will release the code and data in the following order, please stay tuned!
|
We will release the code and data in the following order, please stay tuned!
|
||||||
|
|
||||||
|
@ -130,7 +141,7 @@ We will release the code and data in the following order, please stay tuned!
|
||||||
- [ ] Support Metal for Mac
|
- [ ] Support Metal for Mac
|
||||||
- [ ] Release code for OPT models
|
- [ ] Release code for OPT models
|
||||||
- [ ] Release predictor training code
|
- [ ] Release predictor training code
|
||||||
- [ ] Support online split for FFN network
|
- [x] Support online split for FFN network
|
||||||
- [ ] Support Multi-GPU
|
- [ ] Support Multi-GPU
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -471,12 +471,10 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params) {
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
params.lora_base = argv[i];
|
params.lora_base = argv[i];
|
||||||
} else if (arg == "--gpu-index") {
|
} else if (arg == "--reset-gpu-index") {
|
||||||
if (++i >= argc) {
|
params.reset_gpu_index = true;
|
||||||
invalid_param = true;
|
} else if (arg == "--disable-gpu-index") {
|
||||||
break;
|
params.disale_gpu_index = true;
|
||||||
}
|
|
||||||
params.gpu_index = argv[i];
|
|
||||||
} else if (arg == "--mmproj") {
|
} else if (arg == "--mmproj") {
|
||||||
if (++i >= argc) {
|
if (++i >= argc) {
|
||||||
invalid_param = true;
|
invalid_param = true;
|
||||||
|
@ -910,6 +908,8 @@ struct llama_model_params llama_model_params_from_gpt_params(const gpt_params &
|
||||||
mparams.tensor_split = params.tensor_split;
|
mparams.tensor_split = params.tensor_split;
|
||||||
mparams.use_mmap = params.use_mmap;
|
mparams.use_mmap = params.use_mmap;
|
||||||
mparams.use_mlock = params.use_mlock;
|
mparams.use_mlock = params.use_mlock;
|
||||||
|
mparams.reset_gpu_index = params.reset_gpu_index;
|
||||||
|
mparams.disable_gpu_index = params.disale_gpu_index;
|
||||||
|
|
||||||
return mparams;
|
return mparams;
|
||||||
}
|
}
|
||||||
|
@ -968,24 +968,6 @@ std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_par
|
||||||
return std::make_tuple(nullptr, nullptr);
|
return std::make_tuple(nullptr, nullptr);
|
||||||
}
|
}
|
||||||
|
|
||||||
if (llama_use_sparse_inference(model)) {
|
|
||||||
fprintf(stderr, "%s: postprocessing PowerInfer model '%s'\n", __func__, params.model.c_str());
|
|
||||||
if (!params.gpu_index.empty()) {
|
|
||||||
int err = llama_model_apply_gpu_idx_from_file(model, params.gpu_index.c_str(), true);
|
|
||||||
if (err != 0) {
|
|
||||||
fprintf(stderr, "%s: error: failed to apply mlp adapter\n", __func__);
|
|
||||||
llama_free_model(model);
|
|
||||||
return std::make_tuple(nullptr, nullptr);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (llama_model_apply_augmentation(model) != 0) {
|
|
||||||
fprintf(stderr, "%s: error: failed to apply augmentation\n", __func__);
|
|
||||||
llama_free_model(model);
|
|
||||||
return std::make_tuple(nullptr, nullptr);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
auto cparams = llama_context_params_from_gpt_params(params);
|
auto cparams = llama_context_params_from_gpt_params(params);
|
||||||
llama_context * lctx = llama_new_context_with_model(model, cparams);
|
llama_context * lctx = llama_new_context_with_model(model, cparams);
|
||||||
if (lctx == NULL) {
|
if (lctx == NULL) {
|
||||||
|
@ -1357,7 +1339,8 @@ void dump_non_result_info_yaml(FILE * stream, const gpt_params & params, const l
|
||||||
fprintf(stream, " - %s: %f\n", std::get<0>(la).c_str(), std::get<1>(la));
|
fprintf(stream, " - %s: %f\n", std::get<0>(la).c_str(), std::get<1>(la));
|
||||||
}
|
}
|
||||||
fprintf(stream, "lora_base: %s\n", params.lora_base.c_str());
|
fprintf(stream, "lora_base: %s\n", params.lora_base.c_str());
|
||||||
fprintf(stream, "gpu_index: %s\n", params.gpu_index.c_str());
|
fprintf(stream, "reset_gpu_index: %s\n", params.reset_gpu_index ? "true" : "false");
|
||||||
|
fprintf(stream, "disable_gpu_index: %s\n", params.disale_gpu_index? "true": "false");
|
||||||
fprintf(stream, "main_gpu: %d # default: 0\n", params.main_gpu);
|
fprintf(stream, "main_gpu: %d # default: 0\n", params.main_gpu);
|
||||||
fprintf(stream, "memory_f32: %s # default: false\n", !params.memory_f16 ? "true" : "false");
|
fprintf(stream, "memory_f32: %s # default: false\n", !params.memory_f16 ? "true" : "false");
|
||||||
fprintf(stream, "mirostat: %d # default: 0 (disabled)\n", sparams.mirostat);
|
fprintf(stream, "mirostat: %d # default: 0 (disabled)\n", sparams.mirostat);
|
||||||
|
|
|
@ -91,7 +91,8 @@ struct gpt_params {
|
||||||
std::vector<std::tuple<std::string, float>> lora_adapter; // lora adapter path with user defined scale
|
std::vector<std::tuple<std::string, float>> lora_adapter; // lora adapter path with user defined scale
|
||||||
std::string lora_base = ""; // base model path for the lora adapter
|
std::string lora_base = ""; // base model path for the lora adapter
|
||||||
|
|
||||||
std::string gpu_index = ""; // sparse activation mlp adapter path
|
bool reset_gpu_index = false; // refresh the gpu index file
|
||||||
|
bool disale_gpu_index = false; // disable loading gpu index and splitting ffn
|
||||||
|
|
||||||
int ppl_stride = 0; // stride for perplexity calculations. If left at 0, the pre-existing approach will be used.
|
int ppl_stride = 0; // stride for perplexity calculations. If left at 0, the pre-existing approach will be used.
|
||||||
int ppl_output_type = 0; // = 0 -> ppl output is as usual, = 1 -> ppl output is num_tokens, ppl, one per line
|
int ppl_output_type = 0; // = 0 -> ppl output is as usual, = 1 -> ppl output is num_tokens, ppl, one per line
|
||||||
|
|
|
@ -48,12 +48,11 @@ int main(int argc, char ** argv) {
|
||||||
params.n_threads = std::atoi(argv[6]);
|
params.n_threads = std::atoi(argv[6]);
|
||||||
}
|
}
|
||||||
|
|
||||||
if (argc >= 8) {
|
// For testing purposes, we always reset the GPU index
|
||||||
params.gpu_index = argv[7];
|
params.reset_gpu_index = true;
|
||||||
}
|
|
||||||
|
|
||||||
printf("params: model = %s, prompt = %s, n_parallel = %d, n_len = %d, n_gpu_layers = %d, n_threads = %d, gpu_index = %s\n",
|
printf("params: model = %s, prompt = %s, n_parallel = %d, n_len = %d, n_gpu_layers = %d, n_threads = %d, reset_gpu_index = true\n",
|
||||||
params.model.c_str(), params.prompt.c_str(), n_parallel, n_len, n_gpu_layers, params.n_threads, params.gpu_index.c_str());
|
params.model.c_str(), params.prompt.c_str(), n_parallel, n_len, n_gpu_layers, params.n_threads);
|
||||||
|
|
||||||
if (params.prompt.empty()) {
|
if (params.prompt.empty()) {
|
||||||
params.prompt = "Hello my name is";
|
params.prompt = "Hello my name is";
|
||||||
|
@ -76,21 +75,6 @@ int main(int argc, char ** argv) {
|
||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (!params.gpu_index.empty()) {
|
|
||||||
int err = llama_model_apply_gpu_idx_from_file(model, params.gpu_index.c_str(), true);
|
|
||||||
if (err != 0) {
|
|
||||||
fprintf(stderr, "%s: error: failed to apply mlp adapter\n", __func__);
|
|
||||||
llama_free_model(model);
|
|
||||||
return 1;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (llama_model_apply_augmentation(model) != 0) {
|
|
||||||
fprintf(stderr, "%s: error: failed to apply model augmentation\n", __func__);
|
|
||||||
llama_free_model(model);
|
|
||||||
return 1;
|
|
||||||
}
|
|
||||||
|
|
||||||
// tokenize the prompt
|
// tokenize the prompt
|
||||||
|
|
||||||
std::vector<llama_token> tokens_list;
|
std::vector<llama_token> tokens_list;
|
||||||
|
|
8
ggml.c
8
ggml.c
|
@ -17497,7 +17497,7 @@ int ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan) {
|
||||||
}
|
}
|
||||||
|
|
||||||
const int n_threads = cplan->n_threads;
|
const int n_threads = cplan->n_threads;
|
||||||
#ifdef LLAMA_CUBLAS
|
#ifdef GGML_USE_CUBLAS
|
||||||
struct ggml_compute_state_shared state_shared = {
|
struct ggml_compute_state_shared state_shared = {
|
||||||
/*.cgraph =*/ cgraph,
|
/*.cgraph =*/ cgraph,
|
||||||
/*.cgraph_plan =*/ cplan,
|
/*.cgraph_plan =*/ cplan,
|
||||||
|
@ -17534,7 +17534,7 @@ int ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan) {
|
||||||
.ith = j,
|
.ith = j,
|
||||||
.shared = &state_shared,
|
.shared = &state_shared,
|
||||||
};
|
};
|
||||||
#ifdef LLAMA_CUBLAS
|
#ifdef GGML_USE_CUBLAS
|
||||||
const int rc = ggml_thread_create(&workers[j].thrd, NULL, ggml_graph_compute_thread_hybrid, &workers[j]);
|
const int rc = ggml_thread_create(&workers[j].thrd, NULL, ggml_graph_compute_thread_hybrid, &workers[j]);
|
||||||
#else
|
#else
|
||||||
const int rc = ggml_thread_create(&workers[j].thrd, NULL, ggml_graph_compute_thread, &workers[j]);
|
const int rc = ggml_thread_create(&workers[j].thrd, NULL, ggml_graph_compute_thread, &workers[j]);
|
||||||
|
@ -17551,7 +17551,8 @@ int ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan) {
|
||||||
const int64_t perf_start_time_us = ggml_perf_time_us();
|
const int64_t perf_start_time_us = ggml_perf_time_us();
|
||||||
|
|
||||||
// this is a work thread too
|
// this is a work thread too
|
||||||
#ifdef LLAMA_CUBLAS
|
|
||||||
|
#ifdef GGML_USE_CUBLAS
|
||||||
int compute_status = (size_t) ggml_graph_compute_thread_hybrid(&workers[0]);
|
int compute_status = (size_t) ggml_graph_compute_thread_hybrid(&workers[0]);
|
||||||
#else
|
#else
|
||||||
int compute_status = (size_t) ggml_graph_compute_thread(&workers[0]);
|
int compute_status = (size_t) ggml_graph_compute_thread(&workers[0]);
|
||||||
|
@ -19590,7 +19591,6 @@ struct gguf_context * gguf_init_from_file(const char * fname, struct gguf_init_p
|
||||||
sparse_deriv = GGML_DENSE_INFERENCE;
|
sparse_deriv = GGML_DENSE_INFERENCE;
|
||||||
} else if (strncmp(magic, GGUF_POWERINFER_MAGIC, sizeof(magic)) == 0) {
|
} else if (strncmp(magic, GGUF_POWERINFER_MAGIC, sizeof(magic)) == 0) {
|
||||||
sparse_deriv = GGML_SPARSE_INFERENCE;
|
sparse_deriv = GGML_SPARSE_INFERENCE;
|
||||||
fprintf(stderr, "%s: PowerInfer derived model detected. Sparse inference will be used.\n", __func__);
|
|
||||||
} else {
|
} else {
|
||||||
fprintf(stderr, "%s: invalid magic characters %s.\n", __func__, magic);
|
fprintf(stderr, "%s: invalid magic characters %s.\n", __func__, magic);
|
||||||
fclose(file);
|
fclose(file);
|
||||||
|
|
|
@ -74,6 +74,9 @@ class Keys:
|
||||||
class PowerInfer:
|
class PowerInfer:
|
||||||
SPARSE_THRESHOLD = "powerinfer.sparse_threshold"
|
SPARSE_THRESHOLD = "powerinfer.sparse_threshold"
|
||||||
|
|
||||||
|
class Split:
|
||||||
|
VRAM_CAPACITY = "split.vram_capacity"
|
||||||
|
|
||||||
|
|
||||||
#
|
#
|
||||||
# recommended mapping of model tensor names for storage in gguf
|
# recommended mapping of model tensor names for storage in gguf
|
||||||
|
@ -385,6 +388,9 @@ class GGMLQuantizationType(IntEnum):
|
||||||
Q5_K = 13
|
Q5_K = 13
|
||||||
Q6_K = 14
|
Q6_K = 14
|
||||||
Q8_K = 15
|
Q8_K = 15
|
||||||
|
I8 = 16,
|
||||||
|
I16 = 17
|
||||||
|
I32 = 18,
|
||||||
|
|
||||||
|
|
||||||
class GGUFEndian(IntEnum):
|
class GGUFEndian(IntEnum):
|
||||||
|
|
323
llama.cpp
323
llama.cpp
|
@ -61,6 +61,7 @@
|
||||||
#include <cstdio>
|
#include <cstdio>
|
||||||
#include <cstring>
|
#include <cstring>
|
||||||
#include <ctime>
|
#include <ctime>
|
||||||
|
#include <libgen.h>
|
||||||
#include <forward_list>
|
#include <forward_list>
|
||||||
#include <fstream>
|
#include <fstream>
|
||||||
#include <functional>
|
#include <functional>
|
||||||
|
@ -216,6 +217,8 @@ static std::map<llm_arch, std::string> LLM_ARCH_NAMES = {
|
||||||
{ LLM_ARCH_REFACT, "refact" },
|
{ LLM_ARCH_REFACT, "refact" },
|
||||||
{ LLM_ARCH_BLOOM, "bloom" },
|
{ LLM_ARCH_BLOOM, "bloom" },
|
||||||
{ LLM_ARCH_STABLELM, "stablelm" },
|
{ LLM_ARCH_STABLELM, "stablelm" },
|
||||||
|
|
||||||
|
{ LLM_ARCH_UNKNOWN, "unknown" },
|
||||||
};
|
};
|
||||||
|
|
||||||
enum llm_kv {
|
enum llm_kv {
|
||||||
|
@ -266,6 +269,8 @@ enum llm_kv {
|
||||||
LLM_KV_TOKENIZER_RWKV,
|
LLM_KV_TOKENIZER_RWKV,
|
||||||
|
|
||||||
LLM_KV_SPARSE_THRESHOLD,
|
LLM_KV_SPARSE_THRESHOLD,
|
||||||
|
|
||||||
|
LLM_KV_SPLIT_VRAM_CAPACITY,
|
||||||
};
|
};
|
||||||
|
|
||||||
static std::map<llm_kv, std::string> LLM_KV_NAMES = {
|
static std::map<llm_kv, std::string> LLM_KV_NAMES = {
|
||||||
|
@ -316,6 +321,8 @@ static std::map<llm_kv, std::string> LLM_KV_NAMES = {
|
||||||
{ LLM_KV_TOKENIZER_RWKV, "tokenizer.rwkv.world" },
|
{ LLM_KV_TOKENIZER_RWKV, "tokenizer.rwkv.world" },
|
||||||
|
|
||||||
{ LLM_KV_SPARSE_THRESHOLD, "powerinfer.sparse_threshold" },
|
{ LLM_KV_SPARSE_THRESHOLD, "powerinfer.sparse_threshold" },
|
||||||
|
|
||||||
|
{ LLM_KV_SPLIT_VRAM_CAPACITY, "split.vram_capacity" },
|
||||||
};
|
};
|
||||||
|
|
||||||
struct LLM_KV {
|
struct LLM_KV {
|
||||||
|
@ -756,9 +763,10 @@ struct llama_buffer {
|
||||||
struct llama_file {
|
struct llama_file {
|
||||||
// use FILE * so we don't have to re-open the file to mmap
|
// use FILE * so we don't have to re-open the file to mmap
|
||||||
FILE * fp;
|
FILE * fp;
|
||||||
|
std::string fname;
|
||||||
size_t size;
|
size_t size;
|
||||||
|
|
||||||
llama_file(const char * fname, const char * mode) {
|
llama_file(const char * fname, const char * mode): fname(fname) {
|
||||||
fp = std::fopen(fname, mode);
|
fp = std::fopen(fname, mode);
|
||||||
if (fp == NULL) {
|
if (fp == NULL) {
|
||||||
throw std::runtime_error(format("failed to open %s: %s", fname, strerror(errno)));
|
throw std::runtime_error(format("failed to open %s: %s", fname, strerror(errno)));
|
||||||
|
@ -1367,7 +1375,7 @@ struct llama_vocab {
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
struct llama_mlp_model_loader;
|
struct llama_gpu_split_loader;
|
||||||
struct llama_augmentation_model_loader;
|
struct llama_augmentation_model_loader;
|
||||||
|
|
||||||
struct llama_model {
|
struct llama_model {
|
||||||
|
@ -1405,7 +1413,7 @@ struct llama_model {
|
||||||
std::unique_ptr<llama_mmap> mapping;
|
std::unique_ptr<llama_mmap> mapping;
|
||||||
|
|
||||||
// aux model loaders for dynamically loaded/transformed model weights
|
// aux model loaders for dynamically loaded/transformed model weights
|
||||||
std::unique_ptr<struct llama_mlp_model_loader> mlp_model_loader;
|
std::unique_ptr<struct llama_gpu_split_loader> mlp_model_loader;
|
||||||
std::unique_ptr<struct llama_augmentation_model_loader> aug_model_loader;
|
std::unique_ptr<struct llama_augmentation_model_loader> aug_model_loader;
|
||||||
|
|
||||||
// objects representing data potentially being locked in memory
|
// objects representing data potentially being locked in memory
|
||||||
|
@ -2632,30 +2640,28 @@ static void llm_load_print_meta(llama_model_loader & ml, llama_model & model) {
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
struct llama_mlp_model_loader {
|
struct llama_gpu_split_loader {
|
||||||
int n_tensors = 0;
|
int n_tensors = 0;
|
||||||
size_t n_bytes = 0; // tensor data bytes
|
size_t n_bytes = 0; // tensor data bytes
|
||||||
|
|
||||||
const std::string fname;
|
const std::string fname;
|
||||||
llama_file file;
|
|
||||||
int fver;
|
int fver;
|
||||||
|
|
||||||
bool use_mmap = false; // only supports mmap yet
|
bool use_mmap = false; // only supports mmap yet
|
||||||
std::unique_ptr<llama_mmap> mapping;
|
std::unique_ptr<llama_mmap> mapping;
|
||||||
struct ggml_context * ctx_meta = nullptr;
|
struct ggml_context * ctx_meta = nullptr;
|
||||||
|
|
||||||
llama_mlp_model_loader(const std::string & fname, bool use_mmap) : fname(fname), use_mmap(use_mmap), file(fname.c_str(), "rb") {
|
llama_model_loader * idx_loader;
|
||||||
|
size_t vram_required = 0;
|
||||||
|
|
||||||
|
llama_gpu_split_loader(const std::string & fname, bool use_mmap) : fname(fname), use_mmap(use_mmap) {
|
||||||
GGML_ASSERT(use_mmap);
|
GGML_ASSERT(use_mmap);
|
||||||
|
|
||||||
// verify magic and version
|
idx_loader = new llama_model_loader(fname, use_mmap);
|
||||||
uint32_t magic = file.read_u32();
|
GGUF_GET_KEY(idx_loader->ctx_gguf, vram_required, gguf_get_val_u64, GGUF_TYPE_UINT64, true, LLM_KV_NAMES[LLM_KV_SPLIT_VRAM_CAPACITY]);
|
||||||
// TODO: assert on file magic once we have a stable format
|
printf("loaded gpu_idx, vram_required: %ld\n", vram_required);
|
||||||
GGML_ASSERT(magic == 0xDEADBEEF && "invalid file magic" || true);
|
|
||||||
|
|
||||||
fver = file.read_u32();
|
n_tensors = idx_loader->n_tensors;
|
||||||
GGML_ASSERT(fver == 1 && "unsupported file version");
|
|
||||||
|
|
||||||
n_tensors = file.read_u32();
|
|
||||||
|
|
||||||
// allocate memadata/data for mlp tensors
|
// allocate memadata/data for mlp tensors
|
||||||
// TODO: support allocating buffer for tensor data (when mmap is not used)
|
// TODO: support allocating buffer for tensor data (when mmap is not used)
|
||||||
|
@ -2667,138 +2673,43 @@ struct llama_mlp_model_loader {
|
||||||
/*.no_alloc =*/ true,
|
/*.no_alloc =*/ true,
|
||||||
};
|
};
|
||||||
ctx_meta = ggml_init(params);
|
ctx_meta = ggml_init(params);
|
||||||
|
}
|
||||||
|
|
||||||
// memory-map the mlp weights file
|
bool check_vram_allocable(size_t vram_budget) {
|
||||||
mapping.reset(new llama_mmap(&file, /* prefetch */ 0, ggml_is_numa()));
|
return vram_budget >= vram_required;
|
||||||
}
|
}
|
||||||
|
|
||||||
int apply_tensors_to_base_model(llama_model * model) {
|
int apply_tensors_to_base_model(llama_model * model) {
|
||||||
|
int n_layers = model->layers.size();
|
||||||
// TODO: assert fp is at the end of headers
|
// TODO: assert fp is at the end of headers
|
||||||
if (n_tensors != model -> layers.size() * 2) {
|
if (n_tensors != n_layers * 2) {
|
||||||
LLAMA_LOG_ERROR("%s: error: the number of mlp adapters does not match the layer of model\n", __func__);
|
LLAMA_LOG_ERROR("%s: error: the number of gpu splits does not match the layer of model\n", __func__);
|
||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
LLAMA_LOG_INFO("%s: applying gpu_idx adapter from '%s' - please wait ...\n", __func__, fname.c_str());
|
LLAMA_LOG_INFO("%s: applying gpu_idx adapter from '%s' - please wait ...\n", __func__, fname.c_str());
|
||||||
const int64_t t_start_mlp_us = ggml_time_us();
|
const int64_t t_start_mlp_us = ggml_time_us();
|
||||||
|
|
||||||
for (llama_layer &model_layer : model -> layers) {
|
for (int il = 0; il < n_layers; il++) {
|
||||||
ggml_tensor *mlp_fc1_tensor = load_mlp_tensor_from_stream();
|
llama_layer &model_layer = model->layers[il];
|
||||||
ggml_tensor *mlp_fc2_tensor = load_mlp_tensor_from_stream();
|
ggml_tensor * gpu_idx = idx_loader->get_tensor_meta(il*2);
|
||||||
#ifdef GGML_USE_CUBLAS
|
ggml_tensor * gpu_bucket = idx_loader->get_tensor_meta(il*2+1);
|
||||||
// ggml_set_backend(mlp_fc1_tensor, GGML_BACKEND_GPU);
|
if (gpu_idx == nullptr || gpu_bucket == nullptr) {
|
||||||
// ggml_cuda_transform_tensor(mlp_fc1_tensor->data, mlp_fc1_tensor);
|
LLAMA_LOG_ERROR("%s: error: failed to load gpu index or bucket\n", __func__);
|
||||||
|
|
||||||
// gpu bucket to GPU
|
|
||||||
ggml_set_backend(mlp_fc2_tensor, GGML_BACKEND_GPU);
|
|
||||||
ggml_cuda_transform_tensor(mlp_fc2_tensor->data, mlp_fc2_tensor);
|
|
||||||
#endif // GGML_USE_CUBLAS
|
|
||||||
if (mlp_fc1_tensor == nullptr || mlp_fc2_tensor == nullptr) {
|
|
||||||
LLAMA_LOG_ERROR("%s: error: failed to load mlp tensors\n", __func__);
|
|
||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
|
model_layer.gpu_idx = idx_loader->create_tensor_for(ctx_meta, gpu_idx, GGML_BACKEND_CPU);
|
||||||
// load model layer and check dimensions
|
model_layer.gpu_bucket = idx_loader->create_tensor_for(ctx_meta, gpu_bucket, GGML_BACKEND_GPU);
|
||||||
// ggml_tensor *model_up_t = model_layer.ffn_up;
|
|
||||||
// GGML_ASSERT(model_up_t != nullptr);
|
|
||||||
// if (model_up_t->ne[0] != mlp_fc1_tensor->ne[0] ||
|
|
||||||
// model_up_t->ne[1] != mlp_fc2_tensor->ne[1]) {
|
|
||||||
// LLAMA_LOG_ERROR("%s: incompatible tensor dimensions (%" PRId64
|
|
||||||
// " and %" PRId64
|
|
||||||
// ");"
|
|
||||||
// " are you sure that this adapter is for this model?\n",
|
|
||||||
// __func__, model_up_t->ne[0], mlp_fc1_tensor->ne[1]);
|
|
||||||
// return 1;
|
|
||||||
// }
|
|
||||||
|
|
||||||
// GGML_ASSERT(model_layer.mlp_pre_w1 == nullptr && model_layer.mlp_pre_w2 == nullptr);
|
|
||||||
model_layer.gpu_idx = mlp_fc1_tensor;
|
|
||||||
model_layer.gpu_bucket = mlp_fc2_tensor;
|
|
||||||
int *data1 = (int *)mlp_fc1_tensor->data;
|
|
||||||
int *data2 = (int *)mlp_fc2_tensor->data;
|
|
||||||
|
|
||||||
LLAMA_LOG_INFO(".");
|
|
||||||
}
|
}
|
||||||
|
llama_progress_callback cb = [](float progress, void *ctx) {
|
||||||
|
LLAMA_LOG_INFO(".");
|
||||||
|
};
|
||||||
|
idx_loader->load_all_data(ctx_meta, cb, nullptr, nullptr);
|
||||||
|
|
||||||
const int64_t t_mlp_us = ggml_time_us() - t_start_mlp_us;
|
const int64_t t_mlp_us = ggml_time_us() - t_start_mlp_us;
|
||||||
LLAMA_LOG_INFO(" done (%.2f ms)\n", t_mlp_us / 1000.0);
|
LLAMA_LOG_INFO(" done (%.2f ms)\n", t_mlp_us / 1000.0);
|
||||||
|
|
||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
// Consumes the stream and returns a new mlp tensor.
|
|
||||||
// Returns nullptr on error.
|
|
||||||
// TODO: mmap mlp model file
|
|
||||||
ggml_tensor *load_mlp_tensor_from_stream() {
|
|
||||||
uint32_t n_dims = file.read_u32();
|
|
||||||
uint32_t name_length = file.read_u32();
|
|
||||||
uint32_t ftype = file.read_u32();
|
|
||||||
|
|
||||||
uint32_t ne[2] = {1, 1};
|
|
||||||
for (int i = 0; i < n_dims; ++i) {
|
|
||||||
ne[i] = file.read_u32();
|
|
||||||
}
|
|
||||||
|
|
||||||
std::string tensor_name;
|
|
||||||
{
|
|
||||||
char buf[1024];
|
|
||||||
file.read_raw(buf, name_length);
|
|
||||||
tensor_name = std::string(buf, name_length);
|
|
||||||
}
|
|
||||||
|
|
||||||
// const std::string mlp_suffix = ".mlp";
|
|
||||||
// size_t pos = tensor_name.rfind(mlp_suffix);
|
|
||||||
// if (pos == std::string::npos) {
|
|
||||||
// LLAMA_LOG_ERROR("%s: error: '%s' is not a mlp tensor\n", __func__,
|
|
||||||
// tensor_name.c_str());
|
|
||||||
// return nullptr;
|
|
||||||
// }
|
|
||||||
|
|
||||||
// std::string mlp_type = tensor_name.substr(pos + mlp_suffix.length());
|
|
||||||
// std::string base_name = tensor_name;
|
|
||||||
// base_name.erase(pos);
|
|
||||||
// LLAMA_LOG_INFO("%s: %s => %s (mlp type %s) (", __func__, tensor_name.c_str(),
|
|
||||||
// base_name.c_str(), mlp_type.c_str());
|
|
||||||
// for (int i = 0; i < n_dims; ++i) {
|
|
||||||
// LLAMA_LOG_INFO("%d ", ne[i]);
|
|
||||||
// }
|
|
||||||
// LLAMA_LOG_INFO(")\n");
|
|
||||||
// LLAMA_LOG_INFO("tensor name %s\n", tensor_name.c_str());
|
|
||||||
|
|
||||||
// create ggml tensor
|
|
||||||
ggml_type wtype;
|
|
||||||
switch (ftype) {
|
|
||||||
case 0:
|
|
||||||
wtype = GGML_TYPE_F32;
|
|
||||||
break;
|
|
||||||
case 1:
|
|
||||||
wtype = GGML_TYPE_F16;
|
|
||||||
break;
|
|
||||||
case 18:
|
|
||||||
wtype = GGML_TYPE_I32;
|
|
||||||
break;
|
|
||||||
default: {
|
|
||||||
LLAMA_LOG_ERROR("%s: invalid tensor data type '%d'\n", __func__, ftype);
|
|
||||||
return nullptr;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
ggml_tensor *mlp_tensor;
|
|
||||||
// if (n_dims != 2) {
|
|
||||||
// LLAMA_LOG_ERROR("%s: unsupported tensor dimension %d\n", __func__, n_dims);
|
|
||||||
// return nullptr;
|
|
||||||
// }
|
|
||||||
mlp_tensor = ggml_new_tensor_2d(ctx_meta, wtype, ne[0], ne[1]);
|
|
||||||
// ggml_set_name(mlp_tensor, "");
|
|
||||||
|
|
||||||
// load tensor data
|
|
||||||
size_t offset = file.tell();
|
|
||||||
size_t tensor_data_size = ggml_nbytes(mlp_tensor);
|
|
||||||
offset = (offset + 31) & -32;
|
|
||||||
file.seek(offset, SEEK_SET);
|
|
||||||
// point to the mmaped mlp model file
|
|
||||||
mlp_tensor -> data = (void *) (static_cast<char *>(mapping -> addr) + offset);
|
|
||||||
file.seek(tensor_data_size, SEEK_CUR);
|
|
||||||
return mlp_tensor;
|
|
||||||
}
|
|
||||||
};
|
};
|
||||||
|
|
||||||
// to dynamically load/transform llama model weights
|
// to dynamically load/transform llama model weights
|
||||||
|
@ -2815,8 +2726,8 @@ struct llama_augmentation_model_loader {
|
||||||
// const int64_t ggml_aux_tensor_size = 4 * (100 * 100 + 5120*40*4 * ggml_tensor_overhead() + (int64_t)13824*5120*40*4);
|
// const int64_t ggml_aux_tensor_size = 4 * (100 * 100 + 5120*40*4 * ggml_tensor_overhead() + (int64_t)13824*5120*40*4);
|
||||||
int model_layer = model->layers.size();
|
int model_layer = model->layers.size();
|
||||||
int ffn_dim = model->layers[0].ffn_up->ne[1];
|
int ffn_dim = model->layers[0].ffn_up->ne[1];
|
||||||
const size_t ggml_aux_tensor_size = 4 * (100 * 100 + model_layer*ffn_dim*sizeof(float) * ggml_tensor_overhead() );
|
const size_t ggml_aux_tensor_size = 4 * (model_layer*ffn_dim*sizeof(float)*2+ model_layer*ffn_dim*sizeof(float) * ggml_tensor_overhead() );
|
||||||
printf("augmentation buffer: %ld\n", ggml_aux_tensor_size);
|
|
||||||
struct ggml_init_params params = {
|
struct ggml_init_params params = {
|
||||||
/*.mem_size =*/ ggml_aux_tensor_size,
|
/*.mem_size =*/ ggml_aux_tensor_size,
|
||||||
/*.mem_buffer =*/ nullptr,
|
/*.mem_buffer =*/ nullptr,
|
||||||
|
@ -2868,37 +2779,29 @@ struct llama_augmentation_model_loader {
|
||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
|
|
||||||
void slice_ffn_mat_to_gpu(llama_layer & layer) {
|
size_t slice_ffn_mat_to_gpu(llama_layer & layer) {
|
||||||
std::vector<uint8_t> work_buffer;
|
std::vector<uint8_t> work_buffer;
|
||||||
ggml_cgraph * tmp_sum_gf = ggml_new_graph(aux_ctx);
|
|
||||||
ggml_tensor * gpu_idx = layer.gpu_idx;
|
ggml_tensor * gpu_idx = layer.gpu_idx;
|
||||||
|
|
||||||
// calculate the size of tensor to be copied
|
|
||||||
ggml_tensor * sum_t = ggml_sum(aux_ctx, gpu_idx);
|
|
||||||
ggml_build_forward_expand(tmp_sum_gf, sum_t);
|
|
||||||
ggml_graph_compute_helper(work_buffer, tmp_sum_gf, 2);
|
|
||||||
int64_t gpu_rows = *ggml_get_data_i32(sum_t);
|
|
||||||
|
|
||||||
|
|
||||||
int64_t gpu_index_len = gpu_idx->ne[0];
|
|
||||||
// ggml_tensor * gpu_bucket = ggml_new_tensor_1d(aux_ctx, GGML_TYPE_I32, gpu_rows);
|
|
||||||
// make bucket a reverse index back to unstriped mat
|
|
||||||
// int32_t * pbucket_data = (int32_t *)gpu_bucket->data;
|
|
||||||
// for (int i = 0; i < gpu_index_len; i++) {
|
|
||||||
// if (ggml_get_data_i32(gpu_idx)[i] == 0) {
|
|
||||||
// continue;
|
|
||||||
// }
|
|
||||||
// *pbucket_data = i;
|
|
||||||
// ++pbucket_data;
|
|
||||||
// }
|
|
||||||
// layer.gpu_bucket = gpu_bucket;
|
|
||||||
ggml_tensor *gpu_bucket = layer.gpu_bucket;
|
ggml_tensor *gpu_bucket = layer.gpu_bucket;
|
||||||
|
size_t offloaded_bytes = 0;
|
||||||
|
|
||||||
layer.ffn_gate_gpu = create_striped_mat_to_gpu(layer.ffn_gate, gpu_bucket);
|
layer.ffn_gate_gpu = create_striped_mat_to_gpu(layer.ffn_gate, gpu_bucket);
|
||||||
layer.ffn_up_gpu = create_striped_mat_to_gpu(layer.ffn_up, gpu_bucket);
|
layer.ffn_up_gpu = create_striped_mat_to_gpu(layer.ffn_up, gpu_bucket);
|
||||||
layer.ffn_down_gpu = create_striped_mat_to_gpu(layer.ffn_down_t, gpu_bucket);
|
layer.ffn_down_gpu = create_striped_mat_to_gpu(layer.ffn_down_t, gpu_bucket);
|
||||||
|
|
||||||
|
if (layer.ffn_gate_gpu) {
|
||||||
|
offloaded_bytes += ggml_nbytes(layer.ffn_gate_gpu);
|
||||||
|
}
|
||||||
|
if (layer.ffn_up_gpu) {
|
||||||
|
offloaded_bytes += ggml_nbytes(layer.ffn_up_gpu);
|
||||||
|
}
|
||||||
|
if (layer.ffn_down_gpu) {
|
||||||
|
offloaded_bytes += ggml_nbytes(layer.ffn_down_gpu);
|
||||||
|
}
|
||||||
|
return offloaded_bytes;
|
||||||
}
|
}
|
||||||
|
|
||||||
int apply_augmentation_to_base_model(llama_model * model) {
|
size_t offload_ffn_split(llama_model * model) {
|
||||||
LLAMA_LOG_INFO("%s: applying augmentation to model - please wait ...\n", __func__);
|
LLAMA_LOG_INFO("%s: applying augmentation to model - please wait ...\n", __func__);
|
||||||
const int64_t t_start_aug_us = ggml_time_us();
|
const int64_t t_start_aug_us = ggml_time_us();
|
||||||
std::vector<uint8_t> work_buffer;
|
std::vector<uint8_t> work_buffer;
|
||||||
|
@ -2910,6 +2813,7 @@ struct llama_augmentation_model_loader {
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
// load gpu_idx and slice mat to gpu
|
// load gpu_idx and slice mat to gpu
|
||||||
|
size_t offloaded_bytes = 0;
|
||||||
for (llama_layer &model_layer : model -> layers) {
|
for (llama_layer &model_layer : model -> layers) {
|
||||||
// gpu_idx load
|
// gpu_idx load
|
||||||
if (model_layer.gpu_idx == NULL && model_layer.gpu_bucket == NULL) {
|
if (model_layer.gpu_idx == NULL && model_layer.gpu_bucket == NULL) {
|
||||||
|
@ -2919,12 +2823,12 @@ struct llama_augmentation_model_loader {
|
||||||
ggml_tensor * gpu_bucket = ggml_new_tensor_1d(aux_ctx, GGML_TYPE_I32, 0);
|
ggml_tensor * gpu_bucket = ggml_new_tensor_1d(aux_ctx, GGML_TYPE_I32, 0);
|
||||||
model_layer.gpu_bucket = gpu_bucket;
|
model_layer.gpu_bucket = gpu_bucket;
|
||||||
}
|
}
|
||||||
slice_ffn_mat_to_gpu(model_layer);
|
offloaded_bytes += slice_ffn_mat_to_gpu(model_layer);
|
||||||
LLAMA_LOG_INFO(".");
|
LLAMA_LOG_INFO(".");
|
||||||
}
|
}
|
||||||
|
|
||||||
LLAMA_LOG_INFO(" done (%.2f ms)\n", (ggml_time_us() - t_start_aug_us) / 1000.0);
|
LLAMA_LOG_INFO(" done (%.2f ms)\n", (ggml_time_us() - t_start_aug_us) / 1000.0);
|
||||||
return 0;
|
return offloaded_bytes;
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
|
@ -2957,7 +2861,7 @@ struct buffered_tensor_allocator {
|
||||||
// For GPU tensors, we need to allocate them in VRAM as much as possible,
|
// For GPU tensors, we need to allocate them in VRAM as much as possible,
|
||||||
// and update the tensor data in-place. If the VRAM budget is exceeded,
|
// and update the tensor data in-place. If the VRAM budget is exceeded,
|
||||||
// we allocate the tensor in CPU memory.
|
// we allocate the tensor in CPU memory.
|
||||||
void flush() {
|
size_t flush() {
|
||||||
#if defined(GGML_USE_CUBLAS)
|
#if defined(GGML_USE_CUBLAS)
|
||||||
// iterate over offloading priorities
|
// iterate over offloading priorities
|
||||||
for (int enum_i = TENSOR_OFFLOAD_ATTN; enum_i <= TENSOR_OFFLOAD_KV_CACHE; enum_i ++) {
|
for (int enum_i = TENSOR_OFFLOAD_ATTN; enum_i <= TENSOR_OFFLOAD_KV_CACHE; enum_i ++) {
|
||||||
|
@ -2965,7 +2869,7 @@ struct buffered_tensor_allocator {
|
||||||
for (ggml_tensor * meta_tensor : alloc_queues[level]) {
|
for (ggml_tensor * meta_tensor : alloc_queues[level]) {
|
||||||
size_t tensor_data_size = ggml_nbytes(meta_tensor);
|
size_t tensor_data_size = ggml_nbytes(meta_tensor);
|
||||||
if (vram_allocated_bytes + tensor_data_size > vram_budget_bytes) {
|
if (vram_allocated_bytes + tensor_data_size > vram_budget_bytes) {
|
||||||
return;
|
return vram_allocated_bytes;
|
||||||
}
|
}
|
||||||
// allocate in VRAM
|
// allocate in VRAM
|
||||||
ggml_set_backend(meta_tensor, GGML_BACKEND_GPU);
|
ggml_set_backend(meta_tensor, GGML_BACKEND_GPU);
|
||||||
|
@ -2974,15 +2878,83 @@ struct buffered_tensor_allocator {
|
||||||
}
|
}
|
||||||
ml.done_getting_tensors();
|
ml.done_getting_tensors();
|
||||||
#endif
|
#endif
|
||||||
|
return vram_allocated_bytes;
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
|
static bool load_gpu_split_from_split_file(llama_model & model, std::string split_path, size_t vram_budget) {
|
||||||
|
llama_gpu_split_loader loader(split_path, true);
|
||||||
|
return loader.check_vram_allocable(vram_budget)
|
||||||
|
&& loader.apply_tensors_to_base_model(&model) == 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
static bool llm_load_gpu_split_with_budget(llama_model_loader & ml, llama_model & model, size_t vram_allocatable_bytes, bool no_cache) {
|
||||||
|
const char * model_path = ml.file.fname.c_str();
|
||||||
|
std::string cached_split_path = std::string(model_path) + ".generated.gpuidx";
|
||||||
|
const char * model_basedir = dirname(const_cast<char *>(model_path));
|
||||||
|
|
||||||
|
// Load GPU split from previously generated cache
|
||||||
|
if (access(cached_split_path.c_str(), F_OK) == 0 && !no_cache) {
|
||||||
|
if (load_gpu_split_from_split_file(model, cached_split_path, vram_allocatable_bytes)) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
LLAMA_LOG_ERROR("%s: error: failed to apply previously generated gpu split from '%s'\n", __func__, cached_split_path.c_str());
|
||||||
|
}
|
||||||
|
|
||||||
|
// Generate GPU split
|
||||||
|
std::string activation_path = std::string(model_basedir) + "/activation";
|
||||||
|
if (access(activation_path.c_str(), F_OK) != 0) {
|
||||||
|
LLAMA_LOG_ERROR("%s: error: activation files under '%s' not found\n", __func__, activation_path.c_str());
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Calculate solver parameters
|
||||||
|
ggml_tensor * ffn_up = model.layers[0].ffn_up;
|
||||||
|
ggml_tensor * ffn_gate = model.layers[0].ffn_gate;
|
||||||
|
int slice_size = ffn_up->ne[1] * ggml_type_size(ffn_up->type) / ggml_blck_size(ffn_up->type);
|
||||||
|
// For model arch with FFN gate, the gate is also sliced, otherwise only the up and down matrices are sliced
|
||||||
|
int vram_bytes_per_slice = slice_size * (ffn_gate ? 4.5 : 2); // TODO: why 4.5, not 3?
|
||||||
|
int neuron_cap = floor((double)vram_allocatable_bytes / vram_bytes_per_slice) * 4;
|
||||||
|
|
||||||
|
LLAMA_LOG_INFO("invoking powerinfer Python module to generate gpu split for %.2f MiB of VRAM\n", vram_allocatable_bytes / 1024.0 / 1024.0);
|
||||||
|
|
||||||
|
std::stringstream command_ss;
|
||||||
|
command_ss << "python3 -m powerinfer"
|
||||||
|
<< " --activation " << activation_path
|
||||||
|
<< " --layer " << model.hparams.n_layer
|
||||||
|
<< " --neuron " << ffn_up->ne[1]
|
||||||
|
<< " --capacity " << neuron_cap
|
||||||
|
<< " --vram-capacity " << vram_allocatable_bytes
|
||||||
|
<< " --output " << cached_split_path;
|
||||||
|
if (system(command_ss.str().c_str()) != 0 || access(cached_split_path.c_str(), F_OK) != 0) {
|
||||||
|
LLAMA_LOG_ERROR("%s: error: failed to generate gpu split\n", __func__);
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
return load_gpu_split_from_split_file(model, cached_split_path, vram_allocatable_bytes);
|
||||||
|
}
|
||||||
|
|
||||||
|
static void llm_load_gpu_split(llama_model_loader & ml, llama_model & model, size_t vram_budget_bytes, bool no_cache, bool no_offload) {
|
||||||
|
#if defined(GGML_USE_CUBLAS)
|
||||||
|
if (vram_budget_bytes >= 512ull * 1024 * 1024 && !no_offload) {
|
||||||
|
vram_budget_bytes -= 512ull * 1024 * 1024; // leave 512 MiB as a safety margin
|
||||||
|
if (!llm_load_gpu_split_with_budget(ml, model, vram_budget_bytes, no_cache)) {
|
||||||
|
LLAMA_LOG_ERROR("%s: error: failed to generate gpu split, an empty one will be used\n", __func__);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
// Apply GPU index and split FFNs to GPU
|
||||||
|
size_t ffn_offloaded_bytes = llama_model_offload_ffn_split(&model);
|
||||||
|
LLAMA_LOG_INFO("%s: offloaded %.2f MiB of FFN weights to GPU\n", __func__, ffn_offloaded_bytes / 1024.0 / 1024.0);
|
||||||
|
}
|
||||||
|
|
||||||
static void llm_load_sparse_model_tensors(
|
static void llm_load_sparse_model_tensors(
|
||||||
llama_model_loader & ml,
|
llama_model_loader & ml,
|
||||||
llama_model & model,
|
llama_model & model,
|
||||||
int main_gpu,
|
int main_gpu,
|
||||||
long int vram_budget_bytes,
|
long int vram_budget_bytes,
|
||||||
const float * tensor_split,
|
bool reset_gpu_index,
|
||||||
|
bool disable_ffn_split,
|
||||||
bool use_mlock,
|
bool use_mlock,
|
||||||
llama_progress_callback progress_callback,
|
llama_progress_callback progress_callback,
|
||||||
void * progress_callback_user_data) {
|
void * progress_callback_user_data) {
|
||||||
|
@ -3131,19 +3103,20 @@ static void llm_load_sparse_model_tensors(
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
alloc.flush();
|
size_t vram_allocated_bytes = alloc.flush();
|
||||||
|
GGML_ASSERT(vram_allocated_bytes < vram_capacity);
|
||||||
|
|
||||||
// print memory requirements
|
// print memory requirements
|
||||||
{
|
{
|
||||||
// this is the total memory required to run the inference
|
// this is the total memory required to run the inference
|
||||||
size_t mem_required =
|
size_t mem_required =
|
||||||
ctx_size +
|
ctx_size +
|
||||||
mmapped_size - alloc.vram_allocated_bytes; // weights in VRAM not in memory
|
mmapped_size - vram_allocated_bytes; // weights in VRAM not in memory
|
||||||
|
|
||||||
LLAMA_LOG_INFO("%s: mem required = %7.2f MB\n", __func__, mem_required / 1024.0 / 1024.0);
|
LLAMA_LOG_INFO("%s: mem required = %7.2f MB\n", __func__, mem_required / 1024.0 / 1024.0);
|
||||||
|
|
||||||
#if defined(GGML_USE_CUBLAS) || defined(GGML_USE_CLBLAST)
|
#if defined(GGML_USE_CUBLAS) || defined(GGML_USE_CLBLAST)
|
||||||
LLAMA_LOG_INFO("%s: VRAM used: %.2f MB\n", __func__, alloc.vram_allocated_bytes / 1024.0 / 1024.0);
|
LLAMA_LOG_INFO("%s: VRAM used: %.2f MB\n", __func__, vram_allocated_bytes / 1024.0 / 1024.0);
|
||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -3161,11 +3134,14 @@ static void llm_load_sparse_model_tensors(
|
||||||
|
|
||||||
model.mapping = std::move(ml.mapping);
|
model.mapping = std::move(ml.mapping);
|
||||||
|
|
||||||
|
// Offload FFN segments to GPU if possible
|
||||||
|
llm_load_gpu_split(ml, model, vram_capacity - vram_allocated_bytes, reset_gpu_index, disable_ffn_split);
|
||||||
|
|
||||||
// loading time will be recalculate after the first eval, so
|
// loading time will be recalculate after the first eval, so
|
||||||
// we take page faults deferred by mmap() into consideration
|
// we take page faults deferred by mmap() into consideration
|
||||||
model.t_load_us = ggml_time_us() - model.t_start_us;
|
model.t_load_us = ggml_time_us() - model.t_start_us;
|
||||||
|
|
||||||
model.n_gpu_layers = -1; // based on offloading results?
|
model.n_gpu_layers = -1; // TODO: based on offloading results, by category?
|
||||||
}
|
}
|
||||||
|
|
||||||
static void llm_load_tensors(
|
static void llm_load_tensors(
|
||||||
|
@ -3893,6 +3869,10 @@ static bool llama_model_load(const std::string & fname, llama_model & model, con
|
||||||
try {
|
try {
|
||||||
llama_model_loader ml(fname, params.use_mmap);
|
llama_model_loader ml(fname, params.use_mmap);
|
||||||
|
|
||||||
|
if (ml.sparse_deriv == GGML_SPARSE_INFERENCE) {
|
||||||
|
LLAMA_LOG_INFO("%s: PowerInfer model loaded. Sparse inference will be used.\n", __func__);
|
||||||
|
}
|
||||||
|
|
||||||
model.hparams.vocab_only = params.vocab_only;
|
model.hparams.vocab_only = params.vocab_only;
|
||||||
model.sparse_deriv = ml.sparse_deriv;
|
model.sparse_deriv = ml.sparse_deriv;
|
||||||
|
|
||||||
|
@ -3918,8 +3898,8 @@ static bool llama_model_load(const std::string & fname, llama_model & model, con
|
||||||
}
|
}
|
||||||
double vram_budget_bytes = params.vram_budget_gb * 1024.0 * 1024.0 * 1024.0;
|
double vram_budget_bytes = params.vram_budget_gb * 1024.0 * 1024.0 * 1024.0;
|
||||||
llm_load_sparse_model_tensors(
|
llm_load_sparse_model_tensors(
|
||||||
ml, model, params.main_gpu, vram_budget_bytes, params.tensor_split, params.use_mlock,
|
ml, model, params.main_gpu, vram_budget_bytes, params.reset_gpu_index, params.disable_gpu_index,
|
||||||
params.progress_callback, params.progress_callback_user_data
|
params.use_mlock, params.progress_callback, params.progress_callback_user_data
|
||||||
);
|
);
|
||||||
} else {
|
} else {
|
||||||
llm_load_tensors(
|
llm_load_tensors(
|
||||||
|
@ -9671,24 +9651,19 @@ int llama_model_apply_lora_from_file(const struct llama_model * model, const cha
|
||||||
}
|
}
|
||||||
|
|
||||||
int llama_model_apply_gpu_idx_from_file(struct llama_model * model, const char * path_mlp, bool use_mmap) {
|
int llama_model_apply_gpu_idx_from_file(struct llama_model * model, const char * path_mlp, bool use_mmap) {
|
||||||
llama_mlp_model_loader * mlp_ml = new llama_mlp_model_loader(path_mlp, use_mmap);
|
llama_gpu_split_loader * mlp_ml = new llama_gpu_split_loader(path_mlp, use_mmap);
|
||||||
if (mlp_ml -> apply_tensors_to_base_model(model) > 0) {
|
if (mlp_ml -> apply_tensors_to_base_model(model) > 0) {
|
||||||
LLAMA_LOG_ERROR("%s: failed to apply mlp adapter\n", __func__);
|
LLAMA_LOG_ERROR("%s: failed to apply gpu split\n", __func__);
|
||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
model -> mlp_model_loader = std::unique_ptr<llama_mlp_model_loader>(mlp_ml);
|
model -> mlp_model_loader = std::unique_ptr<llama_gpu_split_loader>(mlp_ml);
|
||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
// Apply postprocessing steps for PowerInfer derived models
|
size_t llama_model_offload_ffn_split(struct llama_model * model) {
|
||||||
int llama_model_apply_augmentation(struct llama_model * model) {
|
|
||||||
llama_augmentation_model_loader * aug_ml = new llama_augmentation_model_loader(model);
|
llama_augmentation_model_loader * aug_ml = new llama_augmentation_model_loader(model);
|
||||||
if (aug_ml -> apply_augmentation_to_base_model(model) > 0) {
|
size_t offloaded_bytes = aug_ml->offload_ffn_split(model);
|
||||||
LLAMA_LOG_ERROR("%s: failed to apply augmentation adapter\n", __func__);
|
return offloaded_bytes;
|
||||||
return 1;
|
|
||||||
}
|
|
||||||
model -> aug_model_loader = std::unique_ptr<llama_augmentation_model_loader>(aug_ml);
|
|
||||||
return 0;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
int llama_get_kv_cache_token_count(const struct llama_context * ctx) {
|
int llama_get_kv_cache_token_count(const struct llama_context * ctx) {
|
||||||
|
|
4
llama.h
4
llama.h
|
@ -173,6 +173,8 @@ extern "C" {
|
||||||
bool vocab_only; // only load the vocabulary, no weights
|
bool vocab_only; // only load the vocabulary, no weights
|
||||||
bool use_mmap; // use mmap if possible
|
bool use_mmap; // use mmap if possible
|
||||||
bool use_mlock; // force system to keep model in RAM
|
bool use_mlock; // force system to keep model in RAM
|
||||||
|
bool reset_gpu_index; // force reset of the GPU index
|
||||||
|
bool disable_gpu_index; // bypass the GPU index and FFN split
|
||||||
};
|
};
|
||||||
|
|
||||||
struct llama_context_params {
|
struct llama_context_params {
|
||||||
|
@ -347,7 +349,7 @@ extern "C" {
|
||||||
const char * path_mlp,
|
const char * path_mlp,
|
||||||
bool use_mmap);
|
bool use_mmap);
|
||||||
|
|
||||||
LLAMA_API int llama_model_apply_augmentation(struct llama_model * model);
|
LLAMA_API size_t llama_model_offload_ffn_split(struct llama_model * model);
|
||||||
|
|
||||||
//
|
//
|
||||||
// KV cache
|
// KV cache
|
||||||
|
|
0
powerinfer-py/powerinfer/__init__.py
Normal file
0
powerinfer-py/powerinfer/__init__.py
Normal file
43
powerinfer-py/powerinfer/__main__.py
Normal file
43
powerinfer-py/powerinfer/__main__.py
Normal file
|
@ -0,0 +1,43 @@
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
from .solver import solve_gpu_split
|
||||||
|
from .export_split import export_split
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
|
||||||
|
# Set up command line arguments
|
||||||
|
parser = argparse.ArgumentParser(description='Optimize neuron activation based on VRAM capacity and other parameters.')
|
||||||
|
parser.add_argument('--activation', type=str, required=True, help='Path to the directory containing activation data.')
|
||||||
|
parser.add_argument('--neuron', type=int, default=8192*4, help='Total number of neurons in the network.')
|
||||||
|
parser.add_argument('--capacity', type=int, default=int(8192*4*32*0.1), help='Total VRAM capacity for the model.')
|
||||||
|
parser.add_argument('--layer', type=int, default=59, help='Total number of layers in the neural network.')
|
||||||
|
parser.add_argument('--vram-capacity', type=int, help='Total VRAM capacity (Bytes) available for splitting')
|
||||||
|
parser.add_argument('--batch', type=int, default=256, help='Batch size for processing.')
|
||||||
|
parser.add_argument('--threshold', type=int, default=0, help='Threshold for splitting a layer across multiple GPUs.')
|
||||||
|
parser.add_argument('--output', type=str, required=True, help='File path for the output pickle file.')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
print("solver args:", args)
|
||||||
|
|
||||||
|
solved = solve_gpu_split(
|
||||||
|
activation_path=args.activation,
|
||||||
|
neuron=args.neuron,
|
||||||
|
capacity=args.capacity,
|
||||||
|
layer=args.layer,
|
||||||
|
batch=args.batch,
|
||||||
|
threshold=args.threshold,
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"solved: {solved}, total neurons: {sum(solved)}")
|
||||||
|
|
||||||
|
export_split(
|
||||||
|
activations_path=args.activation,
|
||||||
|
output_path=args.output,
|
||||||
|
solved_list=solved,
|
||||||
|
vram_capacity=args.vram_capacity
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"Exported to {args.output}")
|
70
powerinfer-py/powerinfer/export_split.py
Normal file
70
powerinfer-py/powerinfer/export_split.py
Normal file
|
@ -0,0 +1,70 @@
|
||||||
|
import argparse
|
||||||
|
import pickle
|
||||||
|
import gguf
|
||||||
|
from gguf.constants import GGMLQuantizationType
|
||||||
|
from gguf.gguf_writer import GGUFWriter
|
||||||
|
import torch
|
||||||
|
from pathlib import Path
|
||||||
|
import os
|
||||||
|
import struct
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
def load_activation_weights(models_base: Path):
|
||||||
|
# TODO: might need a specification file to indicate which models to load.
|
||||||
|
# But for now, let's assume it is a plain directory of activation_{0, ... , n_layers - 1}.pt
|
||||||
|
*_, files = next(os.walk(models_base))
|
||||||
|
return [torch.load(models_base / f"activation_{i}.pt") for i in range(len(files))]
|
||||||
|
|
||||||
|
def append_gpu_idx(gguf: GGUFWriter, i_layer: int, activation, select_count) -> None:
|
||||||
|
_, indices = torch.topk(activation, k=int(select_count))
|
||||||
|
gpu_idx = torch.zeros_like(activation)
|
||||||
|
gpu_idx[indices] = 1
|
||||||
|
gpu_idx = gpu_idx.numpy().astype(np.int32)
|
||||||
|
key = f"blk.{i_layer}.gpu_idx"
|
||||||
|
print(
|
||||||
|
f"{key} => {key} {gpu_idx.shape} {gpu_idx.dtype} {gpu_idx.nbytes/1024/1024} MiB"
|
||||||
|
)
|
||||||
|
gguf.add_tensor(
|
||||||
|
name=key,
|
||||||
|
tensor=gpu_idx,
|
||||||
|
raw_shape=gpu_idx.shape[::-1],
|
||||||
|
raw_dtype=GGMLQuantizationType.I32,
|
||||||
|
)
|
||||||
|
|
||||||
|
indices = indices.numpy().astype(np.int32)
|
||||||
|
gpu_bucket = np.sort(indices)
|
||||||
|
key = f"blk.{i_layer}.gpu_bucket"
|
||||||
|
print(
|
||||||
|
f"{key} => {key} {gpu_bucket.shape} {gpu_bucket.dtype} {gpu_bucket.nbytes/1024/1024} MiB"
|
||||||
|
)
|
||||||
|
gguf.add_tensor(
|
||||||
|
name=key,
|
||||||
|
tensor=gpu_bucket,
|
||||||
|
raw_shape=gpu_bucket.shape[::-1],
|
||||||
|
raw_dtype=GGMLQuantizationType.I32,
|
||||||
|
)
|
||||||
|
|
||||||
|
def export_split(activations_path: str, output_path: str, solved_list: list[int], vram_capacity: int):
|
||||||
|
predictors = load_activation_weights(Path(activations_path)) # predictor => activation acount
|
||||||
|
gguf_out = GGUFWriter(output_path, "generic.gpu_index")
|
||||||
|
for i, (activation, selected_count) in enumerate(zip(predictors, solved_list)):
|
||||||
|
append_gpu_idx(gguf_out, i, activation, selected_count)
|
||||||
|
|
||||||
|
# set kvs
|
||||||
|
gguf_out.add_block_count(len(predictors))
|
||||||
|
# TODO: better to save the actual capacity that split neurons require
|
||||||
|
gguf_out.add_uint64(gguf.Keys.Split.VRAM_CAPACITY, vram_capacity)
|
||||||
|
|
||||||
|
gguf_out.write_header_to_file()
|
||||||
|
gguf_out.write_kv_data_to_file()
|
||||||
|
gguf_out.write_tensors_to_file()
|
||||||
|
gguf_out.close()
|
||||||
|
|
||||||
|
# post-process: write another unique file header to distinguish from the origianl GGUF file
|
||||||
|
with open(output_path, "r+b") as fout:
|
||||||
|
POWERINFER_MAGIC = int.from_bytes(b"PWRI", "little")
|
||||||
|
fout.write(struct.pack("<I", POWERINFER_MAGIC))
|
||||||
|
fout.write(struct.pack("<I", 3))
|
||||||
|
|
||||||
|
print(f"exported GPU index to {output_path}")
|
||||||
|
|
90
powerinfer-py/powerinfer/solver.py
Normal file
90
powerinfer-py/powerinfer/solver.py
Normal file
|
@ -0,0 +1,90 @@
|
||||||
|
#!/usr/bin/env python
|
||||||
|
# coding=utf-8
|
||||||
|
import argparse
|
||||||
|
from cvxopt.glpk import ilp
|
||||||
|
import numpy as np
|
||||||
|
from cvxopt import matrix
|
||||||
|
import torch
|
||||||
|
import pickle
|
||||||
|
|
||||||
|
def solve_gpu_split(
|
||||||
|
activation_path: str,
|
||||||
|
neuron: int,
|
||||||
|
capacity: int,
|
||||||
|
layer: int,
|
||||||
|
batch: int,
|
||||||
|
threshold: int,
|
||||||
|
):
|
||||||
|
# Processing activation data
|
||||||
|
values = []
|
||||||
|
for i in range(layer):
|
||||||
|
# Load and sort activation data for each layer
|
||||||
|
freq = torch.load(f"{activation_path}/activation_{i}.pt")
|
||||||
|
freq, _ = torch.sort(freq, descending=True)
|
||||||
|
freq = freq * -1.0
|
||||||
|
freq = freq.view(-1, batch)
|
||||||
|
freq = freq.sum(dim=1)
|
||||||
|
freq = freq.tolist()
|
||||||
|
values += freq
|
||||||
|
|
||||||
|
# Padding zero values for additional constraints
|
||||||
|
for i in range(layer):
|
||||||
|
values += [0.0]
|
||||||
|
c = np.array(values, dtype=float)
|
||||||
|
c = matrix(c)
|
||||||
|
|
||||||
|
# Setting capacity and neuron count per batch
|
||||||
|
CAP = capacity
|
||||||
|
CAP = int(CAP / batch)
|
||||||
|
neuron = int(neuron / batch)
|
||||||
|
coeff = []
|
||||||
|
h = []
|
||||||
|
|
||||||
|
# Constraint 1: Total neuron activation constraint
|
||||||
|
lst = []
|
||||||
|
for i in range(neuron * layer):
|
||||||
|
lst.append(1)
|
||||||
|
for i in range(layer):
|
||||||
|
lst.append(0)
|
||||||
|
coeff.append(lst)
|
||||||
|
h.append(CAP)
|
||||||
|
|
||||||
|
# Constraint 2: Threshold constraint for GPU split per layer
|
||||||
|
for i in range(layer):
|
||||||
|
lst = [0] * (neuron * layer + layer)
|
||||||
|
for j in range(neuron):
|
||||||
|
lst[i * neuron + j] = -1
|
||||||
|
lst[neuron * layer + i] = int(threshold / batch)
|
||||||
|
coeff.append(lst)
|
||||||
|
h.append(0)
|
||||||
|
|
||||||
|
# Constraint 3: Upper bound on neuron activations
|
||||||
|
for i in range(layer):
|
||||||
|
lst = [0] * (neuron * layer + layer)
|
||||||
|
for j in range(neuron):
|
||||||
|
lst[i * neuron + j] = 1
|
||||||
|
lst[neuron * layer + i] = -1000000 # Arbitrary large negative number as an upper bound
|
||||||
|
coeff.append(lst)
|
||||||
|
h.append(0)
|
||||||
|
|
||||||
|
# Convert lists to matrix format for ILP solver
|
||||||
|
coeff = np.array(coeff, dtype=float)
|
||||||
|
G = matrix(coeff)
|
||||||
|
h = np.array(h, dtype=float)
|
||||||
|
h = matrix(h)
|
||||||
|
|
||||||
|
# Define the set of integer and binary variables
|
||||||
|
I = set(range(neuron * layer + layer))
|
||||||
|
B = set()
|
||||||
|
|
||||||
|
# Solving the ILP problem
|
||||||
|
(status, x) = ilp(c, G, h, None, None, B, I, options={'tm_lim' : 30000}) # with 30s timeout
|
||||||
|
print(f"ILP Status: {status}")
|
||||||
|
ans = list(x)
|
||||||
|
print(f"Total Activation Units: {sum(ans)}")
|
||||||
|
|
||||||
|
aligned_lst = []
|
||||||
|
for i in range(layer):
|
||||||
|
aligned_lst.append(sum(ans[i * neuron:i * neuron + neuron] * batch))
|
||||||
|
|
||||||
|
return aligned_lst
|
20
powerinfer-py/pyproject.toml
Normal file
20
powerinfer-py/pyproject.toml
Normal file
|
@ -0,0 +1,20 @@
|
||||||
|
[build-system]
|
||||||
|
requires = [
|
||||||
|
"flit_core >=3.2,<4",
|
||||||
|
]
|
||||||
|
build-backend = "flit_core.buildapi"
|
||||||
|
|
||||||
|
[project]
|
||||||
|
name = "powerinfer"
|
||||||
|
authors = [
|
||||||
|
{name = "Holden", email = "hodlenx@gmail.com"},
|
||||||
|
]
|
||||||
|
requires-python = ">=3.9"
|
||||||
|
classifiers = ["License :: OSI Approved :: MIT License"]
|
||||||
|
version="0.0.1"
|
||||||
|
description="powerinfer.py: Python helpers for PowerInfer LLM inference engine"
|
||||||
|
|
||||||
|
dependencies = [
|
||||||
|
"torch>=2",
|
||||||
|
"cvxopt==1.3.2"
|
||||||
|
]
|
|
@ -1,3 +1,4 @@
|
||||||
numpy==1.24.4
|
numpy==1.24.4
|
||||||
sentencepiece==0.1.98
|
sentencepiece==0.1.98
|
||||||
-e ./gguf-py
|
-e ./gguf-py
|
||||||
|
-e ./powerinfer-py
|
|
@ -1,142 +0,0 @@
|
||||||
#!/usr/bin/env python3
|
|
||||||
|
|
||||||
import argparse
|
|
||||||
import torch
|
|
||||||
import torch.nn as tnn
|
|
||||||
from pathlib import Path
|
|
||||||
import os
|
|
||||||
import re
|
|
||||||
import struct
|
|
||||||
from typing import Any, BinaryIO
|
|
||||||
import numpy as np
|
|
||||||
import pickle
|
|
||||||
|
|
||||||
class ReluMLP(tnn.Module):
|
|
||||||
def __init__(self, input_dim, hidden_dim, output_dim):
|
|
||||||
super(ReluMLP, self).__init__()
|
|
||||||
self.fc1 = tnn.Linear(input_dim, hidden_dim, bias=False)
|
|
||||||
self.relu = tnn.ReLU()
|
|
||||||
self.fc2 = tnn.Linear(hidden_dim, output_dim, bias=False)
|
|
||||||
|
|
||||||
def forward(self, x):
|
|
||||||
x = self.fc1(x)
|
|
||||||
x = self.relu(x)
|
|
||||||
x = self.fc2(x)
|
|
||||||
return x
|
|
||||||
|
|
||||||
|
|
||||||
def _load_mlp_model(model_file: Path):
|
|
||||||
model = torch.load(model_file)
|
|
||||||
# hidden_size, input_size = model.get("fc1.weight").shape
|
|
||||||
# output_size, _ = model.get("fc2.weight").shape
|
|
||||||
# mlp = ReluMLP(input_size, hidden_size, output_size)
|
|
||||||
# mlp.load_state_dict(model)
|
|
||||||
return model
|
|
||||||
|
|
||||||
|
|
||||||
def load_mlp_predictors(models_base: Path):
|
|
||||||
# TODO: might need a specification file to indicate which models to load.
|
|
||||||
# But for now, let's assume it is a plain directory of models_{0, ... , n_layers - 1}.pt
|
|
||||||
*_, files = next(os.walk(models_base))
|
|
||||||
return [_load_mlp_model(models_base / f"activation_{i}.pt") for i in range(len(files))]
|
|
||||||
|
|
||||||
|
|
||||||
def write_file_header(fout: BinaryIO, n_tensors: int) -> None:
|
|
||||||
fout.write(b"gglp"[::-1]) # magic (GGml mLP)
|
|
||||||
fout.write(struct.pack("i", 1)) # file version
|
|
||||||
# TODO: If we found we need more common parameters, we can add them here.
|
|
||||||
fout.write(struct.pack("i", n_tensors))
|
|
||||||
|
|
||||||
|
|
||||||
def write_tensor_header(
|
|
||||||
fout: BinaryIO, key: str, shape: tuple[int, ...], dtype: np.dtype
|
|
||||||
) -> None:
|
|
||||||
_NUMPY_TYPE_TO_FTYPE: dict[str, int] = {"float32": 0, "float16": 1, "int32": 18}
|
|
||||||
bkey = key.encode("utf-8")
|
|
||||||
fout.write(
|
|
||||||
struct.pack("iii", len(shape), len(bkey), _NUMPY_TYPE_TO_FTYPE[dtype.name])
|
|
||||||
)
|
|
||||||
fout.write(struct.pack("i" * len(shape), *shape))
|
|
||||||
fout.write(bkey)
|
|
||||||
# Aligns to 32 bytes
|
|
||||||
fout.seek((fout.tell() + 31) & -32)
|
|
||||||
|
|
||||||
|
|
||||||
# TODO: need to add more details in key name to indicate the network, layer number, etc.
|
|
||||||
def _translate_mlp_key(key: str) -> str:
|
|
||||||
match = re.match(r"^(fc\d+).weight$", key)
|
|
||||||
if not match or len(match.groups()) != 1:
|
|
||||||
raise ValueError(f"Unexpected key: {key}")
|
|
||||||
return f"{match.group(1)}.weight.mlp"
|
|
||||||
|
|
||||||
|
|
||||||
def append_mlp_model(fout: BinaryIO, model: ReluMLP) -> None:
|
|
||||||
model_dict = model.state_dict()
|
|
||||||
for k, v in model_dict.items():
|
|
||||||
key = _translate_mlp_key(k)
|
|
||||||
# torch.nn.Linear stores the weight matrix as (output_dim, input_dim), so does GGML.
|
|
||||||
weights = v.half().detach().numpy()
|
|
||||||
# GGML stores the weight matrix as (input_dim, output_dim)
|
|
||||||
dims = weights.shape[::-1]
|
|
||||||
print(
|
|
||||||
f"{k} => {key} {weights.shape} {weights.dtype} {weights.nbytes/1024/1024} MiB"
|
|
||||||
)
|
|
||||||
# TODO: add option to write in float32
|
|
||||||
write_tensor_header(fout, key, dims, np.dtype("float16"))
|
|
||||||
weights.tofile(fout)
|
|
||||||
|
|
||||||
def append_gpu_idx(fout: BinaryIO, activation, select_count) -> None:
|
|
||||||
values, indices = torch.topk(activation, k=int(select_count))
|
|
||||||
gpu_idx = torch.zeros_like(activation)
|
|
||||||
gpu_idx[indices] = 1
|
|
||||||
gpu_idx = gpu_idx.numpy().astype(np.int32)
|
|
||||||
weights = gpu_idx
|
|
||||||
dims = gpu_idx.shape[::-1]
|
|
||||||
key = "gpu_idx"
|
|
||||||
print(
|
|
||||||
f"{key} => {key} {weights.shape} {weights.dtype} {weights.nbytes/1024/1024} MiB"
|
|
||||||
)
|
|
||||||
write_tensor_header(fout, key, dims, np.dtype("int32"))
|
|
||||||
weights.tofile(fout)
|
|
||||||
|
|
||||||
indices = indices.numpy().astype(np.int32)
|
|
||||||
weights = indices
|
|
||||||
dims = weights.shape[::-1]
|
|
||||||
key = "gpu_bucket"
|
|
||||||
print(
|
|
||||||
f"{key} => {key} {weights.shape} {weights.dtype} {weights.nbytes/1024/1024} MiB"
|
|
||||||
)
|
|
||||||
write_tensor_header(fout, key, dims, np.dtype("int32"))
|
|
||||||
weights = np.sort(weights)
|
|
||||||
weights.tofile(fout)
|
|
||||||
|
|
||||||
def main(predictors_path: str, output_path: str, solver_path: str):
|
|
||||||
predictors = load_mlp_predictors(Path(predictors_path)) # predictor => activation acount
|
|
||||||
n_tensors = len(predictors) * 2 # gpu_idx and gpu_bucket
|
|
||||||
print(f"found {len(predictors)} MLP adapters with {n_tensors} tensors")
|
|
||||||
with open(solver_path, "rb") as f:
|
|
||||||
loaded_lst = pickle.load(f)
|
|
||||||
# print(f"check solver {loaded_lst}")
|
|
||||||
with open(output_path, "wb") as fout:
|
|
||||||
fout.truncate()
|
|
||||||
write_file_header(fout, n_tensors=n_tensors)
|
|
||||||
for i, activation in enumerate(predictors):
|
|
||||||
print(f"appending gpu idx layer-{i}")
|
|
||||||
append_gpu_idx(fout, activation, loaded_lst[i])
|
|
||||||
# append_gpu_idx(fout, activation, (32768*0.0))
|
|
||||||
|
|
||||||
print(f"converted MLP adapters from {predictors_path} to {output_path}")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
parser = argparse.ArgumentParser()
|
|
||||||
parser.add_argument("predictors_path", help="path to the MLP predictors")
|
|
||||||
parser.add_argument(
|
|
||||||
"output_path",
|
|
||||||
help="path to the output GGML adapter",
|
|
||||||
default="./gpu-index.bin",
|
|
||||||
)
|
|
||||||
parser.add_argument("solver", help="path to the solver")
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
main(args.predictors_path, args.output_path, args.solver)
|
|
106
solver.py
106
solver.py
|
@ -1,106 +0,0 @@
|
||||||
#!/usr/bin/env python
|
|
||||||
# coding=utf-8
|
|
||||||
import argparse
|
|
||||||
from cvxopt.glpk import ilp
|
|
||||||
import numpy as np
|
|
||||||
from cvxopt import matrix
|
|
||||||
import torch
|
|
||||||
import pickle
|
|
||||||
|
|
||||||
# Set up command line arguments
|
|
||||||
parser = argparse.ArgumentParser(description='Optimize neuron activation based on VRAM capacity and other parameters.')
|
|
||||||
parser.add_argument('--activation_path', type=str, required=True, help='Path to the directory containing activation data.')
|
|
||||||
parser.add_argument('--neuron', type=int, default=8192*4, help='Total number of neurons in the network.')
|
|
||||||
parser.add_argument('--capacity', type=int, default=int(8192*4*32*0.1), help='Total VRAM capacity for the model.')
|
|
||||||
parser.add_argument('--layer', type=int, default=59, help='Total number of layers in the neural network.')
|
|
||||||
parser.add_argument('--batch', type=int, default=32, help='Batch size for processing.')
|
|
||||||
parser.add_argument('--threshold', type=int, default=512, help='Threshold for splitting a layer across multiple GPUs.')
|
|
||||||
parser.add_argument('--output', type=str, required=True, help='File path for the output pickle file.')
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
# Assigning command line arguments to variables
|
|
||||||
activation_path = args.activation_path
|
|
||||||
neuron = args.neuron
|
|
||||||
layer = args.layer
|
|
||||||
batch = args.batch
|
|
||||||
output_path = args.output
|
|
||||||
|
|
||||||
# Processing activation data
|
|
||||||
values = []
|
|
||||||
for i in range(layer):
|
|
||||||
# Load and sort activation data for each layer
|
|
||||||
freq = torch.load(f"{activation_path}/activation_{i}.pt")
|
|
||||||
freq, _ = torch.sort(freq, descending=True)
|
|
||||||
freq = freq * -1.0
|
|
||||||
freq = freq.view(-1, batch)
|
|
||||||
freq = freq.sum(dim=1)
|
|
||||||
freq = freq.tolist()
|
|
||||||
values += freq
|
|
||||||
|
|
||||||
# Padding zero values for additional constraints
|
|
||||||
for i in range(layer):
|
|
||||||
values += [0.0]
|
|
||||||
c = np.array(values, dtype=float)
|
|
||||||
c = matrix(c)
|
|
||||||
|
|
||||||
# Setting capacity and neuron count per batch
|
|
||||||
CAP = args.capacity
|
|
||||||
CAP = int(CAP / batch)
|
|
||||||
neuron = int(neuron / batch)
|
|
||||||
coeff = []
|
|
||||||
h = []
|
|
||||||
|
|
||||||
# Constraint 1: Total neuron activation constraint
|
|
||||||
lst = []
|
|
||||||
for i in range(neuron * layer):
|
|
||||||
lst.append(1)
|
|
||||||
for i in range(layer):
|
|
||||||
lst.append(0)
|
|
||||||
coeff.append(lst)
|
|
||||||
h.append(CAP)
|
|
||||||
|
|
||||||
# Constraint 2: Threshold constraint for GPU split per layer
|
|
||||||
for i in range(layer):
|
|
||||||
lst = [0] * (neuron * layer + layer)
|
|
||||||
for j in range(neuron):
|
|
||||||
lst[i * neuron + j] = -1
|
|
||||||
lst[neuron * layer + i] = int(args.threshold / batch)
|
|
||||||
coeff.append(lst)
|
|
||||||
h.append(0)
|
|
||||||
|
|
||||||
# Constraint 3: Upper bound on neuron activations
|
|
||||||
for i in range(layer):
|
|
||||||
lst = [0] * (neuron * layer + layer)
|
|
||||||
for j in range(neuron):
|
|
||||||
lst[i * neuron + j] = 1
|
|
||||||
lst[neuron * layer + i] = -1000000 # Arbitrary large negative number as an upper bound
|
|
||||||
coeff.append(lst)
|
|
||||||
h.append(0)
|
|
||||||
|
|
||||||
# Convert lists to matrix format for ILP solver
|
|
||||||
coeff = np.array(coeff, dtype=float)
|
|
||||||
G = matrix(coeff)
|
|
||||||
h = np.array(h, dtype=float)
|
|
||||||
h = matrix(h)
|
|
||||||
|
|
||||||
# Define the set of integer and binary variables
|
|
||||||
I = set(range(neuron * layer + layer))
|
|
||||||
B = set()
|
|
||||||
|
|
||||||
# Solving the ILP problem
|
|
||||||
(status, x) = ilp(c, G, h, None, None, B, I)
|
|
||||||
print(f"ILP Status: {status}")
|
|
||||||
ans = list(x)
|
|
||||||
print(f"Total Activation Units: {sum(ans)}")
|
|
||||||
|
|
||||||
# Serialize the solution
|
|
||||||
serialize = []
|
|
||||||
for i in range(layer):
|
|
||||||
serialize.append(sum(ans[i * neuron:i * neuron + neuron] * batch))
|
|
||||||
|
|
||||||
aligned_lst = serialize
|
|
||||||
|
|
||||||
# Save the solution to a pickle file
|
|
||||||
with open(output_path, 'wb') as handle:
|
|
||||||
pickle.dump(aligned_lst, handle)
|
|
Loading…
Add table
Add a link
Reference in a new issue