We highly appreciate the great effort from the fork of [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp). Our support of the RedPajama Model is mainly based on this implementation. We extend the model configure and fixed a bug when setting use_parallel_residual flag to False in their original implementation. We also extend the chat model for RedPajama.
## Usage:
### RedPajama Chat model:
- Make the code:
make redpajama-chat quantize-gptneox
- Prepare the RedPajama model (f16 and q4_0) for gglm:
- Run other quantized version of RedPajama Chat model (Make sure you get the f16 model prepared before you run this):
- Make the code to quantize the model if you have not:
make quantize-gptneox
- Generate the quantized model, the supported types include: q4_0, q4_1, q4_2, q5_0, q5_1, and q8_0. For example, to run q4_1, you need to do the following convertion:
- Run other quantize version of RedPajama Base/Instruct model (Make sure you get the f16 model prepared before you run this). Then you can generate the quantized model, the supported types include: q4_0, q4_1, q4_2, q5_0, q5_1, and q8_0. For example, to run q4_1, you need to do the following convertion, e.g for RedPajama-Base q8_0: