Support all LLaMA models + change Q4_0 quantization storage
This commit is contained in:
parent
5f2f970d51
commit
007a8f6f45
5 changed files with 404 additions and 205 deletions
44
README.md
44
README.md
|
@ -17,12 +17,11 @@ The main goal is to run the model using 4-bit quantization on a MacBook.
|
|||
|
||||
This was hacked in an evening - I have no idea if it works correctly.
|
||||
|
||||
So far, I've tested just the 7B model.
|
||||
Here is a typical run:
|
||||
Here is a typical run using LLaMA-7B:
|
||||
|
||||
```java
|
||||
make -j && ./main -m ../LLaMA-4bit/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512
|
||||
I llama.cpp build info:
|
||||
make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512
|
||||
I llama.cpp build info:
|
||||
I UNAME_S: Darwin
|
||||
I UNAME_P: arm
|
||||
I UNAME_M: arm64
|
||||
|
@ -34,7 +33,7 @@ I CXX: Apple clang version 14.0.0 (clang-1400.0.29.202)
|
|||
|
||||
make: Nothing to be done for `default'.
|
||||
main: seed = 1678486056
|
||||
llama_model_load: loading model from '../LLaMA-4bit/7B/ggml-model-q4_0.bin' - please wait ...
|
||||
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
|
||||
llama_model_load: n_vocab = 32000
|
||||
llama_model_load: n_ctx = 512
|
||||
llama_model_load: n_embd = 4096
|
||||
|
@ -110,6 +109,8 @@ https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8
|
|||
|
||||
## Usage
|
||||
|
||||
Here are the step for the LLaMA-7B model:
|
||||
|
||||
```bash
|
||||
# build this repo
|
||||
git clone https://github.com/ggerganov/llama.cpp
|
||||
|
@ -133,9 +134,40 @@ python3 convert-pth-to-ggml.py models/7B/ 1
|
|||
./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128
|
||||
```
|
||||
|
||||
For the bigger models, there are a few extra quantization steps. For example, for LLaMA-13B, converting to FP16 format
|
||||
will create 2 ggml files, instead of one:
|
||||
|
||||
```bash
|
||||
ggml-model-f16.bin
|
||||
ggml-model-f16.bin.1
|
||||
```
|
||||
|
||||
You need to quantize each of them separately like this:
|
||||
|
||||
```bash
|
||||
./quantize ./models/13B/ggml-model-f16.bin ./models/13B/ggml-model-q4_0.bin 2
|
||||
./quantize ./models/13B/ggml-model-f16.bin.1 ./models/13B/ggml-model-q4_0.bin.1 2
|
||||
```
|
||||
|
||||
Everything else is the same. Simply run:
|
||||
|
||||
```bash
|
||||
./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128
|
||||
```
|
||||
|
||||
The number of files generated for each model is as follows:
|
||||
|
||||
```
|
||||
7B -> 1 file
|
||||
13B -> 2 files
|
||||
33B -> 4 files
|
||||
65B -> 8 files
|
||||
```
|
||||
|
||||
When running the larger models, make sure you have enough disk space to store all the intermediate files.
|
||||
|
||||
## Limitations
|
||||
|
||||
- Currently, only LLaMA-7B is supported since I haven't figured out how to merge the tensors of the bigger models. However, in theory, you should be able to run 65B on a 64GB MacBook
|
||||
- Not sure if my tokenizer is correct. There are a few places where we might have a mistake:
|
||||
- https://github.com/ggerganov/llama.cpp/blob/26c084662903ddaca19bef982831bfb0856e8257/convert-pth-to-ggml.py#L79-L87
|
||||
- https://github.com/ggerganov/llama.cpp/blob/26c084662903ddaca19bef982831bfb0856e8257/utils.h#L65-L69
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue