Merge branch 'master' into concedo

# Conflicts:
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	main.cpp
This commit is contained in:
Concedo 2023-03-22 22:31:45 +08:00
commit 86c7457e24
25 changed files with 3028 additions and 1944 deletions

198
.github/ISSUE_TEMPLATE/custom.md vendored Normal file
View file

@ -0,0 +1,198 @@
---
name: Custom issue template
about: Used to report user-related issues with the software
title: "[User] I encountered a problem .."
labels: ''
assignees: ''
---
# Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [ ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [ ] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [ ] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [ ] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.
# Expected Behavior
Please provide a detailed written description of what you were trying to do, and what you expected `lamma.cpp` to do.
# Current Behavior
Please provide a detailed written description of what `lamma.cpp` did, instead.
# Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
* Physical (or virtual) hardware you are using, e.g. for Linux:
`$ lscpu`
* Operating System, e.g. for Linux:
`$ uname -a`
* SDK version, e.g. for Linux:
```
$ python3 --version
$ make --version
$ g++ --version
```
# Models
* The LLaMA models are officially distributed by Facebook and will never be provided through this repository. See this [pull request in Facebook's LLaMA repository](https://github.com/facebookresearch/llama/pull/73/files) if you need to obtain access to the model data.
* If your issue is with model conversion please verify the `sha256sum` of each of your `consolidated*.pth` and `ggml-model-XXX.bin` files to confirm that you have the correct model data files before logging an issue. [Latest sha256 sums for your reference](https://github.com/ggerganov/llama.cpp/issues/238).
* If your issue is with model generation quality then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
* LLaMA:
* [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
* [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
* GPT-3
* [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
* GPT-3.5 / InstructGPT / ChatGPT:
* [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
* [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
# Failure Information (for bugs)
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
# Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
1. step 1
2. step 2
3. step 3
4. etc.
# Failure Logs
Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
Also, please try to **avoid using screenshots** if at all possible. Instead, copy/paste the console output and use [Github's markdown](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) to cleanly format your logs for easy readability. e.g.
```
llama.cpp$ git log | head -1
commit 2af23d30434a677c6416812eea52ccc0af65119c
llama.cpp$ lscpu | egrep "AMD|Flags"
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sme sev
Virtualization: AMD-V
llama.cpp$ python3 --version
Python 3.10.9
llama.cpp$ pip list | egrep "torch|numpy|sentencepiece"
numpy 1.24.2
numpydoc 1.5.0
sentencepiece 0.1.97
torch 1.13.1
torchvision 0.14.1
llama.cpp$ make --version | head -1
GNU Make 4.3
$ md5sum ./models/65B/ggml-model-q4_0.bin
dbdd682cce80e2d6e93cefc7449df487 ./models/65B/ggml-model-q4_0.bin
```
Here's a run with the Linux command [perf](https://www.brendangregg.com/perf.html)
```
llama.cpp$ perf stat ./main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p "Please close your issue when it has been answered."
main: seed = 1679149377
llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 8192
llama_model_load: n_mult = 256
llama_model_load: n_head = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 22016
llama_model_load: n_parts = 8
llama_model_load: ggml ctx size = 41477.73 MB
llama_model_load: memory_size = 2560.00 MB, n_mem = 40960
llama_model_load: loading model part 1/8 from './models/65B/ggml-model-q4_0.bin'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 2/8 from './models/65B/ggml-model-q4_0.bin.1'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 3/8 from './models/65B/ggml-model-q4_0.bin.2'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 4/8 from './models/65B/ggml-model-q4_0.bin.3'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 5/8 from './models/65B/ggml-model-q4_0.bin.4'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 6/8 from './models/65B/ggml-model-q4_0.bin.5'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 7/8 from './models/65B/ggml-model-q4_0.bin.6'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: prompt: 'Please close your issue when it has been answered.'
main: number of tokens in prompt = 11
1 -> ''
12148 -> 'Please'
3802 -> ' close'
596 -> ' your'
2228 -> ' issue'
746 -> ' when'
372 -> ' it'
756 -> ' has'
1063 -> ' been'
7699 -> ' answered'
29889 -> '.'
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
Please close your issue when it has been answered.
@duncan-donut: I'm trying to figure out what kind of "support" you need for this script and why, exactly? Is there a question about how the code works that hasn't already been addressed in one or more comments below this ticket, or are we talking something else entirely like some sorta bugfixing job because your server setup is different from mine??
I can understand if your site needs to be running smoothly and you need help with a fix of sorts but there should really be nothing wrong here that the code itself could not handle. And given that I'm getting reports about how it works perfectly well on some other servers, what exactly are we talking? A detailed report will do wonders in helping us get this resolved for ya quickly so please take your time and describe the issue(s) you see as clearly & concisely as possible!!
@duncan-donut: I'm not sure if you have access to cPanel but you could try these instructions. It is worth a shot! Let me know how it goes (or what error message, exactly!) when/if ya give that code a go? [end of text]
main: mem per token = 71159620 bytes
main: load time = 19309.95 ms
main: sample time = 168.62 ms
main: predict time = 223895.61 ms / 888.47 ms per token
main: total time = 246406.42 ms
Performance counter stats for './main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p Please close your issue when it has been answered.':
3636882.89 msec task-clock # 14.677 CPUs utilized
13509 context-switches # 3.714 /sec
2436 cpu-migrations # 0.670 /sec
10476679 page-faults # 2.881 K/sec
13133115082869 cycles # 3.611 GHz (16.77%)
29314462753 stalled-cycles-frontend # 0.22% frontend cycles idle (16.76%)
10294402631459 stalled-cycles-backend # 78.39% backend cycles idle (16.74%)
23479217109614 instructions # 1.79 insn per cycle
# 0.44 stalled cycles per insn (16.76%)
2353072268027 branches # 647.002 M/sec (16.77%)
1998682780 branch-misses # 0.08% of all branches (16.76%)
247.802177522 seconds time elapsed
3618.573072000 seconds user
18.491698000 seconds sys
```

View file

@ -17,7 +17,7 @@ CXXV := $(shell $(CXX) --version | head -n 1)
# ref: https://github.com/ggerganov/whisper.cpp/issues/66#issuecomment-1282546789 # ref: https://github.com/ggerganov/whisper.cpp/issues/66#issuecomment-1282546789
ifeq ($(UNAME_S),Darwin) ifeq ($(UNAME_S),Darwin)
ifneq ($(UNAME_P),arm) ifneq ($(UNAME_P),arm)
SYSCTL_M := $(shell sysctl -n hw.optional.arm64) SYSCTL_M := $(shell sysctl -n hw.optional.arm64 2>/dev/null)
ifeq ($(SYSCTL_M),1) ifeq ($(SYSCTL_M),1)
# UNAME_P := arm # UNAME_P := arm
# UNAME_M := arm64 # UNAME_M := arm64
@ -30,8 +30,9 @@ endif
# Compile flags # Compile flags
# #
# keep standard at C11 and C++11
CFLAGS = -I. -O3 -DNDEBUG -std=c11 -fPIC CFLAGS = -I. -O3 -DNDEBUG -std=c11 -fPIC
CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++17 -fPIC CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC
LDFLAGS = LDFLAGS =
# OS specific # OS specific
@ -52,6 +53,10 @@ ifeq ($(UNAME_S),NetBSD)
CFLAGS += -pthread CFLAGS += -pthread
CXXFLAGS += -pthread CXXFLAGS += -pthread
endif endif
ifeq ($(UNAME_S),OpenBSD)
CFLAGS += -pthread
CXXFLAGS += -pthread
endif
ifeq ($(UNAME_S),Haiku) ifeq ($(UNAME_S),Haiku)
CFLAGS += -pthread CFLAGS += -pthread
CXXFLAGS += -pthread CXXFLAGS += -pthread
@ -95,30 +100,59 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686))
ifneq (,$(findstring sse3,$(SSE3_M))) ifneq (,$(findstring sse3,$(SSE3_M)))
CFLAGS += -msse3 CFLAGS += -msse3
endif endif
AVX512F_M := $(shell grep "avx512f " /proc/cpuinfo)
ifneq (,$(findstring avx512f,$(AVX512F_M)))
CFLAGS += -mavx512f
endif
AVX512BW_M := $(shell grep "avx512bw " /proc/cpuinfo)
ifneq (,$(findstring avx512bw,$(AVX512BW_M)))
CFLAGS += -mavx512bw
endif
AVX512DQ_M := $(shell grep "avx512dq " /proc/cpuinfo)
ifneq (,$(findstring avx512dq,$(AVX512DQ_M)))
CFLAGS += -mavx512dq
endif
AVX512VL_M := $(shell grep "avx512vl " /proc/cpuinfo)
ifneq (,$(findstring avx512vl,$(AVX512VL_M)))
CFLAGS += -mavx512vl
endif
AVX512CD_M := $(shell grep "avx512cd " /proc/cpuinfo)
ifneq (,$(findstring avx512cd,$(AVX512CD_M)))
CFLAGS += -mavx512cd
endif
AVX512ER_M := $(shell grep "avx512er " /proc/cpuinfo)
ifneq (,$(findstring avx512er,$(AVX512ER_M)))
CFLAGS += -mavx512er
endif
AVX512IFMA_M := $(shell grep "avx512ifma " /proc/cpuinfo)
ifneq (,$(findstring avx512ifma,$(AVX512IFMA_M)))
CFLAGS += -mavx512ifma
endif
AVX512PF_M := $(shell grep "avx512pf " /proc/cpuinfo)
ifneq (,$(findstring avx512pf,$(AVX512PF_M)))
CFLAGS += -mavx512pf
endif
else ifeq ($(UNAME_S),Haiku) else ifeq ($(UNAME_S),Haiku)
AVX1_M := $(shell sysinfo -cpu | grep "AVX ") AVX1_M := $(shell sysinfo -cpu | grep -w "AVX")
ifneq (,$(findstring avx,$(AVX1_M))) ifneq (,$(findstring AVX,$(AVX1_M)))
CFLAGS += -mavx CFLAGS += -mavx
endif endif
AVX2_M := $(shell sysinfo -cpu | grep "AVX2 ") AVX2_M := $(shell sysinfo -cpu | grep -w "AVX2")
ifneq (,$(findstring avx2,$(AVX2_M))) ifneq (,$(findstring AVX2,$(AVX2_M)))
CFLAGS += -mavx2 CFLAGS += -mavx2
endif endif
FMA_M := $(shell sysinfo -cpu | grep "FMA ") FMA_M := $(shell sysinfo -cpu | grep -w "FMA")
ifneq (,$(findstring fma,$(FMA_M))) ifneq (,$(findstring FMA,$(FMA_M)))
CFLAGS += -mfma CFLAGS += -mfma
endif endif
F16C_M := $(shell sysinfo -cpu | grep "F16C ") F16C_M := $(shell sysinfo -cpu | grep -w "F16C")
ifneq (,$(findstring f16c,$(F16C_M))) ifneq (,$(findstring F16C,$(F16C_M)))
CFLAGS += -mf16c CFLAGS += -mf16c
endif endif
else else
CFLAGS += -mfma -mf16c -mavx -mavx2 CFLAGS += -mfma -mf16c -mavx -mavx2
endif endif
endif endif
ifeq ($(UNAME_M),amd64)
CFLAGS += -mavx -mavx2 -mfma -mf16c
endif
ifneq ($(filter ppc64%,$(UNAME_M)),) ifneq ($(filter ppc64%,$(UNAME_M)),)
POWER9_M := $(shell grep "POWER9" /proc/cpuinfo) POWER9_M := $(shell grep "POWER9" /proc/cpuinfo)
ifneq (,$(findstring POWER9,$(POWER9_M))) ifneq (,$(findstring POWER9,$(POWER9_M)))
@ -130,7 +164,8 @@ ifneq ($(filter ppc64%,$(UNAME_M)),)
endif endif
endif endif
ifndef LLAMA_NO_ACCELERATE ifndef LLAMA_NO_ACCELERATE
# Mac M1 - include Accelerate framework # Mac M1 - include Accelerate framework.
# `-framework Accelerate` works on Mac Intel as well, with negliable performance boost (as of the predict time).
ifeq ($(UNAME_S),Darwin) ifeq ($(UNAME_S),Darwin)
CFLAGS += -DGGML_USE_ACCELERATE CFLAGS += -DGGML_USE_ACCELERATE
LDFLAGS += -framework Accelerate LDFLAGS += -framework Accelerate
@ -185,6 +220,9 @@ default: main llamalib quantize
ggml.o: ggml.c ggml.h ggml.o: ggml.c ggml.h
$(CC) $(CFLAGS) -c ggml.c -o ggml.o $(CC) $(CFLAGS) -c ggml.c -o ggml.o
llama.o: llama.cpp llama.h
$(CXX) $(CXXFLAGS) -c llama.cpp -o llama.o
utils.o: utils.cpp utils.h utils.o: utils.cpp utils.h
$(CXX) $(CXXFLAGS) -c utils.cpp -o utils.o $(CXX) $(CXXFLAGS) -c utils.cpp -o utils.o
@ -194,15 +232,16 @@ extra.o: extra.cpp extra.h
clean: clean:
rm -f *.o main quantize rm -f *.o main quantize
main: main.cpp ggml.o utils.o extra.o main: main.cpp ggml.o extra.o utils.o
$(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o extra.o -o main $(LDFLAGS) $(CXX) $(CXXFLAGS) main.cpp ggml.o extra.o utils.o -o main $(LDFLAGS)
./main -h @echo "\x1b[36mrun ./main -h for help\x1b[0m"
llamalib: expose.cpp ggml.o utils.o extra.o llamalib: expose.cpp ggml.o utils.o extra.o
$(CXX) $(CXXFLAGS) expose.cpp ggml.o utils.o extra.o -shared -o llamacpp.dll $(LDFLAGS) $(CXX) $(CXXFLAGS) expose.cpp ggml.o utils.o extra.o -shared -o llamacpp.dll $(LDFLAGS)
quantize: quantize.cpp ggml.o utils.o
$(CXX) $(CXXFLAGS) quantize.cpp ggml.o utils.o -o quantize $(LDFLAGS) quantize: quantize.cpp ggml.o llama.o utils.o
$(CXX) $(CXXFLAGS) quantize.cpp ggml.o llama.o utils.o -o quantize $(LDFLAGS)
# #
# Tests # Tests

53
SHA256SUMS Normal file
View file

@ -0,0 +1,53 @@
700df0d3013b703a806d2ae7f1bfb8e59814e3d06ae78be0c66368a50059f33d models/7B/consolidated.00.pth
abe4aec2cdc297e2916011f66c7efd6fb4424e0e84315503005b5c118358cc22 models/7B/ggml-model-f16.bin
f495fa02a0b5ef265e1864d9680eede7fd23a60b0a2f93edba8091e2a4ca68b9 models/7B/ggml-model-q4_0.bin
7e89e242ddc0dd6f060b43ca219ce8b3e8f08959a72cb3c0855df8bb04d46265 models/7B/params.json
745bf4e29a4dd6f411e72976d92b452da1b49168a4f41c951cfcc8051823cf08 models/13B/consolidated.00.pth
d5ccbcc465c71c0de439a5aeffebe8344c68a519bce70bc7f9f92654ee567085 models/13B/consolidated.01.pth
a6bd0537c6873f36c47292df0b6f794e1135f5aafb89c3343bcc9e93264bf167 models/13B/ggml-model-f16.bin
0fb0951b90f2ec46c1f2f2372af5dacb4614b27e9fb6c10c69fbec58d7dd0e36 models/13B/ggml-model-f16.bin.1
1c218ba37ae61e15e35efd9949c78d6edf553b6280824c263cad56ae0b9d5a8f models/13B/ggml-model-q4_0.bin
c37a20c2ab9fa74b006b389085660269ee06110d1e45a494eb57d4602c9bcdb2 models/13B/ggml-model-q4_0.bin.1
4ab77bec4d4405ccb66a97b282574c89a94417e3c32e5f68f37e2876fc21322f models/13B/params.json
e23294a58552d8cdec5b7e8abb87993b97ea6eced4178ff2697c02472539d067 models/30B/consolidated.00.pth
4e077b7136c7ae2302e954860cf64930458d3076fcde9443f4d0e939e95903ff models/30B/consolidated.01.pth
24a87f01028cbd3a12de551dcedb712346c0b5cbdeff1454e0ddf2df9b675378 models/30B/consolidated.02.pth
1adfcef71420886119544949767f6a56cb6339b4d5fcde755d80fe68b49de93b models/30B/consolidated.03.pth
def20ea508f4e36793719f857471e85b85f96e497a2cbffbbaa1b60e2b18202c models/30B/ggml-model-f16.bin
b37040aa67fa8608cb2d8e0719132cf3e267fd35ec1e2f0d37dbc9fa43d674f1 models/30B/ggml-model-f16.bin.1
e7f263557e99069fe29003262ea5fa9ed885dbe79069083e6eb569b328cf30d3 models/30B/ggml-model-f16.bin.2
2ad6a23af05eb720f202f63d130f4fc5de9b6d2efc95b921be003209a56695aa models/30B/ggml-model-f16.bin.3
7de31d005e6d02ebd9603b2cf5329ad2f832b65d08873a098c5cafc4046cb9ed models/30B/ggml-model-q4_0.bin
f91feef9f30f9a023616db2e91297ca6d5d5d7b9eb351e452a82115c46f7da9e models/30B/ggml-model-q4_0.bin.1
66f3a0916ac7a81839153eb061fa861030ed1892477c2f7af2ce4f98d2f6d06f models/30B/ggml-model-q4_0.bin.2
e3c587ba97f83d2088b001bcda3026571065649ee3090bef6743a51390b01d3b models/30B/ggml-model-q4_0.bin.3
2c07118ea98d69dbe7810d88520e30288fa994751b337f8fca02b171955f44cb models/30B/params.json
135c563f6b3938114458183afb01adc9a63bef3d8ff7cccc3977e5d3664ecafe models/65B/consolidated.00.pth
9a600b37b19d38c7e43809485f70d17d1dc12206c07efa83bc72bb498a568bde models/65B/consolidated.01.pth
e7babf7c5606f165a3756f527cb0fedc4f83e67ef1290391e52fb1cce5f26770 models/65B/consolidated.02.pth
73176ffb426b40482f2aa67ae1217ef79fbbd1fff5482bae5060cdc5a24ab70e models/65B/consolidated.03.pth
882e6431d0b08a8bc66261a0d3607da21cbaeafa96a24e7e59777632dbdac225 models/65B/consolidated.04.pth
a287c0dfe49081626567c7fe87f74cce5831f58e459b427b5e05567641f47b78 models/65B/consolidated.05.pth
72b4eba67a1a3b18cb67a85b70f8f1640caae9b40033ea943fb166bd80a7b36b models/65B/consolidated.06.pth
d27f5b0677d7ff129ceacd73fd461c4d06910ad7787cf217b249948c3f3bc638 models/65B/consolidated.07.pth
7eba2625260cd91f8de901fd9704a1aa39448425514a335a0d3878de4ab9dc77 models/65B/ggml-model-f16.bin
f6aa886575df0785d4231f30cc776d499ccde18857818effc0378c65b178e0b5 models/65B/ggml-model-f16.bin.1
076037141682f5d7537955058c4740ab27f285aa4588915f830874a589c0693d models/65B/ggml-model-f16.bin.2
7853d96d2903ad7de2b2a89c4acf5a33a2f8e3c24ac39c9df6b44cdb42bf530a models/65B/ggml-model-f16.bin.3
b16b7b941abb3bc03a14df1656140855e9360a5371c83e919b9da83a72362314 models/65B/ggml-model-f16.bin.4
5291270216f888697695acb78ef28df0c080f9e85d3245c92fb9992d1fde6678 models/65B/ggml-model-f16.bin.5
0685ee77715f34686841006f8f94d3e7eaf148b97cecc9d3eee72808b0f7989c models/65B/ggml-model-f16.bin.6
00d993d73bb21d7c29388ffe0dced008cbaa0d391831dea77d7eb8f0b5c404b9 models/65B/ggml-model-f16.bin.7
4e398f05842206e08cdc5e7bb4f6c7c34b9dc373435ece6f261b14b7b4fe9b89 models/65B/ggml-model-q4_0.bin
4c4e899e3b12d9f57c9dcea5a1fb41bbc72023323535551f6273582ca7d7294b models/65B/ggml-model-q4_0.bin.1
d7b4594bbbd192043b3db0e5acc2561c42e6944e1cb91cc6e61510eee89dbcd8 models/65B/ggml-model-q4_0.bin.2
9a099d271648863d923d0d097391ea0bc75591f27a2ca3a327760f42e6b69af2 models/65B/ggml-model-q4_0.bin.3
5ee474051e418c5732b7949190b084d9d679db447f83c1de0d2a82daaa1a0cfa models/65B/ggml-model-q4_0.bin.4
a45aa05e7212bd6782790722d68056c5419667ea6b564ccc94bbcb8111d79b8b models/65B/ggml-model-q4_0.bin.5
a58fda714b759c28ad5e4c1d8bf8fda7b158fd5e4c4a49f851f36342fa97a105 models/65B/ggml-model-q4_0.bin.6
a3540cfcbcda33c223c6b0d606034adbd78f17e0e5de1582b78795e78754f7a8 models/65B/ggml-model-q4_0.bin.7
999ed1659b469ccc2a941714c0a9656fa571d17c9f7c8c7589817ca90edef51b models/65B/params.json
1f582babc2bd56bb63b33141898748657d369fd110c4358b2bc280907882bf13 models/alpaca-7B/ggml-model-q4_0.bin
e17730c6b62b565b098af023ca446dcb9e3535d4222ead6369c7aae67207eb3d models/alpaca-13B/ggml-model-q4_0.bin
9bcd1bb30e679c939f367be11b030fe20b3eb9a3606b9bc4106420f1827b6ae4 models/alpaca-30B/ggml-model-q4_0.bin
36079249f53c292a4c2302d7784005dcae94c865f0bedfdbfa51d9ddad402935 models/alpaca-30B/params.json

View file

@ -3,4 +3,4 @@
# Temporary script - will be removed in the future # Temporary script - will be removed in the future
# #
./main -m ./models/ggml-alpaca-7b-q4.bin --color -f ./prompts/alpaca.txt -ins --top_k 10000 --temp 0.96 --repeat_penalty 1 -t 7 ./main -m ./models/ggml-alpaca-7b-q4.bin --color -f ./prompts/alpaca.txt -ins --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7

6
chat.sh Executable file
View file

@ -0,0 +1,6 @@
#!/bin/bash
#
# Temporary script - will be removed in the future
#
./main -m ./models/7B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

172
convert-gptq-to-ggml.py Normal file
View file

@ -0,0 +1,172 @@
# Convert a GPTQ quantized LLaMA model to a ggml compatible file
# Based on: https://github.com/qwopqwop200/GPTQ-for-LLaMa
#
import os
import re
import sys
import json
import struct
import numpy as np
import torch
from sentencepiece import SentencePieceProcessor
if len(sys.argv) != 4:
print("Usage: convert-gptq-to-ggml.py llamaXXb-4bit.pt tokenizer.model out.bin\n")
sys.exit(1)
fname_model = sys.argv[1]
fname_tokenizer = sys.argv[2]
dir_out = sys.argv[3]
model = torch.load(fname_model, map_location="cpu")
n_vocab, n_embd = model['model.embed_tokens.weight'].shape
n_layer = 1 + max(int(m.group(1)) for name in model
if (m := re.match(r'model\.layers\.([0-9]+)', name)))
# hardcoded:
n_mult = 256
n_head = {32: 32, 40: 40, 60: 52, 80: 64}[n_layer]
tokenizer = SentencePieceProcessor(fname_tokenizer)
assert tokenizer.vocab_size() == n_vocab
fname_out = sys.argv[3]
fout = open(fname_out, "wb")
fout.write(struct.pack("i", 0x67676d6c)) # magic: ggml in hex
fout.write(struct.pack("i", n_vocab))
fout.write(struct.pack("i", n_embd))
fout.write(struct.pack("i", n_mult))
fout.write(struct.pack("i", n_head))
fout.write(struct.pack("i", n_layer))
fout.write(struct.pack("i", n_embd // n_head)) # rot (obsolete)
fout.write(struct.pack("i", 4))
# This loop unchanged from convert-pth-to-ggml.py:
for i in range(tokenizer.vocab_size()):
if tokenizer.is_unknown(i):
# "<unk>" token (translated as ??)
text = " \u2047 ".encode("utf-8")
fout.write(struct.pack("i", len(text)))
fout.write(text)
elif tokenizer.is_control(i):
# "<s>"/"</s>" tokens
fout.write(struct.pack("i", 0))
elif tokenizer.is_byte(i):
# "<U+XX>" tokens (which may be invalid UTF-8)
piece = tokenizer.id_to_piece(i)
if len(piece) != 6:
print("Invalid token: " + piece)
sys.exit(1)
byte_value = int(piece[3:-1], 16)
fout.write(struct.pack("i", 1))
fout.write(struct.pack("B", byte_value))
else:
# normal token. Uses U+2581 (LOWER ONE EIGHTH BLOCK) to represent spaces.
text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")
fout.write(struct.pack("i", len(text)))
fout.write(text)
def write_header(shape, dst_name, ftype_cur):
sname = dst_name.encode('utf-8')
fout.write(struct.pack("iii", len(shape), len(sname), ftype_cur))
fout.write(struct.pack("i" * len(shape), *shape[::-1]))
fout.write(sname)
def convert_non_q4(src_name, dst_name):
v = model[src_name]
shape = v.shape
print("Processing non-Q4 variable: " + src_name + " with shape: ", shape, " and type: ", v.dtype)
if len(shape) == 1:
print(" Converting to float32")
v = v.to(torch.float32)
ftype_cur = {torch.float16: 1, torch.float32: 0}[v.dtype]
# header
write_header(shape, dst_name, ftype_cur)
# data
v.numpy().tofile(fout)
def convert_q4(src_name, dst_name, permute=False):
zeros = model[f"{src_name}.zeros"].numpy()
scales = model[f"{src_name}.scales"].numpy()
bias = model[f"{src_name}.bias"].numpy()
qweight = model[f"{src_name}.qweight"].numpy().T # transpose
# Q4_1 does not support bias; good thing the bias is always all zeros.
assert not np.any(bias)
# Each int32 item is actually 8 int4 items packed together, and it's transposed.
shape = (qweight.shape[0], qweight.shape[1] * 8)
print("Processing Q4 variable: " + src_name + " with shape: ", shape)
# The output format has the int4 weights in groups of 32 rather than 8.
# It looks like this:
# For each row:
# For each group of 32 columns:
# - addend (float32, 4 bytes)
# - scale (float32, 4 bytes)
# - weights (int4 * 32, 16 bytes)
# Note that in the input, the scales and addends are shared between all
# the columns in a row, so we end up wasting quite a bit of memory with
# repeated scales and addends.
addends = -zeros # flip sign
# Since the output format is mixed between integers and floats, we have
# to hackily view the floats as int32s just so numpy will let us
# concatenate them.
addends_view = addends.view(dtype=np.int32)
scales_view = scales.view(dtype=np.int32)
# Split into groups of 4 columns (i.e. 32 columns of quantized data):
grouped = qweight.reshape([qweight.shape[0], qweight.shape[1] // 4, 4])
# Repeat addends and scales:
addends_rep = np.atleast_3d(addends_view).repeat(grouped.shape[1], axis=1)
scales_rep = np.atleast_3d(scales_view).repeat(grouped.shape[1], axis=1)
blob = np.concatenate([scales_rep, addends_rep, grouped], axis=2, casting='no')
if permute:
# Permute some rows to undo the permutation done by convert_llama_weights_to_hf.py.
# This can be done after the above conversion because it doesn't affect column order/layout.
blob = (blob.reshape(n_head, 2, shape[0] // n_head // 2, *blob.shape[1:])
.swapaxes(1, 2)
.reshape(blob.shape))
# header
write_header(shape, dst_name, 3) # ftype = Q4_1
# data
blob.tofile(fout)
convert_non_q4("model.embed_tokens.weight", "tok_embeddings.weight")
convert_non_q4("model.norm.weight", "norm.weight")
convert_non_q4("lm_head.weight", "output.weight")
for i in range(n_layer):
convert_q4(f"model.layers.{i}.self_attn.q_proj", f"layers.{i}.attention.wq.weight", permute=True)
convert_q4(f"model.layers.{i}.self_attn.k_proj", f"layers.{i}.attention.wk.weight", permute=True)
convert_q4(f"model.layers.{i}.self_attn.v_proj", f"layers.{i}.attention.wv.weight")
convert_q4(f"model.layers.{i}.self_attn.o_proj", f"layers.{i}.attention.wo.weight")
convert_q4(f"model.layers.{i}.mlp.gate_proj", f"layers.{i}.feed_forward.w1.weight")
convert_q4(f"model.layers.{i}.mlp.down_proj", f"layers.{i}.feed_forward.w2.weight")
convert_q4(f"model.layers.{i}.mlp.up_proj", f"layers.{i}.feed_forward.w3.weight")
convert_non_q4(f"model.layers.{i}.input_layernorm.weight", f"layers.{i}.attention_norm.weight")
convert_non_q4(f"model.layers.{i}.post_attention_layernorm.weight", f"layers.{i}.ffn_norm.weight")
fout.close()
print("Done. Output file: " + fname_out)
print("")

View file

@ -10,12 +10,10 @@
# - Name (char[name_length]) # - Name (char[name_length])
# - Data (float[n_dims]) # - Data (float[n_dims])
# #
# By default, the bigger matrices are converted to 16-bit floats.
# This can be disabled by adding the "use-f32" CLI argument.
#
# At the start of the ggml file we write the model parameters # At the start of the ggml file we write the model parameters
# and vocabulary. # and vocabulary.
# #
import argparse import argparse
import os import os
import sys import sys
@ -23,13 +21,15 @@ import json
import struct import struct
import numpy as np import numpy as np
import torch import torch
from sentencepiece import SentencePieceProcessor from sentencepiece import SentencePieceProcessor
def parse_args(): def parse_args():
parser = argparse.ArgumentParser(description='Convert a LLaMA model checkpoint to a ggml compatible file') parser = argparse.ArgumentParser(description='Convert a LLaMA model checkpoint to a ggml compatible file')
parser.add_argument('dir_model', help='directory containing the model checkpoint') parser.add_argument('dir_model', help='directory containing the model checkpoint')
parser.add_argument('ftype', type=int, choices=[0, 1], default=1, help='file type (0: float32, 1: float16)') parser.add_argument('ftype', help='file type (0: float32, 1: float16)', type=int, choices=[0, 1], default=1)
parser.add_argument('vocab_only', help='only write vocab to file', type=int, default=0, nargs='?')
return parser.parse_args() return parser.parse_args()
def get_n_parts(dim): def get_n_parts(dim):
@ -67,7 +67,7 @@ def write_header(fout, hparams, ftype):
keys = ["vocab_size", "dim", "multiple_of", "n_heads", "n_layers"] keys = ["vocab_size", "dim", "multiple_of", "n_heads", "n_layers"]
values = [ values = [
0x67676d66, # magic: ggml in hex 0x67676d66, # magic: ggmf in hex
1, # file version 1, # file version
*[hparams[key] for key in keys], *[hparams[key] for key in keys],
hparams["dim"] // hparams["n_heads"], # rot (obsolete) hparams["dim"] // hparams["n_heads"], # rot (obsolete)
@ -134,6 +134,29 @@ def main():
ftype_str = ["f32", "f16"] ftype_str = ["f32", "f16"]
hparams, tokenizer = load_hparams_and_tokenizer(dir_model) hparams, tokenizer = load_hparams_and_tokenizer(dir_model)
print(args)
# if only writing vocab to file
if args.vocab_only:
fname_model = f"{dir_model}/consolidated.00.pth"
fname_out = f"{dir_model}/ggml-vocab.bin"
print(f"Extracting only the vocab from '{fname_model}'\n")
model = torch.load(fname_model, map_location="cpu")
with open(fname_out, "wb") as fout:
write_header(fout, hparams, ftype)
write_tokens(fout, tokenizer)
del model
print(f"Done. Output file: {fname_out}\n")
return
n_parts = get_n_parts(hparams["dim"]) n_parts = get_n_parts(hparams["dim"])
for p in range(n_parts): for p in range(n_parts):
@ -151,6 +174,7 @@ def main():
process_and_write_variables(fout, model, ftype) process_and_write_variables(fout, model, ftype)
del model del model
print(f"Done. Output file: {fname_out}, (part {p})\n") print(f"Done. Output file: {fname_out}, (part {p})\n")
if __name__ == "__main__": if __name__ == "__main__":

53
examples/chatLLaMa Executable file
View file

@ -0,0 +1,53 @@
#!/bin/bash
cd "$(dirname "$0")/.." || exit
MODEL="${MODEL:-./models/13B/ggml-model-q4_0.bin}"
USER_NAME="${USER_NAME:-User}"
AI_NAME="${AI_NAME:-ChatLLaMa}"
# Adjust to the number of CPU cores you want to use.
N_THREAD="${N_THREAD:-8}"
# Number of tokens to predict (made it larger than default because we want a long interaction)
N_PREDICTS="${N_PREDICTS:-2048}"
# Note: you can also override the generation options by specifying them on the command line:
# For example, override the context size by doing: ./chatLLaMa --ctx_size 1024
GEN_OPTIONS="${GEN_OPTIONS:---ctx_size 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647}"
# shellcheck disable=SC2086 # Intended splitting of GEN_OPTIONS
./main $GEN_OPTIONS \
--model "$MODEL" \
--threads "$N_THREAD" \
--n_predict "$N_PREDICTS" \
--color --interactive \
--reverse-prompt "${USER_NAME}:" \
--prompt "
Text transcript of a never ending dialog, where ${USER_NAME} interacts with an AI assistant named ${AI_NAME}.
${AI_NAME} is helpful, kind, honest, friendly, good at writing and never fails to answer ${USER_NAME}s requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what ${USER_NAME} and ${AI_NAME} say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
The transcript only includes text, it does not include markup like HTML and Markdown.
$USER_NAME: Hello, $AI_NAME!
$AI_NAME: Hello $USER_NAME! How may I help you today?
$USER_NAME: What time is it?
$AI_NAME: It is $(date +%H:%M).
$USER_NAME: What year is it?
$AI_NAME: We are in $(date +%Y).
$USER_NAME: Please tell me the largest city in Europe.
$AI_NAME: The largest city in Europe is Moscow, the capital of Russia.
$USER_NAME: What can you tell me about Moscow?
$AI_NAME: Moscow, on the Moskva River in western Russia, is the nations cosmopolitan capital. In its historic core is the Kremlin, a complex thats home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russias symbolic center.
$USER_NAME: What is a cat?
$AI_NAME: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.
$USER_NAME: How do I pass command line arguments to a Node.js program?
$AI_NAME: The arguments are stored in process.argv.
argv[0] is the path to the Node. js executable.
argv[1] is the path to the script file.
argv[2] is the first argument passed to the script.
argv[3] is the second argument passed to the script and so on.
$USER_NAME: Name a color.
$AI_NAME: Blue
$USER_NAME:" "$@"

View file

@ -39,31 +39,42 @@ extern "C" {
char text[16384]; //16kb should be enough for any response char text[16384]; //16kb should be enough for any response
}; };
gpt_params api_params;
gpt_vocab api_vocab;
llama_model api_model;
int api_n_past = 0;
gpt_vocab::id old_embd_id = -1;
std::vector<float> api_logits;
std::vector<gpt_vocab::id> last_n_tokens;
size_t mem_per_token = 0;
bool legacy_format = false; bool legacy_format = false;
llama_context_params ctx_params;
gpt_params params;
int n_past = 0;
llama_token old_embd_id = -1;
int n_threads = 4;
int n_batch = 8;
std::string model;
llama_context * ctx;
std::vector<llama_token> last_n_tokens;
bool load_model(const load_model_inputs inputs) bool load_model(const load_model_inputs inputs)
{ {
api_params.n_threads = inputs.threads; ctx_params = llama_context_default_params();
api_params.n_ctx = inputs.max_context_length;
api_params.n_batch = inputs.batch_size;
api_params.model = inputs.model_filename;
int n_parts_overwrite = inputs.n_parts_overwrite; n_threads = inputs.threads;
n_batch = inputs.batch_size;
model = inputs.model_filename;
int loadresult = llama_model_load(api_params.model, api_model, api_vocab, api_params.n_ctx, GGML_TYPE_F16, n_parts_overwrite); ctx_params.n_ctx = inputs.max_context_length;
if (!loadresult) { ctx_params.n_parts = inputs.n_parts_overwrite;
fprintf(stderr, "%s: failed to load model from '%s'\n", __func__, api_params.model.c_str()); ctx_params.seed = -1;
ctx_params.f16_kv = true;
ctx_params.logits_all = false;
ctx = llama_init_from_file(model.c_str(), ctx_params);
if (ctx == NULL) {
fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, model.c_str());
return false; return false;
} }
legacy_format = (loadresult==2?true:false);
//return val: 0=fail, 1=newformat, 2=legacy
int fileformat = check_file_format(model.c_str());
legacy_format = (fileformat==1?true:false);
if(legacy_format) if(legacy_format)
{ {
printf("\n---\nWarning: Your model is using an OUTDATED format. Please reconvert it for better results!\n"); printf("\n---\nWarning: Your model is using an OUTDATED format. Please reconvert it for better results!\n");
@ -74,69 +85,76 @@ extern "C" {
generation_outputs generate(const generation_inputs inputs, generation_outputs & output) generation_outputs generate(const generation_inputs inputs, generation_outputs & output)
{ {
api_params.prompt = inputs.prompt; params.prompt = inputs.prompt;
api_params.seed = inputs.seed; params.seed = inputs.seed;
api_params.n_predict = inputs.max_length; params.n_predict = inputs.max_length;
api_params.top_k = inputs.top_k; params.top_k = inputs.top_k;
api_params.top_p = inputs.top_p; params.top_p = inputs.top_p;
api_params.temp = inputs.temperature; params.temp = inputs.temperature;
api_params.repeat_last_n = inputs.rep_pen_range; params.repeat_last_n = inputs.rep_pen_range;
api_params.repeat_penalty = inputs.rep_pen; params.repeat_penalty = inputs.rep_pen;
api_params.n_ctx = inputs.max_context_length; params.n_ctx = inputs.max_context_length;
params.n_batch = n_batch;
params.n_threads = n_threads;
bool reset_state = inputs.reset_state; bool reset_state = inputs.reset_state;
if(api_n_past==0) if(n_past==0)
{ {
reset_state = true; reset_state = true;
} }
if(api_params.repeat_last_n<1) if(params.repeat_last_n<1)
{ {
api_params.repeat_last_n = 1; params.repeat_last_n = 1;
} }
if(api_params.top_k<1) if(params.top_k<1)
{ {
api_params.top_k = 300; //to disable top_k we actually need to increase this value to a very high number params.top_k = 300; //to disable top_k we actually need to increase this value to a very high number
} }
if (api_params.seed < 0) if (params.seed <= 0)
{ {
api_params.seed = time(NULL); params.seed = time(NULL);
} }
if(reset_state)
{
params.prompt.insert(0, 1, ' ');
}
// tokenize the prompt
std::vector<llama_token> embd_inp;
if(legacy_format)
{
embd_inp = ::legacy_llama_tokenize(ctx, params.prompt, true);
}else{
embd_inp = ::llama_tokenize(ctx, params.prompt, true);
}
//params.n_predict = std::min(params.n_predict, params.n_ctx - (int) embd_inp.size());
//truncate to front of the prompt if its too long
if (embd_inp.size() + params.n_predict > params.n_ctx) {
int offset = embd_inp.size() - params.n_ctx + params.n_predict;
embd_inp = std::vector<llama_token>(embd_inp.begin() + offset, embd_inp.end());
}
std::vector<llama_token> embd;
int last_n_size = params.repeat_last_n;
last_n_tokens.resize(last_n_size);
//display usage //display usage
// std::string tst = " "; // std::string tst = " ";
// char * tst2 = (char*)tst.c_str(); // char * tst2 = (char*)tst.c_str();
// gpt_print_usage(1,&tst2,api_params); // gpt_print_usage(1,&tst2,params);
if(reset_state) if(reset_state)
{ {
api_params.prompt.insert(0, 1, ' '); const std::vector<llama_token> tmp = { 0, 1, 2, 3 };
} llama_eval(ctx, tmp.data(), tmp.size(), 0, params.n_threads);
// tokenize the prompt
std::vector<gpt_vocab::id> embd_inp;
if(legacy_format)
{
embd_inp = ::legacy_llama_tokenize(api_vocab, api_params.prompt, true);
}else{
embd_inp = ::llama_tokenize(api_vocab, api_params.prompt, true);
}
//api_params.n_predict = std::min(api_params.n_predict, api_model.hparams.n_ctx - (int)embd_inp.size());
//truncate to front of the prompt if its too long
if (embd_inp.size() + api_params.n_predict > api_model.hparams.n_ctx) {
int offset = embd_inp.size() - api_model.hparams.n_ctx + api_params.n_predict;
embd_inp = std::vector<gpt_vocab::id>(embd_inp.begin() + offset, embd_inp.end());
}
std::vector<gpt_vocab::id> embd;
int last_n_size = api_params.repeat_last_n;
last_n_tokens.resize(last_n_size);
if(reset_state)
{
llama_eval(api_model, api_params.n_threads, 0, {0, 1, 2, 3}, api_logits, mem_per_token);
std::fill(last_n_tokens.begin(), last_n_tokens.end(), 0); std::fill(last_n_tokens.begin(), last_n_tokens.end(), 0);
api_n_past = 0; n_past = 0;
}else{ }
else
{
//strip out the reset token (1) at the start of the embedding //strip out the reset token (1) at the start of the embedding
if(embd_inp.size()>0) if(embd_inp.size()>0)
{ {
@ -147,96 +165,97 @@ extern "C" {
embd.push_back(old_embd_id); embd.push_back(old_embd_id);
} }
} }
int remaining_tokens = api_params.n_predict; int remaining_tokens = params.n_predict;
int input_consumed = 0; int input_consumed = 0;
std::mt19937 api_rng(api_params.seed); std::mt19937 rng(params.seed);
std::string concat_output = ""; std::string concat_output = "";
bool startedsampling = false; bool startedsampling = false;
printf("\nProcessing Prompt: "); printf("\nProcessing Prompt: ");
while (remaining_tokens > 0)
{ while (remaining_tokens > 0)
gpt_vocab::id id = 0; {
// predict llama_token id = 0;
if (embd.size() > 0) // predict
{ if (embd.size() > 0)
{
printf("|");
// for (auto i: embd) { // for (auto i: embd) {
// std::cout << i << ','; // std::cout << i << ',';
// } // }
//printf("\nnp:%d embd:%d mem:%d",api_n_past,embd.size(),mem_per_token); // printf("\nnp:%d embd:%d",n_past,embd.size());
printf("|"); if (llama_eval(ctx, embd.data(), embd.size(), n_past, params.n_threads))
if (!llama_eval(api_model, api_params.n_threads, api_n_past, embd, api_logits, mem_per_token)) {
{ fprintf(stderr, "Failed to predict\n");
fprintf(stderr, "Failed to predict\n");
snprintf(output.text, sizeof(output.text), "%s", ""); snprintf(output.text, sizeof(output.text), "%s", "");
output.status = 0; output.status = 0;
return output; return output;
} }
} }
api_n_past += embd.size(); n_past += embd.size();
embd.clear(); embd.clear();
if (embd_inp.size() <= input_consumed) if ((int) embd_inp.size() <= input_consumed)
{ {
// out of user input, sample next token // out of user input, sample next token
const float top_k = api_params.top_k; const float top_k = params.top_k;
const float top_p = api_params.top_p; const float top_p = params.top_p;
const float temp = api_params.temp; const float temp = params.temp;
const float repeat_penalty = api_params.repeat_penalty; const float repeat_penalty = params.repeat_penalty;
const int n_vocab = api_model.hparams.n_vocab;
if(!startedsampling)
if(!startedsampling)
{ {
startedsampling = true; startedsampling = true;
printf("\nGenerating: "); printf("\nGenerating: ");
} }
{ {
// set the logit of the eos token (2) to zero to avoid sampling it auto logits = llama_get_logits(ctx);
api_logits[api_logits.size() - n_vocab + EOS_TOKEN_ID] = 0; // set the logit of the eos token (2) to zero to avoid sampling it
//set logits of opening square bracket to zero. logits[llama_token_eos()] = 0;
api_logits[api_logits.size() - n_vocab + 518] = 0; //set logits of opening square bracket to zero.
api_logits[api_logits.size() - n_vocab + 29961] = 0; logits[518] = 0;
logits[29961] = 0;
id = llama_sample_top_p_top_k(ctx, last_n_tokens.data(), last_n_tokens.size(), top_k, top_p, temp, repeat_penalty);
last_n_tokens.erase(last_n_tokens.begin());
last_n_tokens.push_back(id);
}
// add it to the context
old_embd_id = id;
embd.push_back(id);
id = llama_sample_top_p_top_k(api_vocab, api_logits.data() + (api_logits.size() - n_vocab), last_n_tokens, repeat_penalty, top_k, top_p, temp, api_rng); // decrement remaining sampling budget
--remaining_tokens;
//printf("\nid:%d word:%s\n",id,llama_token_to_str(ctx, id));
concat_output += llama_token_to_str(ctx, id);
}
else
{
// some user input remains from prompt or interaction, forward it to processing
while ((int) embd_inp.size() > input_consumed)
{
old_embd_id = embd_inp[input_consumed];
embd.push_back(embd_inp[input_consumed]);
last_n_tokens.erase(last_n_tokens.begin());
last_n_tokens.push_back(embd_inp[input_consumed]);
++input_consumed;
if ((int) embd.size() >= params.n_batch)
{
break;
}
}
}
last_n_tokens.erase(last_n_tokens.begin()); }
last_n_tokens.push_back(id);
} output.status = 1;
// add it to the context
old_embd_id = id;
embd.push_back(id);
// decrement remaining sampling budget
--remaining_tokens;
//printf("\nid:%d word:%s\n",id,api_vocab.id_to_token[id].c_str());
concat_output += api_vocab.id_to_token[id].c_str();
}
else
{
// some user input remains from prompt or interaction, forward it to processing
while (embd_inp.size() > input_consumed)
{
old_embd_id = embd_inp[input_consumed];
embd.push_back(embd_inp[input_consumed]);
last_n_tokens.erase(last_n_tokens.begin());
last_n_tokens.push_back(embd_inp[input_consumed]);
++input_consumed;
if (embd.size() > api_params.n_batch)
{
break;
}
}
}
}
//printf("output: %s",concat_output.c_str());
output.status = 1;
snprintf(output.text, sizeof(output.text), "%s", concat_output.c_str()); snprintf(output.text, sizeof(output.text), "%s", concat_output.c_str());
return output; return output;
} }
} }

View file

@ -1,5 +1,6 @@
#include "extra.h" #include "extra.h"
#include "llama.cpp"
#include <cassert> #include <cassert>
#include <cstring> #include <cstring>
@ -17,13 +18,41 @@
#include <alloca.h> #include <alloca.h>
#endif #endif
//return val: 0=fail, 1=legacy, 2=newformat
int check_file_format(const std::string & fname)
{
std::vector<char> f_buf(1024*1024);
auto fin = std::ifstream(fname, std::ios::binary);
fin.rdbuf()->pubsetbuf(f_buf.data(), f_buf.size());
if (!fin) {
fprintf(stderr, "%s: failed to open '%s'\n", __func__, fname.c_str());
return false;
}
int fileformat = 0;
uint32_t magic;
fin.read((char *) &magic, sizeof(magic));
if (magic == LLAMA_FILE_MAGIC_UNVERSIONED) {
fileformat = 1;
}else{
fileformat = 2;
}
fin.close();
return fileformat;
}
// TODO: Calculate this constant from the vocabulary // TODO: Calculate this constant from the vocabulary
#define MAX_TOKEN_LEN 18 #define MAX_TOKEN_LEN 18
// SentencePiece implementation after https://guillaume-be.github.io/2020-05-30/sentence_piece // SentencePiece implementation after https://guillaume-be.github.io/2020-05-30/sentence_piece
std::vector<gpt_vocab::id> legacy_llama_tokenize(const gpt_vocab & vocab, const std::string & text, bool bos) { std::vector<llama_token> legacy_llama_tokenize(const llama_vocab & vocab, const std::string & text, bool bos) {
std::vector<gpt_vocab::id> res; std::vector<llama_token> res;
std::vector<int> score; std::vector<int> score;
std::vector<gpt_vocab::id> prev; std::vector<llama_token> prev;
int len = text.length(); int len = text.length();
score.resize(len + 1); score.resize(len + 1);
@ -50,14 +79,14 @@ std::vector<gpt_vocab::id> legacy_llama_tokenize(const gpt_vocab & vocab, const
// Backward pass // Backward pass
int i = len; int i = len;
while (i > 0) { while (i > 0) {
gpt_vocab::id token_id = prev[i]; llama_token token_id = prev[i];
if (token_id == 0) { if (token_id == 0) {
// TODO: Return error or something more meaningful // TODO: Return error or something more meaningful
printf("failed to tokenize string!\n"); printf("failed to tokenize string!\n");
break; break;
} }
res.push_back(token_id); res.push_back(token_id);
auto token = (*vocab.id_to_token.find(token_id)).second; auto token = vocab.id_to_token[token_id].tok;
i -= token.length(); i -= token.length();
} }
@ -68,5 +97,33 @@ std::vector<gpt_vocab::id> legacy_llama_tokenize(const gpt_vocab & vocab, const
// Pieces are in reverse order so correct that // Pieces are in reverse order so correct that
std::reverse(res.begin(), res.end()); std::reverse(res.begin(), res.end());
return res;
}
int legacy_llama_tokenize(
struct llama_context * ctx,
const char * text,
llama_token * tokens,
int n_max_tokens,
bool add_bos) {
auto res = legacy_llama_tokenize(ctx->vocab, text, add_bos);
if (n_max_tokens < (int) res.size()) {
fprintf(stderr, "%s: too many tokens\n", __func__);
return -((int) res.size());
}
for (size_t i = 0; i < res.size(); i++) {
tokens[i] = res[i];
}
return res.size();
}
std::vector<llama_token> legacy_llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos) {
std::vector<llama_token> res(8096);
int n = legacy_llama_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
res.resize(n);
return res; return res;
} }

View file

@ -11,4 +11,9 @@
#include <string> #include <string>
#include <vector> #include <vector>
std::vector<gpt_vocab::id> legacy_llama_tokenize(const gpt_vocab & vocab, const std::string & text, bool bos); #include "llama.h"
int check_file_format(const std::string & fname);
std::vector<llama_token> legacy_llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos);

203
ggml.c
View file

@ -2,7 +2,7 @@
#if defined(_MSC_VER) || defined(__MINGW32__) #if defined(_MSC_VER) || defined(__MINGW32__)
#include <malloc.h> // using malloc.h with MSC/MINGW #include <malloc.h> // using malloc.h with MSC/MINGW
#elif !defined(__FreeBSD__) && !defined(__NetBSD__) #elif !defined(__FreeBSD__) && !defined(__NetBSD__) && !defined(__OpenBSD__)
#include <alloca.h> #include <alloca.h>
#endif #endif
@ -361,7 +361,7 @@ static const size_t CACHE_LINE_SIZE_F32 = CACHE_LINE_SIZE/sizeof(float);
// AVX routines provided by GH user Const-me // AVX routines provided by GH user Const-me
// ref: https://github.com/ggerganov/ggml/pull/27#issuecomment-1464934600 // ref: https://github.com/ggerganov/ggml/pull/27#issuecomment-1464934600
#if __AVX2__ #if __AVX2__ || __AVX512F__
// Unpack 32 4-bit fields into 32 bytes // Unpack 32 4-bit fields into 32 bytes
// The output vector contains 32 bytes, each one in [ 0 .. 15 ] interval // The output vector contains 32 bytes, each one in [ 0 .. 15 ] interval
static inline __m256i bytesFromNibbles( const uint8_t* rsi ) static inline __m256i bytesFromNibbles( const uint8_t* rsi )
@ -397,7 +397,6 @@ static inline __m128i packNibbles( __m256i bytes )
} }
#endif #endif
// method 5 // method 5
// blocks of QK elements // blocks of QK elements
// represented with a single float (delta) and QK/2 8-bit ints (i.e QK 4-bit signed integer factors) // represented with a single float (delta) and QK/2 8-bit ints (i.e QK 4-bit signed integer factors)
@ -1262,6 +1261,47 @@ inline static void ggml_vec_dot_f32(const int n, float * restrict s, const float
*s = sumf; *s = sumf;
} }
#if __AVX512F__ && QK == 32
static inline __m512 dot_q4_0_oneblock_avx512(
__m512 acc,
const uint8_t * pd0,
const uint8_t * pd1,
const uint8_t * pb0,
const uint8_t * pb1,
size_t bs,
int i
) {
const float * d0_0 = (const float *) (pd0 + i*bs);
const float * d1_0 = (const float *) (pd1 + i*bs);
const uint8_t * restrict p0 = pb0 + (i+0)*bs;
const uint8_t * restrict p1 = pb1 + (i+0)*bs;
// Compute combined scale for the block
float scaleScalar = d0_0[0] * d1_0[0];
__m512 scale = _mm512_set1_ps( scaleScalar );
__m256i bx = bytesFromNibbles( p0 );
__m256i by = bytesFromNibbles( p1 );
// Now we have a vector with bytes in [ 0 .. 15 ] interval. Offset them into [ -8 .. +7 ] interval.
const __m256i off = _mm256_set1_epi8( 8 );
bx = _mm256_sub_epi8( bx, off );
by = _mm256_sub_epi8( by, off );
// Sign-extend 16 signed bytes into int16_t
__m512i x32 = _mm512_cvtepi8_epi16( bx );
__m512i y32 = _mm512_cvtepi8_epi16( by );
// Compute products of int16_t integers, add pairwise
__m512i i64 = _mm512_madd_epi16( x32, y32 );
// Convert int32_t to float
__m512 p = _mm512_cvtepi32_ps( i64 );
// Apply the scale, and accumulate
return _mm512_fmadd_ps( scale, p, acc );
}
#endif
inline static void ggml_vec_dot_f16(const int n, float * restrict s, ggml_fp16_t * restrict x, ggml_fp16_t * restrict y) { inline static void ggml_vec_dot_f16(const int n, float * restrict s, ggml_fp16_t * restrict x, ggml_fp16_t * restrict y) {
ggml_float sumf = 0.0; ggml_float sumf = 0.0;
@ -1417,6 +1457,40 @@ inline static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void
#else #else
#error "not implemented for QK" #error "not implemented for QK"
#endif #endif
#elif defined(__AVX512F__)
#if QK == 32
// Initialize accumulator with zeros
__m512 acc0 = _mm512_setzero_ps();
__m512 acc1 = _mm512_setzero_ps();
const int superblock_size = 8;
const int superblock_count = nb / superblock_size;
const int remainder = nb % superblock_size;
for (int superblock_ix = 0; superblock_ix < superblock_count; superblock_ix += 1) {
int i = superblock_ix * superblock_size;
acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+0 );
acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+1 );
acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+2 );
acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+3 );
acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+4 );
acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+5 );
acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+6 );
acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+7 );
}
// Remainders
for (int i = superblock_count * superblock_size; i < nb; ++i) {
acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i );
}
// Horizontal sum of all lanes of the accumulator
sumf = _mm512_reduce_add_ps( acc0 ) + _mm512_reduce_add_ps( acc1 );
#else
#error "not implemented for QK"
#endif
#elif defined(__AVX2__) #elif defined(__AVX2__)
#if QK == 32 #if QK == 32
const size_t countBlocks = nb; const size_t countBlocks = nb;
@ -1928,7 +2002,7 @@ inline static void ggml_vec_mad_q4_1(const int n, float * restrict y, void * res
const size_t bs = 2*sizeof(float) + QK/2; const size_t bs = 2*sizeof(float) + QK/2;
const uint8_t * restrict pd = ((const uint8_t *)x + 0*bs); const uint8_t * restrict pd = ((const uint8_t *)x + 0*bs);
const uint8_t * restrict pm = ((const uint8_t *)x + 0*bs + sizeof(float)); const uint8_t * restrict pm = ((const uint8_t *)x + 0*bs + sizeof(float));
const uint8_t * restrict pb = ((const uint8_t *)x + 0*bs + 2*sizeof(float)); const uint8_t * restrict pb = ((const uint8_t *)x + 0*bs + 2*sizeof(float));
for (int i = 0; i < nb; i++) { for (int i = 0; i < nb; i++) {
@ -10628,6 +10702,127 @@ enum ggml_opt_result ggml_opt(
//////////////////////////////////////////////////////////////////////////////// ////////////////////////////////////////////////////////////////////////////////
size_t ggml_quantize_q4_0(float * src, void * dst, int n, int k, int qk, int64_t * hist) {
const int nb = k / qk;
const size_t bs = (sizeof(float) + sizeof(uint8_t)*qk/2);
const size_t row_size = nb*bs;
assert(k % qk == 0);
const size_t pp_size = qk / 2;
uint8_t * pp = (uint8_t *) alloca(pp_size);
char * pdst = (char *) dst;
for (int j = 0; j < n; j += k) {
uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + sizeof(float));
for (int i = 0; i < nb; i++) {
float amax = 0.0f; // absolute max
{
for (int l = 0; l < qk; l++) {
const float v = src[j + i*qk + l];
amax = MAX(amax, fabsf(v));
}
const float d = amax / ((1 << 3) - 1);
const float id = d ? 1.0f/d : 0.0f;
*(float *) pd = d;
pd += bs;
for (int l = 0; l < qk; l += 2) {
const float v0 = (src[j + i*qk + l + 0])*id;
const float v1 = (src[j + i*qk + l + 1])*id;
const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
const uint8_t vi1 = ((int8_t) (round(v1))) + 8;
assert(vi0 >= 0 && vi0 < 16);
assert(vi1 >= 0 && vi1 < 16);
hist[vi0]++;
hist[vi1]++;
pp[l/2] = vi0 | (vi1 << 4);
}
memcpy(pb, pp, pp_size);
pb += bs;
}
}
}
return (n/k)*row_size;
}
size_t ggml_quantize_q4_1(float * src, void * dst, int n, int k, int qk, int64_t * hist) {
const int nb = k / qk;
const size_t bs = (2*sizeof(float) + sizeof(uint8_t)*qk/2);
const size_t row_size = nb*bs;
assert(k % qk == 0);
const size_t pp_size = qk / 2;
uint8_t * pp = (uint8_t *) alloca(pp_size);
char * pdst = (char *) dst;
for (int j = 0; j < n; j += k) {
uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
uint8_t * pm = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + sizeof(float));
uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + 2*sizeof(float));
//printf("n = %d, k = %d, nb = %d, row_size = %d, j = %d, pm = %p, pd = %p, pb = %p\n", n, k, nb, row_size, j, pm, pd, pb);
for (int i = 0; i < nb; i++) {
float min = FLT_MAX;
float max = -FLT_MAX;
{
for (int l = 0; l < qk; l++) {
const float v = src[j + i*qk + l];
if (v < min) min = v;
if (v > max) max = v;
}
const float d = (max - min) / ((1 << 4) - 1);
const float id = d ? 1.0f/d : 0.0f;
*(float *) pd = d;
*(float *) pm = min;
pd += bs;
pm += bs;
for (int l = 0; l < qk; l += 2) {
const float v0 = (src[j + i*qk + l + 0] - min)*id;
const float v1 = (src[j + i*qk + l + 1] - min)*id;
const uint8_t vi0 = round(v0);
const uint8_t vi1 = round(v1);
assert(vi0 >= 0 && vi0 < 16);
assert(vi1 >= 0 && vi1 < 16);
hist[vi0]++;
hist[vi1]++;
pp[l/2] = vi0 | (vi1 << 4);
}
memcpy(pb, pp, pp_size);
pb += bs;
}
}
}
return (n/k)*row_size;
}
////////////////////////////////////////////////////////////////////////////////
int ggml_cpu_has_avx(void) { int ggml_cpu_has_avx(void) {
#if defined(__AVX__) #if defined(__AVX__)
return 1; return 1;

7
ggml.h
View file

@ -741,6 +741,13 @@ enum ggml_opt_result ggml_opt(
struct ggml_opt_params params, struct ggml_opt_params params,
struct ggml_tensor * f); struct ggml_tensor * f);
//
// quantization
//
size_t ggml_quantize_q4_0(float * src, void * dst, int n, int k, int qk, int64_t * hist);
size_t ggml_quantize_q4_1(float * src, void * dst, int n, int k, int qk, int64_t * hist);
// //
// system info // system info
// //

1576
llama.cpp Normal file

File diff suppressed because it is too large Load diff

139
llama.h Normal file
View file

@ -0,0 +1,139 @@
#ifndef LLAMA_H
#define LLAMA_H
#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
#ifdef LLAMA_SHARED
# ifdef _WIN32
# ifdef LLAMA_BUILD
# define LLAMA_API __declspec(dllexport)
# else
# define LLAMA_API __declspec(dllimport)
# endif
# else
# define LLAMA_API __attribute__ ((visibility ("default")))
# endif
#else
# define LLAMA_API
#endif
#define LLAMA_FILE_VERSION 1
#define LLAMA_FILE_MAGIC 0x67676d66 // 'ggmf' in hex
#define LLAMA_FILE_MAGIC_UNVERSIONED 0x67676d6c // pre-versioned files
#ifdef __cplusplus
extern "C" {
#endif
//
// C interface
//
// TODO: show sample usage
//
struct llama_context;
typedef int llama_token;
typedef struct llama_token_data {
llama_token id; // token id
float p; // probability of the token
float plog; // log probability of the token
} llama_token_data;
struct llama_context_params {
int n_ctx; // text context
int n_parts; // -1 for default
int seed; // RNG seed, 0 for random
bool f16_kv; // use fp16 for KV cache
bool logits_all; // the llama_eval() call computes all logits, not just the last one
bool vocab_only; // only load the vocabulary, no weights
};
LLAMA_API struct llama_context_params llama_context_default_params();
// Various functions for loading a ggml llama model.
// Allocate (almost) all memory needed for the model.
// Return NULL on failure
LLAMA_API struct llama_context * llama_init_from_file(
const char * path_model,
struct llama_context_params params);
// Frees all allocated memory
LLAMA_API void llama_free(struct llama_context * ctx);
// TODO: not great API - very likely to change
// Returns 0 on success
LLAMA_API int llama_model_quantize(
const char * fname_inp,
const char * fname_out,
int itype,
int qk);
// Run the llama inference to obtain the logits and probabilities for the next token.
// tokens + n_tokens is the provided batch of new tokens to process
// n_past is the number of tokens to use from previous eval calls
// Returns 0 on success
LLAMA_API int llama_eval(
struct llama_context * ctx,
const llama_token * tokens,
int n_tokens,
int n_past,
int n_threads);
// Convert the provided text into tokens.
// The tokens pointer must be large enough to hold the resulting tokens.
// Returns the number of tokens on success, no more than n_max_tokens
// Returns a negative number on failure - the number of tokens that would have been returned
// TODO: not sure if correct
LLAMA_API int llama_tokenize(
struct llama_context * ctx,
const char * text,
llama_token * tokens,
int n_max_tokens,
bool add_bos);
LLAMA_API int llama_n_vocab(struct llama_context * ctx);
LLAMA_API int llama_n_ctx (struct llama_context * ctx);
// Token logits obtained from the last call to llama_eval()
// The logits for the last token are stored in the last row
// Can be mutated in order to change the probabilities of the next token
// Rows: n_tokens
// Cols: n_vocab
LLAMA_API float * llama_get_logits(struct llama_context * ctx);
// Token Id -> String. Uses the vocabulary in the provided context
LLAMA_API const char * llama_token_to_str(struct llama_context * ctx, llama_token token);
// Special tokens
LLAMA_API llama_token llama_token_bos();
LLAMA_API llama_token llama_token_eos();
// TODO: improve the last_n_tokens interface ?
LLAMA_API llama_token llama_sample_top_p_top_k(
llama_context * ctx,
const llama_token * last_n_tokens_data,
int last_n_tokens_size,
int top_k,
double top_p,
double temp,
double repeat_penalty);
// Performance information
LLAMA_API void llama_print_timings(struct llama_context * ctx);
LLAMA_API void llama_reset_timings(struct llama_context * ctx);
// Print system information
LLAMA_API const char * llama_print_system_info(void);
#ifdef __cplusplus
}
#endif
#endif

Binary file not shown.

1037
main.cpp

File diff suppressed because it is too large Load diff

BIN
main.exe

Binary file not shown.

BIN
models/ggml-vocab.bin Normal file

Binary file not shown.

View file

@ -1,317 +1,17 @@
#include "ggml.h" #include "ggml.h"
#include "llama.h"
#include "utils.h"
#include <cassert>
#include <cinttypes>
#include <cmath>
#include <cstdio> #include <cstdio>
#include <cstring>
#include <fstream>
#include <map>
#include <string> #include <string>
#include <vector>
#include <regex>
// TODO: move somewhere else const int QK = 32;
#define QK 32
// default hparams (LLaMA76B)
struct llama_hparams {
int32_t n_vocab = 32000;
int32_t n_ctx = 512; // this is provided as user input?
int32_t n_embd = 4096;
int32_t n_mult = 256;
int32_t n_head = 32;
int32_t n_layer = 32;
int32_t n_rot = 64;
int32_t f16 = 1;
};
// quantize a model
bool llama_model_quantize(const std::string & fname_inp, const std::string & fname_out, int itype) {
ggml_type type = GGML_TYPE_Q4_1;
switch (itype) {
case 2: type = GGML_TYPE_Q4_0; break;
case 3: type = GGML_TYPE_Q4_1; break;
default: fprintf(stderr, "%s: invalid quantization type %d\n", __func__, itype); return 1;
};
if (type != GGML_TYPE_Q4_0 && type != GGML_TYPE_Q4_1) {
fprintf(stderr, "%s: invalid quantization type %d\n", __func__, type);
return false;
}
gpt_vocab vocab;
printf("%s: loading model from '%s'\n", __func__, fname_inp.c_str());
auto finp = std::ifstream(fname_inp, std::ios::binary);
if (!finp) {
fprintf(stderr, "%s: failed to open '%s' for reading\n", __func__, fname_inp.c_str());
return false;
}
auto fout = std::ofstream(fname_out, std::ios::binary);
if (!fout) {
fprintf(stderr, "%s: failed to open '%s' for writing\n", __func__, fname_out.c_str());
return false;
}
// verify magic
{
uint32_t magic;
finp.read((char *) &magic, sizeof(magic));
if (magic == FILE_MAGIC_UNVERSIONED) {
fprintf(stderr, "%s: invalid model file '%s' (too old, regenerate your model files!)\n",
__func__, fname_inp.c_str());
return false;
}
if (magic != FILE_MAGIC) {
fprintf(stderr, "%s: invalid model file '%s' (bad magic)\n", __func__, fname_inp.c_str());
return false;
}
fout.write((char *) &magic, sizeof(magic));
uint32_t format_version;
finp.read((char *) &format_version, sizeof(format_version));
if (format_version != FILE_VERSION) {
fprintf(stderr, "%s: invalid model file '%s' (unsupported format version %" PRIu32 ", expected %d)\n",
__func__, fname_inp.c_str(), format_version, FILE_VERSION);
return false;
}
fout.write((char *) &format_version, sizeof(format_version));
}
llama_hparams hparams;
// load hparams
{
finp.read((char *) &hparams.n_vocab, sizeof(hparams.n_vocab));
//finp.read((char *) &hparams.n_ctx, sizeof(hparams.n_ctx));
finp.read((char *) &hparams.n_embd, sizeof(hparams.n_embd));
finp.read((char *) &hparams.n_mult, sizeof(hparams.n_mult));
finp.read((char *) &hparams.n_head, sizeof(hparams.n_head));
finp.read((char *) &hparams.n_layer, sizeof(hparams.n_layer));
finp.read((char *) &hparams.n_rot, sizeof(hparams.n_rot));
finp.read((char *) &hparams.f16, sizeof(hparams.f16));
printf("%s: n_vocab = %d\n", __func__, hparams.n_vocab);
printf("%s: n_ctx = %d\n", __func__, hparams.n_ctx);
printf("%s: n_embd = %d\n", __func__, hparams.n_embd);
printf("%s: n_mult = %d\n", __func__, hparams.n_mult);
printf("%s: n_head = %d\n", __func__, hparams.n_head);
printf("%s: n_layer = %d\n", __func__, hparams.n_layer);
printf("%s: f16 = %d\n", __func__, hparams.f16);
fout.write((char *) &hparams.n_vocab, sizeof(hparams.n_vocab));
//fout.write((char *) &hparams.n_ctx, sizeof(hparams.n_ctx));
fout.write((char *) &hparams.n_embd, sizeof(hparams.n_embd));
fout.write((char *) &hparams.n_mult, sizeof(hparams.n_mult));
fout.write((char *) &hparams.n_head, sizeof(hparams.n_head));
fout.write((char *) &hparams.n_layer, sizeof(hparams.n_layer));
fout.write((char *) &hparams.n_rot, sizeof(hparams.n_rot));
fout.write((char *) &itype, sizeof(hparams.f16));
}
// load vocab
{
const int32_t n_vocab = hparams.n_vocab;
if (n_vocab != hparams.n_vocab) {
fprintf(stderr, "%s: invalid model file '%s' (bad vocab size %d != %d)\n",
__func__, fname_inp.c_str(), n_vocab, hparams.n_vocab);
return false;
}
std::string word;
for (int i = 0; i < n_vocab; i++) {
uint32_t len;
finp.read ((char *) &len, sizeof(len));
fout.write((char *) &len, sizeof(len));
word.resize(len);
finp.read ((char *) word.data(), len);
fout.write((char *) word.data(), len);
float score;
finp.read ((char *) &score, sizeof(score));
fout.write((char *) &score, sizeof(score));
vocab.token_to_id[word] = i;
vocab.id_to_token[i] = word;
vocab.score[i] = score;
}
}
// load weights
{
size_t total_size_org = 0;
size_t total_size_new = 0;
std::vector<float> work;
std::vector<uint8_t> data_u8;
std::vector<ggml_fp16_t> data_f16;
std::vector<float> data_f32;
std::vector<int64_t> hist_all(1 << 4, 0);
while (true) {
int32_t n_dims;
int32_t length;
int32_t ftype;
finp.read(reinterpret_cast<char *>(&n_dims), sizeof(n_dims));
finp.read(reinterpret_cast<char *>(&length), sizeof(length));
finp.read(reinterpret_cast<char *>(&ftype), sizeof(ftype));
if (finp.eof()) {
break;
}
int32_t nelements = 1;
int32_t ne[2] = { 1, 1 };
for (int i = 0; i < n_dims; ++i) {
finp.read (reinterpret_cast<char *>(&ne[i]), sizeof(ne[i]));
nelements *= ne[i];
}
std::string name(length, 0);
finp.read (&name[0], length);
{
static const char * ftype_str[] = { "f32", "f16", "q4_0", "q4_1", };
printf("%48s - [%5d, %5d], type = %6s ", name.data(), ne[0], ne[1], ftype_str[ftype]);
}
// regexes of tensor names to be quantized
const std::vector<std::string> k_names = {
".*weight",
};
bool quantize = false;
for (const auto & s : k_names) {
if (std::regex_match(name, std::regex(s))) {
quantize = true;
break;
}
}
// quantize only 2D tensors
quantize &= (n_dims == 2);
if (quantize) {
if (ftype != 0 && ftype != 1) {
fprintf(stderr, "%s: unsupported ftype %d for integer quantization\n", __func__, ftype);
return false;
}
if (ftype == 1) {
data_f16.resize(nelements);
finp.read(reinterpret_cast<char *>(data_f16.data()), nelements * sizeof(ggml_fp16_t));
data_f32.resize(nelements);
for (int i = 0; i < nelements; ++i) {
data_f32[i] = ggml_fp16_to_fp32(data_f16[i]);
}
} else {
data_f32.resize(nelements);
finp.read(reinterpret_cast<char *>(data_f32.data()), nelements * sizeof(float));
}
ftype = itype;
} else {
const int bpe = (ftype == 0) ? sizeof(float) : sizeof(uint16_t);
data_u8.resize(nelements*bpe);
finp.read(reinterpret_cast<char *>(data_u8.data()), nelements * bpe);
}
fout.write(reinterpret_cast<char *>(&n_dims), sizeof(n_dims));
fout.write(reinterpret_cast<char *>(&length), sizeof(length));
fout.write(reinterpret_cast<char *>(&ftype), sizeof(ftype));
for (int i = 0; i < n_dims; ++i) {
fout.write(reinterpret_cast<char *>(&ne[i]), sizeof(ne[i]));
}
fout.write(&name[0], length);
if (quantize) {
printf("quantizing .. ");
work.resize(nelements); // for quantization
size_t cur_size = 0;
std::vector<int64_t> hist_cur(1 << 4, 0);
switch (type) {
case GGML_TYPE_Q4_0:
{
cur_size = ggml_quantize_q4_0(data_f32.data(), work.data(), nelements, ne[0], QK, hist_cur.data());
} break;
case GGML_TYPE_Q4_1:
{
cur_size = ggml_quantize_q4_1(data_f32.data(), work.data(), nelements, ne[0], QK, hist_cur.data());
} break;
default:
{
fprintf(stderr, "%s: unsupported quantization type %d\n", __func__, type);
return false;
}
}
fout.write(reinterpret_cast<char *>(work.data()), cur_size);
total_size_new += cur_size;
printf("size = %8.2f MB -> %8.2f MB | hist: ", nelements * sizeof(float)/1024.0/1024.0, cur_size/1024.0/1024.0);
for (int i = 0; i < hist_cur.size(); ++i) {
hist_all[i] += hist_cur[i];
}
for (int i = 0; i < hist_cur.size(); ++i) {
printf("%5.3f ", hist_cur[i] / (float)nelements);
}
printf("\n");
} else {
printf("size = %8.3f MB\n", data_u8.size()/1024.0/1024.0);
fout.write(reinterpret_cast<char *>(data_u8.data()), data_u8.size());
total_size_new += data_u8.size();
}
total_size_org += nelements * sizeof(float);
}
printf("%s: model size = %8.2f MB\n", __func__, total_size_org/1024.0/1024.0);
printf("%s: quant size = %8.2f MB\n", __func__, total_size_new/1024.0/1024.0);
{
int64_t sum_all = 0;
for (int i = 0; i < hist_all.size(); ++i) {
sum_all += hist_all[i];
}
printf("%s: hist: ", __func__);
for (int i = 0; i < hist_all.size(); ++i) {
printf("%5.3f ", hist_all[i] / (float)sum_all);
}
printf("\n");
}
}
finp.close();
fout.close();
return true;
}
// usage: // usage:
// ./llama-quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type // ./llama-quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type
// //
int main(int argc, char ** argv) { int main(int argc, char ** argv) {
ggml_time_init(); ggml_time_init();
if (argc != 4) { if (argc != 4) {
fprintf(stderr, "usage: %s model-f32.bin model-quant.bin type\n", argv[0]); fprintf(stderr, "usage: %s model-f32.bin model-quant.bin type\n", argv[0]);
fprintf(stderr, " type = 2 - q4_0\n"); fprintf(stderr, " type = 2 - q4_0\n");
@ -339,7 +39,7 @@ int main(int argc, char ** argv) {
{ {
const int64_t t_start_us = ggml_time_us(); const int64_t t_start_us = ggml_time_us();
if (!llama_model_quantize(fname_inp, fname_out, itype)) { if (llama_model_quantize(fname_inp.c_str(), fname_out.c_str(), itype, QK)) {
fprintf(stderr, "%s: failed to quantize model from '%s'\n", __func__, fname_inp.c_str()); fprintf(stderr, "%s: failed to quantize model from '%s'\n", __func__, fname_inp.c_str());
return 1; return 1;
} }

Binary file not shown.

4
tests/CMakeLists.txt Normal file
View file

@ -0,0 +1,4 @@
set(TEST_TARGET test-tokenizer-0)
add_executable(${TEST_TARGET} ${TEST_TARGET}.cpp)
target_link_libraries(${TEST_TARGET} PRIVATE llama ggml utils)
add_test(NAME ${TEST_TARGET} COMMAND $<TARGET_FILE:${TEST_TARGET}> ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab.bin)

View file

@ -0,0 +1,79 @@
#include "utils.h"
#include "llama.h"
#include <cstdio>
#include <string>
#include <map>
static const std::map<std::string, std::vector<llama_token>> k_tests = {
{ "Hello World", { 1, 10994, 2787, }, },
{ " Hello World", { 1, 15043, 2787, }, },
{ " Hello World!", { 1, 15043, 2787, 29991, }, },
{ " this is 🦙.cpp", { 1, 445, 338, 29871, 243, 162, 169, 156, 29889, 8223, }, },
{ "w048 7tuijk dsdfhu", { 1, 29893, 29900, 29946, 29947, 29871, 29955, 9161, 13535, 18031, 2176, 6905, }, },
{ "нещо на Български", { 1, 821, 4851, 665, 1386, 29713, 1305, }, },
};
int main(int argc, char **argv) {
if (argc < 2) {
fprintf(stderr, "Usage: %s <vocab-file>\n", argv[0]);
return 1;
}
const std::string fname = argv[1];
fprintf(stderr, "%s : reading vocab from: '%s'\n", __func__, fname.c_str());
llama_context * ctx;
// load the vocab
{
auto lparams = llama_context_default_params();
lparams.vocab_only = true;
ctx = llama_init_from_file(fname.c_str(), lparams);
if (ctx == NULL) {
fprintf(stderr, "%s: error: failed to load vocab '%s'\n", __func__, fname.c_str());
return 1;
}
}
const int n_vocab = llama_n_vocab(ctx);
if (n_vocab != 32000) {
fprintf(stderr, "%s : expected 32000 tokens, got %d\n", __func__, n_vocab);
return 2;
}
for (const auto & test_kv : k_tests) {
const auto res = ::llama_tokenize(ctx, test_kv.first, true);
bool correct = res.size() == test_kv.second.size();
for (int i = 0; i < (int) res.size() && correct; ++i) {
if (res[i] != test_kv.second[i]) {
correct = false;
}
}
if (!correct) {
fprintf(stderr, "%s : failed test: '%s'\n", __func__, test_kv.first.c_str());
fprintf(stderr, "%s : expected tokens: ", __func__);
for (const auto & t : test_kv.second) {
fprintf(stderr, "%6d, ", t);
}
fprintf(stderr, "\n");
fprintf(stderr, "%s : got tokens: ", __func__);
for (const auto & t : res) {
fprintf(stderr, "%6d, ", t);
}
fprintf(stderr, "\n");
return 3;
}
}
return 0;
}

555
utils.cpp
View file

@ -3,16 +3,13 @@
#include <cassert> #include <cassert>
#include <cstring> #include <cstring>
#include <fstream> #include <fstream>
#include <regex>
#include <iostream>
#include <iterator>
#include <queue>
#include <string> #include <string>
#include <math.h> #include <iterator>
#include <algorithm>
#if defined(_MSC_VER) || defined(__MINGW32__) #if defined(_MSC_VER) || defined(__MINGW32__)
#include <malloc.h> // using malloc.h with MSC/MINGW #include <malloc.h> // using malloc.h with MSC/MINGW
#elif !defined(__FreeBSD__) && !defined(__NetBSD__) #elif !defined(__FreeBSD__) && !defined(__NetBSD__) && !defined(__OpenBSD__)
#include <alloca.h> #include <alloca.h>
#endif #endif
@ -72,8 +69,12 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
params.use_color = true; params.use_color = true;
} else if (arg == "-r" || arg == "--reverse-prompt") { } else if (arg == "-r" || arg == "--reverse-prompt") {
params.antiprompt.push_back(argv[++i]); params.antiprompt.push_back(argv[++i]);
} else if (arg == "--perplexity") {
params.perplexity = true;
} else if (arg == "--ignore-eos") { } else if (arg == "--ignore-eos") {
params.ignore_eos = true; params.ignore_eos = true;
} else if (arg == "--n_parts") {
params.n_parts = std::stoi(argv[++i]);
} else if (arg == "-h" || arg == "--help") { } else if (arg == "-h" || arg == "--help") {
gpt_print_usage(argc, argv, params); gpt_print_usage(argc, argv, params);
exit(0); exit(0);
@ -100,7 +101,7 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
fprintf(stderr, " in interactive mode, poll user input upon seeing PROMPT (can be\n"); fprintf(stderr, " in interactive mode, poll user input upon seeing PROMPT (can be\n");
fprintf(stderr, " specified more than once for multiple prompts).\n"); fprintf(stderr, " specified more than once for multiple prompts).\n");
fprintf(stderr, " --color colorise output to distinguish prompt and user input from generations\n"); fprintf(stderr, " --color colorise output to distinguish prompt and user input from generations\n");
fprintf(stderr, " -s SEED, --seed SEED RNG seed (default: -1)\n"); fprintf(stderr, " -s SEED, --seed SEED RNG seed (default: -1, use random seed for <= 0)\n");
fprintf(stderr, " -t N, --threads N number of threads to use during computation (default: %d)\n", params.n_threads); fprintf(stderr, " -t N, --threads N number of threads to use during computation (default: %d)\n", params.n_threads);
fprintf(stderr, " -p PROMPT, --prompt PROMPT\n"); fprintf(stderr, " -p PROMPT, --prompt PROMPT\n");
fprintf(stderr, " prompt to start generation with (default: empty)\n"); fprintf(stderr, " prompt to start generation with (default: empty)\n");
@ -116,7 +117,9 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
fprintf(stderr, " --ignore-eos ignore end of stream token and continue generating\n"); fprintf(stderr, " --ignore-eos ignore end of stream token and continue generating\n");
fprintf(stderr, " --memory_f16 use f16 instead of f32 for memory key+value\n"); fprintf(stderr, " --memory_f16 use f16 instead of f32 for memory key+value\n");
fprintf(stderr, " --temp N temperature (default: %.1f)\n", params.temp); fprintf(stderr, " --temp N temperature (default: %.1f)\n", params.temp);
fprintf(stderr, " --n_parts N number of model parts (default: -1 = determine from dimensions)\n");
fprintf(stderr, " -b N, --batch_size N batch size for prompt processing (default: %d)\n", params.n_batch); fprintf(stderr, " -b N, --batch_size N batch size for prompt processing (default: %d)\n", params.n_batch);
fprintf(stderr, " --perplexity compute perplexity over the prompt\n");
fprintf(stderr, " -m FNAME, --model FNAME\n"); fprintf(stderr, " -m FNAME, --model FNAME\n");
fprintf(stderr, " model path (default: %s)\n", params.model.c_str()); fprintf(stderr, " model path (default: %s)\n", params.model.c_str());
fprintf(stderr, "\n"); fprintf(stderr, "\n");
@ -141,535 +144,11 @@ std::string gpt_random_prompt(std::mt19937 & rng) {
return "The"; return "The";
} }
void replace(std::string & str, const std::string & needle, const std::string & replacement) { // TODO: not great allocating this every time
size_t pos = 0; std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos) {
while ((pos = str.find(needle, pos)) != std::string::npos) { std::vector<llama_token> res(8096);
str.replace(pos, needle.length(), replacement); int n = llama_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
pos += replacement.length(); res.resize(n);
}
} return res;
std::map<std::string, int32_t> json_parse(const std::string & fname) {
std::map<std::string, int32_t> result;
// read file into string
std::string json;
{
std::ifstream ifs(fname);
if (!ifs) {
fprintf(stderr, "Failed to open %s\n", fname.c_str());
exit(1);
}
json = std::string((std::istreambuf_iterator<char>(ifs)),
(std::istreambuf_iterator<char>()));
}
if (json[0] != '{') {
return result;
}
// parse json
{
bool has_key = false;
bool in_token = false;
std::string str_key = "";
std::string str_val = "";
int n = json.size();
for (int i = 1; i < n; ++i) {
if (!in_token) {
if (json[i] == ' ') continue;
if (json[i] == '"') {
in_token = true;
continue;
}
} else {
if (json[i] == '\\' && i+1 < n) {
if (has_key == false) {
str_key += json[i];
} else {
str_val += json[i];
}
++i;
} else if (json[i] == '"') {
if (has_key == false) {
has_key = true;
++i;
while (json[i] == ' ') ++i;
++i; // :
while (json[i] == ' ') ++i;
if (json[i] != '\"') {
while (json[i] != ',' && json[i] != '}') {
str_val += json[i++];
}
has_key = false;
} else {
in_token = true;
continue;
}
} else {
has_key = false;
}
::replace(str_key, "\\u0120", " " ); // \u0120 -> space
::replace(str_key, "\\u010a", "\n"); // \u010a -> new line
::replace(str_key, "\\\"", "\""); // \\\" -> "
try {
result[str_key] = std::stoi(str_val);
} catch (...) {
//fprintf(stderr, "%s: ignoring key '%s' with value '%s'\n", fname.c_str(), str_key.c_str(), str_val.c_str());
}
str_key = "";
str_val = "";
in_token = false;
continue;
}
if (has_key == false) {
str_key += json[i];
} else {
str_val += json[i];
}
}
}
}
return result;
}
std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const std::string & text) {
std::vector<std::string> words;
// first split the text into words
{
std::string str = text;
std::string pat = R"('s|'t|'re|'ve|'m|'ll|'d| ?[[:alpha:]]+| ?[[:digit:]]+| ?[^\s[:alpha:][:digit:]]+|\s+(?!\S)|\s+)";
std::regex re(pat);
std::smatch m;
while (std::regex_search(str, m, re)) {
for (auto x : m) {
words.push_back(x);
}
str = m.suffix();
}
}
// find the longest tokens that form the words:
std::vector<gpt_vocab::id> tokens;
for (const auto & word : words) {
if (word.size() == 0) continue;
int i = 0;
int n = word.size();
while (i < n) {
int j = n;
while (j > i) {
auto it = vocab.token_to_id.find(word.substr(i, j-i));
if (it != vocab.token_to_id.end()) {
tokens.push_back(it->second);
i = j;
break;
}
--j;
}
if (i == n) {
break;
}
if (j == i) {
auto sub = word.substr(i, 1);
if (vocab.token_to_id.find(sub) != vocab.token_to_id.end()) {
tokens.push_back(vocab.token_to_id.at(sub));
} else {
fprintf(stderr, "%s: unknown token '%s'\n", __func__, sub.data());
}
++i;
}
}
}
return tokens;
}
static size_t utf8_len(char src) {
const size_t lookup[] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 4 };
uint8_t highbits = static_cast<uint8_t>(src) >> 4;
return lookup[highbits];
}
struct llama_sp_symbol {
using index = int;
index prev;
index next;
std::string_view text;
};
struct llama_sp_bigram {
struct comparator {
bool operator()(llama_sp_bigram & l, llama_sp_bigram & r) {
return (l.score < r.score) || (l.score == r.score && l.left > r.left);
}
};
using queue_storage = std::vector<llama_sp_bigram>;
using queue = std::priority_queue<llama_sp_bigram, queue_storage, comparator>;
llama_sp_symbol::index left;
llama_sp_symbol::index right;
float score;
size_t size;
};
struct llama_tokenizer {
llama_tokenizer(const gpt_vocab & vocab): vocab_(vocab) {}
void tokenize(std::string_view text, std::vector<gpt_vocab::id> & output) {
// split string into utf8 chars
int index = 0;
while (!text.empty()) {
llama_sp_symbol sym;
size_t char_len = std::min(text.size(), utf8_len(text.data()[0]));
sym.text = std::string_view(text.data(), char_len);
sym.prev = index - 1;
text.remove_prefix(char_len);
sym.next = text.empty() ? -1 : index + 1;
index++;
symbols_.emplace_back(std::move(sym));
}
// seed the work queue with all possible 2-character tokens.
for (size_t i = 1; i < symbols_.size(); ++i) {
try_add_bigram(i - 1, i);
}
// keep substituting the highest frequency pairs for as long as we can.
while (!work_queue_.empty()) {
auto bigram = work_queue_.top();
work_queue_.pop();
auto & left_sym = symbols_[bigram.left];
auto & right_sym = symbols_[bigram.right];
// if one of the symbols already got merged, skip it.
if (left_sym.text.empty() || right_sym.text.empty() ||
left_sym.text.size() + right_sym.text.size() != bigram.size) {
continue;
}
// merge the right sym into the left one
left_sym.text = std::string_view(left_sym.text.data(), left_sym.text.size() + right_sym.text.size());
right_sym.text = std::string_view("");
// remove the right sym from the chain
left_sym.next = right_sym.next;
if (right_sym.next >= 0) {
symbols_[right_sym.next].prev = bigram.left;
}
// find more substitutions
try_add_bigram(left_sym.prev, bigram.left);
try_add_bigram(bigram.left, left_sym.next);
}
for (int i = 0; i != -1; i = symbols_[i].next) {
auto& symbol = symbols_[i];
auto token = vocab_.token_to_id.find(std::string(symbol.text));
if (token == vocab_.token_to_id.end()) {
// output any symbols that did not form tokens as bytes.
for (int j = 0; j < symbol.text.size(); ++j) {
gpt_vocab::id token_id = static_cast<uint8_t>(symbol.text[j]) + 3;
output.push_back(token_id);
}
} else {
output.push_back((*token).second);
}
}
}
private:
void try_add_bigram(int left, int right) {
if (left == -1 || right == -1) {
return;
}
std::string_view text(symbols_[left].text.data(), symbols_[left].text.size() + symbols_[right].text.size());
auto token = vocab_.token_to_id.find(std::string(text));
if (token == vocab_.token_to_id.end()) {
return;
}
auto score = vocab_.score.find((*token).second);
if (score == vocab_.score.end()) {
return;
}
llama_sp_bigram bigram;
bigram.left = left;
bigram.right = right;
bigram.score = (*score).second;
bigram.size = text.size();
work_queue_.push(bigram);
}
const gpt_vocab & vocab_;
std::vector<llama_sp_symbol> symbols_;
llama_sp_bigram::queue work_queue_;
};
std::vector<gpt_vocab::id> llama_tokenize(const gpt_vocab & vocab, std::string_view text, bool bos) {
llama_tokenizer tokenizer(vocab);
std::vector<gpt_vocab::id> output;
if (text.size() == 0) {
return output;
}
if (bos) {
output.push_back(1);
}
tokenizer.tokenize(text, output);
return output;
}
bool gpt_vocab_init(const std::string & fname, gpt_vocab & vocab) {
printf("%s: loading vocab from '%s'\n", __func__, fname.c_str());
vocab.token_to_id = ::json_parse(fname);
for (const auto & kv : vocab.token_to_id) {
vocab.id_to_token[kv.second] = kv.first;
}
printf("%s: vocab size = %d\n", __func__, (int) vocab.token_to_id.size());
// print the vocabulary
//for (auto kv : vocab.token_to_id) {
// printf("'%s' -> %d\n", kv.first.data(), kv.second);
//}
return true;
}
void sample_top_k(std::vector<std::pair<double, gpt_vocab::id>> & logits_id, int top_k) {
// find the top K tokens
std::partial_sort(
logits_id.begin(),
logits_id.begin() + top_k, logits_id.end(),
[](const std::pair<double, gpt_vocab::id> & a, const std::pair<double, gpt_vocab::id> & b) {
return a.first > b.first;
});
logits_id.resize(top_k);
}
gpt_vocab::id llama_sample_top_p_top_k(
const gpt_vocab & vocab,
const float * logits,
std::vector<gpt_vocab::id> & last_n_tokens,
double repeat_penalty,
int top_k,
double top_p,
double temp,
std::mt19937 & rng) {
int n_logits = vocab.id_to_token.size();
std::vector<std::pair<double, gpt_vocab::id>> logits_id;
logits_id.reserve(n_logits);
{
const double scale = 1.0/temp;
for (int i = 0; i < n_logits; ++i) {
// repetition penalty from CTRL paper (https://arxiv.org/abs/1909.05858)
// credit https://github.com/facebookresearch/llama/compare/main...shawwn:llama:main
if (std::find(last_n_tokens.begin(), last_n_tokens.end(), i) != last_n_tokens.end()) {
// if score < 0 then repetition penalty has to multiplied to reduce the previous token probability
if (logits[i] < 0.0) {
logits_id.push_back(std::make_pair(logits[i]*scale*repeat_penalty, i));
} else {
logits_id.push_back(std::make_pair(logits[i]*scale/repeat_penalty, i));
}
} else {
logits_id.push_back(std::make_pair(logits[i]*scale, i));
}
}
}
sample_top_k(logits_id, top_k);
double maxl = -INFINITY;
for (const auto & kv : logits_id) {
maxl = std::max(maxl, kv.first);
}
// compute probs for the top K tokens
std::vector<double> probs;
probs.reserve(logits_id.size());
double sum = 0.0;
for (const auto & kv : logits_id) {
double p = exp(kv.first - maxl);
probs.push_back(p);
sum += p;
}
// normalize the probs
for (auto & p : probs) {
p /= sum;
}
if (top_p < 1.0f) {
double cumsum = 0.0f;
for (int i = 0; i < (int) probs.size(); i++) {
cumsum += probs[i];
if (cumsum >= top_p) {
probs.resize(i + 1);
logits_id.resize(i + 1);
break;
}
}
cumsum = 1.0/cumsum;
for (int i = 0; i < (int) probs.size(); i++) {
probs[i] *= cumsum;
}
}
//printf("\n");
//for (int i = 0; i < (int) 10; i++) {
// printf("%d: '%s' %f\n", i, vocab.id_to_token.at(logits_id[i].second).c_str(), probs[i]);
//}
//printf("\n\n");
//exit(0);
std::discrete_distribution<> dist(probs.begin(), probs.end());
int idx = dist(rng);
return logits_id[idx].second;
}
size_t ggml_quantize_q4_0(float * src, void * dst, int n, int k, int qk, int64_t * hist) {
const int nb = k / qk;
const size_t bs = (sizeof(float) + sizeof(uint8_t)*qk/2);
const size_t row_size = nb*bs;
assert(k % qk == 0);
const size_t pp_size = qk / 2;
uint8_t *pp = static_cast<uint8_t*>(alloca(pp_size));
char * pdst = (char *) dst;
for (int j = 0; j < n; j += k) {
uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + sizeof(float));
for (int i = 0; i < nb; i++) {
float amax = 0.0f; // absolute max
{
for (int l = 0; l < qk; l++) {
const float v = src[j + i*qk + l];
amax = std::max(amax, fabsf(v));
}
const float d = amax / ((1 << 3) - 1);
const float id = d ? 1.0f/d : 0.0f;
*(float *) pd = d;
pd += bs;
for (int l = 0; l < qk; l += 2) {
const float v0 = (src[j + i*qk + l + 0])*id;
const float v1 = (src[j + i*qk + l + 1])*id;
const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
const uint8_t vi1 = ((int8_t) (round(v1))) + 8;
assert(vi0 >= 0 && vi0 < 16);
assert(vi1 >= 0 && vi1 < 16);
hist[vi0]++;
hist[vi1]++;
pp[l/2] = vi0 | (vi1 << 4);
}
memcpy(pb, pp, pp_size);
pb += bs;
}
}
}
return (n/k)*row_size;
}
size_t ggml_quantize_q4_1(float * src, void * dst, int n, int k, int qk, int64_t * hist) {
const int nb = k / qk;
const size_t bs = (2*sizeof(float) + sizeof(uint8_t)*qk/2);
const size_t row_size = nb*bs;
assert(k % qk == 0);
const size_t pp_size = qk / 2;
uint8_t *pp = static_cast<uint8_t*>(alloca(pp_size));
char * pdst = (char *) dst;
for (int j = 0; j < n; j += k) {
uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
uint8_t * pm = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + sizeof(float));
uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + 2*sizeof(float));
//printf("n = %d, k = %d, nb = %d, row_size = %d, j = %d, pm = %p, pd = %p, pb = %p\n", n, k, nb, row_size, j, pm, pd, pb);
for (int i = 0; i < nb; i++) {
float min = std::numeric_limits<float>::max();
float max = std::numeric_limits<float>::min();
{
for (int l = 0; l < qk; l++) {
const float v = src[j + i*qk + l];
if (v < min) min = v;
if (v > max) max = v;
}
const float d = (max - min) / ((1 << 4) - 1);
const float id = d ? 1.0f/d : 0.0f;
*(float *) pd = d;
*(float *) pm = min;
pd += bs;
pm += bs;
for (int l = 0; l < qk; l += 2) {
const float v0 = (src[j + i*qk + l + 0] - min)*id;
const float v1 = (src[j + i*qk + l + 1] - min)*id;
const uint8_t vi0 = round(v0);
const uint8_t vi1 = round(v1);
assert(vi0 >= 0 && vi0 < 16);
assert(vi1 >= 0 && vi1 < 16);
hist[vi0]++;
hist[vi1]++;
pp[l/2] = vi0 | (vi1 << 4);
}
memcpy(pb, pp, pp_size);
pb += bs;
}
}
}
return (n/k)*row_size;
} }

102
utils.h
View file

@ -2,8 +2,9 @@
#pragma once #pragma once
#include "llama.h"
#include <string> #include <string>
#include <map>
#include <vector> #include <vector>
#include <random> #include <random>
#include <thread> #include <thread>
@ -13,33 +14,34 @@
// //
struct gpt_params { struct gpt_params {
int32_t seed = -1; // RNG seed int32_t seed = -1; // RNG seed
int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency()); int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
int32_t n_predict = 128; // new tokens to predict int32_t n_predict = 128; // new tokens to predict
int32_t repeat_last_n = 64; // last n tokens to penalize int32_t repeat_last_n = 64; // last n tokens to penalize
int32_t n_ctx = 512; //context size int32_t n_parts = -1; // amount of model parts (-1 = determine from model dimensions)
bool memory_f16 = false; // use f16 instead of f32 for memory kv int32_t n_ctx = 512; //context size
// sampling parameters // sampling parameters
int32_t top_k = 40; int32_t top_k = 40;
float top_p = 0.95f; float top_p = 0.95f;
float temp = 0.80f; float temp = 0.80f;
float repeat_penalty = 1.30f; float repeat_penalty = 1.10f;
int32_t n_batch = 8; // batch size for prompt processing int32_t n_batch = 8; // batch size for prompt processing
std::string model = "models/lamma-7B/ggml-model.bin"; // model path std::string model = "models/lamma-7B/ggml-model.bin"; // model path
std::string prompt = ""; std::string prompt = "";
bool random_prompt = false;
bool use_color = false; // use color to distinguish generations and inputs
bool interactive = false; // interactive mode
bool interactive_start = false; // reverse prompt immediately
std::vector<std::string> antiprompt; // string upon seeing which more user input is prompted std::vector<std::string> antiprompt; // string upon seeing which more user input is prompted
bool instruct = false; // instruction mode (used for Alpaca models)
bool ignore_eos = false; // do not stop generating after eos bool memory_f16 = false; // use f16 instead of f32 for memory kv
bool random_prompt = false; // do not randomize prompt if none provided
bool use_color = false; // use color to distinguish generations and inputs
bool interactive = false; // interactive mode
bool interactive_start = false; // reverse prompt immediately
bool instruct = false; // instruction mode (used for Alpaca models)
bool ignore_eos = false; // do not stop generating after eos
bool perplexity = false; // compute perplexity over the prompt
}; };
bool gpt_params_parse(int argc, char ** argv, gpt_params & params); bool gpt_params_parse(int argc, char ** argv, gpt_params & params);
@ -48,72 +50,8 @@ void gpt_print_usage(int argc, char ** argv, const gpt_params & params);
std::string gpt_random_prompt(std::mt19937 & rng); std::string gpt_random_prompt(std::mt19937 & rng);
//
// Model file parsing
//
#define FILE_MAGIC_UNVERSIONED 0x67676d6c // pre-versioned files
#define FILE_MAGIC 0x67676d66 // 'ggmf' in hex
#define FILE_VERSION 1
// //
// Vocab utils // Vocab utils
// //
struct gpt_vocab { std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos);
using id = int32_t;
using token = std::string;
std::map<token, id> token_to_id;
std::map<id, token> id_to_token;
std::map<id, float> score;
};
void replace(std::string & str, const std::string & needle, const std::string & replacement);
// poor-man's JSON parsing
std::map<std::string, int32_t> json_parse(const std::string & fname);
// split text into tokens
//
// ref: https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53
//
// Regex (Python):
// r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
//
// Regex (C++):
// R"('s|'t|'re|'ve|'m|'ll|'d| ?[[:alpha:]]+| ?[[:digit:]]+| ?[^\s[:alpha:][:digit:]]+|\s+(?!\S)|\s+)"
//
std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const std::string & text);
// TODO: this is probably wrong, but I cannot figure out how this tokenizer works ..
// ref: https://github.com/google/sentencepiece
std::vector<gpt_vocab::id> llama_tokenize(const gpt_vocab & vocab, std::string_view text, bool bos);
// load the tokens from encoder.json
bool gpt_vocab_init(const std::string & fname, gpt_vocab & vocab);
// sample next token given probabilities for each embedding
//
// - consider only the top K tokens
// - from them, consider only the top tokens with cumulative probability > P
//
gpt_vocab::id llama_sample_top_p_top_k(
const gpt_vocab & vocab,
const float * logits,
std::vector<gpt_vocab::id> & last_n_tokens,
double repeat_penalty,
int top_k,
double top_p,
double temp,
std::mt19937 & rng);
// filer to top K tokens from list of logits
void sample_top_k(std::vector<std::pair<double, gpt_vocab::id>> & logits_id, int top_k);
//
// Quantization
//
size_t ggml_quantize_q4_0(float * src, void * dst, int n, int k, int qk, int64_t * hist);
size_t ggml_quantize_q4_1(float * src, void * dst, int n, int k, int qk, int64_t * hist);