Merge branch 'master' into concedo

# Conflicts: # .github/workflows/build.yml # CMakeLists.txt # Makefile # README.md # main.cpp
2023-03-22 22:31:45 +08:00 · 2023-03-22 22:31:45 +08:00 · 86c7457e24
commit 86c7457e24
parent 5c475503ce ae44e23ee3
25 changed files with 3028 additions and 1944 deletions
--- a/.github/ISSUE_TEMPLATE/custom.md
+++ b/.github/ISSUE_TEMPLATE/custom.md
@ -0,0 +1,198 @@
 ---
 name: Custom issue template
 about: Used to report user-related issues with the software
 title: "[User] I encountered a problem .."
 labels: ''
 assignees: ''
 ---
 # Prerequisites
 Please answer the following questions for yourself before submitting an issue.
 - [ ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
 - [ ] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
 - [ ] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
 - [ ] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.
 # Expected Behavior
 Please provide a detailed written description of what you were trying to do, and what you expected `lamma.cpp` to do.
 # Current Behavior
 Please provide a detailed written description of what `lamma.cpp` did, instead. 
 # Environment and Context 
 Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
 * Physical (or virtual) hardware you are using, e.g. for Linux:
 `$ lscpu`
 * Operating System, e.g. for Linux:
 `$ uname -a`
 * SDK version, e.g. for Linux:
 ```
 $ python3 --version
 $ make --version
 $ g++ --version
 ```
 # Models
 * The LLaMA models are officially distributed by Facebook and will never be provided through this repository. See this [pull request in Facebook's LLaMA repository](https://github.com/facebookresearch/llama/pull/73/files) if you need to obtain access to the model data.
 * If your issue is with model conversion please verify the `sha256sum` of each of your `consolidated*.pth` and `ggml-model-XXX.bin` files to confirm that you have the correct model data files before logging an issue. [Latest sha256 sums for your reference](https://github.com/ggerganov/llama.cpp/issues/238).
 * If your issue is with model generation quality then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
  * LLaMA:
    * [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
    * [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
  * GPT-3
    * [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
  * GPT-3.5 / InstructGPT / ChatGPT:
    * [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
    * [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
 # Failure Information (for bugs)
 Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
 # Steps to Reproduce
 Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
 1. step 1
 2. step 2
 3. step 3
 4. etc.
 # Failure Logs
 Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
 Also, please try to **avoid using screenshots** if at all possible. Instead, copy/paste the console output and use [Github's markdown](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) to cleanly format your logs for easy readability. e.g.
 ```
 llama.cpp$ git log | head -1
 commit 2af23d30434a677c6416812eea52ccc0af65119c
 llama.cpp$ lscpu | egrep "AMD|Flags"
 Vendor ID:                       AuthenticAMD
 Model name:                      AMD Ryzen Threadripper 1950X 16-Core Processor
 Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sme sev
 Virtualization:                  AMD-V
 llama.cpp$ python3 --version
 Python 3.10.9
 llama.cpp$ pip list | egrep "torch|numpy|sentencepiece"
 numpy                         1.24.2
 numpydoc                      1.5.0
 sentencepiece                 0.1.97
 torch                         1.13.1
 torchvision                   0.14.1
 llama.cpp$ make --version | head -1
 GNU Make 4.3
 $ md5sum ./models/65B/ggml-model-q4_0.bin
 dbdd682cce80e2d6e93cefc7449df487  ./models/65B/ggml-model-q4_0.bin
 ```
 Here's a run with the Linux command [perf](https://www.brendangregg.com/perf.html)
 ```
 llama.cpp$ perf stat ./main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p "Please close your issue when it has been answered."
 main: seed = 1679149377
 llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
 llama_model_load: n_vocab = 32000
 llama_model_load: n_ctx   = 512
 llama_model_load: n_embd  = 8192
 llama_model_load: n_mult  = 256
 llama_model_load: n_head  = 64
 llama_model_load: n_layer = 80
 llama_model_load: n_rot   = 128
 llama_model_load: f16     = 2
 llama_model_load: n_ff    = 22016
 llama_model_load: n_parts = 8
 llama_model_load: ggml ctx size = 41477.73 MB
 llama_model_load: memory_size =  2560.00 MB, n_mem = 40960
 llama_model_load: loading model part 1/8 from './models/65B/ggml-model-q4_0.bin'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 llama_model_load: loading model part 2/8 from './models/65B/ggml-model-q4_0.bin.1'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 llama_model_load: loading model part 3/8 from './models/65B/ggml-model-q4_0.bin.2'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 llama_model_load: loading model part 4/8 from './models/65B/ggml-model-q4_0.bin.3'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 llama_model_load: loading model part 5/8 from './models/65B/ggml-model-q4_0.bin.4'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 llama_model_load: loading model part 6/8 from './models/65B/ggml-model-q4_0.bin.5'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 llama_model_load: loading model part 7/8 from './models/65B/ggml-model-q4_0.bin.6'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
 main: prompt: 'Please close your issue when it has been answered.'
 main: number of tokens in prompt = 11
     1 -> ''
 12148 -> 'Please'
  3802 -> ' close'
   596 -> ' your'
  2228 -> ' issue'
   746 -> ' when'
   372 -> ' it'
   756 -> ' has'
  1063 -> ' been'
  7699 -> ' answered'
 29889 -> '.'
 sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
 Please close your issue when it has been answered.
@duncan-donut: I'm trying to figure out what kind of "support" you need for this script and why, exactly? Is there a question about how the code works that hasn't already been addressed in one or more comments below this ticket, or are we talking something else entirely like some sorta bugfixing job because your server setup is different from mine??
 I can understand if your site needs to be running smoothly and you need help with a fix of sorts but there should really be nothing wrong here that the code itself could not handle. And given that I'm getting reports about how it works perfectly well on some other servers, what exactly are we talking? A detailed report will do wonders in helping us get this resolved for ya quickly so please take your time and describe the issue(s) you see as clearly & concisely as possible!!
@duncan-donut: I'm not sure if you have access to cPanel but you could try these instructions. It is worth a shot! Let me know how it goes (or what error message, exactly!) when/if ya give that code a go? [end of text]
 main: mem per token = 71159620 bytes
 main:     load time = 19309.95 ms
 main:   sample time =   168.62 ms
 main:  predict time = 223895.61 ms / 888.47 ms per token
 main:    total time = 246406.42 ms
 Performance counter stats for './main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p Please close your issue when it has been answered.':
        3636882.89 msec task-clock                #   14.677 CPUs utilized          
             13509      context-switches          #    3.714 /sec                   
              2436      cpu-migrations            #    0.670 /sec                   
          10476679      page-faults               #    2.881 K/sec                  
    13133115082869      cycles                    #    3.611 GHz                      (16.77%)
       29314462753      stalled-cycles-frontend   #    0.22% frontend cycles idle     (16.76%)
    10294402631459      stalled-cycles-backend    #   78.39% backend cycles idle      (16.74%)
    23479217109614      instructions              #    1.79  insn per cycle         
                                                  #    0.44  stalled cycles per insn  (16.76%)
     2353072268027      branches                  #  647.002 M/sec                    (16.77%)
        1998682780      branch-misses             #    0.08% of all branches          (16.76%)
     247.802177522 seconds time elapsed
    3618.573072000 seconds user
      18.491698000 seconds sys
 ```
--- a/79
+++ b/79
@ -17,7 +17,7 @@ CXXV := $(shell $(CXX) --version | head -n 1)
 # ref: https://github.com/ggerganov/whisper.cpp/issues/66#issuecomment-1282546789
 ifeq ($(UNAME_S),Darwin)
 	ifneq ($(UNAME_P),arm)
-		SYSCTL_M := $(shell sysctl -n hw.optional.arm64)
+		SYSCTL_M := $(shell sysctl -n hw.optional.arm64 2>/dev/null)
 		ifeq ($(SYSCTL_M),1)
 			# UNAME_P := arm
 			# UNAME_M := arm64
@ -30,8 +30,9 @@ endif
 # Compile flags
 #
 # keep standard at C11 and C++11
 CFLAGS   = -I.              -O3 -DNDEBUG -std=c11   -fPIC
-CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++17 -fPIC
+CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC
 LDFLAGS  =
 # OS specific
@ -52,6 +53,10 @@ ifeq ($(UNAME_S),NetBSD)
 	CFLAGS   += -pthread
 	CXXFLAGS += -pthread
 endif
 ifeq ($(UNAME_S),OpenBSD)
 	CFLAGS   += -pthread
 	CXXFLAGS += -pthread
 endif
 ifeq ($(UNAME_S),Haiku)
 	CFLAGS   += -pthread
 	CXXFLAGS += -pthread
@ -95,30 +100,59 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686))
 		ifneq (,$(findstring sse3,$(SSE3_M)))
 			CFLAGS += -msse3
 		endif
 		AVX512F_M := $(shell grep "avx512f " /proc/cpuinfo)
 		ifneq (,$(findstring avx512f,$(AVX512F_M)))
 			CFLAGS += -mavx512f
 		endif
 		AVX512BW_M := $(shell grep "avx512bw " /proc/cpuinfo)
 		ifneq (,$(findstring avx512bw,$(AVX512BW_M)))
 			CFLAGS += -mavx512bw
 		endif
 		AVX512DQ_M := $(shell grep "avx512dq " /proc/cpuinfo)
 		ifneq (,$(findstring avx512dq,$(AVX512DQ_M)))
 			CFLAGS += -mavx512dq
 		endif
 		AVX512VL_M := $(shell grep "avx512vl " /proc/cpuinfo)
 		ifneq (,$(findstring avx512vl,$(AVX512VL_M)))
 			CFLAGS += -mavx512vl
 		endif
 		AVX512CD_M := $(shell grep "avx512cd " /proc/cpuinfo)
 		ifneq (,$(findstring avx512cd,$(AVX512CD_M)))
 			CFLAGS += -mavx512cd
 		endif
 		AVX512ER_M := $(shell grep "avx512er " /proc/cpuinfo)
 		ifneq (,$(findstring avx512er,$(AVX512ER_M)))
 			CFLAGS += -mavx512er
 		endif
 		AVX512IFMA_M := $(shell grep "avx512ifma " /proc/cpuinfo)
 		ifneq (,$(findstring avx512ifma,$(AVX512IFMA_M)))
 			CFLAGS += -mavx512ifma
 		endif
 		AVX512PF_M := $(shell grep "avx512pf " /proc/cpuinfo)
 		ifneq (,$(findstring avx512pf,$(AVX512PF_M)))
 			CFLAGS += -mavx512pf
 		endif
 	else ifeq ($(UNAME_S),Haiku)
-		AVX1_M := $(shell sysinfo -cpu | grep "AVX ")
+		AVX1_M := $(shell sysinfo -cpu | grep -w "AVX")
-		ifneq (,$(findstring avx,$(AVX1_M)))
+		ifneq (,$(findstring AVX,$(AVX1_M)))
 			CFLAGS += -mavx
 		endif
-		AVX2_M := $(shell sysinfo -cpu | grep "AVX2 ")
+		AVX2_M := $(shell sysinfo -cpu | grep -w "AVX2")
-		ifneq (,$(findstring avx2,$(AVX2_M)))
+		ifneq (,$(findstring AVX2,$(AVX2_M)))
 			CFLAGS += -mavx2
 		endif
-		FMA_M := $(shell sysinfo -cpu | grep "FMA ")
+		FMA_M := $(shell sysinfo -cpu | grep -w "FMA")
-		ifneq (,$(findstring fma,$(FMA_M)))
+		ifneq (,$(findstring FMA,$(FMA_M)))
 			CFLAGS += -mfma
 		endif
-		F16C_M := $(shell sysinfo -cpu | grep "F16C ")
+		F16C_M := $(shell sysinfo -cpu | grep -w "F16C")
-		ifneq (,$(findstring f16c,$(F16C_M)))
+		ifneq (,$(findstring F16C,$(F16C_M)))
 			CFLAGS += -mf16c
 		endif
 	else
 		CFLAGS += -mfma -mf16c -mavx -mavx2
 	endif
 endif
 ifeq ($(UNAME_M),amd64)
 	CFLAGS += -mavx -mavx2 -mfma -mf16c
 endif
 ifneq ($(filter ppc64%,$(UNAME_M)),)
 	POWER9_M := $(shell grep "POWER9" /proc/cpuinfo)
 	ifneq (,$(findstring POWER9,$(POWER9_M)))
@ -130,7 +164,8 @@ ifneq ($(filter ppc64%,$(UNAME_M)),)
 	endif
 endif
 ifndef LLAMA_NO_ACCELERATE
-	# Mac M1 - include Accelerate framework
+	# Mac M1 - include Accelerate framework.
 	# `-framework Accelerate` works on Mac Intel as well, with negliable performance boost (as of the predict time).
 	ifeq ($(UNAME_S),Darwin)
 		CFLAGS  += -DGGML_USE_ACCELERATE
 		LDFLAGS += -framework Accelerate
@ -185,6 +220,9 @@ default: main llamalib quantize
 ggml.o: ggml.c ggml.h
 	$(CC)  $(CFLAGS)   -c ggml.c -o ggml.o
 llama.o: llama.cpp llama.h
 	$(CXX) $(CXXFLAGS) -c llama.cpp -o llama.o
 utils.o: utils.cpp utils.h
 	$(CXX) $(CXXFLAGS) -c utils.cpp -o utils.o
@ -194,15 +232,16 @@ extra.o: extra.cpp extra.h
 clean:
 	rm -f *.o main quantize
-main: main.cpp ggml.o utils.o extra.o
+main: main.cpp ggml.o extra.o utils.o
-	$(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o extra.o -o main $(LDFLAGS)
+	$(CXX) $(CXXFLAGS) main.cpp ggml.o extra.o utils.o -o main $(LDFLAGS)
-	./main -h
+	@echo "\x1b[36mrun ./main -h for help\x1b[0m"
-	
+
 llamalib: expose.cpp ggml.o utils.o extra.o
 	$(CXX) $(CXXFLAGS) expose.cpp ggml.o utils.o extra.o -shared -o llamacpp.dll $(LDFLAGS)
-quantize: quantize.cpp ggml.o utils.o
+
-	$(CXX) $(CXXFLAGS) quantize.cpp ggml.o utils.o -o quantize $(LDFLAGS)
+quantize: quantize.cpp ggml.o llama.o utils.o
 	$(CXX) $(CXXFLAGS) quantize.cpp ggml.o llama.o utils.o -o quantize $(LDFLAGS)
 #
 # Tests
--- a/53
+++ b/53
@ -0,0 +1,53 @@
 700df0d3013b703a806d2ae7f1bfb8e59814e3d06ae78be0c66368a50059f33d  models/7B/consolidated.00.pth
 abe4aec2cdc297e2916011f66c7efd6fb4424e0e84315503005b5c118358cc22  models/7B/ggml-model-f16.bin
 f495fa02a0b5ef265e1864d9680eede7fd23a60b0a2f93edba8091e2a4ca68b9  models/7B/ggml-model-q4_0.bin
 7e89e242ddc0dd6f060b43ca219ce8b3e8f08959a72cb3c0855df8bb04d46265  models/7B/params.json
 745bf4e29a4dd6f411e72976d92b452da1b49168a4f41c951cfcc8051823cf08  models/13B/consolidated.00.pth
 d5ccbcc465c71c0de439a5aeffebe8344c68a519bce70bc7f9f92654ee567085  models/13B/consolidated.01.pth
 a6bd0537c6873f36c47292df0b6f794e1135f5aafb89c3343bcc9e93264bf167  models/13B/ggml-model-f16.bin
 0fb0951b90f2ec46c1f2f2372af5dacb4614b27e9fb6c10c69fbec58d7dd0e36  models/13B/ggml-model-f16.bin.1
 1c218ba37ae61e15e35efd9949c78d6edf553b6280824c263cad56ae0b9d5a8f  models/13B/ggml-model-q4_0.bin
 c37a20c2ab9fa74b006b389085660269ee06110d1e45a494eb57d4602c9bcdb2  models/13B/ggml-model-q4_0.bin.1
 4ab77bec4d4405ccb66a97b282574c89a94417e3c32e5f68f37e2876fc21322f  models/13B/params.json
 e23294a58552d8cdec5b7e8abb87993b97ea6eced4178ff2697c02472539d067  models/30B/consolidated.00.pth
 4e077b7136c7ae2302e954860cf64930458d3076fcde9443f4d0e939e95903ff  models/30B/consolidated.01.pth
 24a87f01028cbd3a12de551dcedb712346c0b5cbdeff1454e0ddf2df9b675378  models/30B/consolidated.02.pth
 1adfcef71420886119544949767f6a56cb6339b4d5fcde755d80fe68b49de93b  models/30B/consolidated.03.pth
 def20ea508f4e36793719f857471e85b85f96e497a2cbffbbaa1b60e2b18202c  models/30B/ggml-model-f16.bin
 b37040aa67fa8608cb2d8e0719132cf3e267fd35ec1e2f0d37dbc9fa43d674f1  models/30B/ggml-model-f16.bin.1
 e7f263557e99069fe29003262ea5fa9ed885dbe79069083e6eb569b328cf30d3  models/30B/ggml-model-f16.bin.2
 2ad6a23af05eb720f202f63d130f4fc5de9b6d2efc95b921be003209a56695aa  models/30B/ggml-model-f16.bin.3
 7de31d005e6d02ebd9603b2cf5329ad2f832b65d08873a098c5cafc4046cb9ed  models/30B/ggml-model-q4_0.bin
 f91feef9f30f9a023616db2e91297ca6d5d5d7b9eb351e452a82115c46f7da9e  models/30B/ggml-model-q4_0.bin.1
 66f3a0916ac7a81839153eb061fa861030ed1892477c2f7af2ce4f98d2f6d06f  models/30B/ggml-model-q4_0.bin.2
 e3c587ba97f83d2088b001bcda3026571065649ee3090bef6743a51390b01d3b  models/30B/ggml-model-q4_0.bin.3
 2c07118ea98d69dbe7810d88520e30288fa994751b337f8fca02b171955f44cb  models/30B/params.json
 135c563f6b3938114458183afb01adc9a63bef3d8ff7cccc3977e5d3664ecafe  models/65B/consolidated.00.pth
 9a600b37b19d38c7e43809485f70d17d1dc12206c07efa83bc72bb498a568bde  models/65B/consolidated.01.pth
 e7babf7c5606f165a3756f527cb0fedc4f83e67ef1290391e52fb1cce5f26770  models/65B/consolidated.02.pth
 73176ffb426b40482f2aa67ae1217ef79fbbd1fff5482bae5060cdc5a24ab70e  models/65B/consolidated.03.pth
 882e6431d0b08a8bc66261a0d3607da21cbaeafa96a24e7e59777632dbdac225  models/65B/consolidated.04.pth
 a287c0dfe49081626567c7fe87f74cce5831f58e459b427b5e05567641f47b78  models/65B/consolidated.05.pth
 72b4eba67a1a3b18cb67a85b70f8f1640caae9b40033ea943fb166bd80a7b36b  models/65B/consolidated.06.pth
 d27f5b0677d7ff129ceacd73fd461c4d06910ad7787cf217b249948c3f3bc638  models/65B/consolidated.07.pth
 7eba2625260cd91f8de901fd9704a1aa39448425514a335a0d3878de4ab9dc77  models/65B/ggml-model-f16.bin
 f6aa886575df0785d4231f30cc776d499ccde18857818effc0378c65b178e0b5  models/65B/ggml-model-f16.bin.1
 076037141682f5d7537955058c4740ab27f285aa4588915f830874a589c0693d  models/65B/ggml-model-f16.bin.2
 7853d96d2903ad7de2b2a89c4acf5a33a2f8e3c24ac39c9df6b44cdb42bf530a  models/65B/ggml-model-f16.bin.3
 b16b7b941abb3bc03a14df1656140855e9360a5371c83e919b9da83a72362314  models/65B/ggml-model-f16.bin.4
 5291270216f888697695acb78ef28df0c080f9e85d3245c92fb9992d1fde6678  models/65B/ggml-model-f16.bin.5
 0685ee77715f34686841006f8f94d3e7eaf148b97cecc9d3eee72808b0f7989c  models/65B/ggml-model-f16.bin.6
 00d993d73bb21d7c29388ffe0dced008cbaa0d391831dea77d7eb8f0b5c404b9  models/65B/ggml-model-f16.bin.7
 4e398f05842206e08cdc5e7bb4f6c7c34b9dc373435ece6f261b14b7b4fe9b89  models/65B/ggml-model-q4_0.bin
 4c4e899e3b12d9f57c9dcea5a1fb41bbc72023323535551f6273582ca7d7294b  models/65B/ggml-model-q4_0.bin.1
 d7b4594bbbd192043b3db0e5acc2561c42e6944e1cb91cc6e61510eee89dbcd8  models/65B/ggml-model-q4_0.bin.2
 9a099d271648863d923d0d097391ea0bc75591f27a2ca3a327760f42e6b69af2  models/65B/ggml-model-q4_0.bin.3
 5ee474051e418c5732b7949190b084d9d679db447f83c1de0d2a82daaa1a0cfa  models/65B/ggml-model-q4_0.bin.4
 a45aa05e7212bd6782790722d68056c5419667ea6b564ccc94bbcb8111d79b8b  models/65B/ggml-model-q4_0.bin.5
 a58fda714b759c28ad5e4c1d8bf8fda7b158fd5e4c4a49f851f36342fa97a105  models/65B/ggml-model-q4_0.bin.6
 a3540cfcbcda33c223c6b0d606034adbd78f17e0e5de1582b78795e78754f7a8  models/65B/ggml-model-q4_0.bin.7
 999ed1659b469ccc2a941714c0a9656fa571d17c9f7c8c7589817ca90edef51b  models/65B/params.json
 1f582babc2bd56bb63b33141898748657d369fd110c4358b2bc280907882bf13  models/alpaca-7B/ggml-model-q4_0.bin
 e17730c6b62b565b098af023ca446dcb9e3535d4222ead6369c7aae67207eb3d  models/alpaca-13B/ggml-model-q4_0.bin
 9bcd1bb30e679c939f367be11b030fe20b3eb9a3606b9bc4106420f1827b6ae4  models/alpaca-30B/ggml-model-q4_0.bin
 36079249f53c292a4c2302d7784005dcae94c865f0bedfdbfa51d9ddad402935  models/alpaca-30B/params.json
--- a/alpaca.sh
+++ b/alpaca.sh
@ -3,4 +3,4 @@
 # Temporary script - will be removed in the future
 #
-./main -m ./models/ggml-alpaca-7b-q4.bin --color -f ./prompts/alpaca.txt -ins --top_k 10000 --temp 0.96 --repeat_penalty 1 -t 7
+./main -m ./models/ggml-alpaca-7b-q4.bin --color -f ./prompts/alpaca.txt -ins --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7
--- a/chat.sh
+++ b/chat.sh
@ -0,0 +1,6 @@
 #!/bin/bash
 #
 # Temporary script - will be removed in the future
 #
 ./main -m ./models/7B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
--- a/convert-gptq-to-ggml.py
+++ b/convert-gptq-to-ggml.py
@ -0,0 +1,172 @@
 # Convert a GPTQ quantized LLaMA model to a ggml compatible file
 # Based on: https://github.com/qwopqwop200/GPTQ-for-LLaMa
 #
 import os
 import re
 import sys
 import json
 import struct
 import numpy as np
 import torch
 from sentencepiece import SentencePieceProcessor
 if len(sys.argv) != 4:
    print("Usage: convert-gptq-to-ggml.py llamaXXb-4bit.pt tokenizer.model out.bin\n")
    sys.exit(1)
 fname_model = sys.argv[1]
 fname_tokenizer = sys.argv[2]
 dir_out = sys.argv[3]
 model = torch.load(fname_model, map_location="cpu")
 n_vocab, n_embd = model['model.embed_tokens.weight'].shape
 n_layer = 1 + max(int(m.group(1)) for name in model
                  if (m := re.match(r'model\.layers\.([0-9]+)', name)))
 # hardcoded:
 n_mult = 256
 n_head = {32: 32, 40: 40, 60: 52, 80: 64}[n_layer]
 tokenizer = SentencePieceProcessor(fname_tokenizer)
 assert tokenizer.vocab_size() == n_vocab
 fname_out = sys.argv[3]
 fout = open(fname_out, "wb")
 fout.write(struct.pack("i", 0x67676d6c)) # magic: ggml in hex
 fout.write(struct.pack("i", n_vocab))
 fout.write(struct.pack("i", n_embd))
 fout.write(struct.pack("i", n_mult))
 fout.write(struct.pack("i", n_head))
 fout.write(struct.pack("i", n_layer))
 fout.write(struct.pack("i", n_embd // n_head)) # rot (obsolete)
 fout.write(struct.pack("i", 4))
 # This loop unchanged from convert-pth-to-ggml.py:
 for i in range(tokenizer.vocab_size()):
    if tokenizer.is_unknown(i):
        # "<unk>" token (translated as ??)
        text = " \u2047 ".encode("utf-8")
        fout.write(struct.pack("i", len(text)))
        fout.write(text)
    elif tokenizer.is_control(i):
        # "<s>"/"</s>" tokens
        fout.write(struct.pack("i", 0))
    elif tokenizer.is_byte(i):
        # "<U+XX>" tokens (which may be invalid UTF-8)
        piece = tokenizer.id_to_piece(i)
        if len(piece) != 6:
            print("Invalid token: " + piece)
            sys.exit(1)
        byte_value = int(piece[3:-1], 16)
        fout.write(struct.pack("i", 1))
        fout.write(struct.pack("B", byte_value))
    else:
        # normal token. Uses U+2581 (LOWER ONE EIGHTH BLOCK) to represent spaces.
        text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")
        fout.write(struct.pack("i", len(text)))
        fout.write(text)
 def write_header(shape, dst_name, ftype_cur):
    sname = dst_name.encode('utf-8')
    fout.write(struct.pack("iii", len(shape), len(sname), ftype_cur))
    fout.write(struct.pack("i" * len(shape), *shape[::-1]))
    fout.write(sname)
 def convert_non_q4(src_name, dst_name):
    v = model[src_name]
    shape = v.shape
    print("Processing non-Q4 variable: " + src_name + " with shape: ", shape, " and type: ", v.dtype)
    if len(shape) == 1:
        print("  Converting to float32")
        v = v.to(torch.float32)
    ftype_cur = {torch.float16: 1, torch.float32: 0}[v.dtype]
    # header
    write_header(shape, dst_name, ftype_cur)
    # data
    v.numpy().tofile(fout)
 def convert_q4(src_name, dst_name, permute=False):
    zeros = model[f"{src_name}.zeros"].numpy()
    scales = model[f"{src_name}.scales"].numpy()
    bias = model[f"{src_name}.bias"].numpy()
    qweight = model[f"{src_name}.qweight"].numpy().T # transpose
    # Q4_1 does not support bias; good thing the bias is always all zeros.
    assert not np.any(bias)
    # Each int32 item is actually 8 int4 items packed together, and it's transposed.
    shape = (qweight.shape[0], qweight.shape[1] * 8)
    print("Processing Q4 variable: " + src_name + " with shape: ", shape)
    # The output format has the int4 weights in groups of 32 rather than 8.
    # It looks like this:
    # For each row:
    #   For each group of 32 columns:
    #     - addend (float32, 4 bytes)
    #     - scale (float32, 4 bytes)
    #     - weights (int4 * 32, 16 bytes)
    # Note that in the input, the scales and addends are shared between all
    # the columns in a row, so we end up wasting quite a bit of memory with
    # repeated scales and addends.
    addends = -zeros # flip sign
    # Since the output format is mixed between integers and floats, we have
    # to hackily view the floats as int32s just so numpy will let us
    # concatenate them.
    addends_view = addends.view(dtype=np.int32)
    scales_view = scales.view(dtype=np.int32)
    # Split into groups of 4 columns (i.e. 32 columns of quantized data):
    grouped = qweight.reshape([qweight.shape[0], qweight.shape[1] // 4, 4])
    # Repeat addends and scales:
    addends_rep = np.atleast_3d(addends_view).repeat(grouped.shape[1], axis=1)
    scales_rep = np.atleast_3d(scales_view).repeat(grouped.shape[1], axis=1)
    blob = np.concatenate([scales_rep, addends_rep, grouped], axis=2, casting='no')
    if permute:
        # Permute some rows to undo the permutation done by convert_llama_weights_to_hf.py.
        # This can be done after the above conversion because it doesn't affect column order/layout.
        blob = (blob.reshape(n_head, 2, shape[0] // n_head // 2, *blob.shape[1:])
                    .swapaxes(1, 2)
                    .reshape(blob.shape))
    # header
    write_header(shape, dst_name, 3) # ftype = Q4_1
    # data
    blob.tofile(fout)
 convert_non_q4("model.embed_tokens.weight", "tok_embeddings.weight")
 convert_non_q4("model.norm.weight", "norm.weight")
 convert_non_q4("lm_head.weight", "output.weight")
 for i in range(n_layer):
    convert_q4(f"model.layers.{i}.self_attn.q_proj", f"layers.{i}.attention.wq.weight", permute=True)
    convert_q4(f"model.layers.{i}.self_attn.k_proj", f"layers.{i}.attention.wk.weight", permute=True)
    convert_q4(f"model.layers.{i}.self_attn.v_proj", f"layers.{i}.attention.wv.weight")
    convert_q4(f"model.layers.{i}.self_attn.o_proj", f"layers.{i}.attention.wo.weight")
    convert_q4(f"model.layers.{i}.mlp.gate_proj", f"layers.{i}.feed_forward.w1.weight")
    convert_q4(f"model.layers.{i}.mlp.down_proj", f"layers.{i}.feed_forward.w2.weight")
    convert_q4(f"model.layers.{i}.mlp.up_proj",   f"layers.{i}.feed_forward.w3.weight")
    convert_non_q4(f"model.layers.{i}.input_layernorm.weight", f"layers.{i}.attention_norm.weight")
    convert_non_q4(f"model.layers.{i}.post_attention_layernorm.weight", f"layers.{i}.ffn_norm.weight")
 fout.close()
 print("Done. Output file: " + fname_out)
 print("")
--- a/convert-pth-to-ggml.py
+++ b/convert-pth-to-ggml.py
@ -10,12 +10,10 @@
 #   - Name (char[name_length])
 #   - Data (float[n_dims])
 #
 # By default, the bigger matrices are converted to 16-bit floats.
 # This can be disabled by adding the "use-f32" CLI argument.
 #
 # At the start of the ggml file we write the model parameters
 # and vocabulary.
 #
 import argparse
 import os
 import sys
@ -23,13 +21,15 @@ import json
 import struct
 import numpy as np
 import torch
 from sentencepiece import SentencePieceProcessor
 def parse_args():
    parser = argparse.ArgumentParser(description='Convert a LLaMA model checkpoint to a ggml compatible file')
-    parser.add_argument('dir_model', help='directory containing the model checkpoint')
+    parser.add_argument('dir_model',  help='directory containing the model checkpoint')
-    parser.add_argument('ftype', type=int, choices=[0, 1], default=1, help='file type (0: float32, 1: float16)')
+    parser.add_argument('ftype',      help='file type (0: float32, 1: float16)', type=int, choices=[0, 1], default=1)
    parser.add_argument('vocab_only', help='only write vocab to file', type=int, default=0, nargs='?')
    return parser.parse_args()
 def get_n_parts(dim):
@ -67,7 +67,7 @@ def write_header(fout, hparams, ftype):
    keys = ["vocab_size", "dim", "multiple_of", "n_heads", "n_layers"]
    values = [
-        0x67676d66,  # magic: ggml in hex
+        0x67676d66,  # magic: ggmf in hex
        1, # file version
        *[hparams[key] for key in keys],
        hparams["dim"] // hparams["n_heads"],  # rot (obsolete)
@ -134,6 +134,29 @@ def main():
    ftype_str = ["f32", "f16"]
    hparams, tokenizer = load_hparams_and_tokenizer(dir_model)
    print(args)
    # if only writing vocab to file
    if args.vocab_only:
        fname_model = f"{dir_model}/consolidated.00.pth"
        fname_out = f"{dir_model}/ggml-vocab.bin"
        print(f"Extracting only the vocab from '{fname_model}'\n")
        model = torch.load(fname_model, map_location="cpu")
        with open(fname_out, "wb") as fout:
            write_header(fout, hparams, ftype)
            write_tokens(fout, tokenizer)
        del model
        print(f"Done. Output file: {fname_out}\n")
        return
    n_parts = get_n_parts(hparams["dim"])
    for p in range(n_parts):
@ -151,6 +174,7 @@ def main():
            process_and_write_variables(fout, model, ftype)
        del model
        print(f"Done. Output file: {fname_out}, (part {p})\n")
 if __name__ == "__main__":
--- a/examples/chatLLaMa
+++ b/examples/chatLLaMa
@ -0,0 +1,53 @@
 #!/bin/bash
 cd "$(dirname "$0")/.." || exit
 MODEL="${MODEL:-./models/13B/ggml-model-q4_0.bin}"
 USER_NAME="${USER_NAME:-User}"
 AI_NAME="${AI_NAME:-ChatLLaMa}"
 # Adjust to the number of CPU cores you want to use.
 N_THREAD="${N_THREAD:-8}"
 # Number of tokens to predict (made it larger than default because we want a long interaction)
 N_PREDICTS="${N_PREDICTS:-2048}"
 # Note: you can also override the generation options by specifying them on the command line:
 # For example, override the context size by doing: ./chatLLaMa --ctx_size 1024
 GEN_OPTIONS="${GEN_OPTIONS:---ctx_size 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647}"
 # shellcheck disable=SC2086 # Intended splitting of GEN_OPTIONS
 ./main $GEN_OPTIONS \
  --model "$MODEL" \
  --threads "$N_THREAD" \
  --n_predict "$N_PREDICTS" \
  --color --interactive \
  --reverse-prompt "${USER_NAME}:" \
  --prompt "
 Text transcript of a never ending dialog, where ${USER_NAME} interacts with an AI assistant named ${AI_NAME}.
 ${AI_NAME} is helpful, kind, honest, friendly, good at writing and never fails to answer ${USER_NAME}’s requests immediately and with details and precision.
 There are no annotations like (30 seconds passed...) or (to himself), just what ${USER_NAME} and ${AI_NAME} say aloud to each other.
 The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
 The transcript only includes text, it does not include markup like HTML and Markdown.
 $USER_NAME: Hello, $AI_NAME!
 $AI_NAME: Hello $USER_NAME! How may I help you today?
 $USER_NAME: What time is it?
 $AI_NAME: It is $(date +%H:%M).
 $USER_NAME: What year is it?
 $AI_NAME: We are in $(date +%Y).
 $USER_NAME: Please tell me the largest city in Europe.
 $AI_NAME: The largest city in Europe is Moscow, the capital of Russia.
 $USER_NAME: What can you tell me about Moscow?
 $AI_NAME: Moscow, on the Moskva River in western Russia, is the nation’s cosmopolitan capital. In its historic core is the Kremlin, a complex that’s home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russia’s symbolic center.
 $USER_NAME: What is a cat?
 $AI_NAME: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.
 $USER_NAME: How do I pass command line arguments to a Node.js program?
 $AI_NAME: The arguments are stored in process.argv.
    argv[0] is the path to the Node. js executable.
    argv[1] is the path to the script file.
    argv[2] is the first argument passed to the script.
    argv[3] is the second argument passed to the script and so on.
 $USER_NAME: Name a color.
 $AI_NAME: Blue
 $USER_NAME:" "$@"
--- a/expose.cpp
+++ b/expose.cpp
@ -39,31 +39,42 @@ extern "C" {
        char text[16384]; //16kb should be enough for any response
    };
    gpt_params api_params;
    gpt_vocab api_vocab;
    llama_model api_model;    
    int api_n_past = 0;
    gpt_vocab::id old_embd_id = -1;
    std::vector<float> api_logits;
    std::vector<gpt_vocab::id> last_n_tokens;
    size_t mem_per_token = 0;
    bool legacy_format = false;
    llama_context_params ctx_params;
    gpt_params params;
    int n_past = 0;
    llama_token old_embd_id = -1;
    int n_threads = 4;
    int n_batch = 8;
    std::string model;
    llama_context * ctx;
    std::vector<llama_token> last_n_tokens;
    bool load_model(const load_model_inputs inputs)
    {
-        api_params.n_threads = inputs.threads;
+        ctx_params = llama_context_default_params();
        api_params.n_ctx = inputs.max_context_length;
        api_params.n_batch = inputs.batch_size;
        api_params.model = inputs.model_filename;
-        int n_parts_overwrite =  inputs.n_parts_overwrite;
+        n_threads = inputs.threads;       
        n_batch = inputs.batch_size;
        model = inputs.model_filename;        
-        int loadresult = llama_model_load(api_params.model, api_model, api_vocab, api_params.n_ctx, GGML_TYPE_F16, n_parts_overwrite);
+        ctx_params.n_ctx      = inputs.max_context_length;
-        if (!loadresult) {  
+        ctx_params.n_parts    = inputs.n_parts_overwrite;
-            fprintf(stderr, "%s: failed to load model from '%s'\n", __func__, api_params.model.c_str());
+        ctx_params.seed       = -1;
        ctx_params.f16_kv     = true;
        ctx_params.logits_all = false;
        ctx = llama_init_from_file(model.c_str(), ctx_params);
        if (ctx == NULL) {
            fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, model.c_str());
            return false;
        }
-        legacy_format = (loadresult==2?true:false);
+
        //return val: 0=fail, 1=newformat, 2=legacy
        int fileformat = check_file_format(model.c_str());        
        legacy_format = (fileformat==1?true:false);
        if(legacy_format)
        {
            printf("\n---\nWarning: Your model is using an OUTDATED format. Please reconvert it for better results!\n");
@ -74,69 +85,76 @@ extern "C" {
    generation_outputs generate(const generation_inputs inputs, generation_outputs & output)
    {
-        api_params.prompt = inputs.prompt;
+        params.prompt = inputs.prompt;
-        api_params.seed = inputs.seed;
+        params.seed = inputs.seed;
-        api_params.n_predict = inputs.max_length;
+        params.n_predict = inputs.max_length;
-        api_params.top_k = inputs.top_k;
+        params.top_k = inputs.top_k;
-        api_params.top_p = inputs.top_p;
+        params.top_p = inputs.top_p;
-        api_params.temp = inputs.temperature;
+        params.temp = inputs.temperature;
-        api_params.repeat_last_n = inputs.rep_pen_range;
+        params.repeat_last_n = inputs.rep_pen_range;
-        api_params.repeat_penalty = inputs.rep_pen;
+        params.repeat_penalty = inputs.rep_pen;
-        api_params.n_ctx = inputs.max_context_length;
+        params.n_ctx = inputs.max_context_length;
        params.n_batch = n_batch;
        params.n_threads = n_threads;
        bool reset_state = inputs.reset_state;
-        if(api_n_past==0)
+        if(n_past==0)
        {
            reset_state = true;
        }
-        if(api_params.repeat_last_n<1)
+        if(params.repeat_last_n<1)
        {
-            api_params.repeat_last_n = 1;
+            params.repeat_last_n = 1;
        }
-        if(api_params.top_k<1)
+        if(params.top_k<1)
        {
-            api_params.top_k = 300; //to disable top_k we actually need to increase this value to a very high number
+            params.top_k = 300; //to disable top_k we actually need to increase this value to a very high number
        }
-        if (api_params.seed < 0)
+        if (params.seed <= 0)
        {
-            api_params.seed = time(NULL);
+            params.seed = time(NULL);
        }
 		if(reset_state)
 		{
 			params.prompt.insert(0, 1, ' ');
 		}
 	    // tokenize the prompt
 		std::vector<llama_token> embd_inp;
 		if(legacy_format)
        {
            embd_inp = ::legacy_llama_tokenize(ctx, params.prompt, true);
        }else{
            embd_inp = ::llama_tokenize(ctx, params.prompt, true);
        }
 		//params.n_predict = std::min(params.n_predict, params.n_ctx - (int) embd_inp.size());
        //truncate to front of the prompt if its too long
        if (embd_inp.size() + params.n_predict > params.n_ctx) {
            int offset = embd_inp.size() - params.n_ctx + params.n_predict;
            embd_inp = std::vector<llama_token>(embd_inp.begin() + offset, embd_inp.end());
        }	   
   		std::vector<llama_token> embd;
 		int last_n_size = params.repeat_last_n;
    	last_n_tokens.resize(last_n_size);
        //display usage
        // std::string tst = " ";
        // char * tst2 = (char*)tst.c_str();
-        // gpt_print_usage(1,&tst2,api_params);
+        // gpt_print_usage(1,&tst2,params);
-        
+
-        if(reset_state)
+		if(reset_state)
        {
-            api_params.prompt.insert(0, 1, ' ');
+			const std::vector<llama_token> tmp = { 0, 1, 2, 3 };
-        }
+	        llama_eval(ctx, tmp.data(), tmp.size(), 0, params.n_threads);            
        // tokenize the prompt
        std::vector<gpt_vocab::id> embd_inp;
        if(legacy_format)
        {
            embd_inp = ::legacy_llama_tokenize(api_vocab, api_params.prompt, true);
        }else{
            embd_inp = ::llama_tokenize(api_vocab, api_params.prompt, true);
        }
        //api_params.n_predict = std::min(api_params.n_predict, api_model.hparams.n_ctx - (int)embd_inp.size());
        //truncate to front of the prompt if its too long
        if (embd_inp.size() + api_params.n_predict > api_model.hparams.n_ctx) {
            int offset = embd_inp.size() - api_model.hparams.n_ctx + api_params.n_predict;
            embd_inp = std::vector<gpt_vocab::id>(embd_inp.begin() + offset, embd_inp.end());
        }
        std::vector<gpt_vocab::id> embd;
        int last_n_size = api_params.repeat_last_n;
        last_n_tokens.resize(last_n_size);
        if(reset_state)
        {
            llama_eval(api_model, api_params.n_threads, 0, {0, 1, 2, 3}, api_logits, mem_per_token);
            std::fill(last_n_tokens.begin(), last_n_tokens.end(), 0);
-            api_n_past = 0;
+            n_past = 0;
-        }else{
+        }
        else
        {
            //strip out the reset token (1) at the start of the embedding
            if(embd_inp.size()>0)
            {
@ -147,96 +165,97 @@ extern "C" {
                embd.push_back(old_embd_id);
            }
        }
-        
+		
-        int remaining_tokens = api_params.n_predict;
+ 		int remaining_tokens = params.n_predict;
-        int input_consumed = 0;
+		int input_consumed = 0;
-        std::mt19937 api_rng(api_params.seed);
+    	std::mt19937 rng(params.seed);   
-        std::string concat_output = "";        
+		std::string concat_output = "";  
-       
+    	
-        bool startedsampling = false;
+		bool startedsampling = false;
        printf("\nProcessing Prompt: ");
-        while (remaining_tokens > 0)
+
-        {
+		while (remaining_tokens > 0) 
-            gpt_vocab::id id = 0;
+		{
-            // predict
+			llama_token id = 0;
-            if (embd.size() > 0)
+	        // predict
-            {
+	        if (embd.size() > 0) 
 			{
 				printf("|");
                // for (auto i: embd) {                    
                //     std::cout << i << ',';
                // }
-                //printf("\nnp:%d embd:%d mem:%d",api_n_past,embd.size(),mem_per_token);
+                // printf("\nnp:%d embd:%d",n_past,embd.size());
-                printf("|");
+	            if (llama_eval(ctx, embd.data(), embd.size(), n_past, params.n_threads)) 
-                if (!llama_eval(api_model, api_params.n_threads, api_n_past, embd, api_logits, mem_per_token))
+				{
-                {
+	                fprintf(stderr, "Failed to predict\n");
                    fprintf(stderr, "Failed to predict\n");
                    snprintf(output.text, sizeof(output.text), "%s", "");
                    output.status = 0;
                    return output;
-                }
+	            }
-            }
+	        }
-            api_n_past += embd.size();
+        	n_past += embd.size();
-            embd.clear();            
+       		embd.clear();
-            if (embd_inp.size() <= input_consumed)
+        	if ((int) embd_inp.size() <= input_consumed) 
-            {
+			{
-                // out of user input, sample next token
+	            // out of user input, sample next token
-                const float top_k = api_params.top_k;
+	            const float top_k          = params.top_k;
-                const float top_p = api_params.top_p;
+	            const float top_p          = params.top_p;
-                const float temp = api_params.temp;
+	            const float temp           = params.temp;
-                const float repeat_penalty = api_params.repeat_penalty;
+	            const float repeat_penalty = params.repeat_penalty;
-                const int n_vocab = api_model.hparams.n_vocab;
+
-                
+            	if(!startedsampling)
                if(!startedsampling)
                {
                    startedsampling = true;
                    printf("\nGenerating: ");
                }
-                {
+	            {
-                    // set the logit of the eos token (2) to zero to avoid sampling it
+	                auto logits = llama_get_logits(ctx);
-                    api_logits[api_logits.size() - n_vocab + EOS_TOKEN_ID] = 0;
+					// set the logit of the eos token (2) to zero to avoid sampling it
-                    //set logits of opening square bracket to zero.
+	                logits[llama_token_eos()] = 0;
-                    api_logits[api_logits.size() - n_vocab + 518] = 0;
+					//set logits of opening square bracket to zero.
-                    api_logits[api_logits.size() - n_vocab + 29961] = 0;
+					logits[518] = 0;
 					logits[29961] = 0;
 	                id = llama_sample_top_p_top_k(ctx, last_n_tokens.data(), last_n_tokens.size(), top_k, top_p, temp, repeat_penalty);
 	                last_n_tokens.erase(last_n_tokens.begin());
 	                last_n_tokens.push_back(id);
 	            }
 	            // add it to the context
 				old_embd_id = id;
 	            embd.push_back(id);
-                    id = llama_sample_top_p_top_k(api_vocab, api_logits.data() + (api_logits.size() - n_vocab), last_n_tokens, repeat_penalty, top_k, top_p, temp, api_rng);
+	            // decrement remaining sampling budget
 	            --remaining_tokens;
                //printf("\nid:%d word:%s\n",id,llama_token_to_str(ctx, id));
 				concat_output += llama_token_to_str(ctx, id);
        	} 
 			else 
 			{
 	            // some user input remains from prompt or interaction, forward it to processing
 	            while ((int) embd_inp.size() > input_consumed) 
 				{
 					old_embd_id = embd_inp[input_consumed];
 	                embd.push_back(embd_inp[input_consumed]);
 	                last_n_tokens.erase(last_n_tokens.begin());
 	                last_n_tokens.push_back(embd_inp[input_consumed]);
 	                ++input_consumed;
 	                if ((int) embd.size() >= params.n_batch) 
 					{
 	                    break;
 	                }
            	}
        	}
-                    last_n_tokens.erase(last_n_tokens.begin());
+		}
-                    last_n_tokens.push_back(id);
+       		
-                }
+		output.status = 1;
                // add it to the context
                old_embd_id = id;
                embd.push_back(id);
                // decrement remaining sampling budget
                --remaining_tokens;
                //printf("\nid:%d word:%s\n",id,api_vocab.id_to_token[id].c_str());
                concat_output += api_vocab.id_to_token[id].c_str();
            }
            else
            {
                // some user input remains from prompt or interaction, forward it to processing
                while (embd_inp.size() > input_consumed)
                {
                    old_embd_id = embd_inp[input_consumed];
                    embd.push_back(embd_inp[input_consumed]);
                    last_n_tokens.erase(last_n_tokens.begin());
                    last_n_tokens.push_back(embd_inp[input_consumed]);
                    ++input_consumed;
                    if (embd.size() > api_params.n_batch)
                    {
                        break;
                    }
                }
            }
        }
        //printf("output: %s",concat_output.c_str());
        output.status = 1;
        snprintf(output.text, sizeof(output.text), "%s", concat_output.c_str());
        return output;
    }
 }
--- a/extra.cpp
+++ b/extra.cpp
@ -1,5 +1,6 @@
 #include "extra.h"
 #include "llama.cpp"
 #include <cassert>
 #include <cstring>
@ -17,13 +18,41 @@
 #include <alloca.h>
 #endif
 //return val: 0=fail, 1=legacy, 2=newformat
 int check_file_format(const std::string & fname)
 {
    std::vector<char> f_buf(1024*1024);
    auto fin = std::ifstream(fname, std::ios::binary);
    fin.rdbuf()->pubsetbuf(f_buf.data(), f_buf.size());
    if (!fin) {
        fprintf(stderr, "%s: failed to open '%s'\n", __func__, fname.c_str());
        return false;
    }
    int fileformat = 0;
    uint32_t magic;
    fin.read((char *) &magic, sizeof(magic));
    if (magic == LLAMA_FILE_MAGIC_UNVERSIONED) {
       fileformat = 1;
    }else{
        fileformat = 2;
    }
    fin.close();
    return fileformat;
 }
 // TODO: Calculate this constant from the vocabulary
 #define MAX_TOKEN_LEN 18
 // SentencePiece implementation after https://guillaume-be.github.io/2020-05-30/sentence_piece
-std::vector<gpt_vocab::id> legacy_llama_tokenize(const gpt_vocab & vocab, const std::string & text, bool bos) {
+std::vector<llama_token> legacy_llama_tokenize(const llama_vocab & vocab, const std::string & text, bool bos) {
-    std::vector<gpt_vocab::id> res;
+    std::vector<llama_token> res;
    std::vector<int> score;
-    std::vector<gpt_vocab::id> prev;
+    std::vector<llama_token> prev;
    int len = text.length();
    score.resize(len + 1);
@ -50,14 +79,14 @@ std::vector<gpt_vocab::id> legacy_llama_tokenize(const gpt_vocab & vocab, const
    // Backward pass
    int i = len;
    while (i > 0) {
-        gpt_vocab::id token_id = prev[i];
+        llama_token token_id = prev[i];
        if (token_id == 0) {
 	    // TODO: Return error or something more meaningful
            printf("failed to tokenize string!\n");
 	    break;
        }
        res.push_back(token_id);
-        auto token = (*vocab.id_to_token.find(token_id)).second;
+        auto token = vocab.id_to_token[token_id].tok;
        i -= token.length();
    }
@ -68,5 +97,33 @@ std::vector<gpt_vocab::id> legacy_llama_tokenize(const gpt_vocab & vocab, const
    // Pieces are in reverse order so correct that
    std::reverse(res.begin(), res.end());
    return res;
 }
 int legacy_llama_tokenize(
        struct llama_context * ctx,
                  const char * text,
                 llama_token * tokens,
                         int   n_max_tokens,
                        bool   add_bos) {
    auto res = legacy_llama_tokenize(ctx->vocab, text, add_bos);
    if (n_max_tokens < (int) res.size()) {
        fprintf(stderr, "%s: too many tokens\n", __func__);
        return -((int) res.size());
    }
    for (size_t i = 0; i < res.size(); i++) {
        tokens[i] = res[i];
    }
    return res.size();
 }
 std::vector<llama_token> legacy_llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos) {
    std::vector<llama_token> res(8096);
    int n = legacy_llama_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
    res.resize(n);
    return res;
 }
--- a/extra.h
+++ b/extra.h
@ -11,4 +11,9 @@
 #include <string>
 #include <vector>
-std::vector<gpt_vocab::id> legacy_llama_tokenize(const gpt_vocab & vocab, const std::string & text, bool bos);
+#include "llama.h"
 int check_file_format(const std::string & fname);
 std::vector<llama_token> legacy_llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos);
--- a/ggml.c
+++ b/ggml.c
@ -2,7 +2,7 @@
 #if defined(_MSC_VER) || defined(__MINGW32__)
 #include <malloc.h> // using malloc.h with MSC/MINGW
-#elif !defined(__FreeBSD__) && !defined(__NetBSD__)
+#elif !defined(__FreeBSD__) && !defined(__NetBSD__) && !defined(__OpenBSD__)
 #include <alloca.h>
 #endif
@ -361,7 +361,7 @@ static const size_t CACHE_LINE_SIZE_F32 = CACHE_LINE_SIZE/sizeof(float);
 // AVX routines provided by GH user Const-me
 // ref: https://github.com/ggerganov/ggml/pull/27#issuecomment-1464934600
-#if __AVX2__
+#if __AVX2__ || __AVX512F__
 // Unpack 32 4-bit fields into 32 bytes
 // The output vector contains 32 bytes, each one in [ 0 .. 15 ] interval
 static inline __m256i bytesFromNibbles( const uint8_t* rsi )
@ -397,7 +397,6 @@ static inline __m128i packNibbles( __m256i bytes )
 }
 #endif
 // method 5
 // blocks of QK elements
 // represented with a single float (delta) and QK/2 8-bit ints (i.e QK 4-bit signed integer factors)
@ -1262,6 +1261,47 @@ inline static void ggml_vec_dot_f32(const int n, float * restrict s, const float
    *s = sumf;
 }
 #if __AVX512F__ && QK == 32
 static inline __m512 dot_q4_0_oneblock_avx512(
    __m512 acc,
    const uint8_t * pd0,
    const uint8_t * pd1,
    const uint8_t * pb0,
    const uint8_t * pb1,
    size_t bs,
    int i
 ) {
    const float * d0_0 = (const float *) (pd0 + i*bs);
    const float * d1_0 = (const float *) (pd1 + i*bs);
    const uint8_t * restrict p0 = pb0 + (i+0)*bs;
    const uint8_t * restrict p1 = pb1 + (i+0)*bs;
    // Compute combined scale for the block
    float scaleScalar = d0_0[0] * d1_0[0];
    __m512 scale = _mm512_set1_ps( scaleScalar );
    __m256i bx = bytesFromNibbles( p0 );
    __m256i by = bytesFromNibbles( p1 );
    // Now we have a vector with bytes in [ 0 .. 15 ] interval. Offset them into [ -8 .. +7 ] interval.
    const __m256i off = _mm256_set1_epi8( 8 );
    bx = _mm256_sub_epi8( bx, off );
    by = _mm256_sub_epi8( by, off );
    // Sign-extend 16 signed bytes into int16_t
    __m512i x32 = _mm512_cvtepi8_epi16( bx );
    __m512i y32 = _mm512_cvtepi8_epi16( by );
    // Compute products of int16_t integers, add pairwise
    __m512i i64 = _mm512_madd_epi16( x32, y32 );
    // Convert int32_t to float
    __m512 p = _mm512_cvtepi32_ps( i64 );
    // Apply the scale, and accumulate
    return _mm512_fmadd_ps( scale, p, acc );
 }
 #endif
 inline static void ggml_vec_dot_f16(const int n, float * restrict s, ggml_fp16_t * restrict x, ggml_fp16_t * restrict y) {
    ggml_float sumf = 0.0;
@ -1417,6 +1457,40 @@ inline static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void
 #else
 #error "not implemented for QK"
 #endif
 #elif defined(__AVX512F__)
 #if QK == 32
    // Initialize accumulator with zeros
    __m512 acc0 = _mm512_setzero_ps();
    __m512 acc1 = _mm512_setzero_ps();
    const int superblock_size = 8;
    const int superblock_count = nb / superblock_size;
    const int remainder = nb % superblock_size;
    for (int superblock_ix = 0; superblock_ix < superblock_count; superblock_ix += 1) {
        int i = superblock_ix * superblock_size;
        acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+0 );
        acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+1 );
        acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+2 );
        acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+3 );
        acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+4 );
        acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+5 );
        acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+6 );
        acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+7 );
    }
    // Remainders
    for (int i = superblock_count * superblock_size; i < nb; ++i) {
        acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i );
    }
    // Horizontal sum of all lanes of the accumulator
    sumf = _mm512_reduce_add_ps( acc0 ) + _mm512_reduce_add_ps( acc1 );
 #else
 #error "not implemented for QK"
 #endif
 #elif defined(__AVX2__)
 #if QK == 32
    const size_t countBlocks = nb;
@ -1928,7 +2002,7 @@ inline static void ggml_vec_mad_q4_1(const int n, float * restrict y, void * res
    const size_t bs = 2*sizeof(float) + QK/2;
    const uint8_t * restrict pd = ((const uint8_t *)x + 0*bs);
-    const uint8_t * restrict pm = ((const uint8_t *)x + 0*bs +   sizeof(float)); 
+    const uint8_t * restrict pm = ((const uint8_t *)x + 0*bs +   sizeof(float));
    const uint8_t * restrict pb = ((const uint8_t *)x + 0*bs + 2*sizeof(float));
    for (int i = 0; i < nb; i++) {
@ -10628,6 +10702,127 @@ enum ggml_opt_result ggml_opt(
 ////////////////////////////////////////////////////////////////////////////////
 size_t ggml_quantize_q4_0(float * src, void * dst, int n, int k, int qk, int64_t * hist) {
    const int nb = k / qk;
    const size_t bs = (sizeof(float) + sizeof(uint8_t)*qk/2);
    const size_t row_size = nb*bs;
    assert(k % qk == 0);
    const size_t pp_size = qk / 2;
    uint8_t * pp = (uint8_t *) alloca(pp_size);
    char * pdst = (char *) dst;
    for (int j = 0; j < n; j += k) {
        uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
        uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + sizeof(float));
        for (int i = 0; i < nb; i++) {
            float amax = 0.0f; // absolute max
            {
                for (int l = 0; l < qk; l++) {
                    const float v = src[j + i*qk + l];
                    amax = MAX(amax, fabsf(v));
                }
                const float d = amax / ((1 << 3) - 1);
                const float id = d ? 1.0f/d : 0.0f;
                *(float *) pd = d;
                pd += bs;
                for (int l = 0; l < qk; l += 2) {
                    const float v0 = (src[j + i*qk + l + 0])*id;
                    const float v1 = (src[j + i*qk + l + 1])*id;
                    const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
                    const uint8_t vi1 = ((int8_t) (round(v1))) + 8;
                    assert(vi0 >= 0 && vi0 < 16);
                    assert(vi1 >= 0 && vi1 < 16);
                    hist[vi0]++;
                    hist[vi1]++;
                    pp[l/2] = vi0 | (vi1 << 4);
                }
                memcpy(pb, pp, pp_size);
                pb += bs;
            }
        }
    }
    return (n/k)*row_size;
 }
 size_t ggml_quantize_q4_1(float * src, void * dst, int n, int k, int qk, int64_t * hist) {
    const int nb = k / qk;
    const size_t bs = (2*sizeof(float) + sizeof(uint8_t)*qk/2);
    const size_t row_size = nb*bs;
    assert(k % qk == 0);
    const size_t pp_size = qk / 2;
    uint8_t * pp = (uint8_t *) alloca(pp_size);
    char * pdst = (char *) dst;
    for (int j = 0; j < n; j += k) {
        uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
        uint8_t * pm = (uint8_t *) (pdst + (j/k)*row_size + 0*bs +   sizeof(float));
        uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + 2*sizeof(float));
        //printf("n = %d, k = %d, nb = %d, row_size = %d, j = %d, pm = %p, pd = %p, pb = %p\n", n, k, nb, row_size, j, pm, pd, pb);
        for (int i = 0; i < nb; i++) {
            float min = FLT_MAX;
            float max = -FLT_MAX;
            {
                for (int l = 0; l < qk; l++) {
                    const float v = src[j + i*qk + l];
                    if (v < min) min = v;
                    if (v > max) max = v;
                }
                const float d = (max - min) / ((1 << 4) - 1);
                const float id = d ? 1.0f/d : 0.0f;
                *(float *) pd = d;
                *(float *) pm = min;
                pd += bs;
                pm += bs;
                for (int l = 0; l < qk; l += 2) {
                    const float v0 = (src[j + i*qk + l + 0] - min)*id;
                    const float v1 = (src[j + i*qk + l + 1] - min)*id;
                    const uint8_t vi0 = round(v0);
                    const uint8_t vi1 = round(v1);
                    assert(vi0 >= 0 && vi0 < 16);
                    assert(vi1 >= 0 && vi1 < 16);
                    hist[vi0]++;
                    hist[vi1]++;
                    pp[l/2] = vi0 | (vi1 << 4);
                }
                memcpy(pb, pp, pp_size);
                pb += bs;
            }
        }
    }
    return (n/k)*row_size;
 }
 ////////////////////////////////////////////////////////////////////////////////
 int ggml_cpu_has_avx(void) {
 #if defined(__AVX__)
    return 1;
--- a/ggml.h
+++ b/ggml.h
@ -741,6 +741,13 @@ enum ggml_opt_result ggml_opt(
        struct ggml_opt_params params,
        struct ggml_tensor * f);
 //
 // quantization
 //
 size_t ggml_quantize_q4_0(float * src, void * dst, int n, int k, int qk, int64_t * hist);
 size_t ggml_quantize_q4_1(float * src, void * dst, int n, int k, int qk, int64_t * hist);
 //
 // system info
 //
--- a/llama.cpp
+++ b/llama.cpp
--- a/llama.h
+++ b/llama.h
@ -0,0 +1,139 @@
 #ifndef LLAMA_H
 #define LLAMA_H
 #include <stddef.h>
 #include <stdint.h>
 #include <stdbool.h>
 #ifdef LLAMA_SHARED
 #    ifdef _WIN32
 #        ifdef LLAMA_BUILD
 #            define LLAMA_API __declspec(dllexport)
 #        else
 #            define LLAMA_API __declspec(dllimport)
 #        endif
 #    else
 #        define LLAMA_API __attribute__ ((visibility ("default")))
 #    endif
 #else
 #    define LLAMA_API
 #endif
 #define LLAMA_FILE_VERSION 1
 #define LLAMA_FILE_MAGIC 0x67676d66 // 'ggmf' in hex
 #define LLAMA_FILE_MAGIC_UNVERSIONED 0x67676d6c // pre-versioned files
 #ifdef __cplusplus
 extern "C" {
 #endif
    //
    // C interface
    //
    // TODO: show sample usage
    //
    struct llama_context;
    typedef int llama_token;
    typedef struct llama_token_data {
        llama_token id;  // token id
        float p;     // probability of the token
        float plog;  // log probability of the token
    } llama_token_data;
    struct llama_context_params {
        int n_ctx;   // text context
        int n_parts; // -1 for default
        int seed;    // RNG seed, 0 for random
        bool f16_kv;     // use fp16 for KV cache
        bool logits_all; // the llama_eval() call computes all logits, not just the last one
        bool vocab_only; // only load the vocabulary, no weights
    };
    LLAMA_API struct llama_context_params llama_context_default_params();
    // Various functions for loading a ggml llama model.
    // Allocate (almost) all memory needed for the model.
    // Return NULL on failure
    LLAMA_API struct llama_context * llama_init_from_file(
                             const char * path_model,
            struct llama_context_params   params);
    // Frees all allocated memory
    LLAMA_API void llama_free(struct llama_context * ctx);
    // TODO: not great API - very likely to change
    // Returns 0 on success
    LLAMA_API int llama_model_quantize(
            const char * fname_inp,
            const char * fname_out,
                   int   itype,
                   int   qk);
    // Run the llama inference to obtain the logits and probabilities for the next token.
    // tokens + n_tokens is the provided batch of new tokens to process
    // n_past is the number of tokens to use from previous eval calls
    // Returns 0 on success
    LLAMA_API int llama_eval(
            struct llama_context * ctx,
               const llama_token * tokens,
                             int   n_tokens,
                             int   n_past,
                             int   n_threads);
    // Convert the provided text into tokens.
    // The tokens pointer must be large enough to hold the resulting tokens.
    // Returns the number of tokens on success, no more than n_max_tokens
    // Returns a negative number on failure - the number of tokens that would have been returned
    // TODO: not sure if correct
    LLAMA_API int llama_tokenize(
            struct llama_context * ctx,
                      const char * text,
                     llama_token * tokens,
                             int   n_max_tokens,
                            bool   add_bos);
    LLAMA_API int llama_n_vocab(struct llama_context * ctx);
    LLAMA_API int llama_n_ctx  (struct llama_context * ctx);
    // Token logits obtained from the last call to llama_eval()
    // The logits for the last token are stored in the last row
    // Can be mutated in order to change the probabilities of the next token
    // Rows: n_tokens
    // Cols: n_vocab
    LLAMA_API float * llama_get_logits(struct llama_context * ctx);
    // Token Id -> String. Uses the vocabulary in the provided context
    LLAMA_API const char * llama_token_to_str(struct llama_context * ctx, llama_token token);
    // Special tokens
    LLAMA_API llama_token llama_token_bos();
    LLAMA_API llama_token llama_token_eos();
    // TODO: improve the last_n_tokens interface ?
    LLAMA_API llama_token llama_sample_top_p_top_k(
              llama_context * ctx,
          const llama_token * last_n_tokens_data,
                        int   last_n_tokens_size,
                        int   top_k,
                     double   top_p,
                     double   temp,
                     double   repeat_penalty);
    // Performance information
    LLAMA_API void llama_print_timings(struct llama_context * ctx);
    LLAMA_API void llama_reset_timings(struct llama_context * ctx);
    // Print system information
    LLAMA_API const char * llama_print_system_info(void);
 #ifdef __cplusplus
 }
 #endif
 #endif
--- a/llamacpp.dll
+++ b/llamacpp.dll
--- a/main.cpp
+++ b/main.cpp
--- a/main.exe
+++ b/main.exe
--- a/models/ggml-vocab.bin
+++ b/models/ggml-vocab.bin
--- a/quantize.cpp
+++ b/quantize.cpp
@ -1,317 +1,17 @@
 #include "ggml.h"
 #include "llama.h"
 #include "utils.h"
 #include <cassert>
 #include <cinttypes>
 #include <cmath>
 #include <cstdio>
 #include <cstring>
 #include <fstream>
 #include <map>
 #include <string>
 #include <vector>
 #include <regex>
-// TODO: move somewhere else
+const int QK = 32;
 #define QK 32
 // default hparams (LLaMA76B)
 struct llama_hparams {
    int32_t n_vocab = 32000;
    int32_t n_ctx   = 512;   // this is provided as user input?
    int32_t n_embd  = 4096;
    int32_t n_mult  = 256;
    int32_t n_head  = 32;
    int32_t n_layer = 32;
    int32_t n_rot   = 64;
    int32_t f16     = 1;
 };
 // quantize a model
 bool llama_model_quantize(const std::string & fname_inp, const std::string & fname_out, int itype) {
    ggml_type type = GGML_TYPE_Q4_1;
    switch (itype) {
        case 2: type = GGML_TYPE_Q4_0; break;
        case 3: type = GGML_TYPE_Q4_1; break;
        default: fprintf(stderr, "%s: invalid quantization type %d\n", __func__, itype); return 1;
    };
    if (type != GGML_TYPE_Q4_0 && type != GGML_TYPE_Q4_1) {
        fprintf(stderr, "%s: invalid quantization type %d\n", __func__, type);
        return false;
    }
    gpt_vocab vocab;
    printf("%s: loading model from '%s'\n", __func__, fname_inp.c_str());
    auto finp = std::ifstream(fname_inp, std::ios::binary);
    if (!finp) {
        fprintf(stderr, "%s: failed to open '%s' for reading\n", __func__, fname_inp.c_str());
        return false;
    }
    auto fout = std::ofstream(fname_out, std::ios::binary);
    if (!fout) {
        fprintf(stderr, "%s: failed to open '%s' for writing\n", __func__, fname_out.c_str());
        return false;
    }
    // verify magic
    {
        uint32_t magic;
        finp.read((char *) &magic, sizeof(magic));
        if (magic == FILE_MAGIC_UNVERSIONED) {
            fprintf(stderr, "%s: invalid model file '%s' (too old, regenerate your model files!)\n",
                    __func__, fname_inp.c_str());
            return false;
        }
        if (magic != FILE_MAGIC) {
            fprintf(stderr, "%s: invalid model file '%s' (bad magic)\n", __func__, fname_inp.c_str());
            return false;
        }
        fout.write((char *) &magic, sizeof(magic));
        uint32_t format_version;
        finp.read((char *) &format_version, sizeof(format_version));
        if (format_version != FILE_VERSION) {
            fprintf(stderr, "%s: invalid model file '%s' (unsupported format version %" PRIu32 ", expected %d)\n",
                    __func__, fname_inp.c_str(), format_version, FILE_VERSION);
            return false;
        }
        fout.write((char *) &format_version, sizeof(format_version));
    }
    llama_hparams hparams;
    // load hparams
    {
        finp.read((char *) &hparams.n_vocab, sizeof(hparams.n_vocab));
        //finp.read((char *) &hparams.n_ctx,   sizeof(hparams.n_ctx));
        finp.read((char *) &hparams.n_embd,  sizeof(hparams.n_embd));
        finp.read((char *) &hparams.n_mult,  sizeof(hparams.n_mult));
        finp.read((char *) &hparams.n_head,  sizeof(hparams.n_head));
        finp.read((char *) &hparams.n_layer, sizeof(hparams.n_layer));
        finp.read((char *) &hparams.n_rot,   sizeof(hparams.n_rot));
        finp.read((char *) &hparams.f16,     sizeof(hparams.f16));
        printf("%s: n_vocab = %d\n", __func__, hparams.n_vocab);
        printf("%s: n_ctx   = %d\n", __func__, hparams.n_ctx);
        printf("%s: n_embd  = %d\n", __func__, hparams.n_embd);
        printf("%s: n_mult  = %d\n", __func__, hparams.n_mult);
        printf("%s: n_head  = %d\n", __func__, hparams.n_head);
        printf("%s: n_layer = %d\n", __func__, hparams.n_layer);
        printf("%s: f16     = %d\n", __func__, hparams.f16);
        fout.write((char *) &hparams.n_vocab, sizeof(hparams.n_vocab));
        //fout.write((char *) &hparams.n_ctx,   sizeof(hparams.n_ctx));
        fout.write((char *) &hparams.n_embd,  sizeof(hparams.n_embd));
        fout.write((char *) &hparams.n_mult,  sizeof(hparams.n_mult));
        fout.write((char *) &hparams.n_head,  sizeof(hparams.n_head));
        fout.write((char *) &hparams.n_layer, sizeof(hparams.n_layer));
        fout.write((char *) &hparams.n_rot,   sizeof(hparams.n_rot));
        fout.write((char *) &itype,           sizeof(hparams.f16));
    }
    // load vocab
    {
        const int32_t n_vocab = hparams.n_vocab;
        if (n_vocab != hparams.n_vocab) {
            fprintf(stderr, "%s: invalid model file '%s' (bad vocab size %d != %d)\n",
                    __func__, fname_inp.c_str(), n_vocab, hparams.n_vocab);
            return false;
        }
        std::string word;
        for (int i = 0; i < n_vocab; i++) {
            uint32_t len;
            finp.read ((char *) &len, sizeof(len));
            fout.write((char *) &len, sizeof(len));
            word.resize(len);
            finp.read ((char *) word.data(), len);
            fout.write((char *) word.data(), len);
            float score;
            finp.read ((char *) &score, sizeof(score));
            fout.write((char *) &score, sizeof(score));
            vocab.token_to_id[word] = i;
            vocab.id_to_token[i] = word;
            vocab.score[i] = score;
        }
    }
    // load weights
    {
        size_t total_size_org = 0;
        size_t total_size_new = 0;
        std::vector<float> work;
        std::vector<uint8_t>     data_u8;
        std::vector<ggml_fp16_t> data_f16;
        std::vector<float>       data_f32;
        std::vector<int64_t> hist_all(1 << 4, 0);
        while (true) {
            int32_t n_dims;
            int32_t length;
            int32_t ftype;
            finp.read(reinterpret_cast<char *>(&n_dims), sizeof(n_dims));
            finp.read(reinterpret_cast<char *>(&length), sizeof(length));
            finp.read(reinterpret_cast<char *>(&ftype),  sizeof(ftype));
            if (finp.eof()) {
                break;
            }
            int32_t nelements = 1;
            int32_t ne[2] = { 1, 1 };
            for (int i = 0; i < n_dims; ++i) {
                finp.read (reinterpret_cast<char *>(&ne[i]), sizeof(ne[i]));
                nelements *= ne[i];
            }
            std::string name(length, 0);
            finp.read (&name[0], length);
            {
                static const char * ftype_str[] = { "f32", "f16", "q4_0", "q4_1", };
                printf("%48s - [%5d, %5d], type = %6s ", name.data(), ne[0], ne[1], ftype_str[ftype]);
            }
            // regexes of tensor names to be quantized
            const std::vector<std::string> k_names = {
                ".*weight",
            };
            bool quantize = false;
            for (const auto & s : k_names) {
                if (std::regex_match(name, std::regex(s))) {
                    quantize = true;
                    break;
                }
            }
            // quantize only 2D tensors
            quantize &= (n_dims == 2);
            if (quantize) {
                if (ftype != 0 && ftype != 1) {
                    fprintf(stderr, "%s: unsupported ftype %d for integer quantization\n", __func__, ftype);
                    return false;
                }
                if (ftype == 1) {
                    data_f16.resize(nelements);
                    finp.read(reinterpret_cast<char *>(data_f16.data()), nelements * sizeof(ggml_fp16_t));
                    data_f32.resize(nelements);
                    for (int i = 0; i < nelements; ++i) {
                        data_f32[i] = ggml_fp16_to_fp32(data_f16[i]);
                    }
                } else {
                    data_f32.resize(nelements);
                    finp.read(reinterpret_cast<char *>(data_f32.data()), nelements * sizeof(float));
                }
                ftype = itype;
            } else {
                const int bpe = (ftype == 0) ? sizeof(float) : sizeof(uint16_t);
                data_u8.resize(nelements*bpe);
                finp.read(reinterpret_cast<char *>(data_u8.data()), nelements * bpe);
            }
            fout.write(reinterpret_cast<char *>(&n_dims), sizeof(n_dims));
            fout.write(reinterpret_cast<char *>(&length), sizeof(length));
            fout.write(reinterpret_cast<char *>(&ftype),  sizeof(ftype));
            for (int i = 0; i < n_dims; ++i) {
                fout.write(reinterpret_cast<char *>(&ne[i]), sizeof(ne[i]));
            }
            fout.write(&name[0], length);
            if (quantize) {
                printf("quantizing .. ");
                work.resize(nelements); // for quantization
                size_t cur_size = 0;
                std::vector<int64_t> hist_cur(1 << 4, 0);
                switch (type) {
                    case GGML_TYPE_Q4_0:
                        {
                            cur_size = ggml_quantize_q4_0(data_f32.data(), work.data(), nelements, ne[0], QK, hist_cur.data());
                        } break;
                    case GGML_TYPE_Q4_1:
                        {
                            cur_size = ggml_quantize_q4_1(data_f32.data(), work.data(), nelements, ne[0], QK, hist_cur.data());
                        } break;
                    default:
                        {
                            fprintf(stderr, "%s: unsupported quantization type %d\n", __func__, type);
                            return false;
                        }
                }
                fout.write(reinterpret_cast<char *>(work.data()), cur_size);
                total_size_new += cur_size;
                printf("size = %8.2f MB -> %8.2f MB | hist: ", nelements * sizeof(float)/1024.0/1024.0, cur_size/1024.0/1024.0);
                for (int i = 0; i < hist_cur.size(); ++i) {
                    hist_all[i] += hist_cur[i];
                }
                for (int i = 0; i < hist_cur.size(); ++i) {
                    printf("%5.3f ", hist_cur[i] / (float)nelements);
                }
                printf("\n");
            } else {
                printf("size = %8.3f MB\n", data_u8.size()/1024.0/1024.0);
                fout.write(reinterpret_cast<char *>(data_u8.data()), data_u8.size());
                total_size_new += data_u8.size();
            }
            total_size_org += nelements * sizeof(float);
        }
        printf("%s: model size  = %8.2f MB\n", __func__, total_size_org/1024.0/1024.0);
        printf("%s: quant size  = %8.2f MB\n", __func__, total_size_new/1024.0/1024.0);
        {
            int64_t sum_all = 0;
            for (int i = 0; i < hist_all.size(); ++i) {
                sum_all += hist_all[i];
            }
            printf("%s: hist: ", __func__);
            for (int i = 0; i < hist_all.size(); ++i) {
                printf("%5.3f ", hist_all[i] / (float)sum_all);
            }
            printf("\n");
        }
    }
    finp.close();
    fout.close();
    return true;
 }
 // usage:
 //  ./llama-quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type
 //
 int main(int argc, char ** argv) {
    ggml_time_init();
    if (argc != 4) {
        fprintf(stderr, "usage: %s model-f32.bin model-quant.bin type\n", argv[0]);
        fprintf(stderr, "  type = 2 - q4_0\n");
@ -339,7 +39,7 @@ int main(int argc, char ** argv) {
    {
        const int64_t t_start_us = ggml_time_us();
-        if (!llama_model_quantize(fname_inp, fname_out, itype)) {
+        if (llama_model_quantize(fname_inp.c_str(), fname_out.c_str(), itype, QK)) {
            fprintf(stderr, "%s: failed to quantize model from '%s'\n", __func__, fname_inp.c_str());
            return 1;
        }
--- a/quantize.exe
+++ b/quantize.exe
--- a/tests/CMakeLists.txt
+++ b/tests/CMakeLists.txt
@ -0,0 +1,4 @@
 set(TEST_TARGET test-tokenizer-0)
 add_executable(${TEST_TARGET} ${TEST_TARGET}.cpp)
 target_link_libraries(${TEST_TARGET} PRIVATE llama ggml utils)
 add_test(NAME ${TEST_TARGET} COMMAND $<TARGET_FILE:${TEST_TARGET}> ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab.bin)
--- a/tests/test-tokenizer-0.cpp
+++ b/tests/test-tokenizer-0.cpp
@ -0,0 +1,79 @@
 #include "utils.h"
 #include "llama.h"
 #include <cstdio>
 #include <string>
 #include <map>
 static const std::map<std::string, std::vector<llama_token>> k_tests = {
    { "Hello World",        { 1,  10994,   2787, }, },
    { " Hello World",       { 1,  15043,   2787, }, },
    { " Hello World!",      { 1,  15043,   2787,  29991, }, },
    { " this is 🦙.cpp",    { 1,    445,    338,  29871,    243,    162,    169,    156,  29889,   8223, }, },
    { "w048 7tuijk dsdfhu", { 1,  29893,  29900,  29946,  29947,  29871,  29955,   9161,  13535,  18031,   2176,   6905, }, },
    { "нещо на Български",  { 1,    821,   4851,    665,   1386,  29713,   1305, }, },
 };
 int main(int argc, char **argv) {
    if (argc < 2) {
        fprintf(stderr, "Usage: %s <vocab-file>\n", argv[0]);
        return 1;
    }
    const std::string fname = argv[1];
    fprintf(stderr, "%s : reading vocab from: '%s'\n", __func__, fname.c_str());
    llama_context * ctx;
    // load the vocab
    {
        auto lparams = llama_context_default_params();
        lparams.vocab_only = true;
        ctx = llama_init_from_file(fname.c_str(), lparams);
        if (ctx == NULL) {
            fprintf(stderr, "%s: error: failed to load vocab '%s'\n", __func__, fname.c_str());
            return 1;
        }
    }
    const int n_vocab = llama_n_vocab(ctx);
    if (n_vocab != 32000) {
        fprintf(stderr, "%s : expected 32000 tokens, got %d\n", __func__, n_vocab);
        return 2;
    }
    for (const auto & test_kv : k_tests) {
        const auto res = ::llama_tokenize(ctx, test_kv.first, true);
        bool correct = res.size() == test_kv.second.size();
        for (int i = 0; i < (int) res.size() && correct; ++i) {
            if (res[i] != test_kv.second[i]) {
                correct = false;
            }
        }
        if (!correct) {
            fprintf(stderr, "%s : failed test: '%s'\n", __func__, test_kv.first.c_str());
            fprintf(stderr, "%s : expected tokens: ", __func__);
            for (const auto & t : test_kv.second) {
                fprintf(stderr, "%6d, ", t);
            }
            fprintf(stderr, "\n");
            fprintf(stderr, "%s : got tokens:      ", __func__);
            for (const auto & t : res) {
                fprintf(stderr, "%6d, ", t);
            }
            fprintf(stderr, "\n");
            return 3;
        }
    }
    return 0;
 }
--- a/utils.cpp
+++ b/utils.cpp
@ -3,16 +3,13 @@
 #include <cassert>
 #include <cstring>
 #include <fstream>
 #include <regex>
 #include <iostream>
 #include <iterator>
 #include <queue>
 #include <string>
-#include <math.h>
+#include <iterator>
 #include <algorithm>
 #if defined(_MSC_VER) || defined(__MINGW32__)
 #include <malloc.h> // using malloc.h with MSC/MINGW
- #elif !defined(__FreeBSD__) && !defined(__NetBSD__)
+ #elif !defined(__FreeBSD__) && !defined(__NetBSD__) && !defined(__OpenBSD__)
 #include <alloca.h>
 #endif
@ -72,8 +69,12 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
            params.use_color = true;
        } else if (arg == "-r" || arg == "--reverse-prompt") {
            params.antiprompt.push_back(argv[++i]);
        } else if (arg == "--perplexity") {
            params.perplexity = true;
        } else if (arg == "--ignore-eos") {
            params.ignore_eos = true;
        } else if (arg == "--n_parts") {
            params.n_parts = std::stoi(argv[++i]);
        } else if (arg == "-h" || arg == "--help") {
            gpt_print_usage(argc, argv, params);
            exit(0);
@ -100,7 +101,7 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
    fprintf(stderr, "                        in interactive mode, poll user input upon seeing PROMPT (can be\n");
    fprintf(stderr, "                        specified more than once for multiple prompts).\n");
    fprintf(stderr, "  --color               colorise output to distinguish prompt and user input from generations\n");
-    fprintf(stderr, "  -s SEED, --seed SEED  RNG seed (default: -1)\n");
+    fprintf(stderr, "  -s SEED, --seed SEED  RNG seed (default: -1, use random seed for <= 0)\n");
    fprintf(stderr, "  -t N, --threads N     number of threads to use during computation (default: %d)\n", params.n_threads);
    fprintf(stderr, "  -p PROMPT, --prompt PROMPT\n");
    fprintf(stderr, "                        prompt to start generation with (default: empty)\n");
@ -116,7 +117,9 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
    fprintf(stderr, "  --ignore-eos          ignore end of stream token and continue generating\n");
    fprintf(stderr, "  --memory_f16          use f16 instead of f32 for memory key+value\n");
    fprintf(stderr, "  --temp N              temperature (default: %.1f)\n", params.temp);
    fprintf(stderr, "  --n_parts N           number of model parts (default: -1 = determine from dimensions)\n");
    fprintf(stderr, "  -b N, --batch_size N  batch size for prompt processing (default: %d)\n", params.n_batch);
    fprintf(stderr, "  --perplexity          compute perplexity over the prompt\n");
    fprintf(stderr, "  -m FNAME, --model FNAME\n");
    fprintf(stderr, "                        model path (default: %s)\n", params.model.c_str());
    fprintf(stderr, "\n");
@ -141,535 +144,11 @@ std::string gpt_random_prompt(std::mt19937 & rng) {
    return "The";
 }
-void replace(std::string & str, const std::string & needle, const std::string & replacement) {
+// TODO: not great allocating this every time
-    size_t pos = 0;
+std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos) {
-    while ((pos = str.find(needle, pos)) != std::string::npos) {
+    std::vector<llama_token> res(8096);
-        str.replace(pos, needle.length(), replacement);
+    int n = llama_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
-        pos += replacement.length();
+    res.resize(n);
-    }
+
-}
+    return res;
 std::map<std::string, int32_t> json_parse(const std::string & fname) {
    std::map<std::string, int32_t> result;
    // read file into string
    std::string json;
    {
        std::ifstream ifs(fname);
        if (!ifs) {
            fprintf(stderr, "Failed to open %s\n", fname.c_str());
            exit(1);
        }
        json = std::string((std::istreambuf_iterator<char>(ifs)),
                (std::istreambuf_iterator<char>()));
    }
    if (json[0] != '{') {
        return result;
    }
    // parse json
    {
        bool has_key  = false;
        bool in_token = false;
        std::string str_key = "";
        std::string str_val = "";
        int n = json.size();
        for (int i = 1; i < n; ++i) {
            if (!in_token) {
                if (json[i] == ' ') continue;
                if (json[i] == '"') {
                    in_token = true;
                    continue;
                }
            } else {
                if (json[i] == '\\' && i+1 < n) {
                    if (has_key == false) {
                        str_key += json[i];
                    } else {
                        str_val += json[i];
                    }
                    ++i;
                } else if (json[i] == '"') {
                    if (has_key == false) {
                        has_key = true;
                        ++i;
                        while (json[i] == ' ') ++i;
                        ++i; // :
                        while (json[i] == ' ') ++i;
                        if (json[i] != '\"') {
                            while (json[i] != ',' && json[i] != '}') {
                                str_val += json[i++];
                            }
                            has_key = false;
                        } else {
                            in_token = true;
                            continue;
                        }
                    } else {
                        has_key = false;
                    }
                    ::replace(str_key, "\\u0120", " " ); // \u0120 -> space
                    ::replace(str_key, "\\u010a", "\n"); // \u010a -> new line
                    ::replace(str_key, "\\\"",    "\""); // \\\"   -> "
                    try {
                        result[str_key] = std::stoi(str_val);
                    } catch (...) {
                        //fprintf(stderr, "%s: ignoring key '%s' with value '%s'\n", fname.c_str(), str_key.c_str(), str_val.c_str());
                    }
                    str_key = "";
                    str_val = "";
                    in_token = false;
                    continue;
                }
                if (has_key == false) {
                    str_key += json[i];
                } else {
                    str_val += json[i];
                }
            }
        }
    }
    return result;
 }
 std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const std::string & text) {
    std::vector<std::string> words;
    // first split the text into words
    {
        std::string str = text;
        std::string pat = R"('s|'t|'re|'ve|'m|'ll|'d| ?[[:alpha:]]+| ?[[:digit:]]+| ?[^\s[:alpha:][:digit:]]+|\s+(?!\S)|\s+)";
        std::regex re(pat);
        std::smatch m;
        while (std::regex_search(str, m, re)) {
            for (auto x : m) {
                words.push_back(x);
            }
            str = m.suffix();
        }
    }
    // find the longest tokens that form the words:
    std::vector<gpt_vocab::id> tokens;
    for (const auto & word : words) {
        if (word.size() == 0) continue;
        int i = 0;
        int n = word.size();
        while (i < n) {
            int j = n;
            while (j > i) {
                auto it = vocab.token_to_id.find(word.substr(i, j-i));
                if (it != vocab.token_to_id.end()) {
                    tokens.push_back(it->second);
                    i = j;
                    break;
                }
                --j;
            }
            if (i == n) {
                break;
            }
            if (j == i) {
                auto sub = word.substr(i, 1);
                if (vocab.token_to_id.find(sub) != vocab.token_to_id.end()) {
                    tokens.push_back(vocab.token_to_id.at(sub));
                } else {
                    fprintf(stderr, "%s: unknown token '%s'\n", __func__, sub.data());
                }
                ++i;
            }
        }
    }
    return tokens;
 }
 static size_t utf8_len(char src) {
    const size_t lookup[] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 4 };
    uint8_t highbits = static_cast<uint8_t>(src) >> 4;
    return lookup[highbits];
 }
 struct llama_sp_symbol {
    using index = int;
    index prev;
    index next;
    std::string_view text;
 };
 struct llama_sp_bigram {
    struct comparator {
        bool operator()(llama_sp_bigram & l, llama_sp_bigram & r) {
            return (l.score < r.score) || (l.score == r.score && l.left > r.left);
        }
    };
    using queue_storage = std::vector<llama_sp_bigram>;
    using queue = std::priority_queue<llama_sp_bigram, queue_storage, comparator>;
    llama_sp_symbol::index left;
    llama_sp_symbol::index right;
    float score;
    size_t size;
 };
 struct llama_tokenizer {
    llama_tokenizer(const gpt_vocab & vocab): vocab_(vocab) {}
    void tokenize(std::string_view text, std::vector<gpt_vocab::id> & output) {
        // split string into utf8 chars
        int index = 0;
        while (!text.empty()) {
            llama_sp_symbol sym;
            size_t char_len = std::min(text.size(), utf8_len(text.data()[0]));
            sym.text = std::string_view(text.data(), char_len);
            sym.prev = index - 1;
            text.remove_prefix(char_len);
            sym.next = text.empty() ? -1 : index + 1;
            index++;
            symbols_.emplace_back(std::move(sym));
        }
        // seed the work queue with all possible 2-character tokens.
        for (size_t i = 1; i < symbols_.size(); ++i) {
            try_add_bigram(i - 1, i);
        }
        // keep substituting the highest frequency pairs for as long as we can.
        while (!work_queue_.empty()) {
            auto bigram = work_queue_.top();
            work_queue_.pop();
            auto & left_sym = symbols_[bigram.left];
            auto & right_sym = symbols_[bigram.right];
            // if one of the symbols already got merged, skip it.
            if (left_sym.text.empty() || right_sym.text.empty() ||
                left_sym.text.size() + right_sym.text.size() != bigram.size) {
                continue;
            }
            // merge the right sym into the left one
            left_sym.text = std::string_view(left_sym.text.data(), left_sym.text.size() + right_sym.text.size());
            right_sym.text = std::string_view("");
            // remove the right sym from the chain
            left_sym.next = right_sym.next;
            if (right_sym.next >= 0) {
                symbols_[right_sym.next].prev = bigram.left;
            }
            // find more substitutions
            try_add_bigram(left_sym.prev, bigram.left);
            try_add_bigram(bigram.left, left_sym.next);
        }
        for (int i = 0; i != -1; i = symbols_[i].next) {
            auto& symbol = symbols_[i];
            auto token = vocab_.token_to_id.find(std::string(symbol.text));
            if (token == vocab_.token_to_id.end()) {
                // output any symbols that did not form tokens as bytes.
                for (int j = 0; j < symbol.text.size(); ++j) {
                    gpt_vocab::id token_id = static_cast<uint8_t>(symbol.text[j]) + 3;
                    output.push_back(token_id);
                }
            } else {
                output.push_back((*token).second);
            }
        }
    }
 private:
    void try_add_bigram(int left, int right) {
        if (left == -1 || right == -1) {
            return;
        }
        std::string_view text(symbols_[left].text.data(), symbols_[left].text.size() + symbols_[right].text.size());
        auto token = vocab_.token_to_id.find(std::string(text));
        if (token == vocab_.token_to_id.end()) {
            return;
        }
        auto score = vocab_.score.find((*token).second);
        if (score == vocab_.score.end()) {
            return;
        }
        llama_sp_bigram bigram;
        bigram.left = left;
        bigram.right = right;
        bigram.score = (*score).second;
        bigram.size = text.size();
        work_queue_.push(bigram);
    }
    const gpt_vocab & vocab_;
    std::vector<llama_sp_symbol> symbols_;
    llama_sp_bigram::queue work_queue_;
 };
 std::vector<gpt_vocab::id> llama_tokenize(const gpt_vocab & vocab, std::string_view text, bool bos) {
    llama_tokenizer tokenizer(vocab);
    std::vector<gpt_vocab::id> output;
    if (text.size() == 0) {
        return output;
    }
    if (bos) {
        output.push_back(1);
    }
    tokenizer.tokenize(text, output);
    return output;
 }
 bool gpt_vocab_init(const std::string & fname, gpt_vocab & vocab) {
    printf("%s: loading vocab from '%s'\n", __func__, fname.c_str());
    vocab.token_to_id = ::json_parse(fname);
    for (const auto & kv : vocab.token_to_id) {
        vocab.id_to_token[kv.second] = kv.first;
    }
    printf("%s: vocab size = %d\n", __func__, (int) vocab.token_to_id.size());
    // print the vocabulary
    //for (auto kv : vocab.token_to_id) {
    //    printf("'%s' -> %d\n", kv.first.data(), kv.second);
    //}
    return true;
 }
 void sample_top_k(std::vector<std::pair<double, gpt_vocab::id>> & logits_id, int top_k) {
    // find the top K tokens
    std::partial_sort(
            logits_id.begin(),
            logits_id.begin() + top_k, logits_id.end(),
            [](const std::pair<double, gpt_vocab::id> & a, const std::pair<double, gpt_vocab::id> & b) {
        return a.first > b.first;
    });
    logits_id.resize(top_k);
 }
 gpt_vocab::id llama_sample_top_p_top_k(
        const gpt_vocab & vocab,
        const float * logits,
        std::vector<gpt_vocab::id> & last_n_tokens,
        double repeat_penalty,
        int top_k,
        double top_p,
        double temp,
        std::mt19937 & rng) {
    int n_logits = vocab.id_to_token.size();
    std::vector<std::pair<double, gpt_vocab::id>> logits_id;
    logits_id.reserve(n_logits);
    {
        const double scale = 1.0/temp;
        for (int i = 0; i < n_logits; ++i) {
            // repetition penalty from CTRL paper (https://arxiv.org/abs/1909.05858)
            // credit https://github.com/facebookresearch/llama/compare/main...shawwn:llama:main
            if (std::find(last_n_tokens.begin(), last_n_tokens.end(), i) != last_n_tokens.end()) {
                // if score < 0 then repetition penalty has to multiplied to reduce the previous token probability
                if (logits[i] < 0.0) {
                    logits_id.push_back(std::make_pair(logits[i]*scale*repeat_penalty, i));
                } else {
                    logits_id.push_back(std::make_pair(logits[i]*scale/repeat_penalty, i));
                }
            } else {
                logits_id.push_back(std::make_pair(logits[i]*scale, i));
            }
        }
    }
    sample_top_k(logits_id, top_k);
    double maxl = -INFINITY;
    for (const auto & kv : logits_id) {
        maxl = std::max(maxl, kv.first);
    }
    // compute probs for the top K tokens
    std::vector<double> probs;
    probs.reserve(logits_id.size());
    double sum = 0.0;
    for (const auto & kv : logits_id) {
        double p = exp(kv.first - maxl);
        probs.push_back(p);
        sum += p;
    }
    // normalize the probs
    for (auto & p : probs) {
        p /= sum;
    }
    if (top_p < 1.0f) {
        double cumsum = 0.0f;
        for (int i = 0; i < (int) probs.size(); i++) {
            cumsum += probs[i];
            if (cumsum >= top_p) {
                probs.resize(i + 1);
                logits_id.resize(i + 1);
                break;
            }
        }
        cumsum = 1.0/cumsum;
        for (int i = 0; i < (int) probs.size(); i++) {
            probs[i] *= cumsum;
        }
    }
    //printf("\n");
    //for (int i = 0; i < (int) 10; i++) {
    //    printf("%d: '%s' %f\n", i, vocab.id_to_token.at(logits_id[i].second).c_str(), probs[i]);
    //}
    //printf("\n\n");
    //exit(0);
    std::discrete_distribution<> dist(probs.begin(), probs.end());
    int idx = dist(rng);
    return logits_id[idx].second;
 }
 size_t ggml_quantize_q4_0(float * src, void * dst, int n, int k, int qk, int64_t * hist) {
    const int nb = k / qk;
    const size_t bs = (sizeof(float) + sizeof(uint8_t)*qk/2);
    const size_t row_size = nb*bs;
    assert(k % qk == 0);
    const size_t pp_size = qk / 2;
    uint8_t *pp = static_cast<uint8_t*>(alloca(pp_size));
    char * pdst = (char *) dst;
    for (int j = 0; j < n; j += k) {
        uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
        uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + sizeof(float));
        for (int i = 0; i < nb; i++) {
            float amax = 0.0f; // absolute max
            {
                for (int l = 0; l < qk; l++) {
                    const float v = src[j + i*qk + l];
                    amax = std::max(amax, fabsf(v));
                }
                const float d = amax / ((1 << 3) - 1);
                const float id = d ? 1.0f/d : 0.0f;
                *(float *) pd = d;
                pd += bs;
                for (int l = 0; l < qk; l += 2) {
                    const float v0 = (src[j + i*qk + l + 0])*id;
                    const float v1 = (src[j + i*qk + l + 1])*id;
                    const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
                    const uint8_t vi1 = ((int8_t) (round(v1))) + 8;
                    assert(vi0 >= 0 && vi0 < 16);
                    assert(vi1 >= 0 && vi1 < 16);
                    hist[vi0]++;
                    hist[vi1]++;
                    pp[l/2] = vi0 | (vi1 << 4);
                }
                memcpy(pb, pp, pp_size);
                pb += bs;
            }
        }
    }
    return (n/k)*row_size;
 }
 size_t ggml_quantize_q4_1(float * src, void * dst, int n, int k, int qk, int64_t * hist) {
    const int nb = k / qk;
    const size_t bs = (2*sizeof(float) + sizeof(uint8_t)*qk/2);
    const size_t row_size = nb*bs;
    assert(k % qk == 0);
    const size_t pp_size = qk / 2;
    uint8_t *pp = static_cast<uint8_t*>(alloca(pp_size));
    char * pdst = (char *) dst;
    for (int j = 0; j < n; j += k) { 
        uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
        uint8_t * pm = (uint8_t *) (pdst + (j/k)*row_size + 0*bs +   sizeof(float));
        uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + 2*sizeof(float));
        //printf("n = %d, k = %d, nb = %d, row_size = %d, j = %d, pm = %p, pd = %p, pb = %p\n", n, k, nb, row_size, j, pm, pd, pb);
        for (int i = 0; i < nb; i++) {
            float min = std::numeric_limits<float>::max();
            float max = std::numeric_limits<float>::min();
            {
                for (int l = 0; l < qk; l++) {
                    const float v = src[j + i*qk + l];
                    if (v < min) min = v;
                    if (v > max) max = v;
                }
                const float d = (max - min) / ((1 << 4) - 1);
                const float id = d ? 1.0f/d : 0.0f;
                *(float *) pd = d;
                *(float *) pm = min;
                pd += bs; 
                pm += bs;
                for (int l = 0; l < qk; l += 2) {
                    const float v0 = (src[j + i*qk + l + 0] - min)*id;
                    const float v1 = (src[j + i*qk + l + 1] - min)*id;
                    const uint8_t vi0 = round(v0);
                    const uint8_t vi1 = round(v1);
                    assert(vi0 >= 0 && vi0 < 16);
                    assert(vi1 >= 0 && vi1 < 16);
                    hist[vi0]++;
                    hist[vi1]++;
                    pp[l/2] = vi0 | (vi1 << 4);
                }
                memcpy(pb, pp, pp_size);
                pb += bs;
            }
        }
    }
    return (n/k)*row_size;
 }
--- a/utils.h
+++ b/utils.h
@ -2,8 +2,9 @@
 #pragma once
 #include "llama.h"
 #include <string>
 #include <map>
 #include <vector>
 #include <random>
 #include <thread>
@ -13,33 +14,34 @@
 //
 struct gpt_params {
-    int32_t seed      = -1; // RNG seed
+    int32_t seed          = -1;  // RNG seed
-    int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
+    int32_t n_threads     = std::min(4, (int32_t) std::thread::hardware_concurrency());
-    int32_t n_predict = 128; // new tokens to predict
+    int32_t n_predict     = 128; // new tokens to predict
    int32_t repeat_last_n = 64;  // last n tokens to penalize
-    int32_t n_ctx = 512; //context size
+    int32_t n_parts       = -1;  // amount of model parts (-1 = determine from model dimensions)
-    bool memory_f16 = false; // use f16 instead of f32 for memory kv
+    int32_t n_ctx         = 512; //context size
    // sampling parameters
    int32_t top_k = 40;
    float   top_p = 0.95f;
    float   temp  = 0.80f;
-    float   repeat_penalty  = 1.30f;
+    float   repeat_penalty  = 1.10f;
    int32_t n_batch = 8; // batch size for prompt processing
-    std::string model      = "models/lamma-7B/ggml-model.bin"; // model path
+    std::string model  = "models/lamma-7B/ggml-model.bin"; // model path
-    std::string prompt     = "";
+    std::string prompt = "";
    bool random_prompt = false;
    bool use_color = false; // use color to distinguish generations and inputs
    bool interactive = false; // interactive mode
    bool interactive_start = false; // reverse prompt immediately
    std::vector<std::string> antiprompt; // string upon seeing which more user input is prompted
-    bool instruct    = false; // instruction mode (used for Alpaca models)
+
-    bool ignore_eos = false; // do not stop generating after eos
+    bool memory_f16        = false; // use f16 instead of f32 for memory kv
    bool random_prompt     = false; // do not randomize prompt if none provided
    bool use_color         = false; // use color to distinguish generations and inputs
    bool interactive       = false; // interactive mode
    bool interactive_start = false; // reverse prompt immediately
    bool instruct          = false; // instruction mode (used for Alpaca models)
    bool ignore_eos        = false; // do not stop generating after eos
    bool perplexity        = false; // compute perplexity over the prompt
 };
 bool gpt_params_parse(int argc, char ** argv, gpt_params & params);
@ -48,72 +50,8 @@ void gpt_print_usage(int argc, char ** argv, const gpt_params & params);
 std::string gpt_random_prompt(std::mt19937 & rng);
 //
 // Model file parsing
 //
 #define FILE_MAGIC_UNVERSIONED 0x67676d6c // pre-versioned files
 #define FILE_MAGIC 0x67676d66 // 'ggmf' in hex
 #define FILE_VERSION 1
 //
 // Vocab utils
 //
-struct gpt_vocab {
+std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos);
    using id    = int32_t;
    using token = std::string;
    std::map<token, id> token_to_id;
    std::map<id, token> id_to_token;
    std::map<id, float> score;
 };
 void replace(std::string & str, const std::string & needle, const std::string & replacement);
 // poor-man's JSON parsing
 std::map<std::string, int32_t> json_parse(const std::string & fname);
 // split text into tokens
 //
 // ref: https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53
 //
 // Regex (Python):
 // r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
 //
 // Regex (C++):
 // R"('s|'t|'re|'ve|'m|'ll|'d| ?[[:alpha:]]+| ?[[:digit:]]+| ?[^\s[:alpha:][:digit:]]+|\s+(?!\S)|\s+)"
 //
 std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const std::string & text);
 // TODO: this is probably wrong, but I cannot figure out how this tokenizer works ..
 // ref: https://github.com/google/sentencepiece
 std::vector<gpt_vocab::id> llama_tokenize(const gpt_vocab & vocab, std::string_view text, bool bos);
 // load the tokens from encoder.json
 bool gpt_vocab_init(const std::string & fname, gpt_vocab & vocab);
 // sample next token given probabilities for each embedding
 //
 //   - consider only the top K tokens
 //   - from them, consider only the top tokens with cumulative probability > P
 //
 gpt_vocab::id llama_sample_top_p_top_k(
        const gpt_vocab & vocab,
        const float * logits,
        std::vector<gpt_vocab::id> & last_n_tokens,
        double repeat_penalty,
        int top_k,
        double top_p,
        double temp,
        std::mt19937 & rng);
 // filer to top K tokens from list of logits
 void sample_top_k(std::vector<std::pair<double, gpt_vocab::id>> & logits_id, int top_k);
 //
 // Quantization
 //
 size_t ggml_quantize_q4_0(float * src, void * dst, int n, int k, int qk, int64_t * hist);
 size_t ggml_quantize_q4_1(float * src, void * dst, int n, int k, int qk, int64_t * hist);