Merge branch 'master' into interactive-eos-fix

2023-03-21 14:23:16 -05:00 · 2023-03-21 14:23:16 -05:00 · 6bcbe50792
commit 6bcbe50792
parent 52f46ef78a 0f61352708
20 changed files with 1262 additions and 365 deletions
--- a/.github/ISSUE_TEMPLATE/custom.md
+++ b/.github/ISSUE_TEMPLATE/custom.md
@ -0,0 +1,198 @@
 ---
 name: Custom issue template
 about: Used to report user-related issues with the software
 title: "[User] I encountered a problem .."
 labels: ''
 assignees: ''
 ---
 # Prerequisites
 Please answer the following questions for yourself before submitting an issue.
 - [ ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
 - [ ] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
 - [ ] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
 - [ ] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.
 # Expected Behavior
 Please provide a detailed written description of what you were trying to do, and what you expected `lamma.cpp` to do.
 # Current Behavior
 Please provide a detailed written description of what `lamma.cpp` did, instead. 
 # Environment and Context 
 Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
 * Physical (or virtual) hardware you are using, e.g. for Linux:
 `$ lscpu`
 * Operating System, e.g. for Linux:
 `$ uname -a`
 * SDK version, e.g. for Linux:
 ```
 $ python3 --version
 $ make --version
 $ g++ --version
 ```
 # Models
 * The LLaMA models are officially distributed by Facebook and will never be provided through this repository. See this [pull request in Facebook's LLaMA repository](https://github.com/facebookresearch/llama/pull/73/files) if you need to obtain access to the model data.
 * If your issue is with model conversion please verify the `sha256sum` of each of your `consolidated*.pth` and `ggml-model-XXX.bin` files to confirm that you have the correct model data files before logging an issue. [Latest sha256 sums for your reference](https://github.com/ggerganov/llama.cpp/issues/238).
 * If your issue is with model generation quality then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
  * LLaMA:
    * [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
    * [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
  * GPT-3
    * [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
  * GPT-3.5 / InstructGPT / ChatGPT:
    * [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
    * [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
 # Failure Information (for bugs)
 Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
 # Steps to Reproduce
 Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
 1. step 1
 2. step 2
 3. step 3
 4. etc.
 # Failure Logs
 Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
 Also, please try to **avoid using screenshots** if at all possible. Instead, copy/paste the console output and use [Github's markdown](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) to cleanly format your logs for easy readability. e.g.
 ```
 llama.cpp$ git log | head -1
 commit 2af23d30434a677c6416812eea52ccc0af65119c
 llama.cpp$ lscpu | egrep "AMD|Flags"
 Vendor ID:                       AuthenticAMD
 Model name:                      AMD Ryzen Threadripper 1950X 16-Core Processor
 Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sme sev
 Virtualization:                  AMD-V
 llama.cpp$ python3 --version
 Python 3.10.9
 llama.cpp$ pip list | egrep "torch|numpy|sentencepiece"
 numpy                         1.24.2
 numpydoc                      1.5.0
 sentencepiece                 0.1.97
 torch                         1.13.1
 torchvision                   0.14.1
 llama.cpp$ make --version | head -1
 GNU Make 4.3
 $ md5sum ./models/65B/ggml-model-q4_0.bin
 dbdd682cce80e2d6e93cefc7449df487  ./models/65B/ggml-model-q4_0.bin
 ```
 Here's a run with the Linux command [perf](https://www.brendangregg.com/perf.html)
 ```
 llama.cpp$ perf stat ./main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p "Please close your issue when it has been answered."
 main: seed = 1679149377
 llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
 llama_model_load: n_vocab = 32000
 llama_model_load: n_ctx   = 512
 llama_model_load: n_embd  = 8192
 llama_model_load: n_mult  = 256
 llama_model_load: n_head  = 64
 llama_model_load: n_layer = 80
 llama_model_load: n_rot   = 128
 llama_model_load: f16     = 2
 llama_model_load: n_ff    = 22016
 llama_model_load: n_parts = 8
 llama_model_load: ggml ctx size = 41477.73 MB
 llama_model_load: memory_size =  2560.00 MB, n_mem = 40960
 llama_model_load: loading model part 1/8 from './models/65B/ggml-model-q4_0.bin'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 llama_model_load: loading model part 2/8 from './models/65B/ggml-model-q4_0.bin.1'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 llama_model_load: loading model part 3/8 from './models/65B/ggml-model-q4_0.bin.2'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 llama_model_load: loading model part 4/8 from './models/65B/ggml-model-q4_0.bin.3'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 llama_model_load: loading model part 5/8 from './models/65B/ggml-model-q4_0.bin.4'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 llama_model_load: loading model part 6/8 from './models/65B/ggml-model-q4_0.bin.5'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 llama_model_load: loading model part 7/8 from './models/65B/ggml-model-q4_0.bin.6'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
 llama_model_load: .......................................................................................... done
 llama_model_load: model size =  4869.09 MB / num tensors = 723
 system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
 main: prompt: 'Please close your issue when it has been answered.'
 main: number of tokens in prompt = 11
     1 -> ''
 12148 -> 'Please'
  3802 -> ' close'
   596 -> ' your'
  2228 -> ' issue'
   746 -> ' when'
   372 -> ' it'
   756 -> ' has'
  1063 -> ' been'
  7699 -> ' answered'
 29889 -> '.'
 sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
 Please close your issue when it has been answered.
@duncan-donut: I'm trying to figure out what kind of "support" you need for this script and why, exactly? Is there a question about how the code works that hasn't already been addressed in one or more comments below this ticket, or are we talking something else entirely like some sorta bugfixing job because your server setup is different from mine??
 I can understand if your site needs to be running smoothly and you need help with a fix of sorts but there should really be nothing wrong here that the code itself could not handle. And given that I'm getting reports about how it works perfectly well on some other servers, what exactly are we talking? A detailed report will do wonders in helping us get this resolved for ya quickly so please take your time and describe the issue(s) you see as clearly & concisely as possible!!
@duncan-donut: I'm not sure if you have access to cPanel but you could try these instructions. It is worth a shot! Let me know how it goes (or what error message, exactly!) when/if ya give that code a go? [end of text]
 main: mem per token = 71159620 bytes
 main:     load time = 19309.95 ms
 main:   sample time =   168.62 ms
 main:  predict time = 223895.61 ms / 888.47 ms per token
 main:    total time = 246406.42 ms
 Performance counter stats for './main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p Please close your issue when it has been answered.':
        3636882.89 msec task-clock                #   14.677 CPUs utilized          
             13509      context-switches          #    3.714 /sec                   
              2436      cpu-migrations            #    0.670 /sec                   
          10476679      page-faults               #    2.881 K/sec                  
    13133115082869      cycles                    #    3.611 GHz                      (16.77%)
       29314462753      stalled-cycles-frontend   #    0.22% frontend cycles idle     (16.76%)
    10294402631459      stalled-cycles-backend    #   78.39% backend cycles idle      (16.74%)
    23479217109614      instructions              #    1.79  insn per cycle         
                                                  #    0.44  stalled cycles per insn  (16.76%)
     2353072268027      branches                  #  647.002 M/sec                    (16.77%)
        1998682780      branch-misses             #    0.08% of all branches          (16.76%)
     247.802177522 seconds time elapsed
    3618.573072000 seconds user
      18.491698000 seconds sys
 ```
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@ -54,6 +54,7 @@ jobs:
          cd build
          cmake ..
          cmake --build . --config Release
          ctest --output-on-failure
  macOS-latest-make:
    runs-on: macos-latest
@ -90,6 +91,7 @@ jobs:
          cd build
          cmake ..
          cmake --build . --config Release
          ctest --output-on-failure
  windows-latest-cmake:
    runs-on: windows-latest
@ -106,6 +108,7 @@ jobs:
          cd build
          cmake ..
          cmake --build . --config Release
          ctest --output-on-failure
      - name: Get commit hash
        id: commit
--- a/.github/workflows/docker.yml
+++ b/.github/workflows/docker.yml
@ -40,7 +40,7 @@ jobs:
        uses: docker/login-action@v2
        with:
          registry: ghcr.io
-          username: ${{ github.actor }}
+          username: ${{ github.repository_owner }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - name: Build and push Docker image (versioned)
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -1,131 +1,252 @@
-cmake_minimum_required(VERSION 3.8)
+cmake_minimum_required(VERSION 3.12) # Don't bump this version for no reason
-project("llama.cpp")
+project("llama.cpp" C CXX)
-set(CMAKE_CXX_STANDARD 20)
+set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
 set(CMAKE_CXX_STANDARD_REQUIRED true)
 set(CMAKE_C_STANDARD 11)
 set(THREADS_PREFER_PTHREAD_FLAG ON)
 find_package(Threads REQUIRED)
 if (NOT XCODE AND NOT MSVC AND NOT CMAKE_BUILD_TYPE)
    set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type" FORCE)
    set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release" "MinSizeRel" "RelWithDebInfo")
 endif()
-option(LLAMA_ALL_WARNINGS            "llama: enable all compiler warnings"                   ON)
+set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)
 option(LLAMA_ALL_WARNINGS_3RD_PARTY  "llama: enable all compiler warnings in 3rd party libs" OFF)
-option(LLAMA_SANITIZE_THREAD         "llama: enable thread sanitizer"    OFF)
+if(CMAKE_SOURCE_DIR STREQUAL CMAKE_CURRENT_SOURCE_DIR)
-option(LLAMA_SANITIZE_ADDRESS        "llama: enable address sanitizer"   OFF)
+    set(LLAMA_STANDALONE ON)
 option(LLAMA_SANITIZE_UNDEFINED      "llama: enable undefined sanitizer" OFF)
-if (APPLE)
+    # configure project version
-    option(LLAMA_NO_ACCELERATE       "llama: disable Accelerate framework" OFF)
+    # TODO
-    option(LLAMA_NO_AVX              "llama: disable AVX" OFF)
+else()
-    option(LLAMA_NO_AVX2             "llama: disable AVX2" OFF)
+    set(LLAMA_STANDALONE OFF)
    option(LLAMA_NO_FMA              "llama: disable FMA" OFF)
 endif()
 if (EMSCRIPTEN)
    set(BUILD_SHARED_LIBS_DEFAULT OFF)
    option(LLAMA_WASM_SINGLE_FILE "llama: embed WASM inside the generated llama.js" ON)
 else()
    if (MINGW)
        set(BUILD_SHARED_LIBS_DEFAULT OFF)
    else()
        set(BUILD_SHARED_LIBS_DEFAULT ON)
    endif()
 endif()
 #
 # Option list
 #
 # general
 option(LLAMA_STATIC                 "llama: static link libraries"                          OFF)
 option(LLAMA_NATIVE                 "llama: enable -march=native flag"                      OFF)
 option(LLAMA_LTO                    "llama: enable link time optimization"                  OFF)
 # debug
 option(LLAMA_ALL_WARNINGS           "llama: enable all compiler warnings"                   ON)
 option(LLAMA_ALL_WARNINGS_3RD_PARTY "llama: enable all compiler warnings in 3rd party libs" OFF)
 option(LLAMA_GPROF                  "llama: enable gprof"                                   OFF)
 # sanitizers
 option(LLAMA_SANITIZE_THREAD        "llama: enable thread sanitizer"                        OFF)
 option(LLAMA_SANITIZE_ADDRESS       "llama: enable address sanitizer"                       OFF)
 option(LLAMA_SANITIZE_UNDEFINED     "llama: enable undefined sanitizer"                     OFF)
 # instruction set specific
 option(LLAMA_AVX                    "llama: enable AVX"                                     ON)
 option(LLAMA_AVX2                   "llama: enable AVX2"                                    ON)
 option(LLAMA_FMA                    "llama: enable FMA"                                     ON)
 # 3rd party libs
 option(LLAMA_ACCELERATE             "llama: enable Accelerate framework"                    ON)
 option(LLAMA_OPENBLAS               "llama: use OpenBLAS"                                   OFF)
 option(LLAMA_BUILD_TESTS            "llama: build tests"    ${LLAMA_STANDALONE})
 option(LLAMA_BUILD_EXAMPLES         "llama: build examples" ${LLAMA_STANDALONE})
 #
 # Compile flags
 #
 set(CMAKE_CXX_STANDARD_REQUIRED true)
 set(CMAKE_C_STANDARD_REQUIRED true)
 set(THREADS_PREFER_PTHREAD_FLAG ON)
 find_package(Threads REQUIRED)
 if (NOT MSVC)
    if (LLAMA_SANITIZE_THREAD)
-        set(CMAKE_C_FLAGS   "${CMAKE_C_FLAGS}   -fsanitize=thread")
+        add_compile_options(-fsanitize=thread)
        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=thread")
    endif()
    if (LLAMA_SANITIZE_ADDRESS)
-        set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS}     -fsanitize=address -fno-omit-frame-pointer")
+        add_compile_options(-fsanitize=address -fno-omit-frame-pointer)
        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=address -fno-omit-frame-pointer")
    endif()
    if (LLAMA_SANITIZE_UNDEFINED)
-        set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS}     -fsanitize=undefined")
+        add_compile_options(-fsanitize=undefined)
        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=undefined")
    endif()
 endif()
-if (APPLE AND NOT LLAMA_NO_ACCELERATE)
+if (APPLE AND LLAMA_ACCELERATE)
    find_library(ACCELERATE_FRAMEWORK Accelerate)
    if (ACCELERATE_FRAMEWORK)
        message(STATUS "Accelerate framework found")
-        set(LLAMA_EXTRA_LIBS  ${LLAMA_EXTRA_LIBS}  ${ACCELERATE_FRAMEWORK})
+        add_compile_definitions(GGML_USE_ACCELERATE)
-        set(LLAMA_EXTRA_FLAGS ${LLAMA_EXTRA_FLAGS} -DGGML_USE_ACCELERATE)
+        set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} ${ACCELERATE_FRAMEWORK})
    else()
        message(WARNING "Accelerate framework not found")
    endif()
 endif()
 if (LLAMA_OPENBLAS)
    if (LLAMA_STATIC)
        set(BLA_STATIC ON)
    endif()
    set(BLA_VENDOR OpenBLAS)
    find_package(BLAS)
    if (BLAS_FOUND)
        message(STATUS "OpenBLAS found")
        add_compile_definitions(GGML_USE_OPENBLAS)
        add_link_options(${BLAS_LIBRARIES})
    else()
        message(WARNING "OpenBLAS not found")
    endif()
 endif()
 if (LLAMA_ALL_WARNINGS)
    if (NOT MSVC)
-        set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} \
+        set(c_flags
-            -Wall                           \
+            -Wall
-            -Wextra                         \
+            -Wextra
-            -Wpedantic                      \
+            -Wpedantic
-            -Wshadow                        \
+            -Wshadow
-            -Wcast-qual                     \
+            -Wcast-qual
-            -Wstrict-prototypes             \
+            -Wstrict-prototypes
-            -Wpointer-arith                 \
+            -Wpointer-arith
-            -Wno-unused-function            \
+            -Wno-unused-function
-        ")
+        )
-        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} \
+        set(cxx_flags
-            -Wall                           \
+            -Wall
-            -Wextra                         \
+            -Wextra
-            -Wpedantic                      \
+            -Wpedantic
-            -Wcast-qual                     \
+            -Wcast-qual
-        ")
+        )
    else()
        # todo : msvc
    endif()
    add_compile_options(
            "$<$<COMPILE_LANGUAGE:C>:${c_flags}>"
            "$<$<COMPILE_LANGUAGE:CXX>:${cxx_flags}>"
    )
 endif()
-message(STATUS "CMAKE_SYSTEM_PROCESSOR: ${CMAKE_SYSTEM_PROCESSOR}")
+if (LLAMA_LTO)
-
+    include(CheckIPOSupported)
-if (${CMAKE_SYSTEM_PROCESSOR} MATCHES "arm" OR ${CMAKE_SYSTEM_PROCESSOR} MATCHES "aarch64")
+    check_ipo_supported(RESULT result OUTPUT output)
-    message(STATUS "ARM detected")
+    if (result)
-else()
+        set(CMAKE_INTERPROCEDURAL_OPTIMIZATION TRUE)
    message(STATUS "x86 detected")
    if (MSVC)
        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /arch:AVX2")
        set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} /arch:AVX2")
        set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} /arch:AVX2")
    else()
-        if(NOT LLAMA_NO_AVX)
+        message(WARNING "IPO is not supported: ${output}")
            set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mavx")
        endif()
        if(NOT LLAMA_NO_AVX2)
            set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mavx2")
        endif()
        if(NOT LLAMA_NO_FMA)
            set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mfma")
        endif()
        set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mf16c")
    endif()
 endif()
-# if (LLAMA_PERF)
+# Architecture specific
-#     set(LLAMA_EXTRA_FLAGS ${LLAMA_EXTRA_FLAGS} -DGGML_PERF)
+# TODO: probably these flags need to be tweaked on some architectures
-# endif()
+#       feel free to update the Makefile for your architecture and send a pull request or issue
 message(STATUS "CMAKE_SYSTEM_PROCESSOR: ${CMAKE_SYSTEM_PROCESSOR}")
 if (NOT MSVC)
    if (LLAMA_STATIC)
        add_link_options(-static)
        if (MINGW)
            add_link_options(-static-libgcc -static-libstdc++)
        endif()
    endif()
    if (LLAMA_GPROF)
        add_compile_options(-pg)
    endif()
    if (LLAMA_NATIVE)
        add_compile_options(-march=native)
    endif()
 endif()
-add_executable(llama
+if (${CMAKE_SYSTEM_PROCESSOR} MATCHES "arm" OR ${CMAKE_SYSTEM_PROCESSOR} MATCHES "aarch64")
-    main.cpp
+    message(STATUS "ARM detected")
-    utils.cpp
+    if (MSVC)
-    utils.h)
+        # TODO: arm msvc?
    else()
        if (${CMAKE_SYSTEM_PROCESSOR} MATCHES "aarch64")
            add_compile_options(-mcpu=native)
        endif()
        # TODO: armv6,7,8 version specific flags
    endif()
 elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "^(x86_64|i686|AMD64)$")
    message(STATUS "x86 detected")
    if (MSVC)
        if (LLAMA_AVX2)
            add_compile_options(/arch:AVX2)
        elseif (LLAMA_AVX)
            add_compile_options(/arch:AVX)
        endif()
    else()
        add_compile_options(-mf16c)
        if (LLAMA_FMA)
            add_compile_options(-mfma)
        endif()
        if (LLAMA_AVX)
            add_compile_options(-mavx)
        endif()
        if (LLAMA_AVX2)
            add_compile_options(-mavx2)
        endif()
    endif()
 else()
    # TODO: support PowerPC
    message(STATUS "Unknown architecture")
 endif()
 add_executable(quantize
    quantize.cpp
    utils.cpp
    utils.h)
-add_library(ggml
+#
-    ggml.c
+# Build library
-    ggml.h)
+#
-target_compile_definitions(ggml PUBLIC ${LLAMA_EXTRA_FLAGS})
+add_executable(llama main.cpp)
-target_compile_definitions(llama PUBLIC ${LLAMA_EXTRA_FLAGS})
+
-target_compile_definitions(quantize PUBLIC ${LLAMA_EXTRA_FLAGS})
+add_executable(quantize quantize.cpp)
 add_library(utils OBJECT
            utils.cpp
            utils.h)
 target_include_directories(utils PUBLIC .)
 target_compile_features(utils PUBLIC cxx_std_11) # don't bump
 add_library(ggml OBJECT
            ggml.c
            ggml.h)
 target_link_libraries(ggml PRIVATE ${LLAMA_EXTRA_LIBS})
 target_include_directories(ggml PUBLIC .)
-target_link_libraries(quantize PRIVATE ggml)
+target_compile_features(ggml PUBLIC c_std_11) # don't bump
-target_link_libraries(llama PRIVATE ggml)
+
-target_link_libraries(ggml PRIVATE Threads::Threads)
+#
 # Linking
 #
 target_link_libraries(ggml PRIVATE Threads::Threads ${LLAMA_EXTRA_LIBS})
 target_link_libraries(llama PRIVATE ggml utils)
 target_link_libraries(quantize PRIVATE ggml utils)
 #
 # programs, examples and tests
 #
 if (LLAMA_BUILD_TESTS AND NOT CMAKE_JS_VERSION)
    enable_testing()
    add_subdirectory(tests)
 endif ()
 #if (LLAMA_BUILD_EXAMPLES)
 #    add_subdirectory(examples)
 #endif()
--- a/65
+++ b/65
@ -17,7 +17,7 @@ CXXV := $(shell $(CXX) --version | head -n 1)
 # ref: https://github.com/ggerganov/whisper.cpp/issues/66#issuecomment-1282546789
 ifeq ($(UNAME_S),Darwin)
 	ifneq ($(UNAME_P),arm)
-		SYSCTL_M := $(shell sysctl -n hw.optional.arm64)
+		SYSCTL_M := $(shell sysctl -n hw.optional.arm64 2>/dev/null)
 		ifeq ($(SYSCTL_M),1)
 			# UNAME_P := arm
 			# UNAME_M := arm64
@ -30,8 +30,9 @@ endif
 # Compile flags
 #
 # keep standard at C11 and C++11
 CFLAGS   = -I.              -O3 -DNDEBUG -std=c11   -fPIC
-CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++17 -fPIC
+CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC
 LDFLAGS  =
 # OS specific
@ -52,6 +53,10 @@ ifeq ($(UNAME_S),NetBSD)
 	CFLAGS   += -pthread
 	CXXFLAGS += -pthread
 endif
 ifeq ($(UNAME_S),OpenBSD)
 	CFLAGS   += -pthread
 	CXXFLAGS += -pthread
 endif
 ifeq ($(UNAME_S),Haiku)
 	CFLAGS   += -pthread
 	CXXFLAGS += -pthread
@ -95,30 +100,59 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686))
 		ifneq (,$(findstring sse3,$(SSE3_M)))
 			CFLAGS += -msse3
 		endif
 		AVX512F_M := $(shell grep "avx512f " /proc/cpuinfo)
 		ifneq (,$(findstring avx512f,$(AVX512F_M)))
 			CFLAGS += -mavx512f
 		endif
 		AVX512BW_M := $(shell grep "avx512bw " /proc/cpuinfo)
 		ifneq (,$(findstring avx512bw,$(AVX512BW_M)))
 			CFLAGS += -mavx512bw
 		endif
 		AVX512DQ_M := $(shell grep "avx512dq " /proc/cpuinfo)
 		ifneq (,$(findstring avx512dq,$(AVX512DQ_M)))
 			CFLAGS += -mavx512dq
 		endif
 		AVX512VL_M := $(shell grep "avx512vl " /proc/cpuinfo)
 		ifneq (,$(findstring avx512vl,$(AVX512VL_M)))
 			CFLAGS += -mavx512vl
 		endif
 		AVX512CD_M := $(shell grep "avx512cd " /proc/cpuinfo)
 		ifneq (,$(findstring avx512cd,$(AVX512CD_M)))
 			CFLAGS += -mavx512cd
 		endif
 		AVX512ER_M := $(shell grep "avx512er " /proc/cpuinfo)
 		ifneq (,$(findstring avx512er,$(AVX512ER_M)))
 			CFLAGS += -mavx512er
 		endif
 		AVX512IFMA_M := $(shell grep "avx512ifma " /proc/cpuinfo)
 		ifneq (,$(findstring avx512ifma,$(AVX512IFMA_M)))
 			CFLAGS += -mavx512ifma
 		endif
 		AVX512PF_M := $(shell grep "avx512pf " /proc/cpuinfo)
 		ifneq (,$(findstring avx512pf,$(AVX512PF_M)))
 			CFLAGS += -mavx512pf
 		endif
 	else ifeq ($(UNAME_S),Haiku)
-		AVX1_M := $(shell sysinfo -cpu | grep "AVX ")
+		AVX1_M := $(shell sysinfo -cpu | grep -w "AVX")
-		ifneq (,$(findstring avx,$(AVX1_M)))
+		ifneq (,$(findstring AVX,$(AVX1_M)))
 			CFLAGS += -mavx
 		endif
-		AVX2_M := $(shell sysinfo -cpu | grep "AVX2 ")
+		AVX2_M := $(shell sysinfo -cpu | grep -w "AVX2")
-		ifneq (,$(findstring avx2,$(AVX2_M)))
+		ifneq (,$(findstring AVX2,$(AVX2_M)))
 			CFLAGS += -mavx2
 		endif
-		FMA_M := $(shell sysinfo -cpu | grep "FMA ")
+		FMA_M := $(shell sysinfo -cpu | grep -w "FMA")
-		ifneq (,$(findstring fma,$(FMA_M)))
+		ifneq (,$(findstring FMA,$(FMA_M)))
 			CFLAGS += -mfma
 		endif
-		F16C_M := $(shell sysinfo -cpu | grep "F16C ")
+		F16C_M := $(shell sysinfo -cpu | grep -w "F16C")
-		ifneq (,$(findstring f16c,$(F16C_M)))
+		ifneq (,$(findstring F16C,$(F16C_M)))
 			CFLAGS += -mf16c
 		endif
 	else
 		CFLAGS += -mfma -mf16c -mavx -mavx2
 	endif
 endif
 ifeq ($(UNAME_M),amd64)
 	CFLAGS += -mavx -mavx2 -mfma -mf16c
 endif
 ifneq ($(filter ppc64%,$(UNAME_M)),)
 	POWER9_M := $(shell grep "POWER9" /proc/cpuinfo)
 	ifneq (,$(findstring POWER9,$(POWER9_M)))
@ -130,7 +164,8 @@ ifneq ($(filter ppc64%,$(UNAME_M)),)
 	endif
 endif
 ifndef LLAMA_NO_ACCELERATE
-	# Mac M1 - include Accelerate framework
+	# Mac M1 - include Accelerate framework.
 	# `-framework Accelerate` works on Mac Intel as well, with negliable performance boost (as of the predict time).
 	ifeq ($(UNAME_S),Darwin)
 		CFLAGS  += -DGGML_USE_ACCELERATE
 		LDFLAGS += -framework Accelerate
@ -193,7 +228,7 @@ clean:
 main: main.cpp ggml.o utils.o
 	$(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main $(LDFLAGS)
-	./main -h
+	@echo "\x1b[36mrun ./main -h for help\x1b[0m"
 quantize: quantize.cpp ggml.o utils.o
 	$(CXX) $(CXXFLAGS) quantize.cpp ggml.o utils.o -o quantize $(LDFLAGS)
--- a/README.md
+++ b/README.md
@ -178,10 +178,15 @@ If you want a more ChatGPT-like experience, you can run in interactive mode by p
 In this mode, you can always interrupt generation by pressing Ctrl+C and enter one or more lines of text which will be converted into tokens and appended to the current context. You can also specify a *reverse prompt* with the parameter `-r "reverse prompt string"`. This will result in user input being prompted whenever the exact tokens of the reverse prompt string are encountered in the generation. A typical use is to use a prompt which makes LLaMa emulate a chat between multiple users, say Alice and Bob, and pass `-r "Alice:"`.
 Here is an example few-shot interaction, invoked with the command
 ```
 ./main -m ./models/13B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
 ```bash
 # default arguments using 7B model
 ./chat.sh
 # custom arguments using 13B model
 ./main -m ./models/13B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
 ```
 Note the use of `--color` to distinguish between user input and generated text.
 ![image](https://user-images.githubusercontent.com/1991296/224575029-2af3c7dc-5a65-4f64-a6bb-517a532aea38.png)
@ -192,11 +197,10 @@ First, download the `ggml` Alpaca model into the `./models` folder:
 ```
 # use one of these
 # NOTE: these are copied from the alpaca.cpp repo - not sure how long these will work
 # TODO: add a script to simplify the download
-curl -o ggml-alpaca-7b-q4.bin -C - https://gateway.estuary.tech/gw/ipfs/QmQ1bf2BTnYxq73MFJWu1B7bQ2UD6qG7D7YDCxhTndVkPC
+curl -o ./models/ggml-alpaca-7b-q4.bin -C - https://gateway.estuary.tech/gw/ipfs/QmUp1UGeQFDqJKvtjbSYPBiZZKRjLp8shVP9hT8ZB9Ynv1
-curl -o ggml-alpaca-7b-q4.bin -C - https://ipfs.io/ipfs/QmQ1bf2BTnYxq73MFJWu1B7bQ2UD6qG7D7YDCxhTndVkPC
+curl -o ./models/ggml-alpaca-7b-q4.bin -C - https://ipfs.io/ipfs/QmUp1UGeQFDqJKvtjbSYPBiZZKRjLp8shVP9hT8ZB9Ynv1
-curl -o ggml-alpaca-7b-q4.bin -C - https://cloudflare-ipfs.com/ipfs/QmQ1bf2BTnYxq73MFJWu1B7bQ2UD6qG7D7YDCxhTndVkPC
+curl -o ./models/ggml-alpaca-7b-q4.bin -C - https://cloudflare-ipfs.com/ipfs/QmUp1UGeQFDqJKvtjbSYPBiZZKRjLp8shVP9hT8ZB9Ynv1
 ```
 Now run the `main` tool like this:
@ -219,7 +223,7 @@ Sample run:
 There 26 letters in the English Alphabet
 > What is the most common way of transportation in Amsterdam?
 The majority (54%) are using public transit. This includes buses, trams and metros with over 100 lines throughout the city which make it very accessible for tourists to navigate around town as well as locals who commute by tram or metro on a daily basis
-> List 5 words that start with "ca".                                                                       
+> List 5 words that start with "ca".
 cadaver, cauliflower, cabbage (vegetable), catalpa (tree) and Cailleach.
 > 
 ```
--- a/alpaca.sh
+++ b/alpaca.sh
@ -3,4 +3,4 @@
 # Temporary script - will be removed in the future
 #
-./main -m ./models/ggml-alpaca-7b-q4.bin --color -f ./prompts/alpaca.txt -ins --top_k 10000 --temp 0.96 --repeat_penalty 1 -t 7
+./main -m ./models/ggml-alpaca-7b-q4.bin --color -f ./prompts/alpaca.txt -ins --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7
--- a/chat.sh
+++ b/chat.sh
@ -0,0 +1,6 @@
 #!/bin/bash
 #
 # Temporary script - will be removed in the future
 #
 ./main -m ./models/7B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
--- a/convert-gptq-to-ggml.py
+++ b/convert-gptq-to-ggml.py
@ -0,0 +1,172 @@
 # Convert a GPTQ quantized LLaMA model to a ggml compatible file
 # Based on: https://github.com/qwopqwop200/GPTQ-for-LLaMa
 #
 import os
 import re
 import sys
 import json
 import struct
 import numpy as np
 import torch
 from sentencepiece import SentencePieceProcessor
 if len(sys.argv) != 4:
    print("Usage: convert-gptq-to-ggml.py llamaXXb-4bit.pt tokenizer.model out.bin\n")
    sys.exit(1)
 fname_model = sys.argv[1]
 fname_tokenizer = sys.argv[2]
 dir_out = sys.argv[3]
 model = torch.load(fname_model, map_location="cpu")
 n_vocab, n_embd = model['model.embed_tokens.weight'].shape
 n_layer = 1 + max(int(m.group(1)) for name in model
                  if (m := re.match(r'model\.layers\.([0-9]+)', name)))
 # hardcoded:
 n_mult = 256
 n_head = {32: 32, 40: 40, 60: 52, 80: 64}[n_layer]
 tokenizer = SentencePieceProcessor(fname_tokenizer)
 assert tokenizer.vocab_size() == n_vocab
 fname_out = sys.argv[3]
 fout = open(fname_out, "wb")
 fout.write(struct.pack("i", 0x67676d6c)) # magic: ggml in hex
 fout.write(struct.pack("i", n_vocab))
 fout.write(struct.pack("i", n_embd))
 fout.write(struct.pack("i", n_mult))
 fout.write(struct.pack("i", n_head))
 fout.write(struct.pack("i", n_layer))
 fout.write(struct.pack("i", n_embd // n_head)) # rot (obsolete)
 fout.write(struct.pack("i", 4))
 # This loop unchanged from convert-pth-to-ggml.py:
 for i in range(tokenizer.vocab_size()):
    if tokenizer.is_unknown(i):
        # "<unk>" token (translated as ??)
        text = " \u2047 ".encode("utf-8")
        fout.write(struct.pack("i", len(text)))
        fout.write(text)
    elif tokenizer.is_control(i):
        # "<s>"/"</s>" tokens
        fout.write(struct.pack("i", 0))
    elif tokenizer.is_byte(i):
        # "<U+XX>" tokens (which may be invalid UTF-8)
        piece = tokenizer.id_to_piece(i)
        if len(piece) != 6:
            print("Invalid token: " + piece)
            sys.exit(1)
        byte_value = int(piece[3:-1], 16)
        fout.write(struct.pack("i", 1))
        fout.write(struct.pack("B", byte_value))
    else:
        # normal token. Uses U+2581 (LOWER ONE EIGHTH BLOCK) to represent spaces.
        text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")
        fout.write(struct.pack("i", len(text)))
        fout.write(text)
 def write_header(shape, dst_name, ftype_cur):
    sname = dst_name.encode('utf-8')
    fout.write(struct.pack("iii", len(shape), len(sname), ftype_cur))
    fout.write(struct.pack("i" * len(shape), *shape[::-1]))
    fout.write(sname)
 def convert_non_q4(src_name, dst_name):
    v = model[src_name]
    shape = v.shape
    print("Processing non-Q4 variable: " + src_name + " with shape: ", shape, " and type: ", v.dtype)
    if len(shape) == 1:
        print("  Converting to float32")
        v = v.to(torch.float32)
    ftype_cur = {torch.float16: 1, torch.float32: 0}[v.dtype]
    # header
    write_header(shape, dst_name, ftype_cur)
    # data
    v.numpy().tofile(fout)
 def convert_q4(src_name, dst_name, permute=False):
    zeros = model[f"{src_name}.zeros"].numpy()
    scales = model[f"{src_name}.scales"].numpy()
    bias = model[f"{src_name}.bias"].numpy()
    qweight = model[f"{src_name}.qweight"].numpy().T # transpose
    # Q4_1 does not support bias; good thing the bias is always all zeros.
    assert not np.any(bias)
    # Each int32 item is actually 8 int4 items packed together, and it's transposed.
    shape = (qweight.shape[0], qweight.shape[1] * 8)
    print("Processing Q4 variable: " + src_name + " with shape: ", shape)
    # The output format has the int4 weights in groups of 32 rather than 8.
    # It looks like this:
    # For each row:
    #   For each group of 32 columns:
    #     - addend (float32, 4 bytes)
    #     - scale (float32, 4 bytes)
    #     - weights (int4 * 32, 16 bytes)
    # Note that in the input, the scales and addends are shared between all
    # the columns in a row, so we end up wasting quite a bit of memory with
    # repeated scales and addends.
    addends = -zeros # flip sign
    # Since the output format is mixed between integers and floats, we have
    # to hackily view the floats as int32s just so numpy will let us
    # concatenate them.
    addends_view = addends.view(dtype=np.int32)
    scales_view = scales.view(dtype=np.int32)
    # Split into groups of 4 columns (i.e. 32 columns of quantized data):
    grouped = qweight.reshape([qweight.shape[0], qweight.shape[1] // 4, 4])
    # Repeat addends and scales:
    addends_rep = np.atleast_3d(addends_view).repeat(grouped.shape[1], axis=1)
    scales_rep = np.atleast_3d(scales_view).repeat(grouped.shape[1], axis=1)
    blob = np.concatenate([scales_rep, addends_rep, grouped], axis=2, casting='no')
    if permute:
        # Permute some rows to undo the permutation done by convert_llama_weights_to_hf.py.
        # This can be done after the above conversion because it doesn't affect column order/layout.
        blob = (blob.reshape(n_head, 2, shape[0] // n_head // 2, *blob.shape[1:])
                    .swapaxes(1, 2)
                    .reshape(blob.shape))
    # header
    write_header(shape, dst_name, 3) # ftype = Q4_1
    # data
    blob.tofile(fout)
 convert_non_q4("model.embed_tokens.weight", "tok_embeddings.weight")
 convert_non_q4("model.norm.weight", "norm.weight")
 convert_non_q4("lm_head.weight", "output.weight")
 for i in range(n_layer):
    convert_q4(f"model.layers.{i}.self_attn.q_proj", f"layers.{i}.attention.wq.weight", permute=True)
    convert_q4(f"model.layers.{i}.self_attn.k_proj", f"layers.{i}.attention.wk.weight", permute=True)
    convert_q4(f"model.layers.{i}.self_attn.v_proj", f"layers.{i}.attention.wv.weight")
    convert_q4(f"model.layers.{i}.self_attn.o_proj", f"layers.{i}.attention.wo.weight")
    convert_q4(f"model.layers.{i}.mlp.gate_proj", f"layers.{i}.feed_forward.w1.weight")
    convert_q4(f"model.layers.{i}.mlp.down_proj", f"layers.{i}.feed_forward.w2.weight")
    convert_q4(f"model.layers.{i}.mlp.up_proj",   f"layers.{i}.feed_forward.w3.weight")
    convert_non_q4(f"model.layers.{i}.input_layernorm.weight", f"layers.{i}.attention_norm.weight")
    convert_non_q4(f"model.layers.{i}.post_attention_layernorm.weight", f"layers.{i}.ffn_norm.weight")
 fout.close()
 print("Done. Output file: " + fname_out)
 print("")
--- a/convert-pth-to-ggml.py
+++ b/convert-pth-to-ggml.py
@ -10,25 +10,26 @@
 #   - Name (char[name_length])
 #   - Data (float[n_dims])
 #
 # By default, the bigger matrices are converted to 16-bit floats.
 # This can be disabled by adding the "use-f32" CLI argument.
 #
 # At the start of the ggml file we write the model parameters
 # and vocabulary.
 #
 import argparse
 import os
 import sys
 import json
 import struct
 import numpy as np
 import torch
 from sentencepiece import SentencePieceProcessor
 def parse_args():
    parser = argparse.ArgumentParser(description='Convert a LLaMA model checkpoint to a ggml compatible file')
-    parser.add_argument('dir_model', help='directory containing the model checkpoint')
+    parser.add_argument('dir_model',  help='directory containing the model checkpoint')
-    parser.add_argument('ftype', type=int, choices=[0, 1], default=1, help='file type (0: float32, 1: float16)')
+    parser.add_argument('ftype',      help='file type (0: float32, 1: float16)', type=int, choices=[0, 1], default=1)
    parser.add_argument('vocab_only', help='only write vocab to file', type=int, default=0, nargs='?')
    return parser.parse_args()
 def get_n_parts(dim):
@ -44,8 +45,14 @@ def get_n_parts(dim):
 def load_hparams_and_tokenizer(dir_model):
    # `dir_model` is something like `models/7B` or `models/7B/`.
    # "tokenizer.model" is expected under model's parent dir.
    # When `dir_model` is a symlink, f"{dir_model}/../tokenizer.model" would not be found.
    # Let's use the model's parent dir directly.
    model_parent_dir = os.path.dirname(os.path.normpath(dir_model))
    fname_hparams = f"{dir_model}/params.json"
-    fname_tokenizer = f"{dir_model}/../tokenizer.model"
+    fname_tokenizer = f"{model_parent_dir}/tokenizer.model"
    with open(fname_hparams, "r") as f:
        hparams = json.load(f)
@ -60,7 +67,7 @@ def write_header(fout, hparams, ftype):
    keys = ["vocab_size", "dim", "multiple_of", "n_heads", "n_layers"]
    values = [
-        0x67676d66,  # magic: ggml in hex
+        0x67676d66,  # magic: ggmf in hex
        1, # file version
        *[hparams[key] for key in keys],
        hparams["dim"] // hparams["n_heads"],  # rot (obsolete)
@ -127,6 +134,29 @@ def main():
    ftype_str = ["f32", "f16"]
    hparams, tokenizer = load_hparams_and_tokenizer(dir_model)
    print(args)
    # if only writing vocab to file
    if args.vocab_only:
        fname_model = f"{dir_model}/consolidated.00.pth"
        fname_out = f"{dir_model}/ggml-vocab.bin"
        print(f"Extracting only the vocab from '{fname_model}'\n")
        model = torch.load(fname_model, map_location="cpu")
        with open(fname_out, "wb") as fout:
            fout.write(struct.pack("i", hparams["vocab_size"]))
            write_tokens(fout, tokenizer)
        del model
        print(f"Done. Output file: {fname_out}\n")
        return
    n_parts = get_n_parts(hparams["dim"])
    for p in range(n_parts):
@ -144,6 +174,7 @@ def main():
            process_and_write_variables(fout, model, ftype)
        del model
        print(f"Done. Output file: {fname_out}, (part {p})\n")
 if __name__ == "__main__":
--- a/examples/chatLLaMa
+++ b/examples/chatLLaMa
@ -0,0 +1,53 @@
 #!/bin/bash
 cd "$(dirname "$0")/.." || exit
 MODEL="${MODEL:-./models/13B/ggml-model-q4_0.bin}"
 USER_NAME="${USER_NAME:-User}"
 AI_NAME="${AI_NAME:-ChatLLaMa}"
 # Adjust to the number of CPU cores you want to use.
 N_THREAD="${N_THREAD:-8}"
 # Number of tokens to predict (made it larger than default because we want a long interaction)
 N_PREDICTS="${N_PREDICTS:-2048}"
 # Note: you can also override the generation options by specifying them on the command line:
 # For example, override the context size by doing: ./chatLLaMa --ctx_size 1024
 GEN_OPTIONS="${GEN_OPTIONS:---ctx_size 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647}"
 # shellcheck disable=SC2086 # Intended splitting of GEN_OPTIONS
 ./main $GEN_OPTIONS \
  --model "$MODEL" \
  --threads "$N_THREAD" \
  --n_predict "$N_PREDICTS" \
  --color --interactive \
  --reverse-prompt "${USER_NAME}:" \
  --prompt "
 Text transcript of a never ending dialog, where ${USER_NAME} interacts with an AI assistant named ${AI_NAME}.
 ${AI_NAME} is helpful, kind, honest, friendly, good at writing and never fails to answer ${USER_NAME}’s requests immediately and with details and precision.
 There are no annotations like (30 seconds passed...) or (to himself), just what ${USER_NAME} and ${AI_NAME} say alound to each other.
 The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
 The transcript only includes text, it does not include markup like HTML and Markdown.
 $USER_NAME: Hello, $AI_NAME!
 $AI_NAME: Hello $USER_NAME! How may I help you today?
 $USER_NAME: What time is it?
 $AI_NAME: It is $(date +%H:%M).
 $USER_NAME: What year is it?
 $AI_NAME: We are in $(date +%Y).
 $USER_NAME: Please tell me the largest city in Europe.
 $AI_NAME: The largest city in Europe is Moscow, the capital of Russia.
 $USER_NAME: What can you tell me about Moscow?
 $AI_NAME: Moscow, on the Moskva River in western Russia, is the nation’s cosmopolitan capital. In its historic core is the Kremlin, a complex that’s home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russia’s symbolic center.
 $USER_NAME: What is a cat?
 $AI_NAME: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.
 $USER_NAME: How do I pass command line arguments to a Node.js program?
 $AI_NAME: The arguments are stored in process.argv.
    argv[0] is the path to the Node. js executable.
    argv[1] is the path to the script file.
    argv[2] is the first argument passed to the script.
    argv[3] is the second argument passed to the script and so on.
 $USER_NAME: Name a color.
 $AI_NAME: Blue
 $USER_NAME:" "$@"
--- a/flake.nix
+++ b/flake.nix
@ -34,6 +34,7 @@
            cat ${./convert-pth-to-ggml.py} >> $out/bin/convert-pth-to-ggml
            chmod +x $out/bin/convert-pth-to-ggml
          '';
          meta.mainProgram = "llama";
        };
        devShells.default = pkgs.mkShell {
          packages = with pkgs; [
--- a/ggml.c
+++ b/ggml.c
@ -2,7 +2,7 @@
 #if defined(_MSC_VER) || defined(__MINGW32__)
 #include <malloc.h> // using malloc.h with MSC/MINGW
-#elif !defined(__FreeBSD__) && !defined(__NetBSD__)
+#elif !defined(__FreeBSD__) && !defined(__NetBSD__) && !defined(__OpenBSD__)
 #include <alloca.h>
 #endif
@ -361,7 +361,7 @@ static const size_t CACHE_LINE_SIZE_F32 = CACHE_LINE_SIZE/sizeof(float);
 // AVX routines provided by GH user Const-me
 // ref: https://github.com/ggerganov/ggml/pull/27#issuecomment-1464934600
-#if __AVX2__
+#if __AVX2__ || __AVX512F__
 // Unpack 32 4-bit fields into 32 bytes
 // The output vector contains 32 bytes, each one in [ 0 .. 15 ] interval
 static inline __m256i bytesFromNibbles( const uint8_t* rsi )
@ -397,7 +397,6 @@ static inline __m128i packNibbles( __m256i bytes )
 }
 #endif
 // method 5
 // blocks of QK elements
 // represented with a single float (delta) and QK/2 8-bit ints (i.e QK 4-bit signed integer factors)
@ -1262,6 +1261,47 @@ inline static void ggml_vec_dot_f32(const int n, float * restrict s, const float
    *s = sumf;
 }
 #if __AVX512F__ && QK == 32
 static inline __m512 dot_q4_0_oneblock_avx512(
    __m512 acc,
    const uint8_t * pd0,
    const uint8_t * pd1,
    const uint8_t * pb0,
    const uint8_t * pb1,
    size_t bs,
    int i
 ) {
    const float * d0_0 = (const float *) (pd0 + i*bs);
    const float * d1_0 = (const float *) (pd1 + i*bs);
    const uint8_t * restrict p0 = pb0 + (i+0)*bs;
    const uint8_t * restrict p1 = pb1 + (i+0)*bs;
    // Compute combined scale for the block
    float scaleScalar = d0_0[0] * d1_0[0];
    __m512 scale = _mm512_set1_ps( scaleScalar );
    __m256i bx = bytesFromNibbles( p0 );
    __m256i by = bytesFromNibbles( p1 );
    // Now we have a vector with bytes in [ 0 .. 15 ] interval. Offset them into [ -8 .. +7 ] interval.
    const __m256i off = _mm256_set1_epi8( 8 );
    bx = _mm256_sub_epi8( bx, off );
    by = _mm256_sub_epi8( by, off );
    // Sign-extend 16 signed bytes into int16_t
    __m512i x32 = _mm512_cvtepi8_epi16( bx );
    __m512i y32 = _mm512_cvtepi8_epi16( by );
    // Compute products of int16_t integers, add pairwise
    __m512i i64 = _mm512_madd_epi16( x32, y32 );
    // Convert int32_t to float
    __m512 p = _mm512_cvtepi32_ps( i64 );
    // Apply the scale, and accumulate
    return _mm512_fmadd_ps( scale, p, acc );
 }
 #endif
 inline static void ggml_vec_dot_f16(const int n, float * restrict s, ggml_fp16_t * restrict x, ggml_fp16_t * restrict y) {
    ggml_float sumf = 0.0;
@ -1417,6 +1457,40 @@ inline static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void
 #else
 #error "not implemented for QK"
 #endif
 #elif defined(__AVX512F__)
 #if QK == 32
    // Initialize accumulator with zeros
    __m512 acc0 = _mm512_setzero_ps();
    __m512 acc1 = _mm512_setzero_ps();
    const int superblock_size = 8;
    const int superblock_count = nb / superblock_size;
    const int remainder = nb % superblock_size;
    for (int superblock_ix = 0; superblock_ix < superblock_count; superblock_ix += 1) {
        int i = superblock_ix * superblock_size;
        acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+0 );
        acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+1 );
        acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+2 );
        acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+3 );
        acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+4 );
        acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+5 );
        acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+6 );
        acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+7 );
    }
    // Remainders
    for (int i = superblock_count * superblock_size; i < nb; ++i) {
        acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i );
    }
    // Horizontal sum of all lanes of the accumulator
    sumf = _mm512_reduce_add_ps( acc0 ) + _mm512_reduce_add_ps( acc1 );
 #else
 #error "not implemented for QK"
 #endif
 #elif defined(__AVX2__)
 #if QK == 32
    const size_t countBlocks = nb;
@ -1928,7 +2002,7 @@ inline static void ggml_vec_mad_q4_1(const int n, float * restrict y, void * res
    const size_t bs = 2*sizeof(float) + QK/2;
    const uint8_t * restrict pd = ((const uint8_t *)x + 0*bs);
-    const uint8_t * restrict pm = ((const uint8_t *)x + 0*bs +   sizeof(float)); 
+    const uint8_t * restrict pm = ((const uint8_t *)x + 0*bs +   sizeof(float));
    const uint8_t * restrict pb = ((const uint8_t *)x + 0*bs + 2*sizeof(float));
    for (int i = 0; i < nb; i++) {
--- a/main.cpp
+++ b/main.cpp
@ -9,7 +9,6 @@
 #include <cstring>
 #include <fstream>
 #include <iostream>
 #include <map>
 #include <string>
 #include <vector>
@ -20,6 +19,13 @@
 #include <signal.h>
 #endif
 #if defined (_WIN32)
 #pragma comment(lib,"kernel32.lib")
 extern "C" __declspec(dllimport) void* __stdcall GetStdHandle(unsigned long nStdHandle);
 extern "C" __declspec(dllimport) int __stdcall GetConsoleMode(void* hConsoleHandle, unsigned long* lpMode);
 extern "C" __declspec(dllimport) int __stdcall SetConsoleMode(void* hConsoleHandle, unsigned long dwMode);
 #endif
 #define ANSI_COLOR_RED     "\x1b[31m"
 #define ANSI_COLOR_GREEN   "\x1b[32m"
 #define ANSI_COLOR_YELLOW  "\x1b[33m"
@ -29,10 +35,40 @@
 #define ANSI_COLOR_RESET   "\x1b[0m"
 #define ANSI_BOLD          "\x1b[1m"
 /* Keep track of current color of output, and emit ANSI code if it changes. */
 enum console_state {
    CONSOLE_STATE_DEFAULT=0,
    CONSOLE_STATE_PROMPT,
    CONSOLE_STATE_USER_INPUT
 }; 
 static console_state con_st = CONSOLE_STATE_DEFAULT;
 static bool con_use_color = false;
 void set_console_state(console_state new_st)
 {
    if (!con_use_color) return;
    // only emit color code if state changed
    if (new_st != con_st) {
        con_st = new_st;
        switch(con_st) {
        case CONSOLE_STATE_DEFAULT:
            printf(ANSI_COLOR_RESET);
            return;
        case CONSOLE_STATE_PROMPT:
            printf(ANSI_COLOR_YELLOW);
            return;
        case CONSOLE_STATE_USER_INPUT:
            printf(ANSI_BOLD ANSI_COLOR_GREEN);
            return;
        }
    }
 }
 static const int EOS_TOKEN_ID = 2;
 // determine number of model parts based on the dimension
-static const std::map<int, int> LLAMA_N_PARTS = {
+static const std::unordered_map<int, int> LLAMA_N_PARTS = {
    { 4096, 1 },
    { 5120, 2 },
    { 6656, 4 },
@ -86,11 +122,12 @@ struct llama_model {
    //
    struct ggml_context * ctx;
-    std::map<std::string, struct ggml_tensor *> tensors;
+    std::unordered_map<std::string, struct ggml_tensor *> tensors;
 };
 // load the model's weights from a file
-bool llama_model_load(const std::string & fname, llama_model & model, gpt_vocab & vocab, int n_ctx, ggml_type memory_type = GGML_TYPE_F32) {
+
 bool llama_model_load(const std::string & fname, llama_model & model, llama_vocab & vocab, int n_ctx, int n_parts, ggml_type memory_type = GGML_TYPE_F32) {
    fprintf(stderr, "%s: loading model from '%s' - please wait ...\n", __func__, fname.c_str());
    std::vector<char> f_buf(1024*1024);
@ -106,12 +143,12 @@ bool llama_model_load(const std::string & fname, llama_model & model, gpt_vocab
    {
        uint32_t magic;
        fin.read((char *) &magic, sizeof(magic));
-        if (magic == 0x67676d6c) {
+        if (magic == FILE_MAGIC_UNVERSIONED) {
            fprintf(stderr, "%s: invalid model file '%s' (too old, regenerate your model files!)\n",
                    __func__, fname.c_str());
            return false;
        }
-        if (magic != 0x67676d66) {
+        if (magic != FILE_MAGIC) {
            fprintf(stderr, "%s: invalid model file '%s' (bad magic)\n", __func__, fname.c_str());
            return false;
        }
@ -119,15 +156,14 @@ bool llama_model_load(const std::string & fname, llama_model & model, gpt_vocab
        uint32_t format_version;
        fin.read((char *) &format_version, sizeof(format_version));
-        if (format_version != 1) {
+        if (format_version != FILE_VERSION) {
-            fprintf(stderr, "%s: invalid model file '%s' (unsupported format version %" PRIu32 ")\n",
+            fprintf(stderr, "%s: invalid model file '%s' (unsupported format version %" PRIu32 ", expected %d)\n",
-                    __func__, fname.c_str(), format_version);
+                    __func__, fname.c_str(), format_version, FILE_VERSION);
            return false;
        }
    }
    int n_ff = 0;
    int n_parts = 0;
    // load hparams
    {
@ -145,7 +181,16 @@ bool llama_model_load(const std::string & fname, llama_model & model, gpt_vocab
        hparams.n_ctx = n_ctx;
        n_ff = ((2*(4*hparams.n_embd)/3 + hparams.n_mult - 1)/hparams.n_mult)*hparams.n_mult;
-        n_parts = LLAMA_N_PARTS.at(hparams.n_embd);
+
        if (n_parts < 1) {
            n_parts = LLAMA_N_PARTS.at(hparams.n_embd);
        }
        // temp warning to tell the user to use "--n_parts"
        if (hparams.f16 == 4 && n_parts != 1) {
            fprintf(stderr, "%s: GPTQ model detected - are you sure n_parts should be %d? we normally expect it to be 1\n", __func__, n_parts);
            fprintf(stderr, "%s: use '--n_parts 1' if necessary\n", __func__);
        }
        fprintf(stderr, "%s: n_vocab = %d\n", __func__, hparams.n_vocab);
        fprintf(stderr, "%s: n_ctx   = %d\n", __func__, hparams.n_ctx);
@ -162,34 +207,43 @@ bool llama_model_load(const std::string & fname, llama_model & model, gpt_vocab
    // load vocab
    {
        std::string word;
        vocab.id_to_token.resize(model.hparams.n_vocab);
        std::vector<char> tmp(64);
        for (int i = 0; i < model.hparams.n_vocab; i++) {
            uint32_t len;
            fin.read((char *) &len, sizeof(len));
            word.resize(len);
-            fin.read((char *) word.data(), len);
+            if (len > 0) {
                tmp.resize(len);
                fin.read(tmp.data(), len);
                word.assign(tmp.data(), len);
            } else {
                word.clear();
            }
            float score;
            fin.read((char *) &score, sizeof(score));
            vocab.token_to_id[word] = i;
            vocab.id_to_token[i] = word;
            vocab.score[i] = score;
-            //if (i < 30000) {
+            auto &tok_score = vocab.id_to_token[i];
-            //    fprintf(stderr, "%s: vocab[%d] = '%s'\n", __func__, i, word.c_str());
+            tok_score.tok = word;
-            //}
+            tok_score.score = score;
        }
    }
    // for the big tensors, we have the option to store the data in 16-bit floats or quantized
    // in order to save memory and also to speed up the computation
-    ggml_type wtype = GGML_TYPE_COUNT;
+    // wtype is for per-layer weights, while vtype is for other weights
    ggml_type wtype, vtype;
    switch (model.hparams.f16) {
-        case 0: wtype = GGML_TYPE_F32;  break;
+        case 0: wtype = vtype = GGML_TYPE_F32;  break;
-        case 1: wtype = GGML_TYPE_F16;  break;
+        case 1: wtype = vtype = GGML_TYPE_F16;  break;
-        case 2: wtype = GGML_TYPE_Q4_0; break;
+        case 2: wtype = vtype = GGML_TYPE_Q4_0; break;
-        case 3: wtype = GGML_TYPE_Q4_1; break;
+        case 3: wtype = vtype = GGML_TYPE_Q4_1; break;
        case 4: wtype = GGML_TYPE_Q4_1; vtype = GGML_TYPE_F16; break;
        default:
                {
                    fprintf(stderr, "%s: invalid model file '%s' (bad f16 value %d)\n",
@ -210,11 +264,11 @@ bool llama_model_load(const std::string & fname, llama_model & model, gpt_vocab
        const int n_ctx   = hparams.n_ctx;
        const int n_vocab = hparams.n_vocab;
-        ctx_size += n_embd*n_vocab*ggml_type_sizef(wtype); // tok_embeddings
+        ctx_size += n_embd*n_vocab*ggml_type_sizef(vtype); // tok_embeddings
        ctx_size += n_embd*ggml_type_sizef(GGML_TYPE_F32); // norm
-        ctx_size += n_embd*n_vocab*ggml_type_sizef(wtype); // output
+        ctx_size += n_embd*n_vocab*ggml_type_sizef(vtype); // output
        ctx_size += n_layer*(n_embd*ggml_type_sizef(GGML_TYPE_F32)); // attention_norm
@ -261,10 +315,10 @@ bool llama_model_load(const std::string & fname, llama_model & model, gpt_vocab
        model.layers.resize(n_layer);
-        model.tok_embeddings = ggml_new_tensor_2d(ctx, wtype, n_embd, n_vocab);
+        model.tok_embeddings = ggml_new_tensor_2d(ctx, vtype, n_embd, n_vocab);
        model.norm   = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, n_embd);
-        model.output = ggml_new_tensor_2d(ctx, wtype,         n_embd, n_vocab);
+        model.output = ggml_new_tensor_2d(ctx, vtype,         n_embd, n_vocab);
        // map by name
        model.tensors["tok_embeddings.weight"] = model.tok_embeddings;
@ -544,9 +598,10 @@ bool llama_eval(
        const llama_model & model,
        const int n_threads,
        const int n_past,
-        const std::vector<gpt_vocab::id> & embd_inp,
+        const std::vector<llama_vocab::id> & embd_inp,
-              std::vector<float>         & embd_w,
+              std::vector<float>           & embd_w,
-              size_t                     & mem_per_token) {
+              size_t                       & mem_per_token,
              bool return_all_logits = false) {
    const int N = embd_inp.size();
    const auto & hparams = model.hparams;
@ -564,7 +619,7 @@ bool llama_eval(
    static void * buf = malloc(buf_size);
    if (mem_per_token > 0 && mem_per_token*N > buf_size) {
-        const size_t buf_size_new = 1.1*(mem_per_token*N); // add 10% to account for ggml object overhead
+        const size_t buf_size_new = 1.3*(mem_per_token*N); // add 30% to account for ggml object overhead
        //fprintf(stderr, "\n%s: reallocating buffer from %zu to %zu bytes\n", __func__, buf_size, buf_size_new);
        // reallocate
@ -750,9 +805,14 @@ bool llama_eval(
    //embd_w.resize(n_vocab*N);
    //memcpy(embd_w.data(), ggml_get_data(inpL), sizeof(float)*n_vocab*N);
-    // return result for just the last token
+    if (return_all_logits) {
-    embd_w.resize(n_vocab);
+        embd_w.resize(n_vocab * N);
-    memcpy(embd_w.data(), (float *) ggml_get_data(inpL) + (n_vocab*(N-1)), sizeof(float)*n_vocab);
+        memcpy(embd_w.data(), (float *) ggml_get_data(inpL), sizeof(float)*n_vocab*N);
    } else {
        // return result for just the last token
        embd_w.resize(n_vocab);
        memcpy(embd_w.data(), (float *) ggml_get_data(inpL) + (n_vocab*(N-1)), sizeof(float)*n_vocab);
    }
    if (mem_per_token == 0) {
        mem_per_token = ggml_used_mem(ctx0)/N;
@ -764,11 +824,81 @@ bool llama_eval(
    return true;
 }
 std::vector<double> softmax(const std::vector<float>& logits) {
    std::vector<double> probs(logits.size());
    float max_logit = logits[0];
    for (float v : logits) max_logit = std::max(max_logit, v);
    double sum_exp = 0.0;
    for (size_t i = 0; i < logits.size(); i++) {
        // Subtract the maximum logit value from the current logit value for numerical stability
        float logit = logits[i] - max_logit;
        double exp_logit = std::exp(logit);
        sum_exp += exp_logit;
        probs[i] = exp_logit;
    }
    for (size_t i = 0; i < probs.size(); i++) probs[i] /= sum_exp;
    return probs;
 }
 void perplexity(const llama_vocab &vocab, const llama_model &model, const gpt_params &params, size_t mem_per_token) {
    // Download: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
    // Run `./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw`
    // Output: `perplexity: 13.5106 [114/114]`
    std::vector<llama_vocab::id> tokens = ::llama_tokenize(vocab, params.prompt, true);
    int count = 0;
    double nll = 0.0;
    int seq_count = tokens.size() / params.n_ctx;
    printf("Calculating perplexity over %d chunks\n", seq_count);
    for (int i = 0; i < seq_count; ++i) {
        int start = i * params.n_ctx;
        int end = start + params.n_ctx - 1;
        std::vector<llama_vocab::id> embd(tokens.begin() + start, tokens.begin() + end);
        std::vector<float> logits;
        auto start_t = std::chrono::high_resolution_clock::now();
        if (!llama_eval(model, params.n_threads, 0, embd, logits, mem_per_token, true)) {
            fprintf(stderr, "Failed to predict\n");
            return;
        }
        auto end_t = std::chrono::high_resolution_clock::now();
        if (i == 0) {
            double seconds = std::chrono::duration<double>(end_t - start_t).count();
            printf("%.2f seconds per pass - ETA %.2f hours\n", seconds, (seconds * seq_count) / (60.0*60.0));
        }
        // We get the logits for all the tokens in the context window (params.n_ctx)
        // from llama_eval above.  Now, based on https://huggingface.co/docs/transformers/perplexity,
        // calculate the perplexity over the last half the window (so the model always has
        // some context to predict the token).
        //
        // We rely on the fact that attention in the forward pass only looks at previous
        // tokens here, so the logits returned for each token are an accurate representation
        // of what the model would have predicted at that point.
        //
        // Example, we have a context window of 512, we will compute perplexity for each of the
        // last 256 tokens.  Then, we split the input up into context window size chunks to
        // process the entire prompt.
        for (int j = params.n_ctx / 2; j < params.n_ctx - 1; ++j) {
            // Calculate probability of next token, given the previous ones.
            int n_vocab = model.hparams.n_vocab;
            std::vector<float> tok_logits(
                logits.begin() + j * n_vocab,
                logits.begin() + (j + 1) * n_vocab);
            double prob = softmax(tok_logits)[tokens[start + j + 1]];
            nll += -std::log(prob);
            ++count;
        }
        // perplexity is e^(average negative log-likelihood)
        printf("[%d]%.4lf,", i + 1, std::exp(nll / count));
        fflush(stdout);
    }
    printf("\n");
 }
 static bool is_interacting = false;
 #if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
 void sigint_handler(int signo) {
-    printf(ANSI_COLOR_RESET);
+    set_console_state(CONSOLE_STATE_DEFAULT);
    printf("\n"); // this also force flush stdout.
    if (signo == SIGINT) {
        if (!is_interacting) {
@ -827,19 +957,23 @@ int main(int argc, char ** argv) {
        params.prompt = gpt_random_prompt(rng);
    }
    // save choice to use color for later
    // (note for later: this is a slightly awkward choice)
    con_use_color = params.use_color;
 //    params.prompt = R"(// this function checks if the number n is prime
 //bool is_prime(int n) {)";
    int64_t t_load_us = 0;
-    gpt_vocab vocab;
+    llama_vocab vocab;
    llama_model model;
    // load the model
    {
        const ggml_type memory_type = params.memory_f16 ? GGML_TYPE_F16 : GGML_TYPE_F32;
        const int64_t t_start_us = ggml_time_us();
-        if (!llama_model_load(params.model, model, vocab, params.n_ctx, memory_type)) {
+        if (!llama_model_load(params.model, model, vocab, params.n_ctx, params.n_parts, memory_type)) {
            fprintf(stderr, "%s: failed to load model from '%s'\n", __func__, params.model.c_str());
            return 1;
        }
@ -854,23 +988,32 @@ int main(int argc, char ** argv) {
                params.n_threads, std::thread::hardware_concurrency(), llama_print_system_info());
    }
    std::vector<float> logits;
    // determine the required inference memory per token:
    size_t mem_per_token = 0;
    llama_eval(model, params.n_threads, 0, { 0, 1, 2, 3 }, logits, mem_per_token);
    if (params.perplexity) {
        perplexity(vocab, model, params, mem_per_token);
        exit(0);
    }
    int n_past = 0;
    int64_t t_sample_us  = 0;
    int64_t t_predict_us = 0;
    std::vector<float> logits;
    // Add a space in front of the first character to match OG llama tokenizer behavior
    params.prompt.insert(0, 1, ' ');
    // tokenize the prompt
-    std::vector<gpt_vocab::id> embd_inp = ::llama_tokenize(vocab, params.prompt, true);
+    std::vector<llama_vocab::id> embd_inp = ::llama_tokenize(vocab, params.prompt, true);
    params.n_predict = std::min(params.n_predict, model.hparams.n_ctx - (int) embd_inp.size());
    // prefix & suffix for instruct mode
-    const std::vector<gpt_vocab::id> inp_pfx = ::llama_tokenize(vocab, "\n\n### Instruction:\n\n", true);
+    const std::vector<llama_vocab::id> inp_pfx = ::llama_tokenize(vocab, "\n\n### Instruction:\n\n", true);
-    const std::vector<gpt_vocab::id> inp_sfx = ::llama_tokenize(vocab, "\n\n### Response:\n\n", false);
+    const std::vector<llama_vocab::id> inp_sfx = ::llama_tokenize(vocab, "\n\n### Response:\n\n", false);
    // in instruct mode, we inject a prefix and a suffix to each input by the user
    if (params.instruct) {
@ -878,18 +1021,14 @@ int main(int argc, char ** argv) {
        params.antiprompt.push_back("### Instruction:\n\n");
    }
-    // tokenize the reverse prompt
+    // tokenize the first reverse prompt
-    std::vector<std::vector<gpt_vocab::id>> antipromptv_inp;
+    std::vector<llama_vocab::id> first_antiprompt;
-    
+    if (!params.antiprompt.empty()) {
-    for (auto antiprompt : params.antiprompt) {
+        first_antiprompt = ::llama_tokenize(vocab, params.antiprompt.front(), false);
        antipromptv_inp.push_back(::llama_tokenize(vocab, antiprompt, false));
    }
    // tokenize the first reverse prompt
    std::vector<llama_vocab::id> first_antiprompt = ::llama_tokenize(vocab, params.antiprompt.front(), false);
    // enable interactive mode if reverse prompt is specified
-    if (antipromptv_inp.size() != 0) {
+    if (params.antiprompt.size() != 0) {
        params.interactive = true;
    }
@ -897,7 +1036,7 @@ int main(int argc, char ** argv) {
    fprintf(stderr, "%s: prompt: '%s'\n", __func__, params.prompt.c_str());
    fprintf(stderr, "%s: number of tokens in prompt = %zu\n", __func__, embd_inp.size());
    for (int i = 0; i < (int) embd_inp.size(); i++) {
-        fprintf(stderr, "%6d -> '%s'\n", embd_inp[i], vocab.id_to_token.at(embd_inp[i]).c_str());
+        fprintf(stderr, "%6d -> '%s'\n", embd_inp[i], vocab.id_to_token.at(embd_inp[i]).tok.c_str());
    }
    fprintf(stderr, "\n");
    if (params.interactive) {
@ -913,29 +1052,19 @@ int main(int argc, char ** argv) {
        fprintf(stderr, "%s: interactive mode on.\n", __func__);
-        if(antipromptv_inp.size()) {
+        if(params.antiprompt.size()) {
-            for (size_t apindex = 0; apindex < antipromptv_inp.size(); ++apindex) {
+            for (auto antiprompt : params.antiprompt) {
-                auto antiprompt_inp = antipromptv_inp.at(apindex);
+                fprintf(stderr, "Reverse prompt: '%s'\n", antiprompt.c_str());
                fprintf(stderr, "%s: reverse prompt: '%s'\n", __func__, params.antiprompt.at(apindex).c_str());
                fprintf(stderr, "%s: number of tokens in reverse prompt = %zu\n", __func__, antiprompt_inp.size());
                for (int i = 0; i < (int) antiprompt_inp.size(); i++) {
                    fprintf(stderr, "%6d -> '%s'\n", antiprompt_inp[i], vocab.id_to_token.at(antiprompt_inp[i]).c_str());
                }
                fprintf(stderr, "\n");
            }
        }
    }
    fprintf(stderr, "sampling parameters: temp = %f, top_k = %d, top_p = %f, repeat_last_n = %i, repeat_penalty = %f\n", params.temp, params.top_k, params.top_p, params.repeat_last_n, params.repeat_penalty);
    fprintf(stderr, "\n\n");
-    std::vector<gpt_vocab::id> embd;
+    std::vector<llama_vocab::id> embd;
    // determine the required inference memory per token:
    size_t mem_per_token = 0;
    llama_eval(model, params.n_threads, 0, { 0, 1, 2, 3 }, logits, mem_per_token);
    int last_n_size = params.repeat_last_n;
-    std::vector<gpt_vocab::id> last_n_tokens(last_n_size);
+    std::vector<llama_vocab::id> last_n_tokens(last_n_size);
    std::fill(last_n_tokens.begin(), last_n_tokens.end(), 0);
    if (params.interactive) {
@ -956,10 +1085,18 @@ int main(int argc, char ** argv) {
    // dynamically determine the newline token
    const auto NEWLINE_TOKEN_ID = vocab.token_to_id["\n"];
-    // set the color for the prompt which will be output initially
+#if defined (_WIN32)
-    if (params.use_color) {
+  if (params.use_color) {
-        printf(ANSI_COLOR_YELLOW);
+        // Enable ANSI colors on Windows 10+
        unsigned long dwMode = 0;
        void* hConOut = GetStdHandle((unsigned long)-11); // STD_OUTPUT_HANDLE (-11)
        if (hConOut && hConOut != (void*)-1 && GetConsoleMode(hConOut, &dwMode) && !(dwMode & 0x4)) {
            SetConsoleMode(hConOut, dwMode | 0x4); // ENABLE_VIRTUAL_TERMINAL_PROCESSING (0x4)
        }
    }
 #endif
    // the first thing we will do is to output the prompt, so set color accordingly
    set_console_state(CONSOLE_STATE_PROMPT);
    while (remaining_tokens > 0 || params.interactive) {
        // predict
@ -977,7 +1114,7 @@ int main(int argc, char ** argv) {
        n_past += embd.size();
        embd.clear();
-        if (embd_inp.size() <= input_consumed) {
+        if ((int) embd_inp.size() <= input_consumed) {
            // out of user input, sample next token
            const float top_k = params.top_k;
            const float top_p = params.top_p;
@ -986,7 +1123,7 @@ int main(int argc, char ** argv) {
            const int n_vocab = model.hparams.n_vocab;
-            gpt_vocab::id id = 0;
+            llama_vocab::id id = 0;
            {
                const int64_t t_start_sample_us = ggml_time_us();
@ -1023,7 +1160,7 @@ int main(int argc, char ** argv) {
            --remaining_tokens;
        } else {
            // some user input remains from prompt or interaction, forward it to processing
-            while (embd_inp.size() > input_consumed) {
+            while ((int) embd_inp.size() > input_consumed) {
                embd.push_back(embd_inp[input_consumed]);
                last_n_tokens.erase(last_n_tokens.begin());
                last_n_tokens.push_back(embd_inp[input_consumed]);
@ -1037,27 +1174,35 @@ int main(int argc, char ** argv) {
        // display text
        if (!input_noecho) {
            for (auto id : embd) {
-                printf("%s", vocab.id_to_token[id].c_str());
+                printf("%s", vocab.id_to_token[id].tok.c_str());
            }
            fflush(stdout);
        }
        // reset color to default if we there is no pending user input
-        if (!input_noecho && params.use_color && (int)embd_inp.size() == input_consumed) {
+        if (!input_noecho && (int)embd_inp.size() == input_consumed) {
-            printf(ANSI_COLOR_RESET);
+            set_console_state(CONSOLE_STATE_DEFAULT);
        }
        // in interactive mode, and not currently processing queued inputs;
        // check if we should prompt the user for more
-        if (params.interactive && embd_inp.size() <= input_consumed) {
+        if (params.interactive && (int) embd_inp.size() <= input_consumed) {
            // check for reverse prompt
-            for (auto antiprompt_inp : antipromptv_inp) {
+            std::string last_output;
-                if (antiprompt_inp.size() && std::equal(antiprompt_inp.rbegin(), antiprompt_inp.rend(), last_n_tokens.rbegin())) {
+            for (auto id : last_n_tokens) {
-                    // reverse prompt found
+                last_output += vocab.id_to_token[id].tok;
            }
            // Check if each of the reverse prompts appears at the end of the output.
            for (std::string antiprompt : params.antiprompt) {
                if (last_output.find(antiprompt.c_str(), last_output.length() - antiprompt.length(), antiprompt.length()) != std::string::npos) {
                    is_interacting = true;
                    break;
                }
            }
            if (is_interacting) {
                // potentially set color to indicate we are taking user input
                set_console_state(CONSOLE_STATE_USER_INPUT);
                if (params.instruct) {
                    input_consumed = embd_inp.size();
                    embd_inp.insert(embd_inp.end(), inp_pfx.begin(), inp_pfx.end());
@ -1065,8 +1210,6 @@ int main(int argc, char ** argv) {
                    printf("\n> ");
                }
                // currently being interactive
                if (params.use_color) printf(ANSI_BOLD ANSI_COLOR_GREEN);
                std::string buffer;
                std::string line;
                bool another_line = true;
@ -1079,9 +1222,11 @@ int main(int argc, char ** argv) {
                    }
                    buffer += line + '\n'; // Append the line to the result
                } while (another_line);
                if (params.use_color) printf(ANSI_COLOR_RESET);
-                std::vector<gpt_vocab::id> line_inp = ::llama_tokenize(vocab, buffer, false);
+                // done taking input, reset color
                set_console_state(CONSOLE_STATE_DEFAULT);
                std::vector<llama_vocab::id> line_inp = ::llama_tokenize(vocab, buffer, false);
                embd_inp.insert(embd_inp.end(), line_inp.begin(), line_inp.end());
                if (params.instruct) {
@ -1126,9 +1271,7 @@ int main(int argc, char ** argv) {
    ggml_free(model.ctx);
-    if (params.use_color) {
+    set_console_state(CONSOLE_STATE_DEFAULT);
        printf(ANSI_COLOR_RESET);
    }
    return 0;
 }
--- a/models/ggml-vocab.bin
+++ b/models/ggml-vocab.bin
--- a/quantize.cpp
+++ b/quantize.cpp
@ -8,7 +8,6 @@
 #include <cstdio>
 #include <cstring>
 #include <fstream>
 #include <map>
 #include <string>
 #include <vector>
 #include <regex>
@ -44,7 +43,7 @@ bool llama_model_quantize(const std::string & fname_inp, const std::string & fna
        return false;
    }
-    gpt_vocab vocab;
+    llama_vocab vocab;
    printf("%s: loading model from '%s'\n", __func__, fname_inp.c_str());
@ -64,12 +63,12 @@ bool llama_model_quantize(const std::string & fname_inp, const std::string & fna
    {
        uint32_t magic;
        finp.read((char *) &magic, sizeof(magic));
-        if (magic == 0x67676d6c) {
+        if (magic == FILE_MAGIC_UNVERSIONED) {
            fprintf(stderr, "%s: invalid model file '%s' (too old, regenerate your model files!)\n",
                    __func__, fname_inp.c_str());
            return false;
        }
-        if (magic != 0x67676d66) {
+        if (magic != FILE_MAGIC) {
            fprintf(stderr, "%s: invalid model file '%s' (bad magic)\n", __func__, fname_inp.c_str());
            return false;
        }
@ -79,9 +78,9 @@ bool llama_model_quantize(const std::string & fname_inp, const std::string & fna
        uint32_t format_version;
        finp.read((char *) &format_version, sizeof(format_version));
-        if (format_version != 1) {
+        if (format_version != FILE_VERSION) {
-            fprintf(stderr, "%s: invalid model file '%s' (unsupported format version %" PRIu32 ")\n",
+            fprintf(stderr, "%s: invalid model file '%s' (unsupported format version %" PRIu32 ", expected %d)\n",
-                    __func__, fname_inp.c_str(), format_version);
+                    __func__, fname_inp.c_str(), format_version, FILE_VERSION);
            return false;
        }
@ -130,6 +129,7 @@ bool llama_model_quantize(const std::string & fname_inp, const std::string & fna
        }
        std::string word;
        vocab.id_to_token.resize(n_vocab);
        for (int i = 0; i < n_vocab; i++) {
            uint32_t len;
            finp.read ((char *) &len, sizeof(len));
@ -144,8 +144,10 @@ bool llama_model_quantize(const std::string & fname_inp, const std::string & fna
            fout.write((char *) &score, sizeof(score));
            vocab.token_to_id[word] = i;
-            vocab.id_to_token[i] = word;
+
-            vocab.score[i] = score;
+            auto &tok_score = vocab.id_to_token[i];
            tok_score.tok = word;
            tok_score.score = score;
        }
    }
--- a/tests/CMakeLists.txt
+++ b/tests/CMakeLists.txt
@ -0,0 +1,4 @@
 set(TEST_TARGET test-tokenizer-0)
 add_executable(${TEST_TARGET} ${TEST_TARGET}.cpp)
 target_link_libraries(${TEST_TARGET} PRIVATE utils)
 add_test(NAME ${TEST_TARGET} COMMAND $<TARGET_FILE:${TEST_TARGET}> ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab.bin)
--- a/tests/test-tokenizer-0.cpp
+++ b/tests/test-tokenizer-0.cpp
@ -0,0 +1,69 @@
 #include "utils.h"
 #include <cstdio>
 #include <string>
 #include <map>
 static const std::map<std::string, std::vector<llama_vocab::id>> k_tests = {
    { "Hello World",        { 1,  10994,   2787, }, },
    { " Hello World",       { 1,  15043,   2787, }, },
    { " Hello World!",      { 1,  15043,   2787,  29991, }, },
    { " this is 🦙.cpp",    { 1,    445,    338,  29871,    243,    162,    169,    156,  29889,   8223, }, },
    { "w048 7tuijk dsdfhu", { 1,  29893,  29900,  29946,  29947,  29871,  29955,   9161,  13535,  18031,   2176,   6905, }, },
    { "нещо на Български",  { 1,    821,   4851,    665,   1386,  29713,   1305, }, },
 };
 int main(int argc, char **argv) {
    if (argc < 2) {
        fprintf(stderr, "Usage: %s <vocab-file>\n", argv[0]);
        return 1;
    }
    const std::string fname = argv[1];
    fprintf(stderr, "%s : reading vocab from: '%s'\n", __func__, fname.c_str());
    llama_vocab vocab;
    if (!llama_vocab_load(fname, vocab)) {
        fprintf(stderr, "%s : failed to load vocab from: '%s'\n", __func__, fname.c_str());
        return 1;
    }
    const int n_vocab = vocab.id_to_token.size();
    if (n_vocab != 32000) {
        fprintf(stderr, "%s : expected 32000 tokens, got %d\n", __func__, n_vocab);
        return 2;
    }
    for (const auto & test_kv : k_tests) {
        const auto res = llama_tokenize(vocab, test_kv.first, true);
        bool correct = res.size() == test_kv.second.size();
        for (int i = 0; i < (int) res.size() && correct; ++i) {
            if (res[i] != test_kv.second[i]) {
                correct = false;
            }
        }
        if (!correct) {
            fprintf(stderr, "%s : failed test: '%s'\n", __func__, test_kv.first.c_str());
            fprintf(stderr, "%s : expected tokens: ", __func__);
            for (const auto & t : test_kv.second) {
                fprintf(stderr, "%6d, ", t);
            }
            fprintf(stderr, "\n");
            fprintf(stderr, "%s : got tokens:      ", __func__);
            for (const auto & t : res) {
                fprintf(stderr, "%6d, ", t);
            }
            fprintf(stderr, "\n");
            return 3;
        }
    }
    return 0;
 }
--- a/utils.cpp
+++ b/utils.cpp
@ -12,7 +12,7 @@
 #if defined(_MSC_VER) || defined(__MINGW32__)
 #include <malloc.h> // using malloc.h with MSC/MINGW
- #elif !defined(__FreeBSD__) && !defined(__NetBSD__)
+ #elif !defined(__FreeBSD__) && !defined(__NetBSD__) && !defined(__OpenBSD__)
 #include <alloca.h>
 #endif
@ -72,8 +72,12 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
            params.use_color = true;
        } else if (arg == "-r" || arg == "--reverse-prompt") {
            params.antiprompt.push_back(argv[++i]);
        } else if (arg == "--perplexity") {
            params.perplexity = true;
        } else if (arg == "--ignore-eos") {
            params.ignore_eos = true;
        } else if (arg == "--n_parts") {
            params.n_parts = std::stoi(argv[++i]);
        } else if (arg == "-h" || arg == "--help") {
            gpt_print_usage(argc, argv, params);
            exit(0);
@ -116,7 +120,9 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
    fprintf(stderr, "  --ignore-eos          ignore end of stream token and continue generating\n");
    fprintf(stderr, "  --memory_f16          use f16 instead of f32 for memory key+value\n");
    fprintf(stderr, "  --temp N              temperature (default: %.1f)\n", params.temp);
    fprintf(stderr, "  --n_parts N           number of model parts (default: -1 = determine from dimensions)\n");
    fprintf(stderr, "  -b N, --batch_size N  batch size for prompt processing (default: %d)\n", params.n_batch);
    fprintf(stderr, "  --perplexity          compute perplexity over the prompt\n");
    fprintf(stderr, "  -m FNAME, --model FNAME\n");
    fprintf(stderr, "                        model path (default: %s)\n", params.model.c_str());
    fprintf(stderr, "\n");
@ -149,8 +155,8 @@ void replace(std::string & str, const std::string & needle, const std::string &
    }
 }
-std::map<std::string, int32_t> json_parse(const std::string & fname) {
+std::unordered_map<std::string, int32_t> json_parse(const std::string & fname) {
-    std::map<std::string, int32_t> result;
+    std::unordered_map<std::string, int32_t> result;
    // read file into string
    std::string json;
@ -240,61 +246,6 @@ std::map<std::string, int32_t> json_parse(const std::string & fname) {
    return result;
 }
 std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const std::string & text) {
    std::vector<std::string> words;
    // first split the text into words
    {
        std::string str = text;
        std::string pat = R"('s|'t|'re|'ve|'m|'ll|'d| ?[[:alpha:]]+| ?[[:digit:]]+| ?[^\s[:alpha:][:digit:]]+|\s+(?!\S)|\s+)";
        std::regex re(pat);
        std::smatch m;
        while (std::regex_search(str, m, re)) {
            for (auto x : m) {
                words.push_back(x);
            }
            str = m.suffix();
        }
    }
    // find the longest tokens that form the words:
    std::vector<gpt_vocab::id> tokens;
    for (const auto & word : words) {
        if (word.size() == 0) continue;
        int i = 0;
        int n = word.size();
        while (i < n) {
            int j = n;
            while (j > i) {
                auto it = vocab.token_to_id.find(word.substr(i, j-i));
                if (it != vocab.token_to_id.end()) {
                    tokens.push_back(it->second);
                    i = j;
                    break;
                }
                --j;
            }
            if (i == n) {
                break;
            }
            if (j == i) {
                auto sub = word.substr(i, 1);
                if (vocab.token_to_id.find(sub) != vocab.token_to_id.end()) {
                    tokens.push_back(vocab.token_to_id.at(sub));
                } else {
                    fprintf(stderr, "%s: unknown token '%s'\n", __func__, sub.data());
                }
                ++i;
            }
        }
    }
    return tokens;
 }
 static size_t utf8_len(char src) {
    const size_t lookup[] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 4 };
    uint8_t highbits = static_cast<uint8_t>(src) >> 4;
@ -305,7 +256,8 @@ struct llama_sp_symbol {
    using index = int;
    index prev;
    index next;
-    std::string_view text;
+    const char * text;
    size_t n;
 };
 struct llama_sp_bigram {
@ -322,19 +274,23 @@ struct llama_sp_bigram {
    size_t size;
 };
 // original implementation:
 // https://github.com/ggerganov/llama.cpp/commit/074bea2eb1f1349a0118239c4152914aecaa1be4
 struct llama_tokenizer {
-    llama_tokenizer(const gpt_vocab & vocab): vocab_(vocab) {}
+    llama_tokenizer(const llama_vocab & vocab): vocab_(vocab) {}
-    void tokenize(std::string_view text, std::vector<gpt_vocab::id> & output) {
+    void tokenize(const std::string & text, std::vector<llama_vocab::id> & output) {
        // split string into utf8 chars
        int index = 0;
-        while (!text.empty()) {
+        size_t offs = 0;
        while (offs < text.size()) {
            llama_sp_symbol sym;
-            size_t char_len = std::min(text.size(), utf8_len(text.data()[0]));
+            size_t char_len = std::min(text.size() - offs, utf8_len(text[offs]));
-            sym.text = std::string_view(text.data(), char_len);
+            sym.text = text.c_str() + offs;
            sym.n = char_len;
            offs += char_len;
            sym.prev = index - 1;
-            text.remove_prefix(char_len);
+            sym.next = offs == text.size() ? -1 : index + 1;
            sym.next = text.empty() ? -1 : index + 1;
            index++;
            symbols_.emplace_back(std::move(sym));
        }
@ -353,14 +309,16 @@ struct llama_tokenizer {
            auto & right_sym = symbols_[bigram.right];
            // if one of the symbols already got merged, skip it.
-            if (left_sym.text.empty() || right_sym.text.empty() ||
+            if (left_sym.n == 0 || right_sym.n == 0 ||
-                left_sym.text.size() + right_sym.text.size() != bigram.size) {
+                left_sym.n + right_sym.n != bigram.size) {
                continue;
            }
            // merge the right sym into the left one
-            left_sym.text = std::string_view(left_sym.text.data(), left_sym.text.size() + right_sym.text.size());
+            left_sym.n += right_sym.n;
-            right_sym.text = std::string_view("");
+            right_sym.n = 0;
            //printf("left = '%*s' size = %zu\n", (int) left_sym.n, left_sym.text, bigram.size);
            // remove the right sym from the chain
            left_sym.next = right_sym.next;
@ -374,13 +332,13 @@ struct llama_tokenizer {
        }
        for (int i = 0; i != -1; i = symbols_[i].next) {
-            auto& symbol = symbols_[i];
+            auto & symbol = symbols_[i];
-            auto token = vocab_.token_to_id.find(std::string(symbol.text));
+            auto token = vocab_.token_to_id.find(std::string(symbol.text, symbol.n));
            if (token == vocab_.token_to_id.end()) {
                // output any symbols that did not form tokens as bytes.
-                for (int j = 0; j < symbol.text.size(); ++j) {
+                for (int j = 0; j < (int) symbol.n; ++j) {
-                    gpt_vocab::id token_id = static_cast<uint8_t>(symbol.text[j]) + 3;
+                    llama_vocab::id token_id = static_cast<uint8_t>(symbol.text[j]) + 3;
                    output.push_back(token_id);
                }
            } else {
@ -395,35 +353,77 @@ private:
            return;
        }
-        std::string_view text(symbols_[left].text.data(), symbols_[left].text.size() + symbols_[right].text.size());
+        const std::string text = std::string(symbols_[left].text, symbols_[left].n + symbols_[right].n);
-        auto token = vocab_.token_to_id.find(std::string(text));
+        auto token = vocab_.token_to_id.find(text);
        if (token == vocab_.token_to_id.end()) {
            return;
        }
-        auto score = vocab_.score.find((*token).second);
+        if (static_cast<size_t>((*token).second) >= vocab_.id_to_token.size()) {
        if (score == vocab_.score.end()) {
            return;
        }
        const auto &tok_score = vocab_.id_to_token[(*token).second];
        llama_sp_bigram bigram;
        bigram.left = left;
        bigram.right = right;
-        bigram.score = (*score).second;
+        bigram.score = tok_score.score;
        bigram.size = text.size();
        work_queue_.push(bigram);
    }
-    const gpt_vocab & vocab_;
+    const llama_vocab & vocab_;
    std::vector<llama_sp_symbol> symbols_;
    llama_sp_bigram::queue work_queue_;
 };
-std::vector<gpt_vocab::id> llama_tokenize(const gpt_vocab & vocab, std::string_view text, bool bos) {
+// TODO: temporary code duplication with llama.cpp
 //       will resolve after #77 is merged
 bool llama_vocab_load(const std::string & fname, llama_vocab & vocab) {
    std::ifstream fin(fname, std::ios::binary);
    if (!fin.is_open()) {
        return false;
    }
    int n_vocab = 0;
    fin.read((char *) &n_vocab, sizeof(n_vocab));
    std::string word;
    std::vector<char> tmp(64);
    vocab.id_to_token.resize(n_vocab);
    for (int i = 0; i < n_vocab; i++) {
        uint32_t len;
        fin.read((char *) &len, sizeof(len));
        word.resize(len);
        if (len > 0) {
            tmp.resize(len);
            fin.read(tmp.data(), len);
            word.assign(tmp.data(), len);
        } else {
            word.clear();
        }
        float score;
        fin.read((char *) &score, sizeof(score));
        vocab.token_to_id[word] = i;
        auto &tok_score = vocab.id_to_token[i];
        tok_score.tok = word;
        tok_score.score = score;
    }
    return true;
 }
 std::vector<llama_vocab::id> llama_tokenize(const llama_vocab & vocab, const std::string & text, bool bos) {
    llama_tokenizer tokenizer(vocab);
-    std::vector<gpt_vocab::id> output;
+    std::vector<llama_vocab::id> output;
    if (text.size() == 0) {
        return output;
@ -437,42 +437,22 @@ std::vector<gpt_vocab::id> llama_tokenize(const gpt_vocab & vocab, std::string_v
    return output;
 }
-bool gpt_vocab_init(const std::string & fname, gpt_vocab & vocab) {
+void sample_top_k(std::vector<std::pair<double, llama_vocab::id>> & logits_id, int top_k) {
    printf("%s: loading vocab from '%s'\n", __func__, fname.c_str());
    vocab.token_to_id = ::json_parse(fname);
    for (const auto & kv : vocab.token_to_id) {
        vocab.id_to_token[kv.second] = kv.first;
    }
    printf("%s: vocab size = %d\n", __func__, (int) vocab.token_to_id.size());
    // print the vocabulary
    //for (auto kv : vocab.token_to_id) {
    //    printf("'%s' -> %d\n", kv.first.data(), kv.second);
    //}
    return true;
 }
 void sample_top_k(std::vector<std::pair<double, gpt_vocab::id>> & logits_id, int top_k) {
    // find the top K tokens
    std::partial_sort(
            logits_id.begin(),
            logits_id.begin() + top_k, logits_id.end(),
-            [](const std::pair<double, gpt_vocab::id> & a, const std::pair<double, gpt_vocab::id> & b) {
+            [](const std::pair<double, llama_vocab::id> & a, const std::pair<double, llama_vocab::id> & b) {
        return a.first > b.first;
    });
    logits_id.resize(top_k);
 }
-gpt_vocab::id llama_sample_top_p_top_k(
+llama_vocab::id llama_sample_top_p_top_k(
-        const gpt_vocab & vocab,
+        const llama_vocab & vocab,
        const float * logits,
-        std::vector<gpt_vocab::id> & last_n_tokens,
+        std::vector<llama_vocab::id> & last_n_tokens,
        double repeat_penalty,
        int top_k,
        double top_p,
@ -480,7 +460,7 @@ gpt_vocab::id llama_sample_top_p_top_k(
        std::mt19937 & rng) {
    int n_logits = vocab.id_to_token.size();
-    std::vector<std::pair<double, gpt_vocab::id>> logits_id;
+    std::vector<std::pair<double, llama_vocab::id>> logits_id;
    logits_id.reserve(n_logits);
    {
@ -623,7 +603,7 @@ size_t ggml_quantize_q4_1(float * src, void * dst, int n, int k, int qk, int64_t
    char * pdst = (char *) dst;
-    for (int j = 0; j < n; j += k) { 
+    for (int j = 0; j < n; j += k) {
        uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
        uint8_t * pm = (uint8_t *) (pdst + (j/k)*row_size + 0*bs +   sizeof(float));
        uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + 2*sizeof(float));
@ -646,7 +626,7 @@ size_t ggml_quantize_q4_1(float * src, void * dst, int n, int k, int qk, int64_t
                *(float *) pd = d;
                *(float *) pm = min;
-                pd += bs; 
+                pd += bs;
                pm += bs;
                for (int l = 0; l < qk; l += 2) {
--- a/utils.h
+++ b/utils.h
@ -3,7 +3,7 @@
 #pragma once
 #include <string>
-#include <map>
+#include <unordered_map>
 #include <vector>
 #include <random>
 #include <thread>
@ -13,33 +13,34 @@
 //
 struct gpt_params {
-    int32_t seed      = -1; // RNG seed
+    int32_t seed          = -1;  // RNG seed
-    int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
+    int32_t n_threads     = std::min(4, (int32_t) std::thread::hardware_concurrency());
-    int32_t n_predict = 128; // new tokens to predict
+    int32_t n_predict     = 128; // new tokens to predict
    int32_t repeat_last_n = 64;  // last n tokens to penalize
-    int32_t n_ctx = 512; //context size
+    int32_t n_parts       = -1;  // amount of model parts (-1 = determine from model dimensions)
-    bool memory_f16 = false; // use f16 instead of f32 for memory kv
+    int32_t n_ctx         = 512; //context size
    // sampling parameters
    int32_t top_k = 40;
    float   top_p = 0.95f;
    float   temp  = 0.80f;
-    float   repeat_penalty  = 1.30f;
+    float   repeat_penalty  = 1.10f;
    int32_t n_batch = 8; // batch size for prompt processing
-    std::string model      = "models/lamma-7B/ggml-model.bin"; // model path
+    std::string model  = "models/lamma-7B/ggml-model.bin"; // model path
-    std::string prompt     = "";
+    std::string prompt = "";
    bool random_prompt = false;
    bool use_color = false; // use color to distinguish generations and inputs
    bool interactive = false; // interactive mode
    bool interactive_start = false; // reverse prompt immediately
    std::vector<std::string> antiprompt; // string upon seeing which more user input is prompted
-    bool instruct    = false; // instruction mode (used for Alpaca models)
+
-    bool ignore_eos = false; // do not stop generating after eos
+    bool memory_f16        = false; // use f16 instead of f32 for memory kv
    bool random_prompt     = false; // do not randomize prompt if none provided
    bool use_color         = false; // use color to distinguish generations and inputs
    bool interactive       = false; // interactive mode
    bool interactive_start = false; // reverse prompt immediately
    bool instruct          = false; // instruction mode (used for Alpaca models)
    bool ignore_eos        = false; // do not stop generating after eos
    bool perplexity        = false; // compute perplexity over the prompt
 };
 bool gpt_params_parse(int argc, char ** argv, gpt_params & params);
@ -48,52 +49,52 @@ void gpt_print_usage(int argc, char ** argv, const gpt_params & params);
 std::string gpt_random_prompt(std::mt19937 & rng);
 //
 // Model file parsing
 //
 #define FILE_MAGIC_UNVERSIONED 0x67676d6c // pre-versioned files
 #define FILE_MAGIC 0x67676d66 // 'ggmf' in hex
 #define FILE_VERSION 1
 //
 // Vocab utils
 //
-struct gpt_vocab {
+struct llama_vocab {
    using id    = int32_t;
    using token = std::string;
-    std::map<token, id> token_to_id;
+    struct token_score {
-    std::map<id, token> id_to_token;
+        token tok;
-    std::map<id, float> score;
+        float score;
    };
    std::unordered_map<token, id> token_to_id;
    std::vector<token_score> id_to_token;
 };
 void replace(std::string & str, const std::string & needle, const std::string & replacement);
 // poor-man's JSON parsing
-std::map<std::string, int32_t> json_parse(const std::string & fname);
+std::unordered_map<std::string, int32_t> json_parse(const std::string & fname);
-// split text into tokens
+// TODO: temporary until #77 is merged, need this now for some tokenizer tests
-//
+bool llama_vocab_load(const std::string & fname, llama_vocab & vocab);
 // ref: https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53
 //
 // Regex (Python):
 // r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
 //
 // Regex (C++):
 // R"('s|'t|'re|'ve|'m|'ll|'d| ?[[:alpha:]]+| ?[[:digit:]]+| ?[^\s[:alpha:][:digit:]]+|\s+(?!\S)|\s+)"
 //
 std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const std::string & text);
 // TODO: this is probably wrong, but I cannot figure out how this tokenizer works ..
 // ref: https://github.com/google/sentencepiece
-std::vector<gpt_vocab::id> llama_tokenize(const gpt_vocab & vocab, std::string_view text, bool bos);
+std::vector<llama_vocab::id> llama_tokenize(const llama_vocab & vocab, const std::string & text, bool bos);
 // load the tokens from encoder.json
 bool gpt_vocab_init(const std::string & fname, gpt_vocab & vocab);
 // sample next token given probabilities for each embedding
 //
 //   - consider only the top K tokens
 //   - from them, consider only the top tokens with cumulative probability > P
 //
-gpt_vocab::id llama_sample_top_p_top_k(
+llama_vocab::id llama_sample_top_p_top_k(
-        const gpt_vocab & vocab,
+        const llama_vocab & vocab,
        const float * logits,
-        std::vector<gpt_vocab::id> & last_n_tokens,
+        std::vector<llama_vocab::id> & last_n_tokens,
        double repeat_penalty,
        int top_k,
        double top_p,
@ -101,7 +102,7 @@ gpt_vocab::id llama_sample_top_p_top_k(
        std::mt19937 & rng);
 // filer to top K tokens from list of logits
-void sample_top_k(std::vector<std::pair<double, gpt_vocab::id>> & logits_id, int top_k);
+void sample_top_k(std::vector<std::pair<double, llama_vocab::id>> & logits_id, int top_k);
 //
 // Quantization