llama_print_timings: load time = 3021.72 ms
llama_print_timings: sample time = 128.90 ms / 128 runs ( 1.01 ms per token)
llama_print_timings: prompt eval time = 2826.35 ms / 8 tokens ( 353.29 ms per token)
llama_print_timings: eval time = 53198.13 ms / 127 runs ( 418.88 ms per token)
llama_print_timings: total time = 56380.69 ms
- Removed first accumulation
ideas taken from here https://stackoverflow.blog/2020/07/08/improving-performance-with-simd-intrinsics-in-three-use-cases/
llama_print_timings: load time = 3087.59 ms
llama_print_timings: sample time = 132.04 ms / 128 runs ( 1.03 ms per token)
llama_print_timings: prompt eval time = 2894.28 ms / 8 tokens ( 361.78 ms per token)
llama_print_timings: eval time = 58529.67 ms / 127 runs ( 460.86 ms per token)
llama_print_timings: total time = 61780.98 ms
- Accumulate two acc instead of one
llama_print_timings: load time = 3137.95 ms
llama_print_timings: sample time = 132.54 ms / 128 runs ( 1.04 ms per token)
llama_print_timings: prompt eval time = 2943.22 ms / 8 tokens ( 367.90 ms per token)
llama_print_timings: eval time = 59539.50 ms / 127 runs ( 468.81 ms per token)
llama_print_timings: total time = 62843.23 ms
Reference:
llama_print_timings: load time = 14251.20 ms
llama_print_timings: sample time = 129.15 ms / 128 runs ( 1.01 ms per token)
llama_print_timings: prompt eval time = 14050.58 ms / 8 tokens ( 1756.32 ms per token)
llama_print_timings: eval time = 238504.60 ms / 127 runs ( 1877.99 ms per token)
llama_print_timings: total time = 252916.56 ms
SSE3 instructions
llama_print_timings: load time = 3349.09 ms
llama_print_timings: sample time = 53.06 ms / 52 runs ( 1.02 ms per token)
llama_print_timings: prompt eval time = 3154.19 ms / 8 tokens ( 394.27 ms per token)
llama_print_timings: eval time = 23759.20 ms / 51 runs ( 465.87 ms per token)
llama_print_timings: total time = 27174.93 ms
* main : add option to save full output to session
* split behavior into --session and --prompt-cache
* restore original implementation with new names
* PR comments
* move the check for incompatible parameters to gpt_params_parse
* Fix whitespace
Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
---------
Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
* use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler
Tested with a 13B model.
* use _mm_pause() in busyloop
* use _mm_pause() in busyloop on x86_64 to reduce power consumption
* when loading a safetensors file, ignore the metadata header
* check for safetensors files first, and only use PyTorch versions when safetensors aren't available
Minor edit in ggml.c which originally would prevent OpenCL from loading completely if GGML_USE_ACCELERATE was defined.
Minor speedup in prompt eval time.
* Line 698 has one #staticmethod and should not
otherwise throw error at unpickle.load() as not callable
* Update convert.py
---------
Co-authored-by: Ivan Stepanov <ivanstepanovftw@gmail.com>
* change immintrin.h to intrin.h for compatibility
Building on windows11 arm throws an error on this line. Seems like using intrin.h covers x86 and and arm
* conditional def of intrin.h
* fix typo in ggml.c
* fix reverse prompt and multi line
* Code Formatting
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* python script to verify the checksum of the llama models
Added Python script for verifying SHA256 checksums of files in a directory, which can run on multiple platforms. Improved the formatting of the output results for better readability.
* Update README.md
update to the readme for improved readability and to explain the usage of the python checksum verification script
* update the verification script
I've extended the script based on suggestions by @prusnak
The script now checks the available RAM, is there is enough to check the file at once it will do so. If not the file is read in chunks.
* minor improvment
small change so that the available ram is checked and not the total ram
* remove the part of the code that reads the file at once if enough ram is available
based on suggestions from @prusnak i removed the part of the code that checks whether the user had enough ram to read the entire model at once. the file is now always read in chunks.
* Update verify-checksum-models.py
quick fix to pass the git check