The llama_set_state_data function restores the rng state to what it
was at the time llama_copy_state_data was called. But users may want
to restore the state and proceed with a different seed.
* Updated build information
First update to the build instructions to include BLAS.
* Update README.md
* Update information about BLAS
* Better BLAS explanation
Adding a clearer BLAS explanation and adding a link to download the CUDA toolkit.
* Better BLAS explanation
* BLAS for Mac
Specifying that BLAS is already supported on Macs using the Accelerate Framework.
* Clarify the effect of BLAS
* Windows Make instructions
Added the instructions to build with Make on Windows
* Fixing typo
* Fix trailing whitespace
instead of `int` (while `int` option still being supported)
This allows the following usage:
`./quantize ggml-model-f16.bin ggml-model-q4_0.bin q4_0`
instead of:
`./quantize ggml-model-f16.bin ggml-model-q4_0.bin 2`
* Use full range for q4_0 quantization
By keeping the sign of the highest magnitude, we can make sure the
highest value maps to -8, which is currently unused.
This is a bit of a freebie since it is fully backwards compatible with
the current format.
* Update quantize_row_q4_0 for AVX/AVX2
* Update quantize_row_q4_0 for WASM
Untested
* Update quantize_row_q4_0 for Arm NEON
* Update quantize_row_q4_0 for PowerPC
Untested
* Use full range for q4_2 quantization
* add save_load_state example
* use <cstdio> instead of <iostream> and fprintf / printf instead of cout
* renamed save-load-state example files replacing underscores by dashes
* Unit test for quantization functions
Use the ggml_internal_get_quantize_fn function to loop through all
quantization formats and run a sanity check on the result.
Also add a microbenchmark that times these functions directly without
running the rest of the GGML graph.
* test-quantize-fns: CI fixes
Fix issues uncovered in CI
- need to use sizes divisible by 32*8 for loop unrolling
- use intrinsic header that should work on Mac
* test-quantize: remove
Per PR comment, subsumed by test-quantize-fns
* test-quantize: fix for q8_0 intermediates
* set default n_batch to 512 when using BLAS
* spacing
* alternate implementation of setting different n_batch for BLAS
* set n_batch to 512 for all cases
* ggml : prefer vzip to vuzp
This way we always use the same type of instruction across all quantizations
* ggml : alternative Q4_3 implementation using modified Q8_0
* ggml : fix Q4_3 scalar imlpementation
* ggml : slight improvement of Q4_3 - no need for loop unrolling
* ggml : fix AVX paths for Q8_0 quantization
* Moving parameters to separate lines for readability.
* Increasing repeate_penalty to 1.1 to make alpaca more usable by default.
* Adding trailing newline.
* reserve correct size for logits
* add functions to get and set the whole llama state:
including rng, logits, embedding and kv_cache
* remove unused variables
* remove trailing whitespace
* fix comment