quantize : add '--keep-split' to quantize model into shards (#6688)

* Implement '--keep-split' to quantize model into several shards * Add test script * Update examples/quantize/quantize.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Split model correctly even if tensor id is out-of-order * Update llama_model_quantize_params * Fix preci failures --------- Co-authored-by: z5269887 <z5269887@unsw.edu.au> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-25 18:29:35 +08:00 · 2024-04-25 18:29:35 +08:00 · 1966eb2615
commit 1966eb2615
parent 784e11dea1
4 changed files with 148 additions and 26 deletions
--- a/llama.h
+++ b/llama.h
@ -288,6 +288,7 @@ extern "C" {
        bool quantize_output_tensor;         // quantize output.weight
        bool only_copy;                      // only copy tensors - ftype, allow_requantize and quantize_output_tensor are ignored
        bool pure;                           // quantize all tensors to the default type
+        bool keep_split;                     // quantize to the same number of shards
        void * imatrix;                      // pointer to importance matrix data
        void * kv_overrides;                 // pointer to vector containing overrides
    } llama_model_quantize_params;