* Add --show-statistics option * Add --show-statistics logic * Add tensor name parsing * Tidy output format * Fix typo in title * Improve tensor influence ranking * Add better statistics * Change statistics' sort order * Add Cosine Similarity * Add header search path * Change header search path to private * Add weighted statistics per layer * Update report title * Refactor compute_statistics out of main * Refactor compute_cossim out of load_imatrix * Refactor compute_statistics out of load_imatrix * Move imatrix statistics calculation into its own functions * Add checks and validations * Remove unnecessary include directory * Rename labels * Add m_stats getter and refactor compute_statistics out of load_imatrix * Refactor variable names * Minor cosmetic change * Retrigger checks (empty commit) * Rerun checks (empty commit) * Fix unnecessary type promotion Co-authored-by: compilade <git@compilade.net> * Reverting change to improve code readability * Rerun checks (empty commit) * Rerun checks (empty commit) * Rerun checks - third time's the Charm 🤞 (empty commit) * Minor cosmetic change * Update README * Fix typo * Update README * Rerun checks (empty commit) * Re-implement changes on top of #9400 * Update README.md * Update README * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md * Remove duplicate option in print_usage() * Update README.md * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Remove input check * Remove commented out code --------- Co-authored-by: compilade <git@compilade.net>
llama.cpp/tools/imatrix
Compute an importance matrix for a model and given text dataset. Can be used during quantization to enhance the quality of the quantized models. More information is available in https://github.com/ggml-org/llama.cpp/pull/4861.
Usage
./llama-imatrix \
-m model.gguf -f some-text.txt [-o imatrix.gguf] [--no-ppl] \
[--process-output] [--chunk 123] [--save-frequency 0] [--output-frequency 10] \
[--in-file imatrix-prev-0.gguf --in-file imatrix-prev-1.gguf ...] [--parse-special] \
[--show-statistics] [...]
Here -m | --model
with a model name and -f | --file
with a file containing calibration data (such as e.g. wiki.train.raw
) are mandatory.
The parameters in square brackets are optional and have the following meaning:
-h | --help
shows usage information and exits.-lv | --verbosity
specifies the verbosity level. If set to0
, no output other than the perplexity of the processed chunks will be generated. If set to1
, each time the results are saved a message is written tostderr
. If>=2
, a message is output each time data is collected for any tensor. Default verbosity level is1
.-o | --output-file
specifies the name of the file where the computed data will be stored. If missingimatrix.gguf
is used.-ofreq | --output-frequency
specifies how often the so far computed result is saved to disk. Default is 10 (i.e., every 10 chunks)--save-frequency
specifies how often to save a copy of the imatrix in a separate file. Default is 0 (i.e., never)--process-output
specifies if data will be collected for theoutput.weight
tensor. Typically, it is better not to utilize the importance matrix when quantizingoutput.weight
, so this is set tofalse
by default.--in-file
one or more existing imatrix files to load and combine. Useful for merging files from multiple runs/datasets.--parse-special
enables parsing of special tokens (e.g.,<|im_start|>
in some models). Useful for models with custom tokenizers.--chunk | --from-chunk
to skip the firstn
chunks of tokens from the input data. Useful for resuming or skipping initial low-quality data.--chunks
maximum number of chunks to process. Default is -1 for all available chunks.--no-ppl
disables the calculation of perplexity for the processed chunks. Useful if you want to speed up the processing and do not care about perplexity.--show-statistics
displays imatrix file's statistics.
For faster computation, make sure to use GPU offloading via the -ngl | --n-gpu-layers
argument.
Recent versions of llama-imatrix
store data in GGUF format by default. For the legacy format, use an extension other than .gguf
when saving the output file. More information is available in https://github.com/ggml-org/llama.cpp/pull/9400.
Examples
# generate importance matrix using default filename (imatrix.gguf), offloading 99 layers to GPU
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt -ngl 99
# use the imatrix to perform a Q4_K_M quantization
./llama-quantize --imatrix imatrix.gguf ggml-model-f16.gguf ./ggml-model-q4_k_m.gguf q4_k_m
# generate and save the imatrix using legacy format
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt -o imatrix-legcy-format.dat -ngl 99
# covert legacy (binary) imatrix format to new (GGUF) format
./llama-imatrix --in-file imatrix-legacy-format.dat -o imatrix-new-format.gguf
# combine existing imatrices
./llama-imatrix --in-file imatrix-prev-0.gguf --in-file imatrix-prev-1.gguf -o imatrix-combined.gguf
# skip first 5 chunks, save intermediates every 20 chunks and snapshots every 50, parsing special tokens
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt --chunk 5 --output-frequency 20 --save-frequency 50 --parse-special
# analyse imatrix file and display summary statistics instead of running inference
./llama-imatrix --in-file imatrix.gguf --show-statistics
--show-statistics
will display the following statistics:
Per tensor
- Σ(Act²): sum of all squared activations (the importance scores)
- Min & Max: minimum and maximum squared activations values
- μ & σ: Squared activations' mean and standard deviation
- % Active: proportion of elements whose average squared activation exceeds a small threshold (1e-5). Helpful to determine how alive/dormant the tensor is during inference
- N: number of squared activations
- Entropy: entropy of the squared activation distribution, in bits (standard Shannon entropy measurement)
S = -\sum_{i=1}^N p_i \log_2 p_i
- E (norm): Normalized entropy.
E(norm)=\frac{-\sum_{i=1}^N p_i \log_2 p_i}{log_2 N}
. These two metrics can be used to determine how well a prompt "exercises" the model's capabilities - ZD Score: z-score distribution as described in 3.1 Layer Importance Scores of Layer-Wise Quantization
- CosSim: cosine similarity with respect to the previous layer's tensor. Useful to determine how similar the squared activations of the current layer are to the previous layer's squared activations.
Per layer
Weighted averages of Σ(Act²), ZD Score and CosSim are also calculated.
Important note on the computed Statistics
When using these statistics, please note that they are computed on the squared activations, not on the actual (raw) activations. Whilst the results are still useful, they're less realiable than using the raw values, and in the case of the cosine similarity, could be misleading if the tensor contains opposite vectors.