* llama : initial Mamba-2 support * ggml : SIMD ggml_ssm_scan for Mamba-2 * ggml : improve ggml_mul speed when masking recurrent states * llama : support running Mamba-Codestral-7B-v0.1 * llama : fix Mamba-2 conv state saving * ggml : make the ggml_mul fast broadcast path more consistently formatted * llama : remove unused variable * llama : add missing break * convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly. * llama : avoid redundant state copy for Mamba 1 and 2 * metal : attempt to adapt SSM_SCAN for Mamba-2 * metal : fix SSM_SCAN pipeline scope * metal : use log and exp instead of log1pf and expf in SSM_SCAN * metal : remove unused arguments for SSM_SCAN The max index is 31, so trimming the arguments is necessary. * metal : add back n_seqs to SSM_SCAN args Whoops, this is needed for the offset in the concatenated output. * metal : fix SSM_SCAN state head offset * metal : fix wrong number of tokens per sequence in SSM_SCAN * ggml : remove unused fast broadcast path in GGML_MUL This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity. * ggml : avoid multiply by D in GGML_OP_SSM_SCAN This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks * convert : fix flake8 lint * metal : fix confusion between ; and , * metal : add missing args for nb references in ssm_scan_f32_group * metal : single-user mamba2 inference works * kv-cache : remove const_cast when setting inputs for s_copy And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy. * convert : avoid AutoConfig for Mamba and Mamba2 hparams * kv-cache : allow context shift for recurrent models * graph : fix recurrent state copies when avoiding copies Works, but using lambda functions might not be that clean. * ggml : fix mamba2 ssm scan when compiled with SVE * ggml-cpu : reorder SVE FMA for consistency with other SIMD arches * cuda : implement ssm scan for Mamba2 There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2 * mamba : fix mismatched new and delete size for llm_build_mamba Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON * cuda : graceful fallback for Mamba-1 models with weird embd size
gguf
This is a Python package for writing binary files in the GGUF (GGML Universal File) format.
See convert_hf_to_gguf.py as an example for its usage.
Installation
pip install gguf
Optionally, you can install gguf with the extra 'gui' to enable the visual GGUF editor.
pip install gguf[gui]
API Examples/Simple Tools
examples/writer.py — Generates example.gguf
in the current directory to demonstrate generating a GGUF file. Note that this file cannot be used as a model.
examples/reader.py — Extracts and displays key-value pairs and tensor details from a GGUF file in a readable format.
gguf/scripts/gguf_dump.py — Dumps a GGUF file's metadata to the console.
gguf/scripts/gguf_set_metadata.py — Allows changing simple metadata values in a GGUF file by key.
gguf/scripts/gguf_convert_endian.py — Allows converting the endianness of GGUF files.
gguf/scripts/gguf_new_metadata.py — Copies a GGUF file with added/modified/removed metadata values.
gguf/scripts/gguf_editor_gui.py — Allows for viewing, editing, adding, or removing metadata values within a GGUF file as well as viewing its tensors with a Qt interface.
Development
Maintainers who participate in development of this package are advised to install it in editable mode:
cd /path/to/llama.cpp/gguf-py
pip install --editable .
Note: This may require to upgrade your Pip installation, with a message saying that editable installation currently requires setup.py
.
In this case, upgrade Pip to the latest:
pip install --upgrade pip
Automatic publishing with CI
There's a GitHub workflow to make a release automatically upon creation of tags in a specified format.
- Bump the version in
pyproject.toml
. - Create a tag named
gguf-vx.x.x
wherex.x.x
is the semantic version number.
git tag -a gguf-v1.0.0 -m "Version 1.0 release"
- Push the tags.
git push origin --tags
Manual publishing
If you want to publish the package manually for any reason, you need to have twine
and build
installed:
pip install build twine
Then, follow these steps to release a new version:
- Bump the version in
pyproject.toml
. - Build the package:
python -m build
- Upload the generated distribution archives:
python -m twine upload dist/*
Run Unit Tests
From root of this repository you can run this command to run all the unit tests
python -m unittest discover ./gguf-py -v
TODO
- Include conversion scripts as command line entry points in this package.