* ggml-zdnn: inital backend impl
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
ggml-zdnn: temp change z17 to arch15
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
ggml-zdnn: fix build bugs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: tensor->extra logging check
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
ggml-zdnn: add layout name mapping, ztensor information
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
ggml-zdnn: separate logging into its own line
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
ggml-zdnn: add shape comparison
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
ggml-zdnn: add ggml_tensor shape log
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
ggml-zdnn: fix incorrect shape logging
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add output buffer check
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: run compute and store into tensor->extra
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add set_tensor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add more loggers
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: update set_tensor logging to check only for matmul
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: last working matmul version
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add comments to prevent accidentally deleting lines
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: support op out_prod
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: update op out_prod to use tensor->extra
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: rewrite the backend implementation
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: bugfix new impl
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: fix compiler warnings and bugfixes
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: test ztensor finding in init_tensor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: implement at least 1 op to test
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: assign tensor->extra to buffer
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add check for view tensors to prevent init_tensor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: rework init_tensor to create new buffers
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: switch to std vector instead of array
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: switch buffers back and set to arbitrary number
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: impl init_tensor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: update supports_op matmul matrix
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: fix incorrect ztensor shape, reduce memory padding
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: code clean up
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: impl matmul
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: fix compiler error missing type
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: fix missing data transform call
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add bias init_tensor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: tighten memory usage, change string allocation
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add bias ztensor and data free
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add bias data transform
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add more debug info for extra buffer transform
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add logger to check if mat mul ops go through set_tensor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: activate bias transform in matmul
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: move weights transform into mulmat
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add more safeguards in matmul
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: fix sequencing of transforms
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: bugfix transform ztensor vs origtensor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: figure out why sigtrap is happening
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: fix sigsegv
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: move everything back to local declaration
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: move bias data to local also
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: bring back working matmul
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: rewrite into mre
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: fix missing vector import
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: fix missing vector import in header
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: attempt to fix sigsegv
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: fix missing load tensor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: fix invalid ztensor buffer release
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add logging to debug free buffer
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: remove free_buffer debug info
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add parmblkformat detections
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add nnpa installed detection
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add zdnn_init call for static libs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add init_tensor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: attempt at fixing invalid buffer
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: switch to using deque to fix pointer deref problem
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add weights logging to check
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: attempt to use unique ptr
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add tensor to pre_tfm_desc logging
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add inputs logging
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: disable op_none initialisation for testing
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: fix missing return from init_tensor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: load ztensors in cgraph exec
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: work on moving output ztensor as well
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: disable logging and breakpoints for full test
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: attempt at manually changing the layout
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: attempt at using default nwhc format instead
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: disable global load ztensor for now
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: fix errorenous output load tensor
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: add guards to prevent loading ztensor if transformed
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: code cleanup
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: bring load ztensor back to init routine
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: code clean up
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: fix ztensor deallocation abort
stabilise ggml <-> zdnn api
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: clean up matmul selection
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: clean up project structure
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: update documentation, prepare for upstream
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* chore: add codeowners
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: disable batched matmul
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: attempt at fixing tensor views during matmul
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: deny all view tensors directly
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: fix pr comments
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* docs: update ops docs for zdnn
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: redo test-backend-ops for ops.md
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-zdnn: fix typo in build-s390x.md
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* codeowners: remove taronaeo for now
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Revert "codeowners: remove taronaeo for now"
This reverts commit 411ea4ed78
.
* ggml-zdnn: remove unused ggml_zdnn macro
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
12 KiB
Important
This build documentation is specific only to IBM Z & LinuxONE mainframes (s390x). You can find the build documentation for other architectures: build.md.
Build llama.cpp locally (for s390x)
The main product of this project is the llama
library. Its C-style interface can be found in include/llama.h.
The project also includes many example programs and tools using the llama
library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server.
To get the code:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
CPU Build with BLAS
Building llama.cpp with BLAS support is highly recommended as it has shown to provide performance improvements. Make sure to have OpenBLAS installed in your environment.
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release -j $(nproc)
Notes:
-
For faster repeated compilation, install ccache
-
By default, VXE/VXE2 is enabled. To disable it (not recommended):
cmake -S . -B build \ -DCMAKE_BUILD_TYPE=Release \ -DGGML_BLAS=ON \ -DGGML_BLAS_VENDOR=OpenBLAS \ -DGGML_VXE=OFF cmake --build build --config Release -j $(nproc)
-
By default, NNPA is disabled by default. To enable it:
cmake -S . -B build \ -DCMAKE_BUILD_TYPE=Release \ -DGGML_BLAS=ON \ -DGGML_BLAS_VENDOR=OpenBLAS \ -DGGML_NNPA=ON cmake --build build --config Release -j $(nproc)
-
For debug builds:
cmake -S . -B build \ -DCMAKE_BUILD_TYPE=Debug \ -DGGML_BLAS=ON \ -DGGML_BLAS_VENDOR=OpenBLAS cmake --build build --config Debug -j $(nproc)
-
For static builds, add
-DBUILD_SHARED_LIBS=OFF
:cmake -S . -B build \ -DCMAKE_BUILD_TYPE=Release \ -DGGML_BLAS=ON \ -DGGML_BLAS_VENDOR=OpenBLAS \ -DBUILD_SHARED_LIBS=OFF cmake --build build --config Release -j $(nproc)
IBM zDNN Accelerator
This provides acceleration using the IBM zAIU co-processor located in the Telum I and Telum II processors. Make sure to have the IBM zDNN library installed.
Compile from source from IBM
You may find the official build instructions here: Building and Installing zDNN
Compilation
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_ZDNN=ON
cmake --build build --config Release -j$(nproc)
Getting GGUF Models
All models need to be converted to Big-Endian. You can achieve this in three cases:
-
Use pre-converted models verified for use on IBM Z & LinuxONE (easiest)
You can find popular models pre-converted and verified at s390x Verified Models or s390x Runnable Models.
These models have already been converted from
safetensors
toGGUF
Big-Endian and their respective tokenizers verified to run correctly on IBM z15 and later system. -
Convert safetensors model to GGUF Big-Endian directly (recommended)
The model you are trying to convert must be in
safetensors
file format (for example IBM Granite 3.3 2B). Make sure you have downloaded the model repository for this case.Ensure that you have installed the required packages in advance
pip3 install -r requirements.txt
Convert the
safetensors
model toGGUF
python3 convert_hf_to_gguf.py \ --outfile model-name-be.f16.gguf \ --outtype f16 \ --bigendian \ model-directory/
For example,
python3 convert_hf_to_gguf.py \ --outfile granite-3.3-2b-instruct-be.f16.gguf \ --outtype f16 \ --bigendian \ granite-3.3-2b-instruct/
-
Convert existing GGUF Little-Endian model to Big-Endian
The model you are trying to convert must be in
gguf
file format (for example IBM Granite 3.3 2B GGUF). Make sure you have downloaded the model file for this case.python3 gguf-py/gguf/scripts/gguf_convert_endian.py model-name.f16.gguf BIG
For example,
python3 gguf-py/gguf/scripts/gguf_convert_endian.py granite-3.3-2b-instruct-le.f16.gguf BIG mv granite-3.3-2b-instruct-le.f16.gguf granite-3.3-2b-instruct-be.f16.gguf
Notes:
- The GGUF endian conversion script may not support all data types at the moment and may fail for some models/quantizations. When that happens, please try manually converting the safetensors model to GGUF Big-Endian via Step 2.
IBM Accelerators
1. SIMD Acceleration
Only available in IBM z15/LinuxONE 3 or later system with the -DGGML_VXE=ON
(turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.
2. NNPA Vector Intrinsics Acceleration
Only available in IBM z16/LinuxONE 4 or later system with the -DGGML_NNPA=ON
(turned off by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
3. zDNN Accelerator (WIP)
Only available in IBM z17/LinuxONE 5 or later system with the -DGGML_ZDNN=ON
compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs will default back to CPU routines.
4. Spyre Accelerator
Only available with IBM z17 / LinuxONE 5 or later system. No support currently available.
Performance Tuning
1. Virtualization Setup
It is strongly recommended to use only LPAR (Type-1) virtualization to get the most performance.
Note: Type-2 virtualization is not supported at the moment, while you can get it running, the performance will not be the best.
2. IFL (Core) Count
It is recommended to allocate a minimum of 8 shared IFLs assigned to the LPAR. Increasing the IFL count past 8 shared IFLs will only improve Prompt Processing performance but not Token Generation.
Note: IFL count does not equate to vCPU count.
3. SMT vs NOSMT (Simultaneous Multithreading)
It is strongly recommended to disable SMT via the kernel boot parameters as it negatively affects performance. Please refer to your Linux distribution's guide on disabling SMT via kernel boot parameters.
4. BLAS vs NOBLAS
IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongly recommended to use BLAS.
Frequently Asked Questions (FAQ)
-
I'm getting the following error message while trying to load a model:
gguf_init_from_file_impl: failed to load model: this GGUF file version 50331648 is extremely large, is there a mismatch between the host and model endianness?
Answer: Please ensure that the model you have downloaded/converted is GGUFv3 Big-Endian. These models are usually denoted with the
-be
suffix, i.e.,granite-3.3-2b-instruct-be.F16.gguf
.You may refer to the Getting GGUF Models section to manually convert a
safetensors
model toGGUF
Big Endian. -
I'm getting extremely poor performance when running inference on a model
Answer: Please refer to the Appendix B: SIMD Support Matrix to check if your model quantization is supported by SIMD acceleration.
-
I'm building on IBM z17 and getting the following error messages:
invalid switch -march=z17
Answer: Please ensure that your GCC compiler is of minimum GCC 15.1.0 version, and have
binutils
updated to the latest version. If this does not fix the problem, kindly open an issue. -
Failing to install the
sentencepiece
package using GCC 15+Answer: The
sentencepiece
team are aware of this as seen in this issue.As a temporary workaround, please run the installation command with the following environment variables.
export CXXFLAGS="-include cstdint"
For example,
CXXFLAGS="-include cstdint" pip3 install -r requirements.txt
-
-DGGML_NNPA=ON
generates gibberish outputAnswer: We are aware of this as detailed in this issue. Please either try reducing the number of threads, or disable the compile option using
-DGGML_NNPA=OFF
.
Getting Help on IBM Z & LinuxONE
-
Bugs, Feature Requests
Please file an issue in llama.cpp and ensure that the title contains "s390x".
-
Other Questions
Please reach out directly to aionz@us.ibm.com.
Appendix A: Hardware Support Matrix
Support | Minimum Compiler Version | |
---|---|---|
IBM z15 | ✅ | |
IBM z16 | ✅ | |
IBM z17 | ✅ | GCC 15.1.0 |
IBM zDNN | ✅ |
- ✅ - supported and verified to run as intended
- 🚫 - unsupported, we are unlikely able to provide support
Appendix B: SIMD Support Matrix
VX/VXE/VXE2 | NNPA | zDNN | Spyre | |
---|---|---|---|---|
FP32 | ✅ | ✅ | ✅ | ❓ |
FP16 | ✅ | ✅ | ❓ | ❓ |
BF16 | 🚫 | 🚫 | ❓ | ❓ |
Q4_0 | ✅ | ✅ | ❓ | ❓ |
Q4_1 | ✅ | ✅ | ❓ | ❓ |
Q5_0 | 🚫 | 🚫 | ❓ | ❓ |
Q5_1 | 🚫 | 🚫 | ❓ | ❓ |
Q8_0 | ✅ | ✅ | ❓ | ❓ |
Q2_K | 🚫 | 🚫 | ❓ | ❓ |
Q3_K | ✅ | ✅ | ❓ | ❓ |
Q4_K | ✅ | ✅ | ❓ | ❓ |
Q5_K | ✅ | ✅ | ❓ | ❓ |
Q6_K | ✅ | ✅ | ❓ | ❓ |
TQ1_0 | 🚫 | 🚫 | ❓ | ❓ |
TQ2_0 | 🚫 | 🚫 | ❓ | ❓ |
IQ2_XXS | 🚫 | 🚫 | ❓ | ❓ |
IQ2_XS | 🚫 | 🚫 | ❓ | ❓ |
IQ2_S | 🚫 | 🚫 | ❓ | ❓ |
IQ3_XXS | 🚫 | 🚫 | ❓ | ❓ |
IQ3_S | 🚫 | 🚫 | ❓ | ❓ |
IQ1_S | 🚫 | 🚫 | ❓ | ❓ |
IQ1_M | 🚫 | 🚫 | ❓ | ❓ |
IQ4_NL | ✅ | ✅ | ❓ | ❓ |
IQ4_XS | ✅ | ✅ | ❓ | ❓ |
FP32->FP16 | 🚫 | ✅ | ❓ | ❓ |
FP16->FP32 | 🚫 | ✅ | ❓ | ❓ |
- ✅ - acceleration available
- 🚫 - acceleration unavailable, will still run using scalar implementation
- ❓ - acceleration unknown, please contribute if you can test it yourself
Last Updated by Aaron Teo (aaron.teo1@ibm.com) on July 31, 2025.