This is useful for testing for regressions on GCN with CDNA hardware.
With GGML_HIP_MMQ_MFMA=Off and GGML_CUDA_FORCE_MMQ=On we can conveniently test the GCN code path on CDNA. As CDNA is just GCN renamed with MFMA added and limited use ACC registers, this provides a good alternative for regression testing when GCN hardware is not available.
llvm with the amdgcn target dose not support unrolling loops with conditional break statements, when those statements can not be resolved at compile time. Similar to other places in GGML lets simply ignore this warning.
This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices.
Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries.
This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.
* Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs
Gemma3n uses Matrix-Matrix addition as part of their input processing,
wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size
of 1 is used.
* Exclude `project_per_layer_input` by matching node names
This ensures that all other graphs which don't exhibit this pattern do
not have their behavior changed.
* Revert unnecessary formatting changes
* CUDA: add set rows for f32 and f16
* Review: change kernel params, use strides from host
* Use 1-d kernel
* Review: use int64_t for blockDim.x, rename nb->s for clarity
* ggml : add ggml_scale_bias
* ggml_vec_mad1_f32
* add more simd
* add CUDA
* sycl
* vulkan
* cann (placeholder)
* opencl
* will this fix cpu?
* fix cuda
* suggestions from coderabbit
* fix cann compile error
* vDSP_vsmsa
* rm __ARM_FEATURE_SVE
* use memcpy for op params
* make code looks more consistent
* use scalar for __ARM_FEATURE_SVE
* add x param to ggml_vec_mad1_f32
* llama : initial Mamba-2 support
* ggml : SIMD ggml_ssm_scan for Mamba-2
* ggml : improve ggml_mul speed when masking recurrent states
* llama : support running Mamba-Codestral-7B-v0.1
* llama : fix Mamba-2 conv state saving
* ggml : make the ggml_mul fast broadcast path more consistently formatted
* llama : remove unused variable
* llama : add missing break
* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.
* llama : avoid redundant state copy for Mamba 1 and 2
* metal : attempt to adapt SSM_SCAN for Mamba-2
* metal : fix SSM_SCAN pipeline scope
* metal : use log and exp instead of log1pf and expf in SSM_SCAN
* metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
* metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
* metal : fix SSM_SCAN state head offset
* metal : fix wrong number of tokens per sequence in SSM_SCAN
* ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.
* ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
* convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models
to avoid some reshapes.
Not sure if it's a good idea,
but it makes the graph slightly cleaner.
* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
* convert : fix flake8 lint
* metal : fix confusion between ; and ,
* metal : add missing args for nb references in ssm_scan_f32_group
* metal : single-user mamba2 inference works
* kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.
* convert : avoid AutoConfig for Mamba and Mamba2 hparams
* kv-cache : allow context shift for recurrent models
* graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
* ggml : fix mamba2 ssm scan when compiled with SVE
* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
* cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
* mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON
* cuda : graceful fallback for Mamba-1 models with weird embd size
* CUDA: add softmax broadcast
* Pass by const ref
* Review: Use blockDims for indexing, remove designated initializers
* Add TODO for noncontigous input/output
* implement unary REGLU/GEGLU/SWIGLU cpu ops
* relax constraints
* duplicate shape of source
* fix ggml_vec_geglu_f16
* special case gated ops
* implement unary REGLU/GEGLU/SWIGLU cuda ops
* tighten constraints again
* refactor into GGML_GLU_OP
* metal : add glu kernels
ggml-ci
* add CUDA_GLU_BLOCK_SIZE [no ci]
* more constraints and use 64bit ints
ggml-ci
* 64bit multiplication [no ci]
* implement swapped variants (cpu/cuda)
* update comment [no ci]
ggml-ci
* Vulkan: Add GLU ops and shaders
* SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
* ggml : implement GLU for split up/gate (#14181)
* implement GLU for split up/gate
* add tests for ggml_glu_split
* Vulkan: Implement glu_split logic and shader support
* add split to logging [no ci]
* SYCL: refactor element_size ops and add split up and gate support to gated kernels
* SYCL: switch GEGLU to use tanh approximation
---------
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
* GGML: increase OP count in assertion
* Refactor: Optimize SYCL element-wise operations with unary function inlining
This commit refactors the SYCL element-wise operations to improve performance by:
- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic.
- Replacing direct kernel calls with calls to these inlined functions.
- Using `__dpct_inline__` to encourage compiler inlining.
- Minor code cleanup and consistency improvements.
The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.
* vulkan: Increase workgroup size for GLU, for performance (#14345)
* vulkan: Increase workgroup size for GLU, for performance
* vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
* merge fix
* metal : add support for split and swap
ggml-ci
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* CUDA: add bf16 and f32 support to cublas_mul_mat_batched
* Review: add type traits and make function more generic
* Review: make check more explicit, add back comments, and fix formatting
* Review: fix formatting, remove useless type conversion, fix naming for bools