* Initial Q2_K Block Interleaving Implementation
* Addressed review comments and clean up of the code
* Post rebase fixes
* Initial CI/CD fixes
* Update declarations in arch-fallback.h
* Changes for GEMV Q2_K in arch-fallback.h
* Enable repacking only on AVX-512 machines
* Update comments in repack.cpp
* Address q2k comments
---------
Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com>
The pipeline member can be cast to VkPipeline.
This is a VkPipeline_T* on 64 bit but a uint64_t on 32 bit.
Cf. VK_DEFINE_NON_DISPATCHABLE_HANDLE documentation.
This is useful for testing for regressions on GCN with CDNA hardware.
With GGML_HIP_MMQ_MFMA=Off and GGML_CUDA_FORCE_MMQ=On we can conveniently test the GCN code path on CDNA. As CDNA is just GCN renamed with MFMA added and limited use ACC registers, this provides a good alternative for regression testing when GCN hardware is not available.
llvm with the amdgcn target dose not support unrolling loops with conditional break statements, when those statements can not be resolved at compile time. Similar to other places in GGML lets simply ignore this warning.
* remove redundant code in riscv
* remove redundant code in arm
* remove redundant code in loongarch
* remove redundant code in ppc
* remove redundant code in s390
* remove redundant code in wasm
* remove redundant code in x86
* remove fallback headers
* fix x86 ggml_vec_dot_q8_0_q8_0
* SYCL: Add set_rows support for quantized types
This commit adds support for GGML_OP_SET_ROWS operation for various
quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16
type in the SYCL backend.
The quantization/dequantization copy kernels were moved from cpy.cpp
to cpy.hpp to make them available for set_rows.cpp.
This addresses part of the TODOs mentioned in the code.
* Use get_global_linear_id() instead
ggml-ci
* Fix formatting
ggml-ci
* Use const for ne11 and size_t variables in set_rows_sycl_q
ggml-ci
* Increase block size for q kernel to 256
ggml-ci
* Cleanup imports
* Add float.h to cpy.hpp
This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices.
Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries.
This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.
* feat: Add s_off as a parameter in the args struct
This may not be necessary, but it more closely mirrors the CUDA kernel
Branch: GraniteFourPerf
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state
This is a first attempt at optimizing the metal kernel. The changes here
are:
- Launch the kernel with a thread group of size d_state
- Use simd groups and shared memory to do the summation for the y
computation
When tested with G4 tiny preview, this shows roughly a 3x speedup on
prefill and 15% speedup on decode.
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Update logic to correctly do the multi-layer parallel sum
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Correctly size the shared memory bufer and assert expected size relationships
Branch: GraniteFourPerf
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Compute block offsets once rather than once per token
Branch: GraniteFourPerf
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Use local variable for state recursion
Branch: GraniteFourPerf
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Use a secondary simd_sum instead of a for loop
Branch: GraniteFourPerf
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add assertion and comment about relationship between simd size and num simd groups
Branch: GraniteFourPerf
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Parallelize of d_state for mamba-1
Branch: GraniteFourPerf
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Parallel sum in SSM_CONV
Branch: GraniteFourPerf
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Revert "feat: Parallel sum in SSM_CONV"
After discussion with @compilade, the size of the parallelism here is
not worth the cost in complexity or overhead of the parallel for.
https://github.com/ggml-org/llama.cpp/pull/14743#discussion_r2223395357
This reverts commit 16bc059660.
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Simplify shared memory sizing
Branch: GraniteFourPerf
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-Authored-By: Georgi Gerganov <ggerganov@gmail.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Neither "g" nor "x" are valid portPos specifiers per the official
[graphviz documents](https://graphviz.org/docs/attr-types/portPos/):
> If a compass point is used, it must have the form "n","ne","e","se","s","sw","w","nw","c","_".
I tested locally for it to fall back to default portPos specifier if an
invalid portPos is specified. As a consequence, we can remove associated
code.
* CMake config: Create target only once
Fix error on repeated find_package(ggml).
For simplicity, check only for the top-level ggml::ggml.
* CMake config: Add CUDA link libs
* CMake config: Add OpenCL link libs
* CMake config: Use canonical find_dependency
Use set and append to control link lib variables.
Apply more $<LINK_ONLY...>.
* CMake config: Wire OpenMP dependency
This commit removes the inclusion of `<cstdlib>`.
The motivation for this change is that this source file does not seem to
use any functions from this header and the comment about `qsort` is a
little misleading/confusing.
* weight format to nz for 310p
* remove quant weight format to nz
* clean code
* fix
* make the conditions for converting weights to NZ format consistent
* clean code