llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-06-27 20:05:20 +00:00

Author	SHA1	Message	Date
Jeff Bolz	0090950f67	vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (#12833 ) q4_k and q5_k had a lot of redundant global loads where the same 16B of scale information is repeatedly loaded and decoded during each loop iteration. This change restructures the loops to more explicitly iterate over whole blocks in the outer loop (with unrolled inner loop) and to copy/decode the scale data into shared memory once at the start of each outer loop. The copy is pipelined so the scale load from global memory is relatively cheap. This improves q4_k/q5_k model prompt processing performance by around 5-7%. I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k and hurt for q4_0. The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped variants isn't used as often as it originally was (e.g. due to the padded_N change), so I trimmed it down to offset some of the new complexity of the semi-manual loop unrolling.	2025-04-09 07:25:08 +02:00
Jeff Bolz	80b717d493	vulkan: Use unclamped loads for flash attention mask (#12720 ) nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.	2025-04-06 10:47:13 +02:00
0cc4m	6bf28f0111	Vulkan: Tune Vulkan mmq int dot shader for performance (#12767 )	2025-04-05 18:04:03 +02:00
Jeff Bolz	74d4f5b041	vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (#12630 ) There seems to be a bubble waking up from waitForFences, which costs a few percent performance and also increased variance in performance. This change inserts an "almost_ready" fence when the graph is about 80% complete and we waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting for the final fence to be signaled.	2025-04-04 07:54:35 +02:00
Jeff Bolz	f01bd02376	vulkan: Implement split_k for coopmat2 flash attention. (#12627 ) When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.	2025-04-02 14:25:08 -05:00
Jeff Bolz	be0a0f8cae	vulkan: Implement grouped query attention in the coopmat2 FA shader (#12559 ) When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when: dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1)) previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each. This doesn't directly translate to better performance (at least when you have >=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.	2025-04-02 19:40:32 +02:00
Wagner Bruna	2bb3597e42	vulkan: fix build when glslc doesn't support coopmat (#12683 )	2025-04-01 11:38:07 +02:00
0cc4m	a8a1f33567	Vulkan: Add DP4A MMQ and Q8_1 quantization shader (#12135 ) * Vulkan: Add DP4A MMQ and Q8_1 quantization shader * Add q4_0 x q8_1 matrix matrix multiplication support * Vulkan: Add int8 coopmat MMQ support * Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code * Add GL_EXT_integer_dot_product check * Remove ggml changes, fix mmq pipeline picker * Remove ggml changes, restore Intel coopmat behaviour * Fix glsl compile attempt when integer vec dot is not supported * Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq * Remove redundant comment * Fix integer dot check * Fix compile issue with unsupported int dot glslc * Update Windows build Vulkan SDK version	2025-03-31 14:37:01 +02:00
Georgi Gerganov	b4ae50810e	metal : improve FA + improve MoE (#12612 ) * ggml : FA with different K, V head sizes (CPU) ggml-ci * metal : add FA with HS=192 * metal : extend FA to support different K and V head sizes ggml-ci * metal : add FA vector kernels for heads K 192 and V 128 ggml-ci * ggml : restrict op on other backends to equal head sizes ggml-ci * metal : optimize FA-vec kernel ggml-ci * metal : FA remove mq registers * metal : improve MoE mul_mat_id condition ggml-ci * metal : fix comments + remove unnecessary addition ggml-ci * metal : avoid too much shared memory usage with mul_mat_id ggml-ci	2025-03-28 20:21:59 +02:00
Jeff Bolz	eddfb43850	vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505 ) * tests: add mul_mat perf/functional tests for p021/nc vulkan shaders * vulkan: Optimize mul_mat_vec p021 and nc shaders. These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches). Using subgroupAdd in the p021 shader also helps, use that conditionally.	2025-03-22 09:40:11 +01:00
stduhpf	4375415b4a	Vulkan: RTE rounding for cpy to quant (#12480 ) * Vulkan: RTE rounding for cpy to quant Co-Authored-By: Jeff Bolz <jbolz@nvidia.com> * remove trailing whitespace * avoid duplicating pipeline_cpy_f32_quant * fix copypasting issue * remove duplicated code --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-03-21 20:34:50 +01:00
Jeff Bolz	c446b2edd2	vulkan: Submit once enough matmul work has been recorded (#12406 ) I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.	2025-03-19 08:26:26 +01:00
0cc4m	fd123cfead	Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (#12434 )	2025-03-18 07:21:40 +01:00
Molly Sophia	7dfad387e3	llama: Add support for RWKV v7 architecture (#12412 ) * ggml: Add op l2_norm Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: Add op rwkv_wkv7 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: Add support for RWKV7 and ARWKV7 models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix inference with RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: add more (a)rwkv7 variants in size Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Apply code-format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * fix MUSA build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix shape error with rwkv using llama-parallel Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-03-18 07:27:50 +08:00
Jeff Bolz	484a8ab513	vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (#12312 )	2025-03-17 09:26:18 -05:00
Daniele	cf2270e4d3	vulkan: subgroup size tuning (#12087 ) * vulkan: subgroup size test * Vulkan: Add device architecture enum and logic to recognize AMD generations * vulkan: use new architecture logic to specify subgroup size * Initial vulkan subgroup size tuning for RDNA3 * vulkan: commonize RDNA subgroup tuning * vulkan: override subgroup size if required_subgroup_size = 0 * vulkan: disable warp 32 for RDNA3 * vulkan: fine tuned RDNA1 subgroup sizes * vulkan: adjusted subgroup size map * vulkan: fixed RDNA2 subgroup map --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-03-17 12:42:33 +01:00
Jeff Bolz	891c63956d	vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (#12273 ) * vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking	2025-03-17 10:41:59 +01:00
Jeff Bolz	2f21123c1d	vulkan: Adjust coopmat2 tile sizes and selection heuristic (#12258 )	2025-03-17 10:35:00 +01:00
cmdr2	0cbee131ad	cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129) ggml-ci	2025-03-03 18:18:11 +02:00
William Tambellini	70680c48e5	ggml : upgrade init_tensor API to return a ggml_status (#11854 ) * Upgrade init_tensor API to return a ggml_status To prepare for an 'abort-free' ggml (ggml not to abort on OOMs but return a OOM status), as agreeed with Diego in the ggml repo, upgrade the init_tensor() and view_init() APIs to return a ggml_status. * misc fixes --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-02-28 14:41:47 +01:00
Rémy O	438a83926a	vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (#11595 ) * vulkan: implement specialized MMV kernels for IQ2 quantizations * vulkan: add MMV kernels for IQ3 quants * vulkan: Increase MMV batch size and unroll IQ LUT setup * vulkan: fix init_iq_shmem for WG sizes larger than tables * vulkan: common batch size for all I-quants	2025-02-28 09:42:52 +01:00
Jeff Bolz	a82c9e7c23	vulkan: fix assertion when qy_needs_dequant (#12068 ) Looks like a copy/paste bug from qx_needs_dequant.	2025-02-25 16:30:21 +01:00
Judd	c132239bfb	add OP sigmoid (#12056 ) Co-authored-by: Judd <foldl@boxvest.com>	2025-02-25 12:32:20 +01:00
Rémy O	61d4f39dfe	vulkan: implement more backpropagation operators (#11914 ) * vulkan: implement GGML_OP_ROPE_BACK * vulkan: implement GGML_OP_RMS_NORM_BACK * vulkan: implement GGML_OP_SILU_BACK * vulkan: implement GGML_OP_SOFTMAX_BACK	2025-02-25 12:04:45 +01:00
Rémy O	2eea03d86a	vulkan: implement several ops relevant for ggml_opt (#11769 ) * vulkan: support memset_tensor * vulkan: support GGML_OP_SUM * vulkan: implement GGML_OP_ARGMAX * vulkan: implement GGML_OP_SUB * vulkan: implement GGML_OP_COUNT_EQUAL * vulkan: implement GGML_OP_OPT_STEP_ADAMW * vulkan: fix check_results RWKV_WKV6 crash and memory leaks * vulkan: implement GGML_OP_REPEAT_BACK * tests: remove invalid test-backend-ops REPEAT_BACK tests * vulkan: fix COUNT_EQUAL memset using a fillBuffer command	2025-02-17 07:55:57 +01:00
Jeff Bolz	bf42a23d0a	vulkan: support multi/vision rope, and noncontiguous rope (#11902 )	2025-02-16 08:52:23 +01:00
Rémy O	fc1b0d0936	vulkan: initial support for IQ1_S and IQ1_M quantizations (#11528 ) * vulkan: initial support for IQ1_S and IQ1_M quantizations * vulkan: define MMV kernels for IQ1 quantizations * devops: increase timeout of Vulkan tests again * vulkan: simplify ifdef for init_iq_shmem	2025-02-15 09:01:40 +01:00
Eve	a4f011e8d0	vulkan: linux builds + small subgroup size fixes (#11767 ) * mm subgroup size * upload vulkan x86 builds	2025-02-14 02:59:40 +00:00
Danny Milosavljevic	c2a67efe38	vulkan: Make Vulkan optional at runtime (#11493 ). (#11494 ) Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-02-10 07:17:21 +01:00
Wagner Bruna	b044a0fe3c	vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (#11592 )	2025-02-10 07:08:22 +01:00
Jeff Bolz	98f6b0fd1e	vulkan: account for lookup tables when checking shared memory size (#11502 )	2025-02-09 08:43:51 +01:00
Jeff Bolz	c026ba3c23	vulkan: print shared memory size (#11719 )	2025-02-07 11:26:03 +01:00
Rémy O	8a7e3bf17a	vulkan: initial support for IQ4_XS quantization (#11501 )	2025-02-06 07:09:59 +01:00
Jeff Bolz	1b598b3058	vulkan: use smaller combined allocations to avoid fragmentation (#11551 )	2025-02-06 07:02:18 +01:00
Johannes Gäßler	fd08255d0d	CUDA: non-contiguous (RMS) norm support (#11659 ) * CUDA: non-contiguous (RMS) norm support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-04 22:21:42 +01:00
Rémy Oudompheng	66ee4f297c	vulkan: implement initial support for IQ2 and IQ3 quantizations (#11360 ) * vulkan: initial support for IQ3_S * vulkan: initial support for IQ3_XXS * vulkan: initial support for IQ2_XXS * vulkan: initial support for IQ2_XS * vulkan: optimize Q3_K by removing branches * vulkan: implement dequantize variants for coopmat2 * vulkan: initial support for IQ2_S * vulkan: vertically realign code * port failing dequant callbacks from mul_mm * Fix array length mismatches * vulkan: avoid using workgroup size before it is referenced * tests: increase timeout for Vulkan llvmpipe backend --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-01-29 18:29:39 +01:00
Jeff Bolz	2711d0215f	vulkan: Catch pipeline creation failure and print an error message (#11436 ) * vulkan: Catch pipeline creation failure and print an error message Also, fix some warnings from my on-demand compile change. * vulkan: fix pipeline creation logging	2025-01-29 09:26:50 -06:00
Jeff Bolz	4a75d19376	vulkan: compile shaders on-demand (#11406 ) Reduce first-run startup time and memory consumption. Should fix #11339.	2025-01-25 22:29:57 +01:00
amd-dwang	955a6c2d91	Vulkan-run-test: fix mmq_wg_denoms (#11343 ) There should be a copy-and-paste error here. mmq_wg_denoms should be used together with warptile_mmq, instead of wg_denoms.	2025-01-23 08:14:28 +01:00
Jeff Bolz	5245729e33	vulkan: fix diag_mask_inf (#11323 ) With robustbufferaccess disabled, this shader was showing OOB stores. There is a bounds check in the code, but the workgrouop dimensions were reversed vs CUDA and it was running the wrong number of threads. So fix the workgroup dimensions and disable robustness for this pipeline.	2025-01-23 08:01:17 +01:00
Jeff Bolz	aea8ddd516	vulkan: fix coopmat2 validation failures (#11284 ) mul mat and flash attention shaders were loading f32 types directly into A/B matrices, which happens to work but is technically invalid usage. For FA, we can load it as an Accumulator matrix and convert and this is not in the inner loop and is cheap enough. For mul mat, it's more efficient to do this conversion in a separate pass and have the input(s) be f16. coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.	2025-01-20 10:38:32 -06:00
Jeff Bolz	44e18ef939	vulkan: fix coopmat2 flash attention for non-contiguous inputs (#11281 ) Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression. Add noncontiguous FA tests in test-backend-ops. Fixes #11268.	2025-01-18 09:26:50 +01:00
Jeff Bolz	bd38ddea01	vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (#11166 ) * vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl Shaders are based on cpy.cu. * vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32 * ggml: copy q->f32 assumes some contiguity in the destination	2025-01-16 22:47:10 +01:00
0cc4m	c3f9d25706	Vulkan: Fix float16 use on devices without float16 support + fix subgroup_size_control validation error (#11161 ) * Vulkan: Remove float16 use in shaders * Fix validation error about subgroup_size_control extension	2025-01-10 06:39:33 +01:00
Molly Sophia	ee7136c6d1	llama: add support for QRWKV6 model architecture (#11001 ) llama: add support for QRWKV6 model architecture (#11001) * WIP: Add support for RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV: Some graph simplification Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add support for RWKV6Qwen2 with cpu and cuda GLA Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix some typos Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix wkv test & add gla test Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix cuda warning Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update README.md Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update ggml/src/ggml-cuda/gla.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix fused lerp weights loading with RWKV6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * better sanity check skipping for QRWKV6 in llama-quant thanks @compilade Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: compilade <git@compilade.net> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <git@compilade.net>	2025-01-10 09:58:08 +08:00
Mathieu Baudier	02f0430141	Disable GL_KHR_cooperative_matrix Vulkan extension if not available. (#11117 ) * Disable GL_KHR_cooperative_matrix Vulkan extension if not available. * Perform Vulkan extensions checks in a more sensible order * Remove unnecessary #ifdef directive	2025-01-08 09:18:13 +01:00
0cc4m	b56f079e28	Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver (#11074 ) * Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver * Add (TM) to AMD name check	2025-01-04 21:09:59 +01:00
Jeff Bolz	716bd6dec3	vulkan: optimize mul_mat for small values of N (#10991 ) Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better. Share some code for reducing the result values to memory in mul_mat_vec_base.	2024-12-30 18:27:11 +01:00
Jeff Bolz	a813badbbd	vulkan: im2col and matmul optimizations for stable diffusion (#10942 ) * tests: Add im2col perf tests * vulkan: optimize im2col, more elements per thread * vulkan: increase small tile size for NV_coopmat2 * vulkan: change im2col to 512 elements per workgroup	2024-12-29 10:16:34 +01:00
Jeff Bolz	fdd2188912	vulkan: Use push constant offset to handle misaligned descriptors (#10987 )	2024-12-29 09:35:11 +01:00

1 2

77 Commits