llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-06-26 19:55:04 +00:00

Author	SHA1	Message	Date
Piotr Kubaj	31f7803bc4	ggml-cpu-impl.h: do not redefine bool on POWER9 (#12856 ) error: unknown type name '_Bool'	2025-04-10 01:00:34 +02:00
Piotr Kubaj	2391506ace	ggml-impl.h: fix build on POWER9 (#12855 ) error: ISO C++17 does not allow 'register' storage class specifier	2025-04-10 01:00:25 +02:00
Chenguang Li	6e1c4cebdb	CANN: Support Opt CONV_TRANSPOSE_1D and ELU (#12786 ) * [CANN] Support ELU and CONV_TRANSPOSE_1D * [CANN]Modification review comments * [CANN]Modification review comments * [CANN]name adjustment * [CANN]remove lambda used in template * [CANN]Use std::func instead of template * [CANN]Modify the code according to the review comments --------- Signed-off-by: noemotiovon <noemotiovon@gmail.com>	2025-04-09 14:04:14 +08:00
Jeff Bolz	0090950f67	vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (#12833 ) q4_k and q5_k had a lot of redundant global loads where the same 16B of scale information is repeatedly loaded and decoded during each loop iteration. This change restructures the loops to more explicitly iterate over whole blocks in the outer loop (with unrolled inner loop) and to copy/decode the scale data into shared memory once at the start of each outer loop. The copy is pipelined so the scale load from global memory is relatively cheap. This improves q4_k/q5_k model prompt processing performance by around 5-7%. I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k and hurt for q4_0. The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped variants isn't used as often as it originally was (e.g. due to the padded_N change), so I trimmed it down to offset some of the new complexity of the semi-manual loop unrolling.	2025-04-09 07:25:08 +02:00
Jeff Bolz	7ecd780b1a	vulkan: Use fp16 for the flash attention P*V multiplication (#12783 ) This is consistent with the ggml-cuda behavior and the mul_mat fallback.	2025-04-09 07:12:57 +02:00
Sigbjørn Skjæret	7538246e7c	cuda : add f32 to bf16 copy op (#12806 ) This allows BF16 KV-cache on CUDA.	2025-04-08 23:21:31 +02:00
Georgi Gerganov	a19b5cef16	llama : fix FA when KV cache is not used (i.e. embeddings) (#12825 ) * ggml : FA supports F32 V * graph : cast KV to F16 when the KV cache is not used ggml-ci * server : add test that exercises embeddings with FA enabled ggml-ci	2025-04-08 19:54:51 +03:00
Neo Zhang Jianyu	656babd6c2	Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (#12812 ) * Revert "sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_s…" This reverts commit `518a01480e`. * Update ggml/src/ggml-sycl/ggml-sycl.cpp * Update ggml/src/ggml-sycl/ggml-sycl.cpp * rm tail space	2025-04-08 15:03:21 +08:00
lhez	82974011f3	opencl: better identify Adreno GPU (#12760 )	2025-04-07 13:22:54 -07:00
Georgi Gerganov	1a1ab7e7a4	cuda : fix HIP and MUSA BF16 (#0 ) ggml-ci	2025-04-07 18:44:17 +03:00
Georgi Gerganov	ff067dbcb9	ggml : simplify Arm fp16 CPU logic (ggml/1177) * ggml : simlpify Arm fp16 CPU logic ggml-ci * cont : bring back CUDA/MUSA checks ggml-ci	2025-04-07 18:44:17 +03:00
Sigbjørn Skjæret	36ca8b3628	CUDA: don't convert BF16 weights to FP32 (ggml/1174) * add bf16 support * use convert_from_bf16_cuda instead of convert_unary_cuda for f32 * revert 7ec5085 * move functionality into convert_unary with constexpr	2025-04-07 18:44:17 +03:00
cmdr2	995083e4ed	cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167) * cpu: refactor SIMD mappings and vectorized op functions into separate files * Fix warning for ggml_float to float * Fix warnings * cpu: move all the operations (except mul_mat) to a separate c++ file * fix whitespace * Update ggml/src/ggml-cpu/vec.h Co-authored-by: Diego Devesa <slarengh@gmail.com> * Fix PR comments - use GGML_UNUSED, use cassert in ops.cpp * Reverse the order of import for ops.h and vec.h, to match what was present in ggml-cpu.c previously --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-04-07 18:44:17 +03:00
zhouwg	518a01480e	sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (#12734 )	2025-04-07 17:22:57 +02:00
zhouwg	52b3d71f12	CANN: fix typo in ggml-cann (#12733 )	2025-04-07 19:34:14 +08:00
hipudding	d0d5b2232b	CANN: Refactor to reduce duplicate code (#12731 ) * CANN: Refactor to reduce duplicate code * CANN: fix review comment	2025-04-07 17:10:36 +08:00
R0CKSTAR	916c83bfe7	musa: fix compilation warnings in mp_22/31 (#12780 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-04-06 15:23:54 +02:00
Jeff Bolz	0c74b04376	vulkan: fix NaN issue in flash attention shader (#12776 ) Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.	2025-04-06 11:03:47 +02:00
Jeff Bolz	80b717d493	vulkan: Use unclamped loads for flash attention mask (#12720 ) nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.	2025-04-06 10:47:13 +02:00
0cc4m	6bf28f0111	Vulkan: Tune Vulkan mmq int dot shader for performance (#12767 )	2025-04-05 18:04:03 +02:00
Nicolò Scipione	94148ba330	sycl: allow ggml-sycl configuration and compilation using Visual Studio project/solution (#12625 )	2025-04-04 16:00:46 +02:00
Ronny Brendel	9ac4d611d0	cmake: fix ggml-shaders-gen compiler paths containing spaces (#12747 ) fixes error for compiler paths with spaces	2025-04-04 10:12:40 -03:00
Jeff Bolz	74d4f5b041	vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (#12630 ) There seems to be a bubble waking up from waitForFences, which costs a few percent performance and also increased variance in performance. This change inserts an "almost_ready" fence when the graph is about 80% complete and we waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting for the final fence to be signaled.	2025-04-04 07:54:35 +02:00
Jeff Bolz	35e592eb30	vulkan: set cmake minimum and project name in vulkan-shaders (#12744 )	2025-04-04 07:53:20 +02:00
Gaurav Garg	c262beddf2	CUDA: Prefer vector flash decoding kernel for Gemma models (#12738 ) * Prefer vector flash decoding kernel for Gemma models Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category. Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models. * Update ggml/src/ggml-cuda/fattn.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-04-03 18:20:29 +02:00
Jeff Bolz	1c059995e0	vulkan: Fix missing cmake logic for dot product extension (#12721 )	2025-04-03 10:08:26 -05:00
a3sh	193c3e03a6	fix MUSA compiler warning (#12704 ) * fix MUSA compiler warning * replace (void) with GGML_UNUSED	2025-04-03 09:32:55 +02:00
Chenguang Li	65cfe136a0	CANN: Support operator SIN COS ARGMAX (#12709 ) * [CANN]support sin cos argmax Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]codestyle adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]Remove redundant code Signed-off-by: noemotiovon <noemotiovon@gmail.com> --------- Signed-off-by: noemotiovon <noemotiovon@gmail.com> Co-authored-by: noemotiovon <noemotiovon@gmail.com>	2025-04-03 15:18:08 +08:00
Alan Gray	3f9da22c2b	Simplify and improve CUDA graphs through use of indirect copy pointers (#9017 ) * CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers Previously there was complexity in the CUDA graphs implementation due frequently changing parameters to copy kernels associated with K and V cache pointers. This patch simplifies by using indirection to avoid such parameters frequently changing, avoiding the need for frequent graph updates. Fixes #12152 * Addressed comments * fix HIP builds * properly sync to stream * removed ggml_cuda_cpy_fn_ptrs * move stream sync before free * guard to only use indirection with graphs * style fixes * check for errors --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-04-03 03:31:15 +02:00
hipudding	2a0dc97e56	CANN: Fix failed test cases (#12708 ) * CANN: Fix memory waste in aclnn_tensor * CANN: fix backend ops fail * CANN: fix acl_tensor memory alloc. * CANN: format * CANN: remove trailing whitespace	2025-04-03 08:49:51 +08:00
lhez	97a20c012b	opencl: use `max_alloc_size` in backend ctx instead of querying again (#12705 )	2025-04-02 17:01:42 -07:00
Jeff Bolz	f01bd02376	vulkan: Implement split_k for coopmat2 flash attention. (#12627 ) When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.	2025-04-02 14:25:08 -05:00
bandoti	6f3bd38640	cmake: remove caching from vulkan coopmat checks (#12719 )	2025-04-02 14:56:26 -03:00
Jeff Bolz	be0a0f8cae	vulkan: Implement grouped query attention in the coopmat2 FA shader (#12559 ) When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when: dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1)) previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each. This doesn't directly translate to better performance (at least when you have >=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.	2025-04-02 19:40:32 +02:00
0cc4m	92e3006bb6	Vulkan: Fix mmq int dot float cache size (#12722 )	2025-04-02 19:12:30 +02:00
Diego Devesa	e0e912f49b	llama : add option to override model tensor buffers (#11397 ) * llama : add option to override tensor buffers * ggml : fix possible underflow in ggml_nbytes	2025-04-02 14:52:01 +02:00
Chenguang Li	9bacd6b374	[CANN] get_rows and dup optimization (#12671 ) * [CANN]get_rows and dup optimization. Co-authored-by: hipudding <huafengchun@gmail.com> Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]GET_ROWS and CPY/DUP optimization Co-authored-by: hipudding <huafengchun@gmail.com> Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> --------- Signed-off-by: noemotiovon <noemotiovon@gmail.com> Co-authored-by: noemotiovon <noemotiovon@gmail.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2025-04-02 15:22:13 +08:00
Junil Kim	f423981ac8	opencl : fix memory allocation size (#12649 ) issue: https://github.com/CodeLinaro/llama.cpp/pull/17#issuecomment-2760611283 This patch fixes the memory allocation size not exceeding the maximum size of the OpenCL device.	2025-04-01 09:54:34 -07:00
Georgi Gerganov	3fd072a540	metal : use F32 prec in FA kernels (#12688 ) * metal : use F32 prec in FA kernels ggml-ci * cont : fix FA vec kernel ggml-ci	2025-04-01 14:57:19 +03:00
R0CKSTAR	a6f32f0b34	Fix clang warning in gguf_check_reserved_keys (#12686 ) * Fix clang warning in gguf_check_reserved_keys Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Fix typo Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-04-01 13:12:53 +02:00
Wagner Bruna	2bb3597e42	vulkan: fix build when glslc doesn't support coopmat (#12683 )	2025-04-01 11:38:07 +02:00
Romain Biessy	8293970542	SYCL: Rename oneMKL to oneMath (#12192 ) * Rename oneMKL Interface to oneMath * Use oneMath for Intel vendor * Rename occurences to mkl * clang-format * Silence verbose warnings * Set oneMath HIP_TARGETS * Fix silence warnings * Remove step to build oneMath from build instructions * Use fixed oneMath version * Remove INTEL_CPU * Fold CMake oneDNN conditions * Use Intel oneMKL for Intel devices * Improve CMake message * Link against MKL::MKL_SYCL::BLAS only * Move oneMath documentation to Nvidia and AMD sections	2025-04-01 16:24:29 +08:00
Akarshan Biswas	8bbf26083d	SYCL: switch to SYCL namespace (#12674 )	2025-04-01 10:11:39 +02:00
a3sh	250d7953e8	ggml : faster ssm scan (#10558 ) * faster ssm_scan * delete unused commnet * clang format * add space * modify unnecessary calculations * faster ssm conv implementatioin * modify file name with dash	2025-03-31 18:05:13 +02:00
0cc4m	a8a1f33567	Vulkan: Add DP4A MMQ and Q8_1 quantization shader (#12135 ) * Vulkan: Add DP4A MMQ and Q8_1 quantization shader * Add q4_0 x q8_1 matrix matrix multiplication support * Vulkan: Add int8 coopmat MMQ support * Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code * Add GL_EXT_integer_dot_product check * Remove ggml changes, fix mmq pipeline picker * Remove ggml changes, restore Intel coopmat behaviour * Fix glsl compile attempt when integer vec dot is not supported * Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq * Remove redundant comment * Fix integer dot check * Fix compile issue with unsupported int dot glslc * Update Windows build Vulkan SDK version	2025-03-31 14:37:01 +02:00
Georgi Gerganov	1790e73157	cmake : fix whitespace (#0 )	2025-03-31 15:07:32 +03:00
Sandro Hanea	a7724480fd	cmake: improve Vulkan cooperative matrix support checks (whisper/2966) Co-authored-by: Sandro Hanea <me@sandro.rocks>	2025-03-31 15:07:32 +03:00
Akarshan Biswas	6c02a032fa	SYCL: Remove misleading ggml_sycl_op_flatten function (#12387 ) * SYCL: Remove misleading ggml_sycl_op_flatten function * remove trailing whitespace * Fix L2 norm from rebase * remove try catch block from element_wise.cpp * remove comment from common.hp * ggml-sycl.cpp: Add try catch sycl::exception block in compute_forward * norm.cpp: remove try catch exception block	2025-03-31 11:25:24 +02:00
Georgi Gerganov	4663bd353c	metal : use constexpr in FA kernels + fix typedef (#12659 ) * metal : use constexpr in FA kernels ggml-ci * cont ggml-ci * cont : fix typedef ggml-ci	2025-03-30 22:04:04 +03:00
R0CKSTAR	492d7f1ff7	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 ) * musa: fix all warnings Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: enable -DLLAMA_FATAL_WARNINGS=ON in run.sh Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: update ci doc (install ccache) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * fix Windows build issue Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-30 10:59:38 +02:00

... 4 5 6 7 8 ...

976 Commits