llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-06-26 19:55:04 +00:00

Author	SHA1	Message	Date
Christian Kastner	ec9e0301fe	cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890 )	2025-05-30 01:28:54 +02:00
Yibo Cai	54a2c7a8cd	arm64: optimize q4_k_q8_k kernel with i8mm (#13886 ) This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q4_k_m quantization model. - 34% ~ 50% S_PP uplift for all batch sizes - 12% ~ 37% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- \| PP \| TG \| B \| S_PP t/s \| S_TG t/s \| \| \| \| \| original \| this pr \| original \| this pr \| \|-------\|--------\|------\|----------\|----------\|----------\|----------\| \| 128 \| 128 \| 1 \| 110.12 \| 147.83 \| 24.36 \| 24.28 \| \| 128 \| 128 \| 2 \| 121.16 \| 172.42 \| 46.36 \| 47.93 \| \| 128 \| 128 \| 4 \| 120.15 \| 169.75 \| 74.68 \| 84.00 \| \| 128 \| 128 \| 8 \| 130.97 \| 196.81 \| 91.04 \| 114.74 \| \| 128 \| 128 \| 16 \| 131.01 \| 196.88 \| 101.43 \| 135.79 \| \| 128 \| 128 \| 32 \| 130.85 \| 196.51 \| 106.97 \| 147.29 \| --------------------------------------------------------------------- ```	2025-05-29 14:39:20 +03:00
Christian Kastner	21fcc21ad5	cmake: Factor out CPU architecture detection (#13883 ) * cmake: Define function for querying architecture The tests and results match exactly those of ggml/src/CMakeLists.txt * Switch arch detection over to new function	2025-05-29 12:50:25 +02:00
Vineel Abhinav	dd8ba93416	ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882 ) * F32-Mamba-Seq_Scan-SVE * Fix formatting * ggml : missing space --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-05-29 12:18:43 +03:00
Vineel Abhinav	1b8fb8152d	ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843 ) * F32-Mamba-SVE * F32-Mamba-SVE * Resolve test errors-1 * Resolve test errors-2 * F32-vec-SVE * F32-vec-SVE * F32-vec-SVE	2025-05-29 09:01:33 +03:00
Johannes Gäßler	a68247439b	CUDA: fix FA tg at long context for CC >= 8.9 (#13852 )	2025-05-28 13:33:37 +02:00
leo-pony	1e8659e65a	CANN: Add SOC TYPE printing in cmake configuration (#13837 )	2025-05-28 11:54:20 +08:00
lhez	a3c30846e4	opencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm` (#13787 ) * opencl: add `argsort` * opencl: add `div` * opencl: add `add_rows` * opencl: add `sub` * opencl: add `sigmoid`, both `f16` and `f32` * opencl: add `group_norm`	2025-05-27 12:56:08 -07:00
lhez	1701d4c54f	opencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (#13790 )	2025-05-27 12:53:14 -07:00
Jeff Bolz	bef8176387	vulkan: use timestamp queries for GGML_VULKAN_PERF (#13817 ) Also change it to be controlled by an env var rather than cmake flag	2025-05-27 18:39:07 +02:00
Akarshan Biswas	f3101a8cc6	SYCL: add gelu_erf kernel (#13749 ) * SYCL: add gelu_erf kernel * refactor code Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com> * Use scope_op_debug_print --------- Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>	2025-05-27 20:52:59 +05:30
Xuan-Son Nguyen	a8ea03d8ad	ggml : add ggml_repeat_4d (#13824 )	2025-05-27 15:53:55 +02:00
xctan	05f6ac6283	ggml : riscv: add xtheadvector support (#13720 ) * ggml : riscv: add xtheadvector support * ggml : clean up some macro usage	2025-05-27 16:21:36 +03:00
Christian Kastner	7fe03e7446	ggml-cpu: x86 feature detection is specific to x86 (#13811 )	2025-05-27 13:18:39 +02:00
Diego Devesa	952f3953c1	ggml : allow CUDA graphs when using pipeline parallelism (#13814 )	2025-05-27 13:05:18 +02:00
Georgi Gerganov	4265a87b59	cuda : avoid cuGetErrorString (#13791 ) ggml-ci	2025-05-26 22:14:52 +03:00
Akarshan Biswas	6f180b915c	SYCL: Add non contiguous support in RMS_NORM and NORM kernels (#13611 ) * SYCL: Add non contiguous input support to norm kernel * refactor and add RMS_NORM non contiguous input support ggml-ci * restore subgroup reduction for multi-subgroup thread blocks in norm kernels * Swap grid dims of nsamples and nrows ggml-ci * Revert "Swap grid dims of nsamples and nrows" This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf. * restore not required changes ggml-ci * address review comments: change it to more like SYCL * Use a common function to calculate offset * remove wrap around logic for handling broadcasts * remove static from calculate_offset fn and use ceil_div	2025-05-26 21:10:36 +05:30
Romain Biessy	9012eb9b45	sycl: Add more debug prints (#13640 )	2025-05-26 10:28:53 +02:00
Jeff Bolz	fef693dc6b	vulkan: mark IM2COL as supporting non-contig (#13783 )	2025-05-26 06:02:07 +02:00
Bizhao Shi	2d38b6e400	CANN: Add the basic supports of Flash Attention kernel (#13627 ) * cann: add the basic FA support * cann: update the readme * cann: update the FlashAttention with PSEShift * cann: update the input parameters in FA * cann: update the alibi with max_bias * cann: add the constrints of softcap * cann: update the docs CANN.md * cann: update the docs CANN.md * cann: fix typo of CANN.md * cann: add some comments and update the CANN.md * cann: update the CANN.md * cann: update the inner precise for fusedInferAttention * cann: update the constraints of flash_attn_ext on ggml-cann.cpp * cann: clean the whitespace * cann: clean the whitespace * cann: add a new endline	2025-05-26 10:20:18 +08:00
Akarshan Biswas	515fdbf7ed	SYCL: revert "sycl: simplify bin_bcast_kernel (#13383 )" (#13752 ) Temporarily reverted due to failing fp16 DIV operation This reverts commit `02cdd2d8b0`. ggml-ci	2025-05-25 10:08:37 +03:00
Diego Devesa	2bd1b30f69	ggml-cpu : set openmp wait time if not set (#13758 )	2025-05-24 22:26:47 +02:00
Xuan-Son Nguyen	4c32832c59	ggml : add ggml_gelu_erf() CUDA kernel (#13719 ) * ggml : add ggml_gelu_erf() CUDA kernel * missing semicolon	2025-05-24 13:06:47 +02:00
Johannes Gäßler	ffd0eae60b	CUDA: fix race condition in FA vector kernels (#13742 )	2025-05-24 11:46:19 +02:00
Chenguang Li	faaaff5f94	CANN: Support MUL_MAT_ID for q8_0 and q4_0 (#13705 ) * [CANN]Support MUL_MAT_ID Q8 && Q4 Signed-off-by: noemotiovon <757486878@qq.com> * codestyle adjustment Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-05-23 16:47:53 +08:00
Xuan-Son Nguyen	e16c4731c7	ggml : fix the order of ggml_unary_op (#13718 )	2025-05-23 08:12:48 +02:00
Jeff Bolz	1dcd01960c	vulkan: support CPY from any type to itself (#13695 ) Reuse the f16/f32 copy shaders, and just scale the number of elements according to the type size.	2025-05-23 06:45:02 +02:00
Jeff Bolz	c10ed6cbcc	vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (#13696 )	2025-05-23 06:33:45 +02:00
Judd	a127ff1780	use LOG_WARN to replace `std::cerr` (#13657 )	2025-05-23 06:33:08 +02:00
Nicolò Scipione	d394a9aedc	sycl : Remove waits from function calls (#13702 ) * removes the waits in async memcpy functions	2025-05-22 12:54:43 +01:00
Ewan Crawford	6b56a64690	SYCL: Avoid using with SYCL-Graph for unsupported nodes (#13587 ) Currently on a CUDA backend to SYCL when running `GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there are two operations that throw an exception from the blocking waits during queue recording. * `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187 * `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074 We've noticed that `ggml-cuda.cu` has the [check_node_graph_compatibility_and_refresh_copy_ops](`39e73ae0d6/ggml/src/ggml-cuda/ggml-cuda.cu (L2458-L2458)`) method for checking if a graph can be used, even if enabled. I've taken a similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking if a graph can be used for the operations even if a user has asked for it to be enabled.	2025-05-22 16:24:09 +08:00
Henry Linjamäki	a4e8912dfd	opencl: Add support for multiple devices (#12622 ) * opencl: Add support for multiple devices ... but limited to one platform. A platform with a GPU will be preferred. Additionally: * Filter out devices that lack capabilities needed by the backend implementation (half support, OpenCL 2.0+, etc). * Make ggml_backend_opencl_reg() thread-safe. * fixup: fix an error in sync_with_other_backends ... when there is only one OpenCL device available.	2025-05-21 16:21:45 -07:00
Henry Linjamäki	edbf42edfd	opencl: fix couple crashes (#12795 ) * opencl: fix couple crashes * fix kernel launches failed on devices which do not support non-uniform work-groups. When non-uniform work-groups are not supported, set `local_work_size` to NULL (= let driver choose the work-group sizes). This patch does not cover everything - just the cases tested by test-backend-ops. * fix sub-buffer creation failed due to `cl_buffer_region::origin` not being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`. * OpenCL: query non-uniform WG sizes only on OpenCL 3.0+	2025-05-21 13:21:17 -07:00
Xuan-Son Nguyen	cf4cb59e64	ggml : add ggml_gelu_erf() (#13667 ) * ggml : add ggml_gelu_na (not approximated) * fix naming order * rename na --> erf * apply review suggesions * revert naming order	2025-05-21 16:26:33 +02:00
R0CKSTAR	33983057d0	musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (#13647 ) * musa: fix build warning (unused parameter) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: upgrade MUSA SDK version to rc4.0.1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Update ggml/src/ggml-cuda/cpy.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-05-21 09:58:49 +08:00
Eve	fb1cab201c	vulkan: fix warnings (#13626 ) * small fixes * remove ifdef	2025-05-20 21:35:16 +00:00
Johannes Gäßler	b69f1647f9	CUDA: skip fully masked-out KV in FA vec kernel (#13584 ) * CUDA: skip fully masked-out KV in FA vec kernel	2025-05-20 14:45:07 +02:00
Svetlozar Georgiev	4245e622e0	sycl: disable reorder for sycl mulmat (#13536 )	2025-05-20 11:34:15 +02:00
Georgi Gerganov	c00a2634be	metal : fix typo in FA kernel comments (#13651 )	2025-05-20 10:41:40 +03:00
Nicolò Scipione	f7c9429c85	sycl : Overcoming workaround for mmap() allocation on Windows (#13482 ) * Remove mmap workaround on windows After some testing I found that mmap is supported on windows and for many GPUs on Linux. Therefore I remove the workaround for windows since it is not necessary. * Update llama-bench README SYCL backend introduced a workaround that allows execution of llama-bench also without specifying `--mmp 0` flag	2025-05-20 08:54:43 +08:00
0cc4m	8960efd0a6	Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (#13607 )	2025-05-19 17:54:08 +02:00
Johannes Gäßler	6c35981a64	mnist: fix segmentation fault (ggml/1227)	2025-05-19 13:29:56 +03:00
Diego Devesa	8b5e19aea6	ggml : fix apple OS check in ggml_print_backtrace (ggml/1229)	2025-05-19 13:29:56 +03:00
Daniel Tang	60aea028b5	ggml : Fix missing backtrace on Linux (ggml/1228) * Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1 * Fixed lldb attach * Simplify by having the child do ggml_print_backtrace_symbols	2025-05-19 13:29:56 +03:00
Chenguang Li	33d7aed4a8	CANN: Support MOE Model MUL_MAT_ID (#13042 ) Signed-off-by: noemotiovon <757486878@qq.com>	2025-05-19 14:21:17 +08:00
Gilad S.	e3a7cf6c5b	cmake: use the current build config for vulkan-shaders-gen (#13595 ) * fix: use the current build config for `vulkan-shaders-gen` * fix: only pass a valid build type to `--config`	2025-05-17 15:26:43 -03:00
Jeff Bolz	2f5a4e1e09	vulkan: move common FA code to flash_attn_base.comp (#13556 ) * vulkan: move common FA code to flash_attn_base.comp * vulkan: move common FA index/stride setup code to flash_attn_base.comp * build fix	2025-05-17 09:14:55 +02:00
Jeff Bolz	4f41ee11d6	vulkan: use scalar FA rather than coopmat2 when N==1 (#13554 )	2025-05-17 08:35:47 +02:00
Georgi Gerganov	654a67794f	metal : add FA-vec kernel for head size 64 (#13583 ) ggml-ci	2025-05-16 20:32:58 +03:00
Łukasz Ślusarczyk	0a338ed013	sycl : fixed compilation warnings (#13582 )	2025-05-16 18:15:29 +08:00

1 2 3 4 5 ...

886 Commits