llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-08-06 09:10:11 -04:00

Author	SHA1	Message	Date
Johannes Gäßler	946b1f6859	CUDA: fix pointer incrementation in FA (#14916 ) b6013	2025-07-28 14:30:22 +02:00
Dongliang Wei	6c6e397aff	model : add support for SmallThinker series (#14898 ) * support smallthinker * support 20b softmax, 4b no sliding window * new build_moe_ffn_from_probs, and can run 4b * fix 4b rope bug * fix python type check * remove is_moe judge * remove set_dense_start_swa_pattern function and modify set_swa_pattern function * trim trailing whitespace * remove get_vocab_base of SmallThinkerModel in convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * better whitespace Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * use GGML_ASSERT for expert count validation Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Improve null pointer check for probs Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * use template parameter for SWA attention logic * better whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * move the creation of inp_out_ids before the layer loop * remove redundant judge for probs --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b6012	2025-07-28 13:47:00 +02:00
Alberto Cabrera Pérez	afc0e89698	sycl: refactor quantization to q8_1 (#14815 ) * sycl: quantization to q8_1 refactor * Refactored src1 copy logic in op_mul_mat b6011	2025-07-28 11:05:53 +01:00
Georgi Gerganov	a5771c9eea	ops : update BLAS (#14914 )	2025-07-28 10:01:03 +02:00
Georgi Gerganov	c35f9eaf09	ops : update Metal (#14912 )	2025-07-28 08:22:56 +03:00
Georgi Gerganov	1f45f2890e	sync : ggml	2025-07-28 08:15:01 +03:00
Kai Pastor	613c5095c3	cmake : Indent ggml-config.cmake (ggml/1310)	2025-07-28 08:15:01 +03:00
Ed Addario	7f97599581	quantize : update README.md (#14905 ) * Update README.md * Fix trailing whitespace * Update README.md Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-07-27 23:31:11 +02:00
Ruben Ortlam	bf78f5439e	vulkan: add ops docs (#14900 )	2025-07-27 15:33:08 +02:00
Akarshan Biswas	bbfc849274	SYCL: add ops doc (#14901 )	2025-07-27 17:52:58 +05:30
Daniel Bevenius	ca0ef2dddb	llama : clarify comment about pp and tg graphs [no ci] (#14895 ) * llama : clarify comment about pp and tg graphs [no ci] This commit clarifies the comment in `llama-context.cpp` regarding the prefill prompt (pp), and token generation (tg) graphs. The motivation for this is that I've struggled to remember these and had to look them up more than once, so I thought it would be helpful to add a comment that makes it clear what these stand for. * squash! llama : clarify comment about pp and tg graphs [no ci] Change "pp" to "prompt processing".	2025-07-27 12:10:51 +02:00
Erik Scholz	89d1029559	vulkan : add fp16 support for the conv_2d kernel (#14872 ) * add f16 to conv_2d testing * weaken conv2d test error threshold b6002	2025-07-27 12:04:33 +02:00
Jeff Bolz	f1a4e72de5	vulkan: skip empty set_rows to avoid invalid API usage (#14860 ) b6001	2025-07-27 11:05:34 +02:00
Gabriel Larson	4762ad7316	model : make rope_yarn_log_mul optional for deepseek2 (#14896 ) * make rope_yarn_log_mul optional for deepseek2 * default rope_yarn_log_mul = 0.0f b6000	2025-07-27 11:18:37 +03:00
Shunta Saito	1dc9614e06	llama : fix kq_scale for the attention layers of PLaMo2 (#14892 ) * Fix dimensions for expand * Change dimensions to copy states to cache * Fix the default value for plamo2 conversion * Fix scale given to build_attn * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b5999	2025-07-27 09:38:44 +02:00
Aman Gupta	446595b9b3	Docs: add instructions for adding backends (#14889 ) b5998	2025-07-27 09:36:43 +08:00
deepsek	66906cd82a	HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624 ) This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices. Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries. This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus. b5997	2025-07-27 00:28:14 +02:00
hipudding	11dd5a44eb	CANN: Implement GLU ops (#14884 ) Implement REGLU, GEGLU, SWIGLU ops according to #14158 b5996	2025-07-26 17:56:18 +08:00
R0CKSTAR	9b8f3c6c77	musa: fix build warnings (unused variable) (#14869 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> b5995	2025-07-26 10:36:02 +08:00
Aaron Teo	c7f3169cd5	ggml-cpu : disable GGML_NNPA by default due to instability (#14880 ) * docs: update s390x document for sentencepiece Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `e086c5e3a7`) * docs: update huggingface links + reword Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `8410b085ea`) * ggml-cpu: disable ggml-nnpa compile flag by default fixes #14877 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `412f4c7c88`) * docs: update s390x build docs to reflect nnpa disable Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `c1eeae1d0c`) --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> b5994	2025-07-25 19:09:03 +02:00
Gabe Goodhart	793c0d7f46	metal: SSM_SCAN performance (#14743 ) * feat: Add s_off as a parameter in the args struct This may not be necessary, but it more closely mirrors the CUDA kernel Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state This is a first attempt at optimizing the metal kernel. The changes here are: - Launch the kernel with a thread group of size d_state - Use simd groups and shared memory to do the summation for the y computation When tested with G4 tiny preview, this shows roughly a 3x speedup on prefill and 15% speedup on decode. Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update logic to correctly do the multi-layer parallel sum Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Correctly size the shared memory bufer and assert expected size relationships Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Compute block offsets once rather than once per token Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Use local variable for state recursion Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Use a secondary simd_sum instead of a for loop Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add assertion and comment about relationship between simd size and num simd groups Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parallelize of d_state for mamba-1 Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parallel sum in SSM_CONV Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * Revert "feat: Parallel sum in SSM_CONV" After discussion with @compilade, the size of the parallelism here is not worth the cost in complexity or overhead of the parallel for. https://github.com/ggml-org/llama.cpp/pull/14743#discussion_r2223395357 This reverts commit `16bc059660`. Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Simplify shared memory sizing Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b5993	2025-07-25 10:47:39 -06:00
lhez	ce111d39d6	opencl: add fused `rms_norm_mul` (#14841 ) * opencl: add fused `rms_norm` + `mul` * opencl: improve workgroup size for `rms_norm_mul` b5992	2025-07-25 17:12:13 +02:00
wooksong	e7fecba934	docs : update HOWTO‑add‑model.md for ModelBase and new model classes (#14874 ) This patch updates the example in docs/development/HOWTO-add-model.md to reflect recent changes after `TextModel` and `MmprojModel` were introduced. It replaces the outdated `Model` base class with `TextModel` or `MmprojModel` and updates the registration example accordingly. Signed-off-by: Wook Song <wook16.song@samsung.com>	2025-07-25 16:25:05 +02:00
Oliver Simons	e2b7621e7c	ggml : remove invalid portPos specifiers from dot files (#14838 ) Neither "g" nor "x" are valid portPos specifiers per the official [graphviz documents](https://graphviz.org/docs/attr-types/portPos/): > If a compass point is used, it must have the form "n","ne","e","se","s","sw","w","nw","c","_". I tested locally for it to fall back to default portPos specifier if an invalid portPos is specified. As a consequence, we can remove associated code. b5990	2025-07-25 14:29:57 +03:00
Georgi Gerganov	c1dbea752a	context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (#14870 ) ggml-ci b5989	2025-07-25 14:28:06 +03:00
kiwi	749e0d27f0	mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (#14503 ) * [fix] Fix 32-bit narrowing issue in export-lora and mtmd clip * Update export-lora.cpp * Update clip.cpp * Update export-lora.cpp * format: use space to replace tab b5988	2025-07-25 13:08:04 +02:00
Chris Rohlf	64bf1c3744	rpc : check for null buffers in get/set/copy tensor endpoints (#14868 ) b5987	2025-07-25 12:17:02 +02:00
Diego Devesa	c12bbde372	sched : fix multiple evaluations of the same graph with pipeline parallelism (#14855 ) ggml-ci b5986	2025-07-25 11:07:26 +03:00
R0CKSTAR	3f4fc97f1d	musa: upgrade musa sdk to rc4.2.0 (#14498 ) * musa: apply mublas API changes Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: update musa version to 4.2.0 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: restore MUSA graph settings in CMakeLists.txt Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: disable mudnnMemcpyAsync by default Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: switch back to non-mudnn images Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * minor changes Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: restore rc in docker image tag Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> b5985	2025-07-24 20:05:37 +01:00
Georgi Gerganov	2df255da3c	sync : ggml ggml-ci b5984	2025-07-24 20:27:23 +03:00
Kai Pastor	60f816a79d	cmake : fix usage issues (ggml/1257) * CMake config: Create target only once Fix error on repeated find_package(ggml). For simplicity, check only for the top-level ggml::ggml. * CMake config: Add CUDA link libs * CMake config: Add OpenCL link libs * CMake config: Use canonical find_dependency Use set and append to control link lib variables. Apply more $<LINK_ONLY...>. * CMake config: Wire OpenMP dependency	2025-07-24 20:27:23 +03:00
Daniel Bevenius	5592f278b6	ggml-cpu : remove stdlib include from repack.cpp (ggml/1276) This commit removes the inclusion of `<cstdlib>`. The motivation for this change is that this source file does not seem to use any functions from this header and the comment about `qsort` is a little misleading/confusing.	2025-07-24 20:27:23 +03:00
Georgi Gerganov	e4868d16d2	context : perform output reorder lazily upon access after sync (#14853 ) * context : perform output reorder after lazily upon access after sync ggml-ci * cont : add TODO b5981	2025-07-24 16:31:48 +03:00
Xuan-Son Nguyen	820de57d4f	chat : fix kimi-k2 chat template (#14852 ) b5980	2025-07-24 13:59:56 +02:00
Alberto Cabrera Pérez	cb4a63aad6	sycl: fixed semantics of block offset calculation (#14814 ) b5979	2025-07-24 11:09:57 +01:00
yummy	86f5623d90	llama : fix MiniCPM inference after Granite Four changes (#14850 ) MiniCPM models use the llm_build_granite constructor which was changed in the Granite Four PR to use hparams.rope_finetuned instead of a use_rope parameter. MiniCPM models need rope enabled by default. Fixes inference from gibberish to correct responses. b5978	2025-07-24 11:50:51 +02:00
Pouya	39cffdf188	docs: add libcurl-dev install hint for Linux distros (#14801 ) * docs: add libcurl-dev install hint for Linux distros Signed-off-by: PouyaGhahramanian <PooyaGhahramanian@gmail.com> * Update docs/build.md --------- Signed-off-by: PouyaGhahramanian <PooyaGhahramanian@gmail.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-07-24 11:26:44 +02:00
Georgi Gerganov	065908cb09	metal : fix fusion across different encoders (#14849 ) * metal : fix fusion across different encoders ggml-ci * cont : add assertion ggml-ci b5976	2025-07-24 10:24:05 +03:00
Donghyeon Jeong	4ec6291a24	sycl: fix undefined variable in work group size check (#14843 ) b5975	2025-07-24 12:50:41 +08:00
jacekpoplawski	a12363bbf0	convert : text-only support for GLM-4.1V-9B-Thinking (#14823 ) * use language_model part only, ignore visual layers * fix rope_dim calculation	2025-07-23 23:23:57 +02:00
Johannes Gäßler	a86f52b285	CUDA: fix overflow in FA, tune performance (#14840 ) b5973	2025-07-23 21:43:25 +02:00
Johannes Gäßler	b284197df4	CUDA: fix compilation with GGML_CUDA_F16 (#14837 ) b5972	2025-07-23 18:22:30 +02:00
Sigbjørn Skjæret	221c0e0c58	ci : correct label refactor->refactoring (#14832 )	2025-07-23 14:27:54 +02:00
Johannes Gäßler	07a19e27a2	CUDA: fix quantized KV cache + multiple sequences (#14822 ) * CUDA: fix quantized KV cache + multiple sequences * Update ggml/src/ggml-cuda/fattn-common.cuh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b5970	2025-07-23 14:08:09 +03:00
Georgi Gerganov	18f3b5ff9e	tests : add non-cont K,V FA tests ggml-ci	2025-07-23 14:08:09 +03:00
l3utterfly	7233358d29	memory : handle saving/loading null layers in recurrent memory (#14675 ) * Update llama-memory-recurrent.cpp handle saving/loading null layers in recurrent memory * fixed styling issues and updated comments * fix styling issue Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b5968	2025-07-23 11:16:41 +03:00
lixing-star	6c88b3bb25	ggml: fix loongarch quantize_row_q8_1 error (#14827 ) b5967	2025-07-23 09:39:51 +03:00
chen fan	14c28dfc50	CANN: weight format to NZ for Ascend310P3 (#14407 ) * weight format to nz for 310p * remove quant weight format to nz * clean code * fix * make the conditions for converting weights to NZ format consistent * clean code b5966	2025-07-23 11:58:00 +08:00
Aman Gupta	8c988fa41d	CUDA: add fused rms norm (#14800 ) b5965	2025-07-23 09:25:42 +08:00
Csaba Kecskemeti	acd6cb1c41	ggml : model card yaml tab->2xspace (#14819 )	2025-07-22 19:29:43 +03:00

1 2 3 4 5 ...

6013 Commits