llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-08-01 15:09:32 -04:00

Author	SHA1	Message	Date
Georgi Gerganov	64978340b0	ggml : add asserts (#14720 ) * ggml : add asserts ggml-ci * cont : fix constant type Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-07-16 14:43:32 +03:00
Jeff Bolz	ba1ceb3456	vulkan: fix noncontig check for mat_mul_id splitting (#14683 ) * vulkan: fix noncontig check for mat_mul_id splitting Remove supports_op check for > 4096 (splitting fixes this) * vulkan: fix batched matmul dequant for Q*_K	2025-07-15 21:51:09 +02:00
Jeff Bolz	10a0351a97	vulkan: add RTE variants for glu/add/sub/mul/div (#14653 )	2025-07-15 21:32:11 +02:00
R0CKSTAR	cbc68be51d	cuda: fix build warnings in set-rows.cu (unused variable) (#14687 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-15 15:28:53 +08:00
Anton Mitkov	bdca38376f	sycl: Hotfix for non dnnl codepath (#14677 )	2025-07-14 18:12:42 +01:00
shalinib-ibm	55c509daf5	ggml : refactor llamafile_sgemm PPC code (#14673 ) Remove un-necessary templates from class definition and packing functions Reduce deeply nested conditionals, if-else switching in mnapck function Replace repetitive code with inline functions in Packing functions 2 ~ 7% improvement in Q8 Model 15 ~ 50% improvement in Q4 Model Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2025-07-14 16:16:42 +03:00
Akarshan Biswas	0f4c6ec0f1	SYCL: use 1D kernel for set_rows (#14618 ) * SYCL: Use 1D kernel for set_rows * Remove dangling comment * Refactor and use ceil_div	2025-07-14 10:37:55 +01:00
Anton Mitkov	65a3ebb0aa	sycl: Batched mulmat rework for oneDNN dispatch (#14617 )	2025-07-14 10:37:35 +01:00
Sigbjørn Skjæret	923e3ea2e3	cuda : add set rows for bf16 (#14664 )	2025-07-13 15:01:24 +02:00
Yavor Ivanov	e743cddb60	cuda : add ELU support (#14657 )	2025-07-13 11:33:16 +02:00
Georgi Gerganov	05fec5bd29	ggml : add build-time message to remind about ggml_set_rows (#14661 ) ggml-ci	2025-07-13 10:36:33 +03:00
Yavor Ivanov	dcf7f2ea3c	metal : Add missing unary ops Metal support (#14660 )	2025-07-13 08:38:13 +03:00
Aman Gupta	7de5c7cab6	CUDA: add set rows for f32 and f16 (#14551 ) * CUDA: add set rows for f32 and f16 * Review: change kernel params, use strides from host * Use 1-d kernel * Review: use int64_t for blockDim.x, rename nb->s for clarity	2025-07-12 16:31:38 +03:00
Georgi Gerganov	3120413ccd	vulkan : remove unused vars (#0 ) ggml-ci	2025-07-12 14:25:44 +03:00
Acly	74bb294591	vulkan : implement bilinear interpolation (ggml/1291) ggml-ci	2025-07-12 14:25:44 +03:00
Acly	3e303b1107	vulkan : implement ggml_roll (ggml/1290) ggml-ci	2025-07-12 14:25:44 +03:00
Jeff Bolz	b3ad3a0191	vulkan: support SET_ROWS (#14587 ) * vulkan: support SET_ROWS Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now. * vulkan: optimize set_rows Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.	2025-07-12 12:12:26 +02:00
Jeff Bolz	98197e5c98	vulkan: optimizations for deepseek prompt processing (#14555 ) * vulkan: allow unclamped loads in coopmat2 mul_mat_id shader * vulkan: increase coopmat2 mul_mat_id tile size * vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path * vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)	2025-07-12 11:51:58 +02:00
Tarek Dakhran	f5e96b368f	model : support LiquidAI LFM2 hybrid family (#14620 ) Important LFM2 was [merged ](https://github.com/huggingface/transformers/pull/39340)into transformers, but has not yet been released. To convert into gguf, install transformers from source ```shell pip install "transformers @ git+https://github.com/huggingface/transformers.git@main" ```	2025-07-11 20:27:01 +02:00
Slobodan Josic	756aa1020a	HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (#14634 )	2025-07-11 18:55:00 +02:00
rmatif	6bdda13981	opencl: add tiled mul_mat_f16_f32 (#14535 ) * add tiled mul_mat_f16_f32 * fix trailing whitespace * add insightful comments	2025-07-10 14:58:12 -07:00
lhez	0b8855775c	opencl: add `set_rows` for `f16` and `f32` (#14547 ) * opencl: add `set_rows` for `f16` and `f32` * opencl: better choose workgroup size for `set_rows`	2025-07-10 11:48:52 -07:00
Akarshan Biswas	704bb7a71c	SYCL: Initial set_rows kernel implementation (#14562 ) * SYCL: Initial set_rows kernel implementation * Revert max_threads to 256 * Refactor set_rows and address review comments * Deduplicate conversion function * Remove guard before kernel launch and refactor * Fix and add back SFINAE	2025-07-10 09:29:38 +01:00
compilade	a57d1bcb3c	cuda : support Falcon-H1 state size for SSM_SCAN (#14602 )	2025-07-09 23:54:38 -04:00
Xuan-Son Nguyen	98bab638fb	ggml : add ggml_scale_bias (#14417 ) * ggml : add ggml_scale_bias * ggml_vec_mad1_f32 * add more simd * add CUDA * sycl * vulkan * cann (placeholder) * opencl * will this fix cpu? * fix cuda * suggestions from coderabbit * fix cann compile error * vDSP_vsmsa * rm __ARM_FEATURE_SVE * use memcpy for op params * make code looks more consistent * use scalar for __ARM_FEATURE_SVE * add x param to ggml_vec_mad1_f32	2025-07-09 18:16:12 +02:00
Miaoqian Lin	26a48ad699	ggml : prevent integer overflow in gguf tensor size calculation (#14595 )	2025-07-09 14:33:53 +02:00
Jeff Bolz	6efcd65945	vulkan: optimize flash attention split_k_reduce (#14554 ) * vulkan: allow FA split_k with smaller KV values * vulkan: spread split_k_reduce work across more threads k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).	2025-07-08 20:11:42 +02:00
Jeff Bolz	b8eeb8741d	vulkan : fix rope with partial rotation and non-cont src (#14582 )	2025-07-08 15:21:21 +02:00
Georgi Gerganov	4d0dcd4a06	cuda : fix rope with partial rotation and non-cont src (#14580 ) * cuda : fix rope non-cont ggml-ci * cont : fix multi-rope + add test ggml-ci * sycl : try fix ggml-ci * cont : fix sycl + clean-up cuda ggml-ci	2025-07-08 10:15:21 +03:00
Aman Gupta	75c91de6e9	CUDA: add bilinear interpolation for upscale (#14563 )	2025-07-08 10:11:18 +08:00
R0CKSTAR	68155c66f0	musa: fix build warnings (unused variable) (#14561 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-08 07:58:30 +08:00
Aman Gupta	b9c3eefde1	CUDA: add bf16 and i32 to getrows (#14529 )	2025-07-07 21:45:43 +08:00
Eve	6491d6e4f1	vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (#14485 ) Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260 Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>	2025-07-06 12:29:36 +02:00
Jeff Bolz	e592be1575	vulkan: fix rms_norm+mul fusion (#14545 ) The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.	2025-07-06 10:08:16 +02:00
Jeff Bolz	a0374a67e2	vulkan: Handle updated FA dim2/3 definition (#14518 ) * vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1	2025-07-05 09:26:04 +02:00
Sigbjørn Skjæret	6681688146	opencl: add GELU_ERF (#14476 )	2025-07-04 23:24:56 -07:00
Georgi Gerganov	ef797db357	metal : disable fast math in all quantize kernels (#14528 ) ggml-ci	2025-07-04 19:19:09 +03:00
luyhcsu	499a8f5a78	CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002 ) Co-authored-by: luyuhong <luyuhong@kylinos.cn>	2025-07-04 11:50:07 +08:00
Sigbjørn Skjæret	28657a8229	ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445 )	2025-07-03 23:07:22 +02:00
lhez	bee28421be	opencl : broadcast for soft_max (#14510 )	2025-07-03 20:22:24 +02:00
Jeff Bolz	2b72bedec1	vulkan: support mixed/deepseekR1 FA head sizes (#14509 ) * vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes	2025-07-03 20:21:14 +02:00
Johannes Gäßler	c8c4495b8d	ggml: backward pass for split swiglu (#14483 )	2025-07-03 17:05:18 +02:00
Nicolò Scipione	7b63a71a6b	Fix conditional enabling following arch checks for ggml-sycl (#14504 ) Signed-off-by: nscipione <nicolo.scipione@codeplay.com>	2025-07-03 11:00:03 +02:00
Georgi Gerganov	a70c8a0c4b	kv-cache : use ggml_set_rows (#14285 ) * kv-cache : use ggml_set_rows ggml-ci * graph : separate k and v indices ggml-ci * cont : remove redundant ifs ggml-ci * kv-cache : improve find_slot impl * kv-cache : bounds-check when accessing slot_info indices * kv-cache : add comments ggml-ci * ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends ggml-ci	2025-07-03 10:53:35 +03:00
Georgi Gerganov	9067487c44	ggml : fix FA mask dim 2 and 3 (#14505 ) * ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1	2025-07-03 10:46:57 +03:00
Georgi Gerganov	d4cdd9c1c3	ggml : remove kompute backend (#14501 ) ggml-ci	2025-07-03 07:48:32 +03:00
Aman Gupta	55c2646b45	CUDA: add dynamic shared mem to softmax, refactor general usage (#14497 )	2025-07-03 07:45:11 +08:00
compilade	5d46babdc2	llama : initial Mamba-2 support (#9126 ) * llama : initial Mamba-2 support * ggml : SIMD ggml_ssm_scan for Mamba-2 * ggml : improve ggml_mul speed when masking recurrent states * llama : support running Mamba-Codestral-7B-v0.1 * llama : fix Mamba-2 conv state saving * ggml : make the ggml_mul fast broadcast path more consistently formatted * llama : remove unused variable * llama : add missing break * convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly. * llama : avoid redundant state copy for Mamba 1 and 2 * metal : attempt to adapt SSM_SCAN for Mamba-2 * metal : fix SSM_SCAN pipeline scope * metal : use log and exp instead of log1pf and expf in SSM_SCAN * metal : remove unused arguments for SSM_SCAN The max index is 31, so trimming the arguments is necessary. * metal : add back n_seqs to SSM_SCAN args Whoops, this is needed for the offset in the concatenated output. * metal : fix SSM_SCAN state head offset * metal : fix wrong number of tokens per sequence in SSM_SCAN * ggml : remove unused fast broadcast path in GGML_MUL This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity. * ggml : avoid multiply by D in GGML_OP_SSM_SCAN This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks * convert : fix flake8 lint * metal : fix confusion between ; and , * metal : add missing args for nb references in ssm_scan_f32_group * metal : single-user mamba2 inference works * kv-cache : remove const_cast when setting inputs for s_copy And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy. * convert : avoid AutoConfig for Mamba and Mamba2 hparams * kv-cache : allow context shift for recurrent models * graph : fix recurrent state copies when avoiding copies Works, but using lambda functions might not be that clean. * ggml : fix mamba2 ssm scan when compiled with SVE * ggml-cpu : reorder SVE FMA for consistency with other SIMD arches * cuda : implement ssm scan for Mamba2 There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2 * mamba : fix mismatched new and delete size for llm_build_mamba Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1\|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON * cuda : graceful fallback for Mamba-1 models with weird embd size	2025-07-02 13:10:24 -04:00
Daniel Bevenius	c46944aa25	ggml : add version function to get lib version (ggml/1286) * ggml : add version function to get lib version This commit adds a function `ggml_version()` to the ggml library that returns the version of the library as a string. The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used. Usage: ```c printf("GGML version: %s\n", ggml_version()); ``` Output: ```console GGML version: 0.0.2219 ``` * ggml : add ggml_commit() --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-02 20:08:45 +03:00
Aman Gupta	55a1c5a5fd	CUDA: add softmax broadcast (#14475 ) * CUDA: add softmax broadcast * Pass by const ref * Review: Use blockDims for indexing, remove designated initializers * Add TODO for noncontigous input/output	2025-07-02 15:48:33 +03:00

1 2 3 4 5 ...

1031 Commits