Commit Graph

5833 Commits

Author SHA1 Message Date
a0374a67e2 vulkan: Handle updated FA dim2/3 definition (#14518)
* vulkan: Handle updated FA dim2/3 definition

Pack mask boolean and n_head_log2 into a single dword to keep the push
constant block under the 128B limit.

* handle null mask for gqa

* allow gqa with dim3>1
b5833
2025-07-05 09:26:04 +02:00
ddef99522d server : fix assistant prefilling when content is an array (#14360) b5832 2025-07-05 09:17:14 +02:00
6681688146 opencl: add GELU_ERF (#14476) b5831 2025-07-04 23:24:56 -07:00
bac8bed248 eval-callback : check for empty input (#14539) b5830 2025-07-05 07:18:09 +03:00
b81510a7b7 test-backend-ops: add support for specifying output format (#14368)
* test-backend-ops: add support for specifying output format

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Add build_commit and build_number in test_result

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* refactor

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Get build commit from ggml_commit()

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Merge errors into test_operation_info && address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* remove visitor nonsense

* remove visitor comment

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

---------

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
b5829
2025-07-05 12:10:53 +08:00
ef797db357 metal : disable fast math in all quantize kernels (#14528)
ggml-ci
b5828
2025-07-04 19:19:09 +03:00
67d1ef23c6 batch : add optional for sequential equal split (#14511)
ggml-ci
b5827
2025-07-04 09:08:59 +03:00
7b50f7c025 graph : prepare for 4D mask (#14515)
ggml-ci
b5826
2025-07-04 09:05:36 +03:00
c79184d2d1 batch : add n_used count (#14512)
ggml-ci
b5825
2025-07-04 09:04:59 +03:00
499a8f5a78 CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002)
Co-authored-by: luyuhong <luyuhong@kylinos.cn>
b5824
2025-07-04 11:50:07 +08:00
28657a8229 ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445) b5823 2025-07-03 23:07:22 +02:00
bee28421be opencl : broadcast for soft_max (#14510) b5822 2025-07-03 20:22:24 +02:00
2b72bedec1 vulkan: support mixed/deepseekR1 FA head sizes (#14509)
* vulkan: better parameterize FA by head sizes

* vulkan: support mixed/deepseekR1 FA head sizes
b5821
2025-07-03 20:21:14 +02:00
c8c4495b8d ggml: backward pass for split swiglu (#14483) b5820 2025-07-03 17:05:18 +02:00
7b63a71a6b Fix conditional enabling following arch checks for ggml-sycl (#14504)
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
b5819
2025-07-03 11:00:03 +02:00
0c2ee38ab7 convert : correct gemma 3n conversion (#14450)
* convert : correct gemma 3n conversion

* rm redundant code
2025-07-03 10:03:06 +02:00
a70c8a0c4b kv-cache : use ggml_set_rows (#14285)
* kv-cache : use ggml_set_rows

ggml-ci

* graph : separate k and v indices

ggml-ci

* cont : remove redundant ifs

ggml-ci

* kv-cache : improve find_slot impl

* kv-cache : bounds-check when accessing slot_info indices

* kv-cache : add comments

ggml-ci

* ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends

ggml-ci
b5817
2025-07-03 10:53:35 +03:00
9067487c44 ggml : fix FA mask dim 2 and 3 (#14505)
* ggml : fix FA mask dim 2 and 3

ggml-ci

* backends : unsupport batched FA in CUDA and Vulkan

ggml-ci

* vulkan : disable FA for mask->ne[2] != 1
b5816
2025-07-03 10:46:57 +03:00
d4cdd9c1c3 ggml : remove kompute backend (#14501)
ggml-ci
b5815
2025-07-03 07:48:32 +03:00
55c2646b45 CUDA: add dynamic shared mem to softmax, refactor general usage (#14497) b5814 2025-07-03 07:45:11 +08:00
e75ba4c043 gguf-py : add support for chat template jinja files (#14508)
* add support for chat template jinja files

* remove gemma3n hack
2025-07-02 21:02:35 +02:00
5d46babdc2 llama : initial Mamba-2 support (#9126)
* llama : initial Mamba-2 support

* ggml : SIMD ggml_ssm_scan for Mamba-2

* ggml : improve ggml_mul speed when masking recurrent states

* llama : support running Mamba-Codestral-7B-v0.1

* llama : fix Mamba-2 conv state saving

* ggml : make the ggml_mul fast broadcast path more consistently formatted

* llama : remove unused variable

* llama : add missing break

* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.

* llama : avoid redundant state copy for Mamba 1 and 2

* metal : attempt to adapt SSM_SCAN for Mamba-2

* metal : fix SSM_SCAN pipeline scope

* metal : use log and exp instead of log1pf and expf in SSM_SCAN

* metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

* metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

* metal : fix SSM_SCAN state head offset

* metal : fix wrong number of tokens per sequence in SSM_SCAN

* ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.

* ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

* convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models
to avoid some reshapes.

Not sure if it's a good idea,
but it makes the graph slightly cleaner.

* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks

* convert : fix flake8 lint

* metal : fix confusion between ; and ,

* metal : add missing args for nb references in ssm_scan_f32_group

* metal : single-user mamba2 inference works

* kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.

* convert : avoid AutoConfig for Mamba and Mamba2 hparams

* kv-cache : allow context shift for recurrent models

* graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

* ggml : fix mamba2 ssm scan when compiled with SVE

* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches

* cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2

* mamba : fix mismatched new and delete size for llm_build_mamba

Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON

* cuda : graceful fallback for Mamba-1 models with weird embd size
b5812
2025-07-02 13:10:24 -04:00
e17991c466 sync : ggml
ggml-ci
b5811
2025-07-02 20:08:45 +03:00
c46944aa25 ggml : add version function to get lib version (ggml/1286)
* ggml : add version function to get lib version

This commit adds a function `ggml_version()` to the ggml library that
returns the version of the library as a string.

The motivation for this is that it can be useful to be able to
programmatically check the version of the ggml library being used.

Usage:
```c
printf("GGML version: %s\n", ggml_version());
```
Output:
```console
GGML version: 0.0.2219
```

* ggml : add ggml_commit()

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-02 20:08:45 +03:00
f3ed38d793 Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (#14309) b5809 2025-07-02 18:37:16 +02:00
55a1c5a5fd CUDA: add softmax broadcast (#14475)
* CUDA: add softmax broadcast

* Pass by const ref

* Review: Use blockDims for indexing, remove designated initializers

* Add TODO for noncontigous input/output
b5808
2025-07-02 15:48:33 +03:00
12a81af45f CUDA: broadcasting for FlashAttention mask (#14500) 2025-07-02 15:48:33 +03:00
8875523eb3 vulkan: support softmax/FA batch and broadcast (#14449) 2025-07-02 15:48:33 +03:00
ec68e84c32 ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435)
ggml-ci
2025-07-02 15:48:33 +03:00
307e79d33d opencl : fix possible buffer overflow in dump_tensor (#14490) b5804 2025-07-02 14:38:10 +02:00
d7f5f4e578 simple-chat : fix context-exceeded condition (#14494)
* simple-chat : fix context-exceeded condition

ggml-ci

* cont : fix n_ctx_used computation

ggml-ci
b5803
2025-07-02 14:12:07 +03:00
c8a4e470f6 opencl : skip empty nodes on cgraph compute (#14491) b5802 2025-07-02 13:00:04 +02:00
603e43dc91 opencl : update upscale to support align corners (#14488) b5801 2025-07-02 09:07:42 +02:00
611ba4b264 ci : add OpenCL to labeler workflow (#14496) 2025-07-02 09:02:51 +02:00
85841e121d github : add OpenCL backend to issue templates (#14492) 2025-07-02 08:41:35 +03:00
68b3cd6514 ggml : Callback before abort (#14481)
* Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.

* Return previous callback to allow callback chaining

* style fixes

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
b5798
2025-07-02 08:19:31 +03:00
de56944147 ci : disable fast-math for Metal GHA CI (#14478)
* ci : disable fast-math for Metal GHA CI

ggml-ci

* cont : remove -g flag

ggml-ci
b5797
2025-07-01 18:04:08 +03:00
1b2aaf28ac Add Vulkan images to docker.md (#14472)
Right now it's not easy to find those.
2025-07-01 15:44:11 +02:00
343b6e94b6 CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (#14411)
* [CANN]update to aclnnGroupedMatmulV2

Signed-off-by: noemotiovon <757486878@qq.com>

* Support MUL_MAT_ID on 310p

Signed-off-by: noemotiovon <757486878@qq.com>

* fix editorconfig

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
b5795
2025-07-01 16:47:30 +08:00
6a746cf9c4 vulkan: Split large mul_mat_id to fit in shared memory (#14451) b5794 2025-07-01 10:43:08 +02:00
eff5e45443 add GELU_ERF (#14455) b5793 2025-07-01 10:14:21 +02:00
a6a47958a1 ggml : remove trailing whitespace (#0) b5792 2025-07-01 11:06:39 +03:00
f61c05d4b1 sync : ggml
ggml-ci
2025-07-01 11:06:39 +03:00
431b2c24f3 ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)
* add "align corners" mode for bilinear upscale, and allow downscaling
* add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag
* test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners
2025-07-01 11:06:39 +03:00
497be7c01d ggml-quants : rename best_mad to best_error (ggml/1283)
This commit renames the variable `best_mad` to `best_error` in the
`make_qkx2_quants` function.

The motivation for this is that the name `best_mad` can be somewhat
confusing if mean absolute deviation (MAD) is not in use.
2025-07-01 11:06:39 +03:00
79b33b2317 opencl : add GEGLU, REGLU, SWIGLU (#14456) b5788 2025-07-01 09:19:16 +02:00
0a5a3b5cdf Add Conv2d for CPU (#14388)
* Conv2D: Add CPU version

* Half decent

* Tiled approach for F32

* remove file

* Fix tests

* Support F16 operations

* add assert about size

* Review: further formatting fixes, add assert and use CPU version of fp32->fp16
b5787
2025-06-30 23:57:04 +08:00
745f11fed0 memory : correctly handle failure in apply() (#14438)
ggml-ci
2025-06-30 18:03:03 +03:00
5dd942de59 metal : disable fast-math for some cpy kernels (#14460)
* metal : disable fast-math for some cpy kernels

ggml-ci

* cont : disable for q4_1

ggml-ci

* cont : disable for iq4_nl

ggml-ci
b5785
2025-06-30 17:04:05 +03:00
a7417f5594 ggml-cpu: sycl: Re-enable exp f16 (#14462) b5784 2025-06-30 14:52:02 +02:00