* vulkan: allow FA split_k with smaller KV values
* vulkan: spread split_k_reduce work across more threads
k_num can get rather large. Use the whole workgroup to reduce the M/L values.
Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
Splits producing more than one ubatch per batch for recurrent models
were broken with #14512.
This fixes it by moving the completeness check after the ubatch split loop.
The fused operation was grabbing the epsilon value from the wrong place.
Add an env var to disable fusion.
Add some missing checks for supported shapes/types.
Handle fused rms_norm+mul in check_results.
* vulkan: Handle updated FA dim2/3 definition
Pack mask boolean and n_head_log2 into a single dword to keep the push
constant block under the 128B limit.
* handle null mask for gqa
* allow gqa with dim3>1
* kv-cache : use ggml_set_rows
ggml-ci
* graph : separate k and v indices
ggml-ci
* cont : remove redundant ifs
ggml-ci
* kv-cache : improve find_slot impl
* kv-cache : bounds-check when accessing slot_info indices
* kv-cache : add comments
ggml-ci
* ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends
ggml-ci
* llama : initial Mamba-2 support
* ggml : SIMD ggml_ssm_scan for Mamba-2
* ggml : improve ggml_mul speed when masking recurrent states
* llama : support running Mamba-Codestral-7B-v0.1
* llama : fix Mamba-2 conv state saving
* ggml : make the ggml_mul fast broadcast path more consistently formatted
* llama : remove unused variable
* llama : add missing break
* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.
* llama : avoid redundant state copy for Mamba 1 and 2
* metal : attempt to adapt SSM_SCAN for Mamba-2
* metal : fix SSM_SCAN pipeline scope
* metal : use log and exp instead of log1pf and expf in SSM_SCAN
* metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
* metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
* metal : fix SSM_SCAN state head offset
* metal : fix wrong number of tokens per sequence in SSM_SCAN
* ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.
* ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
* convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models
to avoid some reshapes.
Not sure if it's a good idea,
but it makes the graph slightly cleaner.
* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
* convert : fix flake8 lint
* metal : fix confusion between ; and ,
* metal : add missing args for nb references in ssm_scan_f32_group
* metal : single-user mamba2 inference works
* kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.
* convert : avoid AutoConfig for Mamba and Mamba2 hparams
* kv-cache : allow context shift for recurrent models
* graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
* ggml : fix mamba2 ssm scan when compiled with SVE
* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
* cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
* mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON
* cuda : graceful fallback for Mamba-1 models with weird embd size
* ggml : add version function to get lib version
This commit adds a function `ggml_version()` to the ggml library that
returns the version of the library as a string.
The motivation for this is that it can be useful to be able to
programmatically check the version of the ggml library being used.
Usage:
```c
printf("GGML version: %s\n", ggml_version());
```
Output:
```console
GGML version: 0.0.2219
```
* ggml : add ggml_commit()
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* CUDA: add softmax broadcast
* Pass by const ref
* Review: Use blockDims for indexing, remove designated initializers
* Add TODO for noncontigous input/output