Commit Graph

5178 Commits

Author SHA1 Message Date
7c727fbe39 arg : add --no-mmproj-offload (#13093)
* arg : add --no-mmproj-offload

* Update common/arg.cpp
b5178
2025-04-24 14:04:14 +02:00
80982e815e arg : clean up handling --mmproj with -hf (#13082)
* arg : clean up handling --mmproj with -hf

* rm change about no_mmproj

* Revert "rm change about no_mmproj"

This reverts commit 2cac8e0efb.

* handle no_mmproj explicitly

* skip download mmproj on examples not using it
b5177
2025-04-24 12:14:13 +02:00
7604a7d6b8 metal : fix floating-point range of attention scores in FA kernels (#13090)
ggml-ci
b5176
2025-04-24 10:38:30 +03:00
Eve
b3b6d862cf vulkan: matmul gcn tuning (#13016)
* tune matmul for gcn

* this one is more power efficient

* Update ggml/src/ggml-vulkan/ggml-vulkan.cpp

Co-authored-by: 0cc4m <picard12@live.de>

* disable this tune for the proprietary driver

---------

Co-authored-by: 0cc4m <picard12@live.de>
b5175
2025-04-24 09:18:33 +02:00
5630406959 llama-mtmd-cli: Sigint rework in mtmd vision example (#13080)
* Sigint rework in mtmd vision example

* Applied suggestions on mtmd-cli PR

* Forgot to invert one of the conditions

* Update examples/llava/mtmd-cli.cpp

* Removed redundant exit check

---------

Co-authored-by: pl752 <maximpl752@gmail.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
b5174
2025-04-23 23:32:35 +02:00
ecda2ec4b3 mtmd : Support Pixtral 12B (#13065)
* add pixtral text model (vision is wip)

* cgraph ok, just missing 2D RoPE

* fix bad rebase

* first working version

* fix problem with img_break token

* support dynamic image size

* update docs

* update test script
b5173
2025-04-23 20:21:59 +02:00
eb1776b15a convert : Append mult-eos,half-rope,bos to GLM4-0414 and Z (#13021)
* append mult-eos,half-rope,bos to GLM4-0414

* remove unset var
2025-04-23 16:59:14 +02:00
2cca6c01e4 rpc : add command line option for number of threads for the CPU backend (#13060)
closes #13051
b5171
2025-04-23 10:32:49 +03:00
658987cfc9 CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (#13014)
* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID

* fix logic for RoPE support, CUDA graphs
b5170
2025-04-22 21:27:40 +02:00
dc39a5e7a8 mtmd : support SmolVLM (version 1 and 2) (#13050)
* mtmd : support SmolVLM (version 1 and 2)

* correct chat template

* fix n_patches

* scale_factor is an int

* add more models to test
b5169
2025-04-22 16:24:54 +02:00
ab47dec3d3 security : add note about RPC and server functionality (#13061)
* security : add note about RPC functionality

* security : add note about llama-server
2025-04-22 16:16:10 +03:00
7b53389c24 metal : add memory pool for temp allocs (#12850)
* metal : add memory pool for temp allocs (wip) [no ci]

* cont : free buffers from the heap

* cont : resize heap [no ci]

* cont : refactor heap [no ci]

* cont : heap for each cmd buffer [no ci]

* cont : fix free

* wip

* cont : fix alignment [no ci]

* cont : not working .. [no ci]

* cont : heap allocation now works [no ci]

* cont : use MTLHeapTypePlacement

ggml-ci

* metal : use dynamic MTLHeap allocations

ggml-ci

* metal : add comments

* metal : disable softmax use of mem_pool

ggml-ci

* metal : final touches
2025-04-22 16:15:51 +03:00
243453533e llava : update documentations (#13055)
* llava : update documentations

* fix typo
b5166
2025-04-22 10:37:00 +02:00
1d735c0b4f ggml : add SSE 4.2 and x64 base variant for CPUs without AVX (#12871)
* ggml : add SSE 4.2 variant for CPUs without AVX

* ggml : add x64 base ABI variant
b5165
2025-04-21 18:13:51 +02:00
5368ddda7a SYCL: Add non-contiguous support in ROPE (#12993)
ggml-ci
b5164
2025-04-21 19:13:30 +05:30
84a9bf2fc2 mtmd : merge llava, gemma3 and minicpmv CLI into single llama-mtmd-cli (#13012)
* mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli`

* support for minicpmv

* remove cpp files of llava and minicpmv

* update hot topics

* mtmd : add not supported msg for qwen2vl

* Update examples/llava/mtmd.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5163
2025-04-21 15:32:58 +02:00
2016f07bd1 convert : experimental support for --mmproj flag (#13023)
* convert : experimental support for `--mmproj` flag

* fix bad ctrl+f replace

* fix style

* split into subclasses TextModel and VisionModel

* rename Mode --> ModelBase

* small fix

* correct CLIP_VISION arch name (because existing GGUF already use it)

* Apply suggestions from code review

Co-authored-by: compilade <git@compilade.net>

* fix Mistral3Model

* fix typo

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: compilade <git@compilade.net>
b5162
2025-04-20 23:29:36 +02:00
6602304814 llava: fix errors in clip.h on certain compilers (#13030) b5161 2025-04-20 12:15:41 +02:00
66168204be vulkan: support noncontiguous rms_norm (#13031) b5160 2025-04-20 10:50:02 +02:00
4ba9d711ba metal: add neg operator (#13029) b5159 2025-04-20 08:28:40 +03:00
00137157fc Disable CI cross-compile builds (#13022) b5158 2025-04-19 18:05:03 +02:00
fb28f4f80e gguf-py : fix upload python package workflow (#13020) gguf-v0.16.2 2025-04-19 16:26:38 +02:00
37b9f0d29d clip : refactor, add image_manipulation and llava_uhd classes (#13011)
* clip : refactor, add `image_manipulation` and `llava_uhd`

* refactor llava-1.6 preprocessing

* simplify logic for llava-1.5

* missing include
b5156
2025-04-19 09:15:45 +02:00
6408210082 main : Fix Ctrl+D/newline handling (#12951)
This restores the behavior from #491. This does not affect Ctrl+D's ability to
terminate --multiline-input lines (#1040).

This also actually implements #587: "If the user wants the text to end in a
newline, this should be accomplished by explicitly adding a newline by using
\ followed by return, then returning control by pressing return again."

Fixes #12949
b5155
2025-04-18 22:02:55 +02:00
aff9d107b0 gguf-py : GGUF Editor GUI - Python + Qt6 (#12930) gguf-v0.16.1 2025-04-18 20:30:41 +02:00
35370ba945 server : use std::move whenever possible (#12936)
* server : use std::move whenever possible

* use r-value ref

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* make task creation scoped

* restore std::move

* fix task_id not set correctly

* apply changes from suggestion

Co-authored-by: ggerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5153
2025-04-18 19:58:12 +02:00
8d66005763 SYCL: Refactor and enable FP16 in binary broadcast OPs (#12975)
* SYCL: refactor move to a separate file

* Fix binbcast

* Remove duplicates

* fix include formatting

* fix typo
b5152
2025-04-18 15:57:56 +02:00
b9154ecff9 mtmd : add methods to access mtmd_image_tokens (#12906)
* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* fix prompt_modified

* rm redundant data member
b5151
2025-04-18 10:04:51 +02:00
2db9ba1464 rpc : add RPC_CMD_HELLO (#12955)
Add RPC_CMD_HELLO for getting the version of the protocol implemend by
the server. Follow the semantic versioning rules at https://semver.org

Hopefully this bring better user experience when we make breaking
changes at the protocol level and avoid issues like #12465
b5150
2025-04-18 10:13:42 +03:00
2f74c354c0 graph : make FA compatible with MLA + add initial Metal kernels (#12953)
* graph : make mla compatible with FA

* metal : add exp FA kernels for DeepSeek models

ggml-ci

* llama : minor naming updates

ggml-ci

* ggml : disable FA for DS head sizes

* tests : add FA tests for MLA shapes

ggml-ci
b5149
2025-04-17 18:16:36 +03:00
207c22ec2d ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (#12970) b5148 2025-04-17 15:19:42 +02:00
7a395f67a7 CANN: Add support for async operator submission (#12864)
Submit operators using asynchronous threads to improve performance.

Use the environment variable GGML_CANN_ASYNC_MODE to control whether
asynchronous submission is enabled. It is disabled by default.

Testing shows a 10%–20% performance improvement in scenarios with
small parameter sizes, especially in quantized models.
b5147
2025-04-17 20:34:16 +08:00
971f245b3b llama : recognize IBM Granite 3.3 FIM tokens (#12988)
The Granite's FIM tokens are very similar to Qwen's; it's just that
they use underscore instead of a dash. So <fim_middle> for example
instead of <fim-middle>.

Opening up tokenizer_config.json in ibm-granite/granite-3.3-8b-base
shows:

```
    "<fim_prefix>",
    "<fim_middle>",
    "<fim_suffix>",
    "<fim_pad>",
    ...
    "<reponame>",
```
b5146
2025-04-17 11:37:05 +03:00
12b17501e6 opencl: fix incorrect local_size index in profiling log (#12868) b5145 2025-04-16 14:25:57 -07:00
015022bb53 vulkan: enable coopmat2 FA gqa and split_k optimizations more often (#12931)
The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.

split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting.
b5144
2025-04-16 20:37:25 +02:00
b43d89e311 CANN: Add 310P operator support check (#12962) b5143 2025-04-16 16:21:05 +08:00
80f19b4186 opencl: split ggml-opencl.cl into multiple files and cleanup (#12886)
* opencl: refactor - split the kernel files

---------

Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>

* opencl: split more kernels into separate files

* opencl: specify subgroup size instead of querying it

* opencl: refine Adreno cl compiler version parsing

* opencl: skip some kernels not used by Adreno on old compilers

* opencl: refine logic for selecting Adreno kernels

* opencl: refine Adreno cl compiler version

* opencl: cleanup preprocessor for kernels

* opencl: consider Adreno CL compiler on Windows

* opencl: add final newline for `mul_mv_f16_f16.cl`

---------

Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>
b5142
2025-04-15 12:26:00 -07:00
f8f820cc4d metal : add FA-vec kernels for head size 96 (#12952)
ggml-ci
b5141
2025-04-15 14:45:05 +03:00
54a7272043 CANN: Add x86 build ci (#12950)
* CANN: Add x86 build ci

* CANN: fix code format
b5140
2025-04-15 12:08:55 +01:00
84778e9770 CUDA/HIP: Share the same unified memory allocation logic. (#12934)
Replace compile-time `GGML_HIP_UMA` with environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY`. This unifies the usage on NVIDIA and AMD GPUs, and allows a single binary to be shared between integrated and dedicated GPUs.
2025-04-15 11:20:38 +02:00
510676475f SYCL: Add ROPE vision kernel (#12887)
* SYCL: Add ROPE vision kernel

* Add comment about rope mode
b5138
2025-04-15 10:37:42 +02:00
daa422881a llama : DeepSeek V2/V3 MLA implementation (#12801)
* Merged using squash to remove all noise commit messages

* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large

* Removed 3 conts (2x RoPE and 1x RMS-norm)

* Changed to use `<cmath>` instead of `<math.h>`

* Reverted removal of the 3 conts

* Used `reshape` in `llm_graph_context::build_attn_mha()`

* Use `k_pe = ggml_reshape`

* Removed the 3 conts again

* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF

* Removed MQA optimisation from `build_attn_mha()` as no gains now

* Simplified `is_mla` branch in `llm_build_deepseek2()`

* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls

* Fixed call to `build_attn` in `llm_build_t5_enc`
b5137
2025-04-15 09:49:57 +03:00
eccc7a1602 ggml : Add AVX512 implementation of GEMM - Q4_Kx8 (#12829)
* Add AVX512 implementation of GEMM - q4kx8

* Update changes to remove unnecessary whitespaces
b5136
2025-04-15 09:22:36 +03:00
0019279bb5 CANN: Opt ROPE optimization (#12865)
* [CANN]Opt ROPE optimization

* [CANN]Codestyle adjustment

* [CANN]Fix the ROPE precision issue

* [CANN]codestyle fix

* [CANN]add rope unsupport case

Signed-off-by: noemotiovon <noemotiovon@gmail.com>
b5135
2025-04-15 10:09:35 +08:00
b0c75ac9f9 CANN: Optimize CANN buffer pool memory management (#12875)
Multiple optional memory pools are provided for CANN, including VMM, 
priority queue-based, and traditional memory pools.
1.When the memory pool is available and GGML_CANN_DISABLE_VMM_POOL 
   is not defined, the VMM pool is selected by default.
2.Otherwise, if GGML_CANN_ENABLE_BUF_PRIO_POOL is defined, 
   the priority queue-based memory pool is used.
3.If neither condition is met, the default memory pool is used.
b5134
2025-04-15 10:04:24 +08:00
d6d2c2ab8c Add performance print for gemma3 in example (#12929) b5133 2025-04-14 19:18:20 +02:00
75afa0ae31 SYCL: Fix im2col (#12910)
* SYCL: Fix im2col

* restore local workgroup size adjustments for large inputs

* restore format
b5132
2025-04-14 14:23:53 +02:00
c772d54926 rpc : use ggml_context_ptr (#12938) b5131 2025-04-14 13:59:34 +03:00
81c7e64fc2 dsiable curl lib check, this action is missed by commit bd3f59f812 (#12761) (#12937) 2025-04-14 18:19:07 +08:00
526739b879 sync : ggml
ggml-ci
b5129
2025-04-14 09:26:15 +03:00