Commit Graph

5380 Commits

Author SHA1 Message Date
237acc7cd5 server : update readme + return json for "meta" field 2025-05-14 15:30:12 +03:00
6190e1c1c9 server : passthrough the /models endpoint during loading 2025-05-14 14:17:20 +03:00
09d13d94fb cmake: simplify vulkan shader test logic (#13263) b5378 2025-05-14 07:53:57 -03:00
24e86cae72 vulkan: KHR_coopmat flash attention (#13506)
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
b5377
2025-05-14 11:55:26 +02:00
bb1681fbd5 webui : use fflate for more deterministic gzip compress (#13525)
* webui : use pako for more deterministic gzip compress

* simpler code

* use fflate instead of pako
2025-05-14 10:26:12 +02:00
d486dd3e8e webui: Allow pasting file from clipboard (#13526)
* server: Allow pasting file from clipboard

* server: Prevent default action on file paste

* update build

* format then build combined

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-14 10:07:31 +02:00
21ca987fba docs: Update link to ggml-org in multimodal.md (#13513)
* Update multimodal.md

Minor change to include the huggingface link

* Update docs/multimodal.md

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-14 09:59:12 +02:00
be1d4a13db scripts : fix compare-llama-bench.py show parameter (#13514) 2025-05-14 08:41:01 +02:00
ab3971f2a0 vulkan: workaround FA compile failures on macos (#13517) b5372 2025-05-14 06:15:50 +02:00
e5c834f718 quantize : improve tensor-type pattern matching (#13033) b5371 2025-05-13 19:12:31 +02:00
71bdbdb587 clip : clip.h become private API (⚠️ breaking change) (#13510) b5370 2025-05-13 17:07:21 +02:00
f0995d28ce metal : use FA-vec kernel up to batch size 20 (#13496)
* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci

* metal : use FA-vec kernel up to batch size 20

ggml-ci
b5369
2025-05-13 18:04:39 +03:00
c252e0c409 metal : optimize multi-sequence FA vec kernel (#13493)
* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci
b5368
2025-05-13 18:04:00 +03:00
4f711afed5 ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509)
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
b5367
2025-05-13 18:02:28 +03:00
b89d605a91 batched-bench : fix pp batch contents (#13492) b5366 2025-05-13 18:01:53 +03:00
b4726345ac mtmd : remove libllava, remove clip-quantize-cli (⚠️ breaking change) (#13460)
* mtmd : remove libllava, remove clip-quantize-cli

* rm clip_model_quantize
b5365
2025-05-13 15:33:58 +02:00
bf79371120 scripts : support arbitrary input file formats in compare-llama-bench.py (#13455) 2025-05-13 15:31:12 +02:00
d590cd4c24 model : Granite MoE shared (#13269)
* feat: Add GGUF conversion for granitemoeshared

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: hparam and arch plumbing for granitemoeshared

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Split MoE fused tensors for shared experts in conversion

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: First WIP cut at model arch in cpp

The hparam and architecture plumbing should be correct, but the
implementation of the shared experts seems to still be broken.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Cleaner (maybe more correct?) splitting for gate/up

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix the input to the shared experts

I had misread that the shared experts take the inputs _before_ the standard
MoE layer and was feeding the output of the MoE to the shared experts.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Avoid architecture-specific checks for Granite MoE Shared

This is a cleaner way that will allow more flexibility in architecture
strings going forward.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Split granite architectures out of llm_build_llama

This helps de-clutter the llama-family graph construction and allows
granite to diverge further (in preparation for Granite 4).

NOTE: I removed the granite scale factors from llm_build_deci because they
appear to only be there as copy-paste from llm_build_llama. The HF config
does not seem to set those values:
https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix compiler warning about uninitialized inp_pos

This should not have been reachable, but it warns on some compliers

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Consoladate GraniteMoEShared into GraniteMoE for conversion

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Consolidate GraniteMoEShared into GraniteMoE on the c++ side

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
b5363
2025-05-13 15:12:01 +02:00
1e2809bc4b sync : ggml 2025-05-13 14:02:28 +03:00
cf0a43bb64 llama-bench : add defrag-thold, check for invalid ranges (#13487) b5361 2025-05-13 00:31:37 +02:00
f0d46ef157 opencl: remove unnecessary assert for add (#13257) b5360 2025-05-12 13:13:49 -07:00
de4c07f937 clip : cap max image size 1024 for qwen vl model (#13478) b5359 2025-05-12 15:06:51 +02:00
10d2af0eaa llama/ggml: add LLM training support (#10544)
* llama/ggml: add LLM training support

more compact progress bar

llama_save_model_to_file

llama_opt_param_filter

ggml_graph_dup force_grads

refactor ggml_opt, fix test-opt

* remove logits_all

* refactor CUDA implementation for ACC

* reset graph at beginning of opt period
b5358
2025-05-12 14:44:49 +02:00
064cc596ac context : fix state io for memory-less contexts (#13470)
ggml-ci
b5357
2025-05-12 15:12:27 +03:00
91159ee9df server : allow content to be null in oaicompat_completion_params_parse (#13477) b5356 2025-05-12 13:56:42 +02:00
22cdab343b llama-bench : accept ranges for integer parameters (#13410) b5355 2025-05-12 13:08:22 +02:00
a71a4075cd ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (#13053)
* ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

* * code review fixes

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

* * adds a comment that clarifies barrier usage

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

---------

Signed-off-by: Dan Johansson <dan.johansson@arm.com>
Co-authored-by: Charles Xu <charles.xu@arm.com>
b5354
2025-05-12 13:06:19 +02:00
95e18884fc CUDA: fix misaligned synchronization in FA (#13469) b5353 2025-05-12 10:51:21 +02:00
df8491922f ggml : add mrope kernel for metal (#13457) b5352 2025-05-12 10:29:13 +02:00
14492144c2 enable dpcpp nightly builds with libraries (#13406) b5351 2025-05-12 13:15:32 +08:00
c104023994 mtmd : Use RMS norm for InternVL 3 38B and 78B mmproj (#13459) b5350 2025-05-12 00:39:06 +02:00
9a390c4829 tools : fix uninitialized llama_batch in server (#13436)
* add constructor to initialize server_context::batch, preventing destructor's call to llama_batch_free from causing an invalid free()

* Update tools/server/server.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* use C++11 initializer syntax

* switch from Copy-list-initialization to Direct-list-initialization

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
b5349
2025-05-11 17:08:26 +02:00
09232370fc scripts : exit compare-llama-bench.py gracefully when there's nothing to compare (#13451) 2025-05-11 16:20:39 +02:00
7474e00b34 CUDA: fix crash with partial offloading of MoE (#13439) b5347 2025-05-11 16:09:33 +02:00
7f323a589f Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B (#13386) b5346 2025-05-11 14:18:39 +02:00
3eac209319 mtmd : support InternVL 3 38B and 78B mmproj (#13443)
* Support InternVL 3 38B and 78B mmproj

* Swap norms in clip.cpp

* Group variables together
b5345
2025-05-11 11:35:52 +02:00
a634d75d1b mtmd : move helpers to dedicated file (#13442)
* mtmd : move helpers to dedicated file

* fix windows build

* rm redundant include
b5344
2025-05-11 11:34:23 +02:00
62d4250e52 docs : Fix typo in InternVL3 model name (#13440) 2025-05-10 22:26:46 +02:00
0208355f42 CUDA: fix race conditions FlashAttention kernels (#13438) b5342 2025-05-10 22:22:48 +02:00
d2a4ef05c6 vocab : add ByteDance-Seed/Seed-Coder (#13423) b5341 2025-05-10 22:08:07 +02:00
15e6125a39 mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl (#13434)
* mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl

* fix typo
b5340
2025-05-10 19:57:54 +02:00
3b24d26c22 server : update docs (#13432) 2025-05-10 18:44:49 +02:00
43dfd741a5 llguidance : set tokenizer slices to default (#13424) b5338 2025-05-10 17:19:52 +02:00
b064a51a4e ci: free_disk_space flag enabled for intel variant (#13426)
before cleanup: 20G
after cleanup: 44G
after all built and pushed: 24G

https://github.com/Thammachart/llama.cpp/actions/runs/14945093573/job/41987371245
2025-05-10 16:34:48 +02:00
053367d149 mtmd : support InternVL 2.5 and 3 (#13422)
* convert : internvl support

* InternVL3-1B working

* fix regression

* rm mobilevlm from test

* fix conversion

* add test for internvl

* add to list of pre-quant

* restore boi/eoi check

* add clarify comment for norm eps
b5336
2025-05-10 16:26:42 +02:00
d8919424f1 CUDA: fix FlashAttention on Turing (#13415) b5335 2025-05-10 09:16:52 +02:00
7fef11766c arg : add env var to control mmproj (#13416)
* arg : add env var to control mmproj

* small note about -hf --mmproj
b5334
2025-05-10 08:16:29 +02:00
dc1d2adfc0 vulkan: scalar flash attention implementation (#13324)
* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA
b5333
2025-05-10 08:07:07 +02:00
7c28a74e07 chore(llguidance): use tagged version that does not break the build (#13413) b5332 2025-05-09 23:15:39 +03:00
33eff40240 server : vision support via libmtmd (#12898)
* server : (experimental) vision support via libmtmd

* mtmd : add more api around mtmd_image_tokens

* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* abstract out the batch management

* small fix

* refactor logic adding tokens to batch

* implement hashing image

* use FNV hash, now hash bitmap instead of file data

* allow decoding image embedding to be split into batches

* rm whitespace

* disable some features when mtmd is on

* fix --no-mmproj-offload

* mtmd_context_params no timings

* refactor server_inp to server_tokens

* fix the failing test case

* init

* wip

* working version

* add mtmd::bitmaps

* add test target

* rm redundant define

* test: mtmd_input_chunks_free

* rm outdated comment

* fix merging issue

* explicitly create mtmd::input_chunks

* mtmd_input_chunk_copy

* add clone()

* improve server_input struct

* clip :  fix confused naming ffn_up and ffn_down

* rm ffn_i/o/g naming

* rename n_embd, n_ff

* small fix

* no check n_ff

* fix detokenize

* add const to various places

* add warning about breaking changes

* add c api

* helper: use mtmd_image_tokens_get_n_pos

* fix ctx_shift

* fix name shadowing

* more strict condition

* support remote image_url

* remote image_url log

* add CI test

* do not log base64

* add "has_multimodal" to /props

* remove dangling image

* speculative: use slot.cache_tokens.insert

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* rm can_be_detokenized

* on prmpt processing done, assert cache_tokens.size

* handle_completions_impl returns void

* adapt the new web ui

* update docs and hot topics

* rm assert

* small fix (2)

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5331
2025-05-09 19:29:37 +02:00