Commit Graph

5543 Commits

Author SHA1 Message Date
62d4250e52 docs : Fix typo in InternVL3 model name (#13440) 2025-05-10 22:26:46 +02:00
0208355f42 CUDA: fix race conditions FlashAttention kernels (#13438) b5342 2025-05-10 22:22:48 +02:00
d2a4ef05c6 vocab : add ByteDance-Seed/Seed-Coder (#13423) b5341 2025-05-10 22:08:07 +02:00
15e6125a39 mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl (#13434)
* mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl

* fix typo
b5340
2025-05-10 19:57:54 +02:00
3b24d26c22 server : update docs (#13432) 2025-05-10 18:44:49 +02:00
43dfd741a5 llguidance : set tokenizer slices to default (#13424) b5338 2025-05-10 17:19:52 +02:00
b064a51a4e ci: free_disk_space flag enabled for intel variant (#13426)
before cleanup: 20G
after cleanup: 44G
after all built and pushed: 24G

https://github.com/Thammachart/llama.cpp/actions/runs/14945093573/job/41987371245
2025-05-10 16:34:48 +02:00
053367d149 mtmd : support InternVL 2.5 and 3 (#13422)
* convert : internvl support

* InternVL3-1B working

* fix regression

* rm mobilevlm from test

* fix conversion

* add test for internvl

* add to list of pre-quant

* restore boi/eoi check

* add clarify comment for norm eps
b5336
2025-05-10 16:26:42 +02:00
d8919424f1 CUDA: fix FlashAttention on Turing (#13415) b5335 2025-05-10 09:16:52 +02:00
7fef11766c arg : add env var to control mmproj (#13416)
* arg : add env var to control mmproj

* small note about -hf --mmproj
b5334
2025-05-10 08:16:29 +02:00
dc1d2adfc0 vulkan: scalar flash attention implementation (#13324)
* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA
b5333
2025-05-10 08:07:07 +02:00
7c28a74e07 chore(llguidance): use tagged version that does not break the build (#13413) b5332 2025-05-09 23:15:39 +03:00
33eff40240 server : vision support via libmtmd (#12898)
* server : (experimental) vision support via libmtmd

* mtmd : add more api around mtmd_image_tokens

* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* abstract out the batch management

* small fix

* refactor logic adding tokens to batch

* implement hashing image

* use FNV hash, now hash bitmap instead of file data

* allow decoding image embedding to be split into batches

* rm whitespace

* disable some features when mtmd is on

* fix --no-mmproj-offload

* mtmd_context_params no timings

* refactor server_inp to server_tokens

* fix the failing test case

* init

* wip

* working version

* add mtmd::bitmaps

* add test target

* rm redundant define

* test: mtmd_input_chunks_free

* rm outdated comment

* fix merging issue

* explicitly create mtmd::input_chunks

* mtmd_input_chunk_copy

* add clone()

* improve server_input struct

* clip :  fix confused naming ffn_up and ffn_down

* rm ffn_i/o/g naming

* rename n_embd, n_ff

* small fix

* no check n_ff

* fix detokenize

* add const to various places

* add warning about breaking changes

* add c api

* helper: use mtmd_image_tokens_get_n_pos

* fix ctx_shift

* fix name shadowing

* more strict condition

* support remote image_url

* remote image_url log

* add CI test

* do not log base64

* add "has_multimodal" to /props

* remove dangling image

* speculative: use slot.cache_tokens.insert

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* rm can_be_detokenized

* on prmpt processing done, assert cache_tokens.size

* handle_completions_impl returns void

* adapt the new web ui

* update docs and hot topics

* rm assert

* small fix (2)

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5331
2025-05-09 19:29:37 +02:00
17512a94d6 sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (#12858)
* sycl : Implemented reorder Q4_0 mmvq

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>

* sycl : Fixed mmvq being called when reorder is disabled

* sycl : Improved comments in the quants header

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>

* Use static_assert

* safe_div -> ceil_div

* Clarify qi comment

* change the reorder tensor from init to execute OP

* dbg

* Undo changes to test-backend-ops

* Refactor changes on top of q4_0 reorder fix

* Missing Reverts

* Refactored opt_for_reorder logic to simplify code path

* Explicit inlining and unroll

* Renamed mul_mat_algo enum for consistency

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
Co-authored-by: romain.biessy <romain.biessy@codeplay.com>
b5330
2025-05-09 16:34:08 +01:00
611aa914ef metal : optimize MoE for large batches (#13388)
ggml-ci
b5329
2025-05-09 15:14:56 +03:00
0cf6725e9f CUDA: FA support for Deepseek (Ampere or newer) (#13306)
* CUDA: FA support for Deepseek (Ampere or newer)

* do loop unrolling via C++ template
b5328
2025-05-09 13:34:58 +02:00
27ebfcacba llama : do not crash if there is no CPU backend (#13395)
* llama : do not crash if there is no CPU backend

* add checks to examples
b5327
2025-05-09 13:02:07 +02:00
5c86c9ed3e CUDA: fix crash on large batch size for MoE models (#13384) b5326 2025-05-09 12:14:04 +02:00
efb8b47eda imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (#13389)
* Add --parse-special for enabling parsing of special tokens in imatrix calculation

* whitespace
b5325
2025-05-09 11:53:58 +02:00
0527771dd8 llama-run: add support for downloading models from ModelScope (#13370)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b5324
2025-05-09 10:25:50 +01:00
2189fd3b63 mtmd : fix batch_view for m-rope (#13397)
* mtmd : fix batch_view for m-rope

* nits : fix comment
b5323
2025-05-09 11:18:02 +02:00
3f96aeff39 llama : one-off chat template fix for Mistral-Small-2503 (#13398)
* llama : one-off chat template fix for Mistral-Small-2503

* update readme

* add mistral-v7-tekken
b5322
2025-05-09 11:17:51 +02:00
b486ba05bf rpc : add rpc_msg_set_tensor_hash_req (#13353)
* rpc : add rpc_msg_set_tensor_hash_req

Use a dedicated struct for the request of RPC_CMD_SET_TENSOR_HASH which
makes the code cleaner.

* fix
b5321
2025-05-09 10:31:07 +03:00
02115dcd9a vulkan: Allow up to 4096 elements for mul_mat_id row_ids (#13326)
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.
b5320
2025-05-09 09:23:41 +02:00
d9c4accaff server : (webui) rename has_multimodal --> modalities (#13393)
* server : (webui) rename has_multimodal --> modalities

* allow converting SVG to PNG

* less complicated code
2025-05-09 09:06:37 +02:00
15e03282bb ci : limit write permission to only the release step + fixes (#13392)
* ci : limit write permission to only the release step

* fix win cuda file name

* fix license file copy on multi-config generators
b5318
2025-05-08 23:45:22 +02:00
f05a6d71a0 mtmd : Expose helper_decode_image_chunk (#13366)
* mtmd: Expose helper_decode_image, output_embd_copy, image_tokens_copy/free

* Slim down

* Cleanups
b5317
2025-05-08 20:25:39 +02:00
ee01d71e58 server : (webui) fix a very small misalignment (#13387)
* server : (webui) fix a very small misalignment

* restore font-bold
2025-05-08 18:51:45 +02:00
8c83449cb7 server : (webui) revamp the input area, plus many small UI improvements (#13365)
* rework the input area

* process selected file

* change all icons to heroicons

* fix thought process collapse

* move conversation more menu to sidebar

* sun icon --> moon icon

* rm default system message

* stricter upload file check, only allow image if server has mtmd

* build it

* add renaming

* better autoscroll

* build

* add conversation group

* fix scroll

* extra context first, then user input in the end

* fix <hr> tag

* clean up a bit

* build

* add mb-3 for <pre>

* throttle adjustTextareaHeight to make it less laggy

* (nits) missing padding in sidebar

* rm stray console log
b5315
2025-05-08 15:37:29 +02:00
1a844be132 convert : support rope_scaling type and rope_type (#13349) 2025-05-08 15:34:29 +02:00
0ccc121354 mtmd : fix the calculation of n_tokens for smolvlm (#13381)
Co-authored-by: Taichi Nishimura <Taichi.A.Nishimura@sony.com>
b5313
2025-05-08 15:03:53 +02:00
6562e5a4d6 context : allow cache-less context for embeddings (#13108)
* context : allow cache-less context for embeddings

ggml-ci

* context : enable reranking with encode()

ggml-ci

* context : encode() clears embd_seq

ggml-ci

* examples : use llama_encode() when appropriate

ggml-ci

* models : nomic bert moe does not require KV cache

* llama : update comments for llama_decode/llama_encode

ggml-ci

* context : update warning log [no ci]
2025-05-08 14:28:33 +03:00
51fb96b1ff context : remove logits_all flag (#13284)
* context : remove logits_all flag

ggml-ci

* llama : remove logits_all flag + reorder llama_context_params

ggml-ci
b5311
2025-05-08 14:26:50 +03:00
70a6991edf ci : move release workflow to a separate file (#13362) b5310 2025-05-08 13:15:28 +02:00
f061021206 llama : print size and type of overridden tensors (#13364) b5309 2025-05-08 13:15:15 +02:00
8733e0cf6e sycl: addressing non-contiguous src1 mul_mats (nc and batched) (#13343)
* sycl: fixed non-contiguous src1 mul_mats (nc and batched)

* Fixed wrong static_cast inside kernel
b5308
2025-05-08 10:08:01 +01:00
814f795e06 docker : disable arm64 and intel images (#13356) 2025-05-07 16:36:33 +02:00
d879433824 sync : ggml
ggml-ci
b5306
2025-05-07 17:28:36 +03:00
13b0a04597 whisper: remove MSVC warnings pragmas (whisper/3090)
* ggml : remove MSVC warnings pragmas

This commit removes the MSVC-specific pragmas as these are now handled
in ggml/CMakeLists.txt.

* whisper : remove MSVC warning pragmas

This commit removes the MSVC-specific pragmas. These are now handled in
the ggml/CMakeLists.txt file.
2025-05-07 17:28:36 +03:00
bba9d945c1 cmake : removed stdc++fs (whisper/3097)
* removed stdc++fs

* kept line, but removed stdc++fs
2025-05-07 17:28:36 +03:00
bc4e1128f7 llama : deci : support ffn-free with attention (#13296) b5303 2025-05-07 12:49:27 +02:00
39e73ae0d6 common : Add a warning when we can't match samplers from a string or char. (#13330) b5302 2025-05-07 11:23:28 +03:00
1f73301b63 cuda : remove nrows_x in mul_mat_q_process_tile (#13325)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b5301
2025-05-07 09:48:23 +02:00
4773d7a02f examples : remove infill (#13283)
ggml-ci
b5300
2025-05-07 10:28:02 +03:00
6c7fd67b64 llama : support tie embedding for chatglm models (#13328) b5299 2025-05-07 09:23:11 +02:00
141a908a59 CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF (#13135) b5298 2025-05-06 23:35:51 +02:00
32916a4907 clip : refactor graph builder (#13321)
* mtmd : refactor graph builder

* fix qwen2vl

* clean up siglip cgraph

* pixtral migrated

* move minicpmv to a dedicated build function

* move max_feature_layer to build_llava

* use build_attn for minicpm resampler

* fix windows build

* add comment for batch_size

* also support tinygemma3 test model

* qwen2vl does not use RMS norm

* fix qwen2vl norm (2)
b5297
2025-05-06 22:40:24 +02:00
ffc727203a sampling : make top_n_sigma no-op at <=0 or a single candidate (#13345) b5296 2025-05-06 22:36:24 +02:00
91a86a6f35 sampling : don't consider -infinity values in top_n_sigma (#13344) b5295 2025-05-06 20:24:15 +02:00
f4ed10b69c cmake : remove arm64 msvc presets (#13342) 2025-05-06 20:15:31 +02:00