Commit Graph

5336 Commits

Author SHA1 Message Date
5c32fc3d13 Break down main function in llama-server
llama-server main function is getting meaty, just breaking it
down into smaller functions.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-05-10 13:31:48 +01:00
d8919424f1 CUDA: fix FlashAttention on Turing (#13415) b5335 2025-05-10 09:16:52 +02:00
7fef11766c arg : add env var to control mmproj (#13416)
* arg : add env var to control mmproj

* small note about -hf --mmproj
b5334
2025-05-10 08:16:29 +02:00
dc1d2adfc0 vulkan: scalar flash attention implementation (#13324)
* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA
b5333
2025-05-10 08:07:07 +02:00
7c28a74e07 chore(llguidance): use tagged version that does not break the build (#13413) b5332 2025-05-09 23:15:39 +03:00
33eff40240 server : vision support via libmtmd (#12898)
* server : (experimental) vision support via libmtmd

* mtmd : add more api around mtmd_image_tokens

* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* abstract out the batch management

* small fix

* refactor logic adding tokens to batch

* implement hashing image

* use FNV hash, now hash bitmap instead of file data

* allow decoding image embedding to be split into batches

* rm whitespace

* disable some features when mtmd is on

* fix --no-mmproj-offload

* mtmd_context_params no timings

* refactor server_inp to server_tokens

* fix the failing test case

* init

* wip

* working version

* add mtmd::bitmaps

* add test target

* rm redundant define

* test: mtmd_input_chunks_free

* rm outdated comment

* fix merging issue

* explicitly create mtmd::input_chunks

* mtmd_input_chunk_copy

* add clone()

* improve server_input struct

* clip :  fix confused naming ffn_up and ffn_down

* rm ffn_i/o/g naming

* rename n_embd, n_ff

* small fix

* no check n_ff

* fix detokenize

* add const to various places

* add warning about breaking changes

* add c api

* helper: use mtmd_image_tokens_get_n_pos

* fix ctx_shift

* fix name shadowing

* more strict condition

* support remote image_url

* remote image_url log

* add CI test

* do not log base64

* add "has_multimodal" to /props

* remove dangling image

* speculative: use slot.cache_tokens.insert

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* rm can_be_detokenized

* on prmpt processing done, assert cache_tokens.size

* handle_completions_impl returns void

* adapt the new web ui

* update docs and hot topics

* rm assert

* small fix (2)

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5331
2025-05-09 19:29:37 +02:00
17512a94d6 sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (#12858)
* sycl : Implemented reorder Q4_0 mmvq

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>

* sycl : Fixed mmvq being called when reorder is disabled

* sycl : Improved comments in the quants header

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>

* Use static_assert

* safe_div -> ceil_div

* Clarify qi comment

* change the reorder tensor from init to execute OP

* dbg

* Undo changes to test-backend-ops

* Refactor changes on top of q4_0 reorder fix

* Missing Reverts

* Refactored opt_for_reorder logic to simplify code path

* Explicit inlining and unroll

* Renamed mul_mat_algo enum for consistency

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
Co-authored-by: romain.biessy <romain.biessy@codeplay.com>
b5330
2025-05-09 16:34:08 +01:00
611aa914ef metal : optimize MoE for large batches (#13388)
ggml-ci
b5329
2025-05-09 15:14:56 +03:00
0cf6725e9f CUDA: FA support for Deepseek (Ampere or newer) (#13306)
* CUDA: FA support for Deepseek (Ampere or newer)

* do loop unrolling via C++ template
b5328
2025-05-09 13:34:58 +02:00
27ebfcacba llama : do not crash if there is no CPU backend (#13395)
* llama : do not crash if there is no CPU backend

* add checks to examples
b5327
2025-05-09 13:02:07 +02:00
5c86c9ed3e CUDA: fix crash on large batch size for MoE models (#13384) b5326 2025-05-09 12:14:04 +02:00
efb8b47eda imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (#13389)
* Add --parse-special for enabling parsing of special tokens in imatrix calculation

* whitespace
b5325
2025-05-09 11:53:58 +02:00
0527771dd8 llama-run: add support for downloading models from ModelScope (#13370)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b5324
2025-05-09 10:25:50 +01:00
2189fd3b63 mtmd : fix batch_view for m-rope (#13397)
* mtmd : fix batch_view for m-rope

* nits : fix comment
b5323
2025-05-09 11:18:02 +02:00
3f96aeff39 llama : one-off chat template fix for Mistral-Small-2503 (#13398)
* llama : one-off chat template fix for Mistral-Small-2503

* update readme

* add mistral-v7-tekken
b5322
2025-05-09 11:17:51 +02:00
b486ba05bf rpc : add rpc_msg_set_tensor_hash_req (#13353)
* rpc : add rpc_msg_set_tensor_hash_req

Use a dedicated struct for the request of RPC_CMD_SET_TENSOR_HASH which
makes the code cleaner.

* fix
b5321
2025-05-09 10:31:07 +03:00
02115dcd9a vulkan: Allow up to 4096 elements for mul_mat_id row_ids (#13326)
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.
b5320
2025-05-09 09:23:41 +02:00
d9c4accaff server : (webui) rename has_multimodal --> modalities (#13393)
* server : (webui) rename has_multimodal --> modalities

* allow converting SVG to PNG

* less complicated code
2025-05-09 09:06:37 +02:00
15e03282bb ci : limit write permission to only the release step + fixes (#13392)
* ci : limit write permission to only the release step

* fix win cuda file name

* fix license file copy on multi-config generators
b5318
2025-05-08 23:45:22 +02:00
f05a6d71a0 mtmd : Expose helper_decode_image_chunk (#13366)
* mtmd: Expose helper_decode_image, output_embd_copy, image_tokens_copy/free

* Slim down

* Cleanups
b5317
2025-05-08 20:25:39 +02:00
ee01d71e58 server : (webui) fix a very small misalignment (#13387)
* server : (webui) fix a very small misalignment

* restore font-bold
2025-05-08 18:51:45 +02:00
8c83449cb7 server : (webui) revamp the input area, plus many small UI improvements (#13365)
* rework the input area

* process selected file

* change all icons to heroicons

* fix thought process collapse

* move conversation more menu to sidebar

* sun icon --> moon icon

* rm default system message

* stricter upload file check, only allow image if server has mtmd

* build it

* add renaming

* better autoscroll

* build

* add conversation group

* fix scroll

* extra context first, then user input in the end

* fix <hr> tag

* clean up a bit

* build

* add mb-3 for <pre>

* throttle adjustTextareaHeight to make it less laggy

* (nits) missing padding in sidebar

* rm stray console log
b5315
2025-05-08 15:37:29 +02:00
1a844be132 convert : support rope_scaling type and rope_type (#13349) 2025-05-08 15:34:29 +02:00
0ccc121354 mtmd : fix the calculation of n_tokens for smolvlm (#13381)
Co-authored-by: Taichi Nishimura <Taichi.A.Nishimura@sony.com>
b5313
2025-05-08 15:03:53 +02:00
6562e5a4d6 context : allow cache-less context for embeddings (#13108)
* context : allow cache-less context for embeddings

ggml-ci

* context : enable reranking with encode()

ggml-ci

* context : encode() clears embd_seq

ggml-ci

* examples : use llama_encode() when appropriate

ggml-ci

* models : nomic bert moe does not require KV cache

* llama : update comments for llama_decode/llama_encode

ggml-ci

* context : update warning log [no ci]
2025-05-08 14:28:33 +03:00
51fb96b1ff context : remove logits_all flag (#13284)
* context : remove logits_all flag

ggml-ci

* llama : remove logits_all flag + reorder llama_context_params

ggml-ci
b5311
2025-05-08 14:26:50 +03:00
70a6991edf ci : move release workflow to a separate file (#13362) b5310 2025-05-08 13:15:28 +02:00
f061021206 llama : print size and type of overridden tensors (#13364) b5309 2025-05-08 13:15:15 +02:00
8733e0cf6e sycl: addressing non-contiguous src1 mul_mats (nc and batched) (#13343)
* sycl: fixed non-contiguous src1 mul_mats (nc and batched)

* Fixed wrong static_cast inside kernel
b5308
2025-05-08 10:08:01 +01:00
814f795e06 docker : disable arm64 and intel images (#13356) 2025-05-07 16:36:33 +02:00
d879433824 sync : ggml
ggml-ci
b5306
2025-05-07 17:28:36 +03:00
13b0a04597 whisper: remove MSVC warnings pragmas (whisper/3090)
* ggml : remove MSVC warnings pragmas

This commit removes the MSVC-specific pragmas as these are now handled
in ggml/CMakeLists.txt.

* whisper : remove MSVC warning pragmas

This commit removes the MSVC-specific pragmas. These are now handled in
the ggml/CMakeLists.txt file.
2025-05-07 17:28:36 +03:00
bba9d945c1 cmake : removed stdc++fs (whisper/3097)
* removed stdc++fs

* kept line, but removed stdc++fs
2025-05-07 17:28:36 +03:00
bc4e1128f7 llama : deci : support ffn-free with attention (#13296) b5303 2025-05-07 12:49:27 +02:00
39e73ae0d6 common : Add a warning when we can't match samplers from a string or char. (#13330) b5302 2025-05-07 11:23:28 +03:00
1f73301b63 cuda : remove nrows_x in mul_mat_q_process_tile (#13325)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b5301
2025-05-07 09:48:23 +02:00
4773d7a02f examples : remove infill (#13283)
ggml-ci
b5300
2025-05-07 10:28:02 +03:00
6c7fd67b64 llama : support tie embedding for chatglm models (#13328) b5299 2025-05-07 09:23:11 +02:00
141a908a59 CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF (#13135) b5298 2025-05-06 23:35:51 +02:00
32916a4907 clip : refactor graph builder (#13321)
* mtmd : refactor graph builder

* fix qwen2vl

* clean up siglip cgraph

* pixtral migrated

* move minicpmv to a dedicated build function

* move max_feature_layer to build_llava

* use build_attn for minicpm resampler

* fix windows build

* add comment for batch_size

* also support tinygemma3 test model

* qwen2vl does not use RMS norm

* fix qwen2vl norm (2)
b5297
2025-05-06 22:40:24 +02:00
ffc727203a sampling : make top_n_sigma no-op at <=0 or a single candidate (#13345) b5296 2025-05-06 22:36:24 +02:00
91a86a6f35 sampling : don't consider -infinity values in top_n_sigma (#13344) b5295 2025-05-06 20:24:15 +02:00
f4ed10b69c cmake : remove arm64 msvc presets (#13342) 2025-05-06 20:15:31 +02:00
1e333d5bba SYCL: Disable reorder optimize by default and stop setting tensor extras when optimize is disabled (#13254)
* SYCL: Do not set tensor extras when reorder optimize is disabled

* SYCL: Disable reorder optimize by default
b5293
2025-05-06 20:27:06 +05:30
2f54e348ad llama : fix build_ffn without gate (#13336)
* llama : fix build_ffn without gate

* fix build on windows

* Revert "fix build on windows"

This reverts commit fc420d3c7e.
b5292
2025-05-06 14:25:40 +02:00
2356fb1d53 CUDA: fix bad asserts for partial offload (#13337) 2025-05-06 13:58:51 +02:00
764b85627b convert : qwen2/3moe : set yarn metadata if present (#13331)
* set yarn metadata if present

* add comment about enabling YaRN

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
2025-05-06 11:12:06 +02:00
15a28ec8c7 CUDA: fix --split-mode row for MMQ (#13323) b5289 2025-05-06 08:36:46 +02:00
a7366faa5b gguf-py : avoid requiring pyside6 for other scripts (#13036)
- gguf-py : remove gguf-py/gguf/scripts/__init__.py because it's not needed

Implicit namespaces are supported since Python 3.3 (https://peps.python.org/pep-0420/),
and the entrypoints in pyproject.toml can directly refer to the main functions.
gguf-v0.16.3
2025-05-05 22:27:31 -04:00
9070365020 CUDA: fix logic for clearing padding with -ngl 0 (#13320) b5287 2025-05-05 22:32:13 +02:00