Commit Graph

5336 Commits

Author SHA1 Message Date
6f67cf1f48 arg : -hf do not fail if url mismatch (#13219)
* arg : -hf do not fail if url mismatch

* do not return if cannot parse metadata json
b5236
2025-04-30 21:29:15 +01:00
16a457facd fix typo: n_ctx_pre_seq -> n_ctx_per_seq (#13221) b5235 2025-04-30 21:28:43 +01:00
3e168bede4 convert : improve model arch handling (#13122)
* convert : improve model arch handling

* use AutoConfig

* rm trust_remote_code

* Update convert_hf_to_gguf.py

* fix self.block_count for vision

* fix NomicBertModel
2025-04-30 16:56:24 +02:00
ceda28ef8e llava : remove duplicate include (#13207) b5233 2025-04-30 15:25:20 +02:00
3b127c7385 common : add -jf / --json-schema-file flag (#12011) b5232 2025-04-30 14:52:35 +02:00
e5007a5edf vulkan: use uint array index to avoid glslang bug (#13193) b5231 2025-04-30 14:38:37 +02:00
416313773b ggml : fix ppc64le build (#13176)
Build fails with compilation error on power pc.
This patch fixes the same.

Tested with unit tests run via
 --build <build_dir> && cd <build_dir> && make test

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
b5230
2025-04-30 13:17:08 +02:00
07c2e2f76c convert : correct typo image_mean --> image_std (#13208) 2025-04-30 13:06:15 +02:00
44cd8d91ff feat(ggml-cpu): enable z17 compile (#13182)
z17 compilation requires GCC 15.1.0 and onwards

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
b5228
2025-04-30 10:47:35 +01:00
5933e6fdc9 arg : allow using -hf offline (#13202)
* arg : allow using -hf offline

* add more comments in code [no ci]
2025-04-30 10:46:32 +02:00
da84c04d8f docker : do not build tests (#13204)
* docker : do not build tests

* include "ggml-cpu.h"
b5226
2025-04-30 10:44:07 +02:00
a0f7016d17 rpc : fix cache directory initialization (#13188)
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
b5225
2025-04-30 09:29:22 +03:00
19e899ce21 scripts: n_depth for compare-llama-bench [no ci] (#13201) 2025-04-29 23:32:04 +02:00
e2e1ddb93a server : Prefilling assistant message in openai compatible API (#13174)
* Prefilling assistant message in openai compatible API

* fixed indentation

* fixed code convention

* simplify method usage

* no more than one assistant message at end of messages

* merge checks into prefill code

* Update examples/server/utils.hpp

---------

Co-authored-by: matteo <matteo@naspc.lan>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
b5223
2025-04-29 20:33:10 +02:00
d9d398f84f sampling : when top-k <= 0 -> noop (#13173)
ggml-ci
b5222
2025-04-29 20:22:57 +03:00
5a63980117 llama-bench: fixed size of fields to correctly map to values (#13183) b5221 2025-04-29 17:24:36 +02:00
cdf76586b2 CUDA: fix non-cont. inputs for batched mat mul (#13155) b5220 2025-04-29 16:00:27 +02:00
7d3af70b08 llama : llm_type order by size (#13177) b5219 2025-04-29 13:25:53 +02:00
00e3e5a194 mtmd : add qwen2vl and qwen2.5vl (#13141)
* llava : add clip_n_output_tokens, deprecate clip_n_patches

* mtmd : add qwen2vl and qwen2.5vl

* decode_embd_batch::set_position_...

* working version

* deprecate llama-qwen2vl-cli

* correct order W, H of clip_embd_nbytes_by_img

* edit existing line in hot topics
b5218
2025-04-29 11:47:04 +02:00
e98b3692be llama : set qwen3 model type sizes (#13175) b5217 2025-04-29 11:00:31 +02:00
b6ce7430b7 llama-graph : fix text position for mrope (#13159)
* llama-graph : fix text position for mrope

* fix typo

* explicitly set 4th dim in the loop
b5216
2025-04-29 09:45:49 +03:00
AT
5f5e39e1ba model : Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture (#12466)
* Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture

- Adds MoE-based embedding model supporting multilingual embeddings.
- Selects architecture variant based on hyperparameter detection (MoE layers).
- Removes unnecessary subclass initialization checks for clarity.

https://www.nomic.ai/blog/posts/nomic-embed-text-v2

Co-authored-by: Jared Van Bortel <jared@nomic.ai>

* fix tokenizer

* don't rename this tensor

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>
b5215
2025-04-28 22:52:15 +03:00
eaea325324 clip : fix model size display (#13153) b5214 2025-04-28 21:23:19 +02:00
43ddab6eee fix(rpc): Improve input validation and error handling (#13069)
* fix(rpc): Improve input validation and error handling

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): address pr comments

removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): ambiguous nullptr from create_node

rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): initial zero check in create_node

The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* fix(rpc): Handle get_alloc_size failure in server

Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): input size validation in graph_compute

Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove extra status code setting

Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove redundant check for tensor->type

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
b5213
2025-04-28 21:00:20 +03:00
1831f538f7 llama-bench: add -d depth arg (#13096)
* add depth param

* update llama-bench README and add depth param

* llama-bench: default params for depth arg for faster execution

* Update examples/llama-bench/README.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* fix buffer print ub

* use user provided args

* remove extra whitespaces

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b5212
2025-04-28 16:50:39 +02:00
4e87962e34 mtmd : fix glm-edge redundant token count (#13139)
* mtmd : fix glm-edge redundant token count

* fix chat template

* temporary disable GLMEdge test chat tmpl
b5211
2025-04-28 16:12:56 +02:00
fb0471d175 context : do not clear output buffer on reserve (#13152)
Co-authored-by: pockers21 <liyang2@uniontech.com>
b5210
2025-04-28 16:45:40 +03:00
d2b2031e5f llama : (mrope) allow using normal 1D position for text token (#13138)
* llama : (mrope) use normal position for text token

* rm n_pos_per_embd from llm_graph_input_attn_temp
b5209
2025-04-28 14:20:56 +02:00
5fa9e63be8 clip : refactor set input for cgraph + fix qwen2.5vl input (#13136)
* clip : refactor set input for cgraph

* more strict assert

* minicpmv : use clip_n_mmproj_embd instead of copying the same code everywhere

* split qwen2 and qwen2.5 code blocks

* minor style fix
b5208
2025-04-28 12:18:59 +02:00
a4c340f974 SYCL: Add all missing unary kernels (#13074)
* SYCL: Add all missing unary kernels

ggml-ci

* decouple kernel launch range from data size using strided loop

* use ciel_div helper for num_blocks
ggml-ci

* clean auto imported header files
b5207
2025-04-28 11:33:25 +02:00
d0a417f3c7 readme : update hot topics (#13150) 2025-04-28 12:10:18 +03:00
43f2b07193 common : fix noreturn compile warning (#13151)
ggml-ci
b5205
2025-04-28 11:57:19 +03:00
e5d6c2554e llama-chat : fix typo GML --> GLM (#13143) b5204 2025-04-28 10:11:58 +02:00
f0dd6a1926 musa: fix typo in cc control (#13144)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-04-28 09:33:28 +02:00
69699be48a CUDA: fix q_nope_absorbed prec for DS 2 Lite f16 (#13137) b5202 2025-04-28 09:29:26 +02:00
85f36e5e71 arg : fix unused variable (#13142) b5201 2025-04-28 08:16:59 +03:00
c0a97b762e llama-bench : Add --override-tensors arg (#12922)
* Add --override-tensors option to llama-bench

* Correct llama-bench --override-tensors to --override-tensor

* llama-bench: Update --override-tensors parsing to match --tensor-split, appear in test matrix.

* Make new llama-bench util functions static to fix Ubuntu CI

* llama-bench: Correct -ot corner cases (No -ot calls, leading and trailing empty -ot spans, etc.)
b5200
2025-04-27 23:48:26 +02:00
ced44be342 llama-chat : fix wrong template in GLM4-0414 (#13140)
* fix wrong template in GLM4-0414

* fix spaces

* no bos token since it is already in the template

* moved the chatgml4 check to higher priority

* restored template for old GLM models

* moved the GLM4 template check in the correct place with correct check
b5199
2025-04-27 21:57:32 +02:00
e291450b76 musa: fix build warning (#13129)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b5198
2025-04-27 13:22:49 +02:00
59e991c23c Fixes Qwen2.5VL segfault during inference with https://github.com/ggml-org/llama.cpp/pull/12402 as has_qwen2vl_merger migration was incomplete (#13133) b5197 2025-04-27 12:43:37 +02:00
ca2bb89eac clip : Add Qwen2.5VL support (#12402)
* implment vision model architecture, gguf convertor

* handle window attention inputs

* add debug utils

* fix few incorrect tensor memory layout

* move position id remap out of ggml to avoid int32 cuda operations

* cleaning up

* ignore transformers Qwen2_5_xxx type check

* remove not so often use `qwen2vl-cli` debug functions

* remove commented-out code blocks

* fix attn weight scaling after rebase

* add `PROJECTOR_TYPE_QWEN2_5_VL`

* remove `KEY_USE_GLU_MLP`, `KEY_USE_RMS_NORM`

* replace `KEY_FULLATTN_BLK_IDX` with `KEY_WIN_ATTN_PATTERN`

* remove `attn_window_size` from gguf

* fix model conversion

* clean up

* fix merging problem

* add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b5196
2025-04-27 10:10:34 +02:00
2d451c8059 common : add common_remote_get_content (#13123)
* common : add common_remote_get_content

* support max size and timeout

* add tests
b5195
2025-04-26 22:58:12 +02:00
4753791e70 clip : improve projector naming (#13118)
* clip : improve projector naming

* no more kv has_llava_projector

* rm unused kv

* rm more unused
b5194
2025-04-26 22:39:47 +02:00
SXX
77d5e9a76a ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs (#13107)
* ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion

* move fp converter to ggml-cpu

* Switch ggml_compute_forward_get_rows_f16/bf16 to new ggml_cpu_fp16/bf16_to_fp32
b5193
2025-04-26 16:05:31 +02:00
d5fe4e81bd grammar : handle maxItems == 0 in JSON schema (#13117)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
b5192
2025-04-26 10:10:20 +02:00
295354ea68 llama : fix K-shift with quantized K and BLAS backend (#13113) b5191 2025-04-25 19:40:11 +02:00
558a764713 Force FP32 compute in GLM4 FFN Down (#13101)
* Force FP32 compute in cuBLAS GEMM

* Revert "Force FP32 compute in cuBLAS GEMM"

This reverts commit 6efd872732.

* Force F32 compute in GLM4 ffn down

* Edit comment to clarify issue

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b5190
2025-04-25 14:38:34 +02:00
edb18b6e8f clip : fix pixtral on some GPU backends (#13097)
* clip : fix pixtral on some GPU backends

* refactor inp_raw set

* rm outdated comment

* fix dynamic size

* add TODO
b5189
2025-04-25 14:31:42 +02:00
514c45608f change the reorder tensor from init to execute OP (#13003) b5188 2025-04-25 17:37:51 +08:00
553a5c3a9f rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
b5187
2025-04-25 10:08:08 +03:00