Commit Graph

5661 Commits

Author SHA1 Message Date
1c49c70d07 sync : ggml 2025-05-27 18:05:33 +03:00
a8ea03d8ad ggml : add ggml_repeat_4d (#13824) b5510 2025-05-27 15:53:55 +02:00
05f6ac6283 ggml : riscv: add xtheadvector support (#13720)
* ggml : riscv: add xtheadvector support

* ggml : clean up some macro usage
b5509
2025-05-27 16:21:36 +03:00
bc583e3c63 mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) (#13784)
* mtmd : allow multiple modalities at the same time

* refactor mtmd tokenizer

* fix compile

* ok, missing SinusoidsPositionEmbedding

* first working version

* fix style

* more strict validate of n_embd

* refactor if..else to switch

* fix regression

* add test for 3B

* update docs

* fix tokenizing with add_special

* add more tests

* fix test case "huge"

* rm redundant code

* set_position_mrope_1d rm n_tokens
b5508
2025-05-27 14:06:10 +02:00
72b090da2c docs: remove link for llama-cli function calling (#13810) 2025-05-27 08:52:40 -03:00
7fe03e7446 ggml-cpu: x86 feature detection is specific to x86 (#13811) b5506 2025-05-27 13:18:39 +02:00
952f3953c1 ggml : allow CUDA graphs when using pipeline parallelism (#13814) b5505 2025-05-27 13:05:18 +02:00
81713121ee kv-cells : track min/max used cells and per-sequence positions (#13808)
* kv-cells : track min/max used cells and per-sequence positions

ggml-ci

* kv-cells : fix pos-modification updates for seq_pos

ggml-ci

* kv-cells : add comments

ggml-ci
b5504
2025-05-27 13:49:41 +03:00
f9cd68398b sampling : make sure samplers return at least 1 token (#13822)
* sampling : min-p should always return at least one token

ggml-ci

* sampling : same for typical sampling

* tests : sampling tests use min_keep == 0

ggml-ci
b5503
2025-05-27 12:07:52 +03:00
4f81b33e32 llama : validate seq id batch input (#13809)
* llama : validate seq id batch input

ggml-ci

* cont : fix the fix

ggml-ci
b5502
2025-05-27 09:40:59 +03:00
cdf94a1802 server: --offline mode (#13804)
* server: --offline mode (env: LLAMA_OFFLINE)

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
b5501
2025-05-26 22:34:27 +01:00
a26c4cc11e scripts : add option to compare commits in Debug (#13806)
* scripts : add option to compare commits in Debug

* cont : reuse existing CMAKE_OPTS
2025-05-26 22:24:01 +03:00
4265a87b59 cuda : avoid cuGetErrorString (#13791)
ggml-ci
b5499
2025-05-26 22:14:52 +03:00
6f180b915c SYCL: Add non contiguous support in RMS_NORM and NORM kernels (#13611)
* SYCL: Add non contiguous input support to norm kernel

* refactor and add RMS_NORM non contiguous input support

ggml-ci

* restore subgroup reduction for multi-subgroup thread blocks in norm kernels

* Swap grid dims of nsamples and nrows

ggml-ci

* Revert "Swap grid dims of nsamples and nrows"

This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf.

* restore not required changes
ggml-ci

* address review comments: change it to more like SYCL

* Use a common function to calculate offset

* remove wrap around logic for handling broadcasts

* remove static from calculate_offset fn and use ceil_div
b5498
2025-05-26 21:10:36 +05:30
03f582ae8f server: fix streaming crashes (#13786)
* add preludes to content on partial regex match

* allow all parsers to parse non-tool-call content.

* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
b5497
2025-05-26 16:03:57 +01:00
88c125f2ac examples/training: Fix file name in README (#13803)
This patch fixes binary file names in README.md.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
2025-05-26 16:55:24 +02:00
d74e94c1b3 server: fix format of streamed tool call deltas (diff name, fix id location) (#13800)
* fix deltas of tool_call.function.name

* fix tool_call.id (was in tool_call.function.id!) + add function type

* add tool_call.type

* populate empty tool_call.function.arguments on first delta
b5495
2025-05-26 14:56:49 +01:00
f13847cfb5 server: fix regression on streamed non-chat completion w/ stops (#13785)
* more forgiving message diffs: partial stop words aren't erased, full stops are

* Add (slow) server test for completion + stream + stop
b5494
2025-05-26 14:16:37 +01:00
79c137f776 examples : allow extracting embeddings from decoder contexts (#13797)
ggml-ci
b5493
2025-05-26 14:03:54 +03:00
22229314fc llama : clarify deprecation message (#13794) b5492 2025-05-26 12:57:50 +03:00
9012eb9b45 sycl: Add more debug prints (#13640) 2025-05-26 10:28:53 +02:00
fef693dc6b vulkan: mark IM2COL as supporting non-contig (#13783) b5490 2025-05-26 06:02:07 +02:00
2d38b6e400 CANN: Add the basic supports of Flash Attention kernel (#13627)
* cann: add the basic FA support

* cann: update the readme

* cann: update the FlashAttention with PSEShift

* cann: update the input parameters in FA

* cann: update the alibi with max_bias

* cann: add the constrints of softcap

* cann: update the docs CANN.md

* cann: update the docs CANN.md

* cann: fix typo of CANN.md

* cann: add some comments and update the CANN.md

* cann: update the CANN.md

* cann: update the inner precise for fusedInferAttention

* cann: update the constraints of flash_attn_ext on ggml-cann.cpp

* cann: clean the whitespace

* cann: clean the whitespace

* cann: add a new endline
b5489
2025-05-26 10:20:18 +08:00
e121edc432 server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771)
---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
b5488
2025-05-26 00:30:51 +01:00
2f099b510f webui : bump max upload file size to 500MB (#13779) 2025-05-25 18:02:18 +01:00
aa50ba462f tests : improve UGM tokenizer test coverage (#13773) b5486 2025-05-25 16:22:29 +02:00
de2ef53a4b kv-cache : rework kv_cell (#13706)
* kv-cache : rework kv_cell

ggml-ci

* kv-cells : use "shift" instead of "delta" consistently

ggml-ci

* llama : add llama_max_parallel_sequences()

ggml-ci

* kv-cells : update comments [no ci]

* context : fail upon construction if sequences exceed max value

ggml-ci

* kv-cells : get_pos() -> pos_get() + comments

ggml-ci

* kv-cells : fix tracking of "used" cells

ggml-ci
2025-05-25 16:34:36 +03:00
c508256db2 rpc : Fix build on OpenBSD (#13541) b5484 2025-05-25 15:35:53 +03:00
40aaa8a403 mtmd : add support for Qwen2-Audio and SeaLLM-Audio (#13760)
* mtmd : add Qwen2-Audio support

* small clean up

* update discussion link

* clarify mtmd_get_output_embd

* clarification in multimodal.md

* fix ultravox bug

* ggml_cont
b5483
2025-05-25 14:06:32 +02:00
a08c1d2845 docs : add Moondream2 pre-quantized link (#13745)
* Multimodal: Added Moondream2 model and fixed ggml.org link

* Apply suggestions from code review

---------

Co-authored-by: name <none@none.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-25 14:04:49 +02:00
d785f9c1fd server: fix/test add_generation_prompt (#13770)
Co-authored-by: ochafik <ochafik@google.com>
b5481
2025-05-25 10:45:49 +01:00
4032ca4066 llama : add support for Qwen3 MoE tied word embeddings (#13768) b5480 2025-05-25 10:29:43 +02:00
515fdbf7ed SYCL: revert "sycl: simplify bin_bcast_kernel (#13383)" (#13752)
Temporarily reverted due to failing fp16 DIV operation

This reverts commit 02cdd2d8b0.

ggml-ci
b5479
2025-05-25 10:08:37 +03:00
f5cd27b71d server: streaming of tool calls and thoughts when --jinja is on (#12379)
* add common_json w/ support for truncated json healing

* add common_chat_msg_diff

* partial common_chat_parse

* refactor parser w/ optionals

* server: wire chat diffs in stream mode

* fix trigger of thinking models (must happen after thoughts are closed)

* fix functionary v3.2 raw python!

* rename: common_chat_syntax (now contains format)

* rm common_regex.at_start

* don't return empty <think></think>

* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)

* fix QwQ 32B tool call parsing after thoughts (hermes2)

* better logs for grammar triggers

* consume spaces after parse_json_tool_calls

* fix required tool calls w/ thinking models that have pre-opened thinking tags

* fix thinking model's initial trigger + test qwq's template

* run most test_tool_call tests in stream + non-stream modes

* make functionary v3.2 parsing more strict (differentiate first match from others)

* send final diff from server, to close off raw python arguments

* support partial content streaming in Generic mode

* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)

* Update function-calling.md

* Update tool_bench.py

* chat-parser: remove input from exception (llm output may contain PII)

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com>
b5478
2025-05-25 01:48:08 +01:00
a2d02d5793 releases : bundle llvm omp library in windows release (#13763) b5477 2025-05-25 00:55:16 +02:00
17fc817b58 releases : enable openmp in windows cpu backend build (#13756) b5476 2025-05-24 22:27:03 +02:00
2bd1b30f69 ggml-cpu : set openmp wait time if not set (#13758) b5475 2025-05-24 22:26:47 +02:00
259469c4b5 Move GLM4 f32 attention fix to the correct function (#13750) b5474 2025-05-24 16:49:12 +02:00
4c32832c59 ggml : add ggml_gelu_erf() CUDA kernel (#13719)
* ggml : add ggml_gelu_erf() CUDA kernel

* missing semicolon
b5473
2025-05-24 13:06:47 +02:00
c3a2624339 vocab : fix ugm tokenizer precision (#13743) b5472 2025-05-24 12:29:09 +02:00
ffd0eae60b CUDA: fix race condition in FA vector kernels (#13742) b5471 2025-05-24 11:46:19 +02:00
b775345d78 ci : enable winget package updates (#13734) 2025-05-23 23:14:00 +03:00
a70a8a69c2 ci : add winget package updater (#13732) 2025-05-23 22:09:38 +02:00
d13d0f6135 hparams : initialize arrays (#13728)
ggml-ci
b5468
2025-05-23 20:16:13 +03:00
8a2afb7520 llama : allow custom list of swa_layers (#13726) 2025-05-23 17:07:04 +02:00
9ecf3e66a3 server : support audio input (#13714)
* server : support audio input

* add audio support on webui
b5466
2025-05-23 11:03:47 +02:00
faaaff5f94 CANN: Support MUL_MAT_ID for q8_0 and q4_0 (#13705)
* [CANN]Support MUL_MAT_ID Q8 && Q4

Signed-off-by: noemotiovon <757486878@qq.com>

* codestyle adjustment

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
b5465
2025-05-23 16:47:53 +08:00
e16c4731c7 ggml : fix the order of ggml_unary_op (#13718) b5464 2025-05-23 08:12:48 +02:00
1dcd01960c vulkan: support CPY from any type to itself (#13695)
Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.
b5463
2025-05-23 06:45:02 +02:00
c10ed6cbcc vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (#13696) b5462 2025-05-23 06:33:45 +02:00