Commit Graph

5678 Commits

Author SHA1 Message Date
f5cd27b71d server: streaming of tool calls and thoughts when --jinja is on (#12379)
* add common_json w/ support for truncated json healing

* add common_chat_msg_diff

* partial common_chat_parse

* refactor parser w/ optionals

* server: wire chat diffs in stream mode

* fix trigger of thinking models (must happen after thoughts are closed)

* fix functionary v3.2 raw python!

* rename: common_chat_syntax (now contains format)

* rm common_regex.at_start

* don't return empty <think></think>

* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)

* fix QwQ 32B tool call parsing after thoughts (hermes2)

* better logs for grammar triggers

* consume spaces after parse_json_tool_calls

* fix required tool calls w/ thinking models that have pre-opened thinking tags

* fix thinking model's initial trigger + test qwq's template

* run most test_tool_call tests in stream + non-stream modes

* make functionary v3.2 parsing more strict (differentiate first match from others)

* send final diff from server, to close off raw python arguments

* support partial content streaming in Generic mode

* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)

* Update function-calling.md

* Update tool_bench.py

* chat-parser: remove input from exception (llm output may contain PII)

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com>
b5478
2025-05-25 01:48:08 +01:00
a2d02d5793 releases : bundle llvm omp library in windows release (#13763) b5477 2025-05-25 00:55:16 +02:00
17fc817b58 releases : enable openmp in windows cpu backend build (#13756) b5476 2025-05-24 22:27:03 +02:00
2bd1b30f69 ggml-cpu : set openmp wait time if not set (#13758) b5475 2025-05-24 22:26:47 +02:00
259469c4b5 Move GLM4 f32 attention fix to the correct function (#13750) b5474 2025-05-24 16:49:12 +02:00
4c32832c59 ggml : add ggml_gelu_erf() CUDA kernel (#13719)
* ggml : add ggml_gelu_erf() CUDA kernel

* missing semicolon
b5473
2025-05-24 13:06:47 +02:00
c3a2624339 vocab : fix ugm tokenizer precision (#13743) b5472 2025-05-24 12:29:09 +02:00
ffd0eae60b CUDA: fix race condition in FA vector kernels (#13742) b5471 2025-05-24 11:46:19 +02:00
b775345d78 ci : enable winget package updates (#13734) 2025-05-23 23:14:00 +03:00
a70a8a69c2 ci : add winget package updater (#13732) 2025-05-23 22:09:38 +02:00
d13d0f6135 hparams : initialize arrays (#13728)
ggml-ci
b5468
2025-05-23 20:16:13 +03:00
8a2afb7520 llama : allow custom list of swa_layers (#13726) 2025-05-23 17:07:04 +02:00
9ecf3e66a3 server : support audio input (#13714)
* server : support audio input

* add audio support on webui
b5466
2025-05-23 11:03:47 +02:00
faaaff5f94 CANN: Support MUL_MAT_ID for q8_0 and q4_0 (#13705)
* [CANN]Support MUL_MAT_ID Q8 && Q4

Signed-off-by: noemotiovon <757486878@qq.com>

* codestyle adjustment

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
b5465
2025-05-23 16:47:53 +08:00
e16c4731c7 ggml : fix the order of ggml_unary_op (#13718) b5464 2025-05-23 08:12:48 +02:00
1dcd01960c vulkan: support CPY from any type to itself (#13695)
Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.
b5463
2025-05-23 06:45:02 +02:00
c10ed6cbcc vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (#13696) b5462 2025-05-23 06:33:45 +02:00
a127ff1780 use LOG_WARN to replace std::cerr (#13657) b5461 2025-05-23 06:33:08 +02:00
3079e9ac8e release : fix windows hip release (#13707)
* release : fix windows hip release

* make single hip release with multiple targets
b5460
2025-05-23 00:21:37 +02:00
8a1d206f1d tts : fix n_ubatch + make WavTokenizer cache-less (#13713)
ggml-ci
b5459
2025-05-22 22:21:07 +03:00
797990c4bc mtmd : add ultravox audio input (#13623)
* convert ok, load ok

* warmup ok

* test

* still does not work?

* fix padding

* temporary give up

* fix merge conflict

* build_ultravox()

* rm test

* fix merge conflict

* add necessary mtmd APIs

* first working version (only 4s of audio)

* will this monster compile?

* fix compile

* please compile

* fPIC

* fix windows

* various fixes

* clean up audio_helpers

* fix conversion

* add some debug stuff

* long audio input ok

* adapt the api

* add --audio arg

* final touch UX

* add miniaudio to readme

* fix typo

* refactor kv metadata

* mtmd_default_marker()
b5458
2025-05-22 20:42:48 +02:00
ab86335760 common: Include torch package for s390x (#13699)
* common: update requirements.txt to include pytorch nightly for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* common: fix torch installation via pip for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-05-22 21:31:29 +03:00
cc74d5be99 server : pad small embedding batches (#13692)
ggml-ci
b5456
2025-05-22 16:33:39 +03:00
5be24af73d gguf-py : correct charsmap parameter typing (#13701) 2025-05-22 14:25:05 +02:00
d394a9aedc sycl : Remove waits from function calls (#13702)
* removes the waits in async memcpy functions
b5454
2025-05-22 12:54:43 +01:00
6b56a64690 SYCL: Avoid using with SYCL-Graph for unsupported nodes (#13587)
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.

* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074

We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](39e73ae0d6/ggml/src/ggml-cuda/ggml-cuda.cu (L2458-L2458))
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.
b5453
2025-05-22 16:24:09 +08:00
a4e8912dfd opencl: Add support for multiple devices (#12622)
* opencl: Add support for multiple devices

... but limited to one platform. A platform with a GPU will be preferred.

Additionally:

* Filter out devices that lack capabilities needed by the backend
  implementation (half support, OpenCL 2.0+, etc).

* Make ggml_backend_opencl_reg() thread-safe.

* fixup: fix an error in sync_with_other_backends

... when there is only one OpenCL device available.
b5452
2025-05-21 16:21:45 -07:00
edbf42edfd opencl: fix couple crashes (#12795)
* opencl: fix couple crashes

* fix kernel launches failed on devices which do not support
  non-uniform work-groups. When non-uniform work-groups are not
  supported, set `local_work_size` to NULL (= let driver choose the
  work-group sizes). This patch does not cover everything - just the
  cases tested by test-backend-ops.

* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
  being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.

* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+
b5451
2025-05-21 13:21:17 -07:00
d643bb2c79 releases : build CPU backend separately (windows) (#13642) b5450 2025-05-21 22:09:57 +02:00
8e186ef0e7 hparams : support models for which all layers use SWA (#13682)
ggml-ci
b5449
2025-05-21 20:00:49 +03:00
5fbfe384d4 server : improve error reporting (#13680) b5448 2025-05-21 19:46:56 +03:00
c76532e7ba convert : add qwen2vl support for unsloth merges (#13686) 2025-05-21 18:40:35 +02:00
2aa777d86d examples : switch retrieval to llama_encode (#13685)
* switch retrieval to llama_encode

* enable --no-warmup for retrieval
b5446
2025-05-21 16:57:38 +02:00
eb0f5c28d3 gguf-py : display the invalid gguf type (#13687)
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-05-21 16:33:54 +02:00
cf4cb59e64 ggml : add ggml_gelu_erf() (#13667)
* ggml : add ggml_gelu_na (not approximated)

* fix naming order

* rename na --> erf

* apply review suggesions

* revert naming order
b5444
2025-05-21 16:26:33 +02:00
0d5c742161 server : Add the endpoints /api/tags and /api/chat (#13659)
* Add the endpoints /api/tags and /api/chat

Add the endpoints /api/tags and /api/chat, and improved the model metadata response

* Remove trailing whitespaces

* Removed code that is not needed for copilot to work.
b5443
2025-05-21 15:15:27 +02:00
42158ae2e8 server : fix first message identification (#13634)
* server : fix first message identification

When using the OpenAI SDK (https://github.com/openai/openai-node/blob/master/src/lib/ChatCompletionStream.ts#L623-L626) we noticed that the expected assistant role is missing in the first streaming message. Fix this by correctly checking for the first message.

Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>

* server : Fix checks for first role message for stream=True

Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>

---------

Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
b5442
2025-05-21 15:07:57 +02:00
797f2ac062 kv-cache : simplify the interface (#13660)
* kv-cache : simplify the interface

ggml-ci

* context : revert llama_batch_allocr position change

ggml-ci
b5441
2025-05-21 15:11:13 +03:00
b44890df2e model : disable SWA for Phi models (#13676)
* model : disable SWA for Phi models

ggml-ci

* model : update warning message

* model : print warning only if n_swa > 0

* model : fix typo
b5440
2025-05-21 13:09:21 +03:00
33983057d0 musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (#13647)
* musa: fix build warning (unused parameter)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: upgrade MUSA SDK version to rc4.0.1

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Update ggml/src/ggml-cuda/cpy.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b5439
2025-05-21 09:58:49 +08:00
Eve
fb1cab201c vulkan: fix warnings (#13626)
* small fixes

* remove ifdef
b5438
2025-05-20 21:35:16 +00:00
b7a17463ec mtmd-helper : bug fix to token batching in mtmd (#13650)
* Update mtmd-helper.cpp

* Update tools/mtmd/mtmd-helper.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
b5437
2025-05-20 18:55:30 +02:00
be0239693c model : fix llama4 graph (#13663)
ggml-ci
b5436
2025-05-20 19:21:04 +03:00
a4090d1174 llama : remove llama_kv_cache_view API + remove deprecated (#13653)
ggml-ci
b5435
2025-05-20 16:13:16 +03:00
b69f1647f9 CUDA: skip fully masked-out KV in FA vec kernel (#13584)
* CUDA: skip fully masked-out KV in FA vec kernel
b5434
2025-05-20 14:45:07 +02:00
759e37b0d8 tests : avoid github urls due to throttling (#13654) 2025-05-20 12:03:17 +02:00
4245e622e0 sycl: disable reorder for sycl mulmat (#13536) b5432 2025-05-20 11:34:15 +02:00
c9c64dee57 Set GLM4 blk.*.attn_output.weight, kqv_out-* matmul to GGML_PREC_F32 to fix infinity values in output (#13639) b5431 2025-05-20 10:11:56 +02:00
c00a2634be metal : fix typo in FA kernel comments (#13651) b5430 2025-05-20 10:41:40 +03:00
e298d2fbd0 kv-cache : add SWA support (#13194)
* kv-cache : prepare for SWA

ggml-ci

* kv-cache : initial iSWA implementation

ggml-ci

* kv-cache : rework error recovery logic

ggml-ci

* models : fix Phi-3 SWA parameters

ggml-ci

* model : adjust Granite to rope factor changes

ggml-ci

* server : check if context can do shifts

ggml-ci

* iswa : for now, always enable shifts (experiment)

ggml-ci

* kv-cache : simplify SWA logic

ggml-ci

* kv-cache : apply defrag when we fail to find slots for the batch

ggml-ci

* llama : update docs about llama_decode

ggml-ci

* kv-cache : update warning logs when no space for the batch is available

ggml-ci

* llama : add llama_kv_self_seq_pos_min()

* kv-cache : keep track of partial SWA computes and print warnings

* server : disallow use cases involving partial SWA context

ggml-ci

* llama : add param to control SWA cache size

ggml-ci

* minor : clean-up

ggml-ci
b5429
2025-05-20 08:05:46 +03:00