79c137f776
examples : allow extracting embeddings from decoder contexts ( #13797 )
...
ggml-ci
b5493
2025-05-26 14:03:54 +03:00
22229314fc
llama : clarify deprecation message ( #13794 )
b5492
2025-05-26 12:57:50 +03:00
9012eb9b45
sycl: Add more debug prints ( #13640 )
2025-05-26 10:28:53 +02:00
fef693dc6b
vulkan: mark IM2COL as supporting non-contig ( #13783 )
b5490
2025-05-26 06:02:07 +02:00
2d38b6e400
CANN: Add the basic supports of Flash Attention kernel ( #13627 )
...
* cann: add the basic FA support
* cann: update the readme
* cann: update the FlashAttention with PSEShift
* cann: update the input parameters in FA
* cann: update the alibi with max_bias
* cann: add the constrints of softcap
* cann: update the docs CANN.md
* cann: update the docs CANN.md
* cann: fix typo of CANN.md
* cann: add some comments and update the CANN.md
* cann: update the CANN.md
* cann: update the inner precise for fusedInferAttention
* cann: update the constraints of flash_attn_ext on ggml-cann.cpp
* cann: clean the whitespace
* cann: clean the whitespace
* cann: add a new endline
b5489
2025-05-26 10:20:18 +08:00
e121edc432
server
: add --reasoning-budget 0
to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771 )
...
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
b5488
2025-05-26 00:30:51 +01:00
2f099b510f
webui : bump max upload file size to 500MB ( #13779 )
2025-05-25 18:02:18 +01:00
aa50ba462f
tests : improve UGM tokenizer test coverage ( #13773 )
b5486
2025-05-25 16:22:29 +02:00
de2ef53a4b
kv-cache : rework kv_cell ( #13706 )
...
* kv-cache : rework kv_cell
ggml-ci
* kv-cells : use "shift" instead of "delta" consistently
ggml-ci
* llama : add llama_max_parallel_sequences()
ggml-ci
* kv-cells : update comments [no ci]
* context : fail upon construction if sequences exceed max value
ggml-ci
* kv-cells : get_pos() -> pos_get() + comments
ggml-ci
* kv-cells : fix tracking of "used" cells
ggml-ci
2025-05-25 16:34:36 +03:00
c508256db2
rpc : Fix build on OpenBSD ( #13541 )
b5484
2025-05-25 15:35:53 +03:00
40aaa8a403
mtmd : add support for Qwen2-Audio and SeaLLM-Audio ( #13760 )
...
* mtmd : add Qwen2-Audio support
* small clean up
* update discussion link
* clarify mtmd_get_output_embd
* clarification in multimodal.md
* fix ultravox bug
* ggml_cont
b5483
2025-05-25 14:06:32 +02:00
a08c1d2845
docs : add Moondream2 pre-quantized link ( #13745 )
...
* Multimodal: Added Moondream2 model and fixed ggml.org link
* Apply suggestions from code review
---------
Co-authored-by: name <none@none.com >
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
2025-05-25 14:04:49 +02:00
d785f9c1fd
server: fix/test add_generation_prompt ( #13770 )
...
Co-authored-by: ochafik <ochafik@google.com >
b5481
2025-05-25 10:45:49 +01:00
4032ca4066
llama : add support for Qwen3 MoE tied word embeddings ( #13768 )
b5480
2025-05-25 10:29:43 +02:00
515fdbf7ed
SYCL: revert "sycl: simplify bin_bcast_kernel ( #13383 )" ( #13752 )
...
Temporarily reverted due to failing fp16 DIV operation
This reverts commit 02cdd2d8b0
.
ggml-ci
b5479
2025-05-25 10:08:37 +03:00
f5cd27b71d
server
: streaming of tool calls and thoughts when --jinja
is on (#12379 )
...
* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com >
b5478
2025-05-25 01:48:08 +01:00
a2d02d5793
releases : bundle llvm omp library in windows release ( #13763 )
b5477
2025-05-25 00:55:16 +02:00
17fc817b58
releases : enable openmp in windows cpu backend build ( #13756 )
b5476
2025-05-24 22:27:03 +02:00
2bd1b30f69
ggml-cpu : set openmp wait time if not set ( #13758 )
b5475
2025-05-24 22:26:47 +02:00
259469c4b5
Move GLM4 f32 attention fix to the correct function ( #13750 )
b5474
2025-05-24 16:49:12 +02:00
4c32832c59
ggml : add ggml_gelu_erf() CUDA kernel ( #13719 )
...
* ggml : add ggml_gelu_erf() CUDA kernel
* missing semicolon
b5473
2025-05-24 13:06:47 +02:00
c3a2624339
vocab : fix ugm tokenizer precision ( #13743 )
b5472
2025-05-24 12:29:09 +02:00
ffd0eae60b
CUDA: fix race condition in FA vector kernels ( #13742 )
b5471
2025-05-24 11:46:19 +02:00
b775345d78
ci : enable winget package updates ( #13734 )
2025-05-23 23:14:00 +03:00
a70a8a69c2
ci : add winget package updater ( #13732 )
2025-05-23 22:09:38 +02:00
d13d0f6135
hparams : initialize arrays ( #13728 )
...
ggml-ci
b5468
2025-05-23 20:16:13 +03:00
8a2afb7520
llama : allow custom list of swa_layers ( #13726 )
2025-05-23 17:07:04 +02:00
9ecf3e66a3
server : support audio input ( #13714 )
...
* server : support audio input
* add audio support on webui
b5466
2025-05-23 11:03:47 +02:00
faaaff5f94
CANN: Support MUL_MAT_ID for q8_0 and q4_0 ( #13705 )
...
* [CANN]Support MUL_MAT_ID Q8 && Q4
Signed-off-by: noemotiovon <757486878@qq.com >
* codestyle adjustment
Signed-off-by: noemotiovon <757486878@qq.com >
---------
Signed-off-by: noemotiovon <757486878@qq.com >
b5465
2025-05-23 16:47:53 +08:00
e16c4731c7
ggml : fix the order of ggml_unary_op ( #13718 )
b5464
2025-05-23 08:12:48 +02:00
1dcd01960c
vulkan: support CPY from any type to itself ( #13695 )
...
Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.
b5463
2025-05-23 06:45:02 +02:00
c10ed6cbcc
vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it ( #13696 )
b5462
2025-05-23 06:33:45 +02:00
a127ff1780
use LOG_WARN to replace std::cerr
( #13657 )
b5461
2025-05-23 06:33:08 +02:00
3079e9ac8e
release : fix windows hip release ( #13707 )
...
* release : fix windows hip release
* make single hip release with multiple targets
b5460
2025-05-23 00:21:37 +02:00
8a1d206f1d
tts : fix n_ubatch + make WavTokenizer cache-less ( #13713 )
...
ggml-ci
b5459
2025-05-22 22:21:07 +03:00
797990c4bc
mtmd : add ultravox audio input ( #13623 )
...
* convert ok, load ok
* warmup ok
* test
* still does not work?
* fix padding
* temporary give up
* fix merge conflict
* build_ultravox()
* rm test
* fix merge conflict
* add necessary mtmd APIs
* first working version (only 4s of audio)
* will this monster compile?
* fix compile
* please compile
* fPIC
* fix windows
* various fixes
* clean up audio_helpers
* fix conversion
* add some debug stuff
* long audio input ok
* adapt the api
* add --audio arg
* final touch UX
* add miniaudio to readme
* fix typo
* refactor kv metadata
* mtmd_default_marker()
b5458
2025-05-22 20:42:48 +02:00
ab86335760
common: Include torch package for s390x ( #13699 )
...
* common: update requirements.txt to include pytorch nightly for s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* common: fix torch installation via pip for s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
2025-05-22 21:31:29 +03:00
cc74d5be99
server : pad small embedding batches ( #13692 )
...
ggml-ci
b5456
2025-05-22 16:33:39 +03:00
5be24af73d
gguf-py : correct charsmap parameter typing ( #13701 )
2025-05-22 14:25:05 +02:00
d394a9aedc
sycl : Remove waits from function calls ( #13702 )
...
* removes the waits in async memcpy functions
b5454
2025-05-22 12:54:43 +01:00
6b56a64690
SYCL: Avoid using with SYCL-Graph for unsupported nodes ( #13587 )
...
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.
* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074
We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](39e73ae0d6/ggml/src/ggml-cuda/ggml-cuda.cu (L2458-L2458)
)
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.
b5453
2025-05-22 16:24:09 +08:00
a4e8912dfd
opencl: Add support for multiple devices ( #12622 )
...
* opencl: Add support for multiple devices
... but limited to one platform. A platform with a GPU will be preferred.
Additionally:
* Filter out devices that lack capabilities needed by the backend
implementation (half support, OpenCL 2.0+, etc).
* Make ggml_backend_opencl_reg() thread-safe.
* fixup: fix an error in sync_with_other_backends
... when there is only one OpenCL device available.
b5452
2025-05-21 16:21:45 -07:00
edbf42edfd
opencl: fix couple crashes ( #12795 )
...
* opencl: fix couple crashes
* fix kernel launches failed on devices which do not support
non-uniform work-groups. When non-uniform work-groups are not
supported, set `local_work_size` to NULL (= let driver choose the
work-group sizes). This patch does not cover everything - just the
cases tested by test-backend-ops.
* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.
* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+
b5451
2025-05-21 13:21:17 -07:00
d643bb2c79
releases : build CPU backend separately (windows) ( #13642 )
b5450
2025-05-21 22:09:57 +02:00
8e186ef0e7
hparams : support models for which all layers use SWA ( #13682 )
...
ggml-ci
b5449
2025-05-21 20:00:49 +03:00
5fbfe384d4
server : improve error reporting ( #13680 )
b5448
2025-05-21 19:46:56 +03:00
c76532e7ba
convert : add qwen2vl support for unsloth merges ( #13686 )
2025-05-21 18:40:35 +02:00
2aa777d86d
examples : switch retrieval to llama_encode ( #13685 )
...
* switch retrieval to llama_encode
* enable --no-warmup for retrieval
b5446
2025-05-21 16:57:38 +02:00
eb0f5c28d3
gguf-py : display the invalid gguf type ( #13687 )
...
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com >
2025-05-21 16:33:54 +02:00
cf4cb59e64
ggml : add ggml_gelu_erf() ( #13667 )
...
* ggml : add ggml_gelu_na (not approximated)
* fix naming order
* rename na --> erf
* apply review suggesions
* revert naming order
b5444
2025-05-21 16:26:33 +02:00