a127ff1780
use LOG_WARN to replace std::cerr
( #13657 )
b5461
2025-05-23 06:33:08 +02:00
3079e9ac8e
release : fix windows hip release ( #13707 )
...
* release : fix windows hip release
* make single hip release with multiple targets
b5460
2025-05-23 00:21:37 +02:00
8a1d206f1d
tts : fix n_ubatch + make WavTokenizer cache-less ( #13713 )
...
ggml-ci
b5459
2025-05-22 22:21:07 +03:00
797990c4bc
mtmd : add ultravox audio input ( #13623 )
...
* convert ok, load ok
* warmup ok
* test
* still does not work?
* fix padding
* temporary give up
* fix merge conflict
* build_ultravox()
* rm test
* fix merge conflict
* add necessary mtmd APIs
* first working version (only 4s of audio)
* will this monster compile?
* fix compile
* please compile
* fPIC
* fix windows
* various fixes
* clean up audio_helpers
* fix conversion
* add some debug stuff
* long audio input ok
* adapt the api
* add --audio arg
* final touch UX
* add miniaudio to readme
* fix typo
* refactor kv metadata
* mtmd_default_marker()
b5458
2025-05-22 20:42:48 +02:00
ab86335760
common: Include torch package for s390x ( #13699 )
...
* common: update requirements.txt to include pytorch nightly for s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* common: fix torch installation via pip for s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
2025-05-22 21:31:29 +03:00
cc74d5be99
server : pad small embedding batches ( #13692 )
...
ggml-ci
b5456
2025-05-22 16:33:39 +03:00
5be24af73d
gguf-py : correct charsmap parameter typing ( #13701 )
2025-05-22 14:25:05 +02:00
d394a9aedc
sycl : Remove waits from function calls ( #13702 )
...
* removes the waits in async memcpy functions
b5454
2025-05-22 12:54:43 +01:00
6b56a64690
SYCL: Avoid using with SYCL-Graph for unsupported nodes ( #13587 )
...
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.
* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074
We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](39e73ae0d6/ggml/src/ggml-cuda/ggml-cuda.cu (L2458-L2458)
)
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.
b5453
2025-05-22 16:24:09 +08:00
a4e8912dfd
opencl: Add support for multiple devices ( #12622 )
...
* opencl: Add support for multiple devices
... but limited to one platform. A platform with a GPU will be preferred.
Additionally:
* Filter out devices that lack capabilities needed by the backend
implementation (half support, OpenCL 2.0+, etc).
* Make ggml_backend_opencl_reg() thread-safe.
* fixup: fix an error in sync_with_other_backends
... when there is only one OpenCL device available.
b5452
2025-05-21 16:21:45 -07:00
edbf42edfd
opencl: fix couple crashes ( #12795 )
...
* opencl: fix couple crashes
* fix kernel launches failed on devices which do not support
non-uniform work-groups. When non-uniform work-groups are not
supported, set `local_work_size` to NULL (= let driver choose the
work-group sizes). This patch does not cover everything - just the
cases tested by test-backend-ops.
* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.
* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+
b5451
2025-05-21 13:21:17 -07:00
d643bb2c79
releases : build CPU backend separately (windows) ( #13642 )
b5450
2025-05-21 22:09:57 +02:00
8e186ef0e7
hparams : support models for which all layers use SWA ( #13682 )
...
ggml-ci
b5449
2025-05-21 20:00:49 +03:00
5fbfe384d4
server : improve error reporting ( #13680 )
b5448
2025-05-21 19:46:56 +03:00
c76532e7ba
convert : add qwen2vl support for unsloth merges ( #13686 )
2025-05-21 18:40:35 +02:00
2aa777d86d
examples : switch retrieval to llama_encode ( #13685 )
...
* switch retrieval to llama_encode
* enable --no-warmup for retrieval
b5446
2025-05-21 16:57:38 +02:00
eb0f5c28d3
gguf-py : display the invalid gguf type ( #13687 )
...
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com >
2025-05-21 16:33:54 +02:00
cf4cb59e64
ggml : add ggml_gelu_erf() ( #13667 )
...
* ggml : add ggml_gelu_na (not approximated)
* fix naming order
* rename na --> erf
* apply review suggesions
* revert naming order
b5444
2025-05-21 16:26:33 +02:00
0d5c742161
server : Add the endpoints /api/tags and /api/chat ( #13659 )
...
* Add the endpoints /api/tags and /api/chat
Add the endpoints /api/tags and /api/chat, and improved the model metadata response
* Remove trailing whitespaces
* Removed code that is not needed for copilot to work.
b5443
2025-05-21 15:15:27 +02:00
42158ae2e8
server : fix first message identification ( #13634 )
...
* server : fix first message identification
When using the OpenAI SDK (https://github.com/openai/openai-node/blob/master/src/lib/ChatCompletionStream.ts#L623-L626 ) we noticed that the expected assistant role is missing in the first streaming message. Fix this by correctly checking for the first message.
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com >
Signed-off-by: Dorin Geman <dorin.geman@docker.com >
* server : Fix checks for first role message for stream=True
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com >
Signed-off-by: Dorin Geman <dorin.geman@docker.com >
---------
Signed-off-by: Dorin Geman <dorin.geman@docker.com >
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com >
b5442
2025-05-21 15:07:57 +02:00
797f2ac062
kv-cache : simplify the interface ( #13660 )
...
* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci
b5441
2025-05-21 15:11:13 +03:00
b44890df2e
model : disable SWA for Phi models ( #13676 )
...
* model : disable SWA for Phi models
ggml-ci
* model : update warning message
* model : print warning only if n_swa > 0
* model : fix typo
b5440
2025-05-21 13:09:21 +03:00
33983057d0
musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy ( #13647 )
...
* musa: fix build warning (unused parameter)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* musa: upgrade MUSA SDK version to rc4.0.1
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* Update ggml/src/ggml-cuda/cpy.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
b5439
2025-05-21 09:58:49 +08:00
fb1cab201c
vulkan: fix warnings ( #13626 )
...
* small fixes
* remove ifdef
b5438
2025-05-20 21:35:16 +00:00
b7a17463ec
mtmd-helper : bug fix to token batching in mtmd ( #13650 )
...
* Update mtmd-helper.cpp
* Update tools/mtmd/mtmd-helper.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
b5437
2025-05-20 18:55:30 +02:00
be0239693c
model : fix llama4 graph ( #13663 )
...
ggml-ci
b5436
2025-05-20 19:21:04 +03:00
a4090d1174
llama : remove llama_kv_cache_view API + remove deprecated ( #13653 )
...
ggml-ci
b5435
2025-05-20 16:13:16 +03:00
b69f1647f9
CUDA: skip fully masked-out KV in FA vec kernel ( #13584 )
...
* CUDA: skip fully masked-out KV in FA vec kernel
b5434
2025-05-20 14:45:07 +02:00
759e37b0d8
tests : avoid github urls due to throttling ( #13654 )
2025-05-20 12:03:17 +02:00
4245e622e0
sycl: disable reorder for sycl mulmat ( #13536 )
b5432
2025-05-20 11:34:15 +02:00
c9c64dee57
Set GLM4 blk.*.attn_output.weight, kqv_out-* matmul to GGML_PREC_F32 to fix infinity values in output ( #13639 )
b5431
2025-05-20 10:11:56 +02:00
c00a2634be
metal : fix typo in FA kernel comments ( #13651 )
b5430
2025-05-20 10:41:40 +03:00
e298d2fbd0
kv-cache : add SWA support ( #13194 )
...
* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci
b5429
2025-05-20 08:05:46 +03:00
f0adb80bf7
CANN: Update CANN model support ( #13162 )
...
* Update CANN model support status
* Update of model support
* update
* update
* update
* fix format of CANN.md
* fix format of CANN.md
* fix format of CANN.md
2025-05-20 11:43:43 +08:00
f7c9429c85
sycl : Overcoming workaround for mmap() allocation on Windows ( #13482 )
...
* Remove mmap workaround on windows
After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.
* Update llama-bench README
SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
b5427
2025-05-20 08:54:43 +08:00
1dfbf2cf3a
common : add load_progress_callback ( #13617 )
b5426
2025-05-19 21:17:36 +02:00
8960efd0a6
Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence ( #13607 )
b5425
2025-05-19 17:54:08 +02:00
725f23f1f3
sycl : backend documentation review ( #13544 )
...
* sycl: reviewing and updating docs
* Updates Runtime error codes
* Improves OOM troubleshooting entry
* Added a llama 3 sample
* Updated supported models
* Updated releases table
2025-05-19 14:38:20 +01:00
92ecdcc06a
mtmd : add vision support for llama 4 ( #13282 )
...
* wip llama 4 conversion
* rm redundant __init__
* fix conversion
* fix conversion
* test impl
* try this
* reshape patch_embeddings_0
* fix view
* rm ffn_post_norm
* cgraph ok
* f32 for pos embd
* add image marker tokens
* Llama4UnfoldConvolution
* correct pixel shuffle
* fix merge conflicts
* correct
* add debug_graph
* logits matched, but it still preceives the image incorrectly
* fix style
* add image_grid_pinpoints
* handle llama 4 preprocessing
* rm load_image_size
* rm unused line
* fix
* small fix 2
* add test & docs
* fix llava-1.6 test
* test: add notion of huge models
* add comment
* add warn about degraded quality
b5423
2025-05-19 13:04:14 +02:00
f71f40a284
ci : upgraded oneAPI version in SYCL workflows and dockerfile ( #13532 )
b5422
2025-05-19 11:46:09 +01:00
d30cb5a7fa
sync : ggml
...
ggml-ci
b5421
2025-05-19 13:29:56 +03:00
6c35981a64
mnist: fix segmentation fault (ggml/1227)
2025-05-19 13:29:56 +03:00
8b5e19aea6
ggml : fix apple OS check in ggml_print_backtrace (ggml/1229)
2025-05-19 13:29:56 +03:00
60aea028b5
ggml : Fix missing backtrace on Linux (ggml/1228)
...
* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols
2025-05-19 13:29:56 +03:00
9c55e5c5c2
fix: check model pointer validity before use ( #13631 )
b5417
2025-05-19 13:25:41 +03:00
33d7aed4a8
CANN: Support MOE Model MUL_MAT_ID ( #13042 )
...
Signed-off-by: noemotiovon <757486878@qq.com >
b5416
2025-05-19 14:21:17 +08:00
6a2bc8bfb7
server : added --no-prefill-assistant flag ( #13608 )
...
* added no-prefill-assistant flag
* reworded documentation comment
* updated server README.md
b5415
2025-05-17 23:59:48 +02:00
e3a7cf6c5b
cmake: use the current build config for vulkan-shaders-gen ( #13595 )
...
* fix: use the current build config for `vulkan-shaders-gen`
* fix: only pass a valid build type to `--config`
b5414
2025-05-17 15:26:43 -03:00
518329b2d4
parallel : add option for non-shared and larger prompts ( #13598 )
...
* parallel : add option for non-shared and larger prompts
* parallel : update readme [no ci]
* cont : add note about base models [no ci]
* parallel : better var name
ggml-ci
2025-05-17 12:58:55 +03:00
2f5a4e1e09
vulkan: move common FA code to flash_attn_base.comp ( #13556 )
...
* vulkan: move common FA code to flash_attn_base.comp
* vulkan: move common FA index/stride setup code to flash_attn_base.comp
* build fix
b5412
2025-05-17 09:14:55 +02:00