17512a94d6
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs ( #12858 )
...
* sycl : Implemented reorder Q4_0 mmvq
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com >
* sycl : Fixed mmvq being called when reorder is disabled
* sycl : Improved comments in the quants header
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com >
* Use static_assert
* safe_div -> ceil_div
* Clarify qi comment
* change the reorder tensor from init to execute OP
* dbg
* Undo changes to test-backend-ops
* Refactor changes on top of q4_0 reorder fix
* Missing Reverts
* Refactored opt_for_reorder logic to simplify code path
* Explicit inlining and unroll
* Renamed mul_mat_algo enum for consistency
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com >
Co-authored-by: romain.biessy <romain.biessy@codeplay.com >
b5330
2025-05-09 16:34:08 +01:00
611aa914ef
metal : optimize MoE for large batches ( #13388 )
...
ggml-ci
b5329
2025-05-09 15:14:56 +03:00
0cf6725e9f
CUDA: FA support for Deepseek (Ampere or newer) ( #13306 )
...
* CUDA: FA support for Deepseek (Ampere or newer)
* do loop unrolling via C++ template
b5328
2025-05-09 13:34:58 +02:00
27ebfcacba
llama : do not crash if there is no CPU backend ( #13395 )
...
* llama : do not crash if there is no CPU backend
* add checks to examples
b5327
2025-05-09 13:02:07 +02:00
5c86c9ed3e
CUDA: fix crash on large batch size for MoE models ( #13384 )
b5326
2025-05-09 12:14:04 +02:00
efb8b47eda
imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation ( #13389 )
...
* Add --parse-special for enabling parsing of special tokens in imatrix calculation
* whitespace
b5325
2025-05-09 11:53:58 +02:00
0527771dd8
llama-run: add support for downloading models from ModelScope ( #13370 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
b5324
2025-05-09 10:25:50 +01:00
2189fd3b63
mtmd : fix batch_view for m-rope ( #13397 )
...
* mtmd : fix batch_view for m-rope
* nits : fix comment
b5323
2025-05-09 11:18:02 +02:00
3f96aeff39
llama : one-off chat template fix for Mistral-Small-2503 ( #13398 )
...
* llama : one-off chat template fix for Mistral-Small-2503
* update readme
* add mistral-v7-tekken
b5322
2025-05-09 11:17:51 +02:00
b486ba05bf
rpc : add rpc_msg_set_tensor_hash_req ( #13353 )
...
* rpc : add rpc_msg_set_tensor_hash_req
Use a dedicated struct for the request of RPC_CMD_SET_TENSOR_HASH which
makes the code cleaner.
* fix
b5321
2025-05-09 10:31:07 +03:00
02115dcd9a
vulkan: Allow up to 4096 elements for mul_mat_id row_ids ( #13326 )
...
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:
GGML_ASSERT(nei0 * nei1 <= 3072);
The tensor is 8 x 512. Increase this array size to accommodate.
b5320
2025-05-09 09:23:41 +02:00
d9c4accaff
server : (webui) rename has_multimodal --> modalities ( #13393 )
...
* server : (webui) rename has_multimodal --> modalities
* allow converting SVG to PNG
* less complicated code
2025-05-09 09:06:37 +02:00
15e03282bb
ci : limit write permission to only the release step + fixes ( #13392 )
...
* ci : limit write permission to only the release step
* fix win cuda file name
* fix license file copy on multi-config generators
b5318
2025-05-08 23:45:22 +02:00
f05a6d71a0
mtmd : Expose helper_decode_image_chunk ( #13366 )
...
* mtmd: Expose helper_decode_image, output_embd_copy, image_tokens_copy/free
* Slim down
* Cleanups
b5317
2025-05-08 20:25:39 +02:00
ee01d71e58
server : (webui) fix a very small misalignment ( #13387 )
...
* server : (webui) fix a very small misalignment
* restore font-bold
2025-05-08 18:51:45 +02:00
8c83449cb7
server : (webui) revamp the input area, plus many small UI improvements ( #13365 )
...
* rework the input area
* process selected file
* change all icons to heroicons
* fix thought process collapse
* move conversation more menu to sidebar
* sun icon --> moon icon
* rm default system message
* stricter upload file check, only allow image if server has mtmd
* build it
* add renaming
* better autoscroll
* build
* add conversation group
* fix scroll
* extra context first, then user input in the end
* fix <hr> tag
* clean up a bit
* build
* add mb-3 for <pre>
* throttle adjustTextareaHeight to make it less laggy
* (nits) missing padding in sidebar
* rm stray console log
b5315
2025-05-08 15:37:29 +02:00
1a844be132
convert : support rope_scaling type and rope_type ( #13349 )
2025-05-08 15:34:29 +02:00
0ccc121354
mtmd : fix the calculation of n_tokens for smolvlm ( #13381 )
...
Co-authored-by: Taichi Nishimura <Taichi.A.Nishimura@sony.com >
b5313
2025-05-08 15:03:53 +02:00
6562e5a4d6
context : allow cache-less context for embeddings ( #13108 )
...
* context : allow cache-less context for embeddings
ggml-ci
* context : enable reranking with encode()
ggml-ci
* context : encode() clears embd_seq
ggml-ci
* examples : use llama_encode() when appropriate
ggml-ci
* models : nomic bert moe does not require KV cache
* llama : update comments for llama_decode/llama_encode
ggml-ci
* context : update warning log [no ci]
2025-05-08 14:28:33 +03:00
51fb96b1ff
context : remove logits_all flag ( #13284 )
...
* context : remove logits_all flag
ggml-ci
* llama : remove logits_all flag + reorder llama_context_params
ggml-ci
b5311
2025-05-08 14:26:50 +03:00
70a6991edf
ci : move release workflow to a separate file ( #13362 )
b5310
2025-05-08 13:15:28 +02:00
f061021206
llama : print size and type of overridden tensors ( #13364 )
b5309
2025-05-08 13:15:15 +02:00
8733e0cf6e
sycl: addressing non-contiguous src1 mul_mats (nc and batched) ( #13343 )
...
* sycl: fixed non-contiguous src1 mul_mats (nc and batched)
* Fixed wrong static_cast inside kernel
b5308
2025-05-08 10:08:01 +01:00
814f795e06
docker : disable arm64 and intel images ( #13356 )
2025-05-07 16:36:33 +02:00
d879433824
sync : ggml
...
ggml-ci
b5306
2025-05-07 17:28:36 +03:00
13b0a04597
whisper: remove MSVC warnings pragmas (whisper/3090)
...
* ggml : remove MSVC warnings pragmas
This commit removes the MSVC-specific pragmas as these are now handled
in ggml/CMakeLists.txt.
* whisper : remove MSVC warning pragmas
This commit removes the MSVC-specific pragmas. These are now handled in
the ggml/CMakeLists.txt file.
2025-05-07 17:28:36 +03:00
bba9d945c1
cmake : removed stdc++fs (whisper/3097)
...
* removed stdc++fs
* kept line, but removed stdc++fs
2025-05-07 17:28:36 +03:00
bc4e1128f7
llama : deci : support ffn-free with attention ( #13296 )
b5303
2025-05-07 12:49:27 +02:00
39e73ae0d6
common : Add a warning when we can't match samplers from a string or char. ( #13330 )
b5302
2025-05-07 11:23:28 +03:00
1f73301b63
cuda : remove nrows_x in mul_mat_q_process_tile ( #13325 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
b5301
2025-05-07 09:48:23 +02:00
4773d7a02f
examples : remove infill ( #13283 )
...
ggml-ci
b5300
2025-05-07 10:28:02 +03:00
6c7fd67b64
llama : support tie embedding for chatglm models ( #13328 )
b5299
2025-05-07 09:23:11 +02:00
141a908a59
CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF ( #13135 )
b5298
2025-05-06 23:35:51 +02:00
32916a4907
clip : refactor graph builder ( #13321 )
...
* mtmd : refactor graph builder
* fix qwen2vl
* clean up siglip cgraph
* pixtral migrated
* move minicpmv to a dedicated build function
* move max_feature_layer to build_llava
* use build_attn for minicpm resampler
* fix windows build
* add comment for batch_size
* also support tinygemma3 test model
* qwen2vl does not use RMS norm
* fix qwen2vl norm (2)
b5297
2025-05-06 22:40:24 +02:00
ffc727203a
sampling : make top_n_sigma no-op at <=0 or a single candidate ( #13345 )
b5296
2025-05-06 22:36:24 +02:00
91a86a6f35
sampling : don't consider -infinity values in top_n_sigma ( #13344 )
b5295
2025-05-06 20:24:15 +02:00
f4ed10b69c
cmake : remove arm64 msvc presets ( #13342 )
2025-05-06 20:15:31 +02:00
1e333d5bba
SYCL: Disable reorder optimize by default and stop setting tensor extras when optimize is disabled ( #13254 )
...
* SYCL: Do not set tensor extras when reorder optimize is disabled
* SYCL: Disable reorder optimize by default
b5293
2025-05-06 20:27:06 +05:30
2f54e348ad
llama : fix build_ffn without gate ( #13336 )
...
* llama : fix build_ffn without gate
* fix build on windows
* Revert "fix build on windows"
This reverts commit fc420d3c7e
.
b5292
2025-05-06 14:25:40 +02:00
2356fb1d53
CUDA: fix bad asserts for partial offload ( #13337 )
2025-05-06 13:58:51 +02:00
764b85627b
convert : qwen2/3moe : set yarn metadata if present ( #13331 )
...
* set yarn metadata if present
* add comment about enabling YaRN
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co >
---------
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co >
2025-05-06 11:12:06 +02:00
15a28ec8c7
CUDA: fix --split-mode row for MMQ ( #13323 )
b5289
2025-05-06 08:36:46 +02:00
a7366faa5b
gguf-py : avoid requiring pyside6 for other scripts ( #13036 )
...
- gguf-py : remove gguf-py/gguf/scripts/__init__.py because it's not needed
Implicit namespaces are supported since Python 3.3 (https://peps.python.org/pep-0420/ ),
and the entrypoints in pyproject.toml can directly refer to the main functions.
gguf-v0.16.3
2025-05-05 22:27:31 -04:00
9070365020
CUDA: fix logic for clearing padding with -ngl 0 ( #13320 )
b5287
2025-05-05 22:32:13 +02:00
233461f812
sampling : Integrate Top-nσ into main sampling chain (and add it to the server) ( #13264 )
...
* sampling: add Top-nσ sampler to `llama-server` and sampler ordering
* revert: sampler ordering
* revert: VS' crappy auto-formatting
* revert: VS' crappy auto-formatting pt.2
* revert: my crappy eye sight...
* sampling: add XTC to Top-nσ sampler chain
* sampling: add Dyna. Temp. to Top-nσ sampler chain
* sampling: actually remove Top-nσ from sampler(oops)
* Integrate top_n_sigma into main sampler chain
* Define COMMON_SAMPLER_TYPE_TOP_N_SIGMA
* Formatting
* Lint
* Exit early in the sampler if nsigma < 0
---------
Co-authored-by: CasualAutopsy <casual_autopsy@outlook.com >
b5286
2025-05-05 22:12:19 +02:00
b34c859146
server : Webui - change setText command from parent window to also send the message. ( #13309 )
...
* setText command from parent window for llama-vscode now sends the message automatically.
* Upgrade packages versions to fix vulnerabilities with "npm audit fix" command.
* Fix code formatting.
* Add index.html.gz changes.
* Revert "Upgrade packages versions to fix vulnerabilities with "npm audit fix" command."
This reverts commit 67687b7fda
.
* easier approach
* add setTimeout
---------
Co-authored-by: igardev <ivailo.gardev@akros.ch >
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2025-05-05 16:03:31 +02:00
9b61acf060
mtmd : rename llava directory to mtmd ( #13311 )
...
* mv llava to mtmd
* change ref everywhere
b5284
2025-05-05 16:02:55 +02:00
5215b91e93
clip : fix confused naming ffn_up and ffn_down ( #13290 )
...
* clip : fix confused naming ffn_up and ffn_down
* rm ffn_i/o/g naming
* rename n_embd, n_ff
* small fix
* no check n_ff
b5283
2025-05-05 12:54:44 +02:00
ae803bfc3d
convert : bailingmoe : set yarn metadata if present ( #13312 )
2025-05-05 12:34:26 +02:00
66645a5285
SYCL: Disable mul_mat kernels for noncontiguous tensor b ( #13308 )
...
ggml-ci
b5281
2025-05-05 13:39:10 +05:30