6107303ab0
llama : remove logits_all flag + reorder llama_context_params
...
ggml-ci
2025-05-08 13:01:41 +03:00
6c0501adf7
context : remove logits_all flag
...
ggml-ci
2025-05-08 13:01:34 +03:00
8733e0cf6e
sycl: addressing non-contiguous src1 mul_mats (nc and batched) ( #13343 )
...
* sycl: fixed non-contiguous src1 mul_mats (nc and batched)
* Fixed wrong static_cast inside kernel
b5308
2025-05-08 10:08:01 +01:00
814f795e06
docker : disable arm64 and intel images ( #13356 )
2025-05-07 16:36:33 +02:00
d879433824
sync : ggml
...
ggml-ci
b5306
2025-05-07 17:28:36 +03:00
13b0a04597
whisper: remove MSVC warnings pragmas (whisper/3090)
...
* ggml : remove MSVC warnings pragmas
This commit removes the MSVC-specific pragmas as these are now handled
in ggml/CMakeLists.txt.
* whisper : remove MSVC warning pragmas
This commit removes the MSVC-specific pragmas. These are now handled in
the ggml/CMakeLists.txt file.
2025-05-07 17:28:36 +03:00
bba9d945c1
cmake : removed stdc++fs (whisper/3097)
...
* removed stdc++fs
* kept line, but removed stdc++fs
2025-05-07 17:28:36 +03:00
bc4e1128f7
llama : deci : support ffn-free with attention ( #13296 )
b5303
2025-05-07 12:49:27 +02:00
39e73ae0d6
common : Add a warning when we can't match samplers from a string or char. ( #13330 )
b5302
2025-05-07 11:23:28 +03:00
1f73301b63
cuda : remove nrows_x in mul_mat_q_process_tile ( #13325 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
b5301
2025-05-07 09:48:23 +02:00
4773d7a02f
examples : remove infill ( #13283 )
...
ggml-ci
b5300
2025-05-07 10:28:02 +03:00
6c7fd67b64
llama : support tie embedding for chatglm models ( #13328 )
b5299
2025-05-07 09:23:11 +02:00
141a908a59
CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF ( #13135 )
b5298
2025-05-06 23:35:51 +02:00
32916a4907
clip : refactor graph builder ( #13321 )
...
* mtmd : refactor graph builder
* fix qwen2vl
* clean up siglip cgraph
* pixtral migrated
* move minicpmv to a dedicated build function
* move max_feature_layer to build_llava
* use build_attn for minicpm resampler
* fix windows build
* add comment for batch_size
* also support tinygemma3 test model
* qwen2vl does not use RMS norm
* fix qwen2vl norm (2)
b5297
2025-05-06 22:40:24 +02:00
ffc727203a
sampling : make top_n_sigma no-op at <=0 or a single candidate ( #13345 )
b5296
2025-05-06 22:36:24 +02:00
91a86a6f35
sampling : don't consider -infinity values in top_n_sigma ( #13344 )
b5295
2025-05-06 20:24:15 +02:00
f4ed10b69c
cmake : remove arm64 msvc presets ( #13342 )
2025-05-06 20:15:31 +02:00
1e333d5bba
SYCL: Disable reorder optimize by default and stop setting tensor extras when optimize is disabled ( #13254 )
...
* SYCL: Do not set tensor extras when reorder optimize is disabled
* SYCL: Disable reorder optimize by default
b5293
2025-05-06 20:27:06 +05:30
2f54e348ad
llama : fix build_ffn without gate ( #13336 )
...
* llama : fix build_ffn without gate
* fix build on windows
* Revert "fix build on windows"
This reverts commit fc420d3c7e
.
b5292
2025-05-06 14:25:40 +02:00
2356fb1d53
CUDA: fix bad asserts for partial offload ( #13337 )
2025-05-06 13:58:51 +02:00
764b85627b
convert : qwen2/3moe : set yarn metadata if present ( #13331 )
...
* set yarn metadata if present
* add comment about enabling YaRN
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co >
---------
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co >
2025-05-06 11:12:06 +02:00
15a28ec8c7
CUDA: fix --split-mode row for MMQ ( #13323 )
b5289
2025-05-06 08:36:46 +02:00
a7366faa5b
gguf-py : avoid requiring pyside6 for other scripts ( #13036 )
...
- gguf-py : remove gguf-py/gguf/scripts/__init__.py because it's not needed
Implicit namespaces are supported since Python 3.3 (https://peps.python.org/pep-0420/ ),
and the entrypoints in pyproject.toml can directly refer to the main functions.
gguf-v0.16.3
2025-05-05 22:27:31 -04:00
9070365020
CUDA: fix logic for clearing padding with -ngl 0 ( #13320 )
b5287
2025-05-05 22:32:13 +02:00
233461f812
sampling : Integrate Top-nσ into main sampling chain (and add it to the server) ( #13264 )
...
* sampling: add Top-nσ sampler to `llama-server` and sampler ordering
* revert: sampler ordering
* revert: VS' crappy auto-formatting
* revert: VS' crappy auto-formatting pt.2
* revert: my crappy eye sight...
* sampling: add XTC to Top-nσ sampler chain
* sampling: add Dyna. Temp. to Top-nσ sampler chain
* sampling: actually remove Top-nσ from sampler(oops)
* Integrate top_n_sigma into main sampler chain
* Define COMMON_SAMPLER_TYPE_TOP_N_SIGMA
* Formatting
* Lint
* Exit early in the sampler if nsigma < 0
---------
Co-authored-by: CasualAutopsy <casual_autopsy@outlook.com >
b5286
2025-05-05 22:12:19 +02:00
b34c859146
server : Webui - change setText command from parent window to also send the message. ( #13309 )
...
* setText command from parent window for llama-vscode now sends the message automatically.
* Upgrade packages versions to fix vulnerabilities with "npm audit fix" command.
* Fix code formatting.
* Add index.html.gz changes.
* Revert "Upgrade packages versions to fix vulnerabilities with "npm audit fix" command."
This reverts commit 67687b7fda
.
* easier approach
* add setTimeout
---------
Co-authored-by: igardev <ivailo.gardev@akros.ch >
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2025-05-05 16:03:31 +02:00
9b61acf060
mtmd : rename llava directory to mtmd ( #13311 )
...
* mv llava to mtmd
* change ref everywhere
b5284
2025-05-05 16:02:55 +02:00
5215b91e93
clip : fix confused naming ffn_up and ffn_down ( #13290 )
...
* clip : fix confused naming ffn_up and ffn_down
* rm ffn_i/o/g naming
* rename n_embd, n_ff
* small fix
* no check n_ff
b5283
2025-05-05 12:54:44 +02:00
ae803bfc3d
convert : bailingmoe : set yarn metadata if present ( #13312 )
2025-05-05 12:34:26 +02:00
66645a5285
SYCL: Disable mul_mat kernels for noncontiguous tensor b ( #13308 )
...
ggml-ci
b5281
2025-05-05 13:39:10 +05:30
27aa259532
mtmd : add C public API ( #13184 )
...
* init
* wip
* working version
* add mtmd::bitmaps
* add test target
* rm redundant define
* test: mtmd_input_chunks_free
* rm outdated comment
* fix merging issue
* explicitly create mtmd::input_chunks
* mtmd_input_chunk_copy
* add clone()
* add const to various places
* add warning about breaking changes
* helper: use mtmd_image_tokens_get_n_pos
b5280
2025-05-04 23:43:42 +02:00
9fdfcdaedd
rpc : use backend registry, support dl backends ( #13304 )
b5279
2025-05-04 21:25:43 +02:00
6eb7d25c70
ggml : activate s390x simd for Q3_K ( #13301 )
...
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
b5278
2025-05-04 19:49:12 +02:00
86bd60d3fe
llava/mtmd : fixes to fully support dl backends ( #13303 )
b5277
2025-05-04 17:05:20 +02:00
9f2da5871f
llama : build windows releases with dl backends ( #13220 )
b5276
2025-05-04 14:20:49 +02:00
93c4e23905
CUDA: fix race condition in MMQ stream-k fixup ( #13299 )
b5275
2025-05-04 14:16:39 +02:00
8afbd96818
CUDA: fix race condition in MMQ ids_dst ( #13294 )
b5274
2025-05-04 13:58:38 +02:00
8ae5ebcf85
vulkan: Additional type support for unary, binary, and copy ( #13266 )
...
Support f16->f32 copy.
Support f16->f16 and f32->f32 unary ops.
Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.
b5273
2025-05-04 07:17:16 +02:00
3e959f0976
imatrix: fix oob writes if src1 is not contiguous ( #13286 )
b5272
2025-05-04 00:50:37 +02:00
36667c8edc
clip : revert the change of BOI/EOI token for GLM-edge ( ⚠️ breaking change) ( #13259 )
b5271
2025-05-03 20:07:54 +02:00
3bf785f3ef
llama : Llama-3_1-Nemotron-Ultra-253B-v1 support ( #12843 )
b5270
2025-05-03 17:39:51 +02:00
1d36b3670b
llama : move end-user examples to tools directory ( #13249 )
...
* llama : move end-user examples to tools directory
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
b5269
2025-05-02 20:27:13 +02:00
b34443923c
sync : ggml ( #13268 )
...
* vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204)
* vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW)
* review: remove src_x/y < 0 checks; add performance tests
* sync : ggml
ggml-ci
* vulkan : fix lint (#0 )
---------
Co-authored-by: Acly <aclysia@gmail.com >
2025-05-02 20:54:30 +03:00
a75cb30dc9
context : fix reorder logic ( #13267 )
...
ggml-ci
b5267
2025-05-02 20:54:13 +03:00
3f3769ba76
ggml : Enable MMA for BF16 in llamafile_sgemm ( #13148 )
...
This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type.
This change results in 9x - 40x gains
in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark.
The patch is tested with Meta-Lllama-3-8B,
and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine.
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com >
b5266
2025-05-02 19:53:12 +03:00
2f567611c0
llama-model : support Qwen2 embedding models and pooling_mode_lasttoken ( #13245 )
b5265
2025-05-02 11:42:30 -04:00
7d2123484e
convert : use correct context length for nomic-embed-text-v2 ( #13216 )
2025-05-02 11:41:54 -04:00
074e42ab31
convert : converting mmproj for Qwen2/2.5VL from convert_hf_to_gguf ( #13209 )
...
* wip
* qwen2.5vl ok
* vision: fix models missing "text_config"
* add test
* fix test repo name
* fix 32B model
* Revert "fix 32B model"
This reverts commit 651752f1ae
.
* clarify about 32B
* rm qwen surgery script
* update llava/readme
* move V_ENC_EMBD_PATCH handling to Qwen2VLVisionModel
2025-05-02 17:17:15 +02:00
c642bc014c
kv-cache : separate recurrent vs non-recurrent impl ( #12799 )
...
* kv-cache : serparate recurrent vs non-recurrent impl (wip)
ggml-ci
* kv-cache : init -> contructor + add llama_memory_params
ggml-ci
* kv-cache : fix callback reference
ggml-ci
* context : llama_kv_cache -> llama_memory_i
ggml-ci
* context : move memory creation logic to model
ggml-ci
* llama : remove reference of memory during encode
ggml-ci
* kv-cache : hide padding details in the implementation
ggml-ci
* kv-cache : add ubatch_next()
ggml-ci
* context : simplify sbatch logic
ggml-ci
* kv-cache : hide defrag logic in the implementation
ggml-ci
* context : hide kv cache details in implementation
ggml-ci
* build : fix
ggml-ci
* cont : another fix
ggml-ci
* kv-cache : simplify interface (wip)
ggml-ci
* kv-cache : use separate KV cell structs for unified/recurrent
ggml-ci
* kv-cache : clean-up
ggml-ci
* model : better llama_model::create_model() signature
ggml-ci
* kv-cache : fix recurrent seq_rm()
ggml-ci
* kv-cache : replace `struct callbacks` with `llama_model &`
ggml-ci
* kv-cache : replace `struct graph_params` with `llama_context &`
ggml-ci
* kv-cache : fix offload check
ggml-ci
* context : avoid passing unique_ptr
ggml-ci
* kv-cache : avoid using the backends from the llama_context
ref #13113
ggml-ci
* kv-cache : more consistent debug logs [no ci]
* kv-cache : do not pass the full llama_context for kv graphs
ggml-ci
* kv-cache : remove comment
* kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext
ggml-ci
* kv-cache : fix recurrent multi-user case
ggml-ci
* memory : remove comments [no ci]
2025-05-02 17:48:36 +03:00
cb06a3c363
llama : orion rope type is neox ( #13261 )
b5261
2025-05-02 12:44:24 +02:00