Reese Levine
9515c6131a
ggml: WebGPU disable SET_ROWS for now ( #15078 )
...
* Add paramater buffer pool, batching of submissions, refactor command building/submission
* Add header for linux builds
* Free staged parameter buffers at once
* Format with clang-format
* Fix thread-safe implementation
* Use device implicit synchronization
* Update workflow to use custom release
* Remove testing branch workflow
* Disable set_rows until it's implemented
* Fix potential issue around empty queue submission
* Try synchronous submission
* Try waiting on all futures explicitly
* Add debug
* Add more debug messages
* Work on getting ssh access for debugging
* Debug on failure
* Disable other tests
* Remove extra if
* Try more locking
* maybe passes?
* test
* Some cleanups
* Restore build file
* Remove extra testing branch ci
b6097
2025-08-05 16:26:38 -07:00
Georgi Gerganov
fd1234cb46
llama : add gpt-oss ( #15091 )
...
* oai moe
* compat with new checkpoint
* add attn sink impl
* add rope scaling yarn
* logits match with latest transformers code
* wip chat template
* rm trailing space
* use ggml_scale_bias
* rm redundant is_swa_all
* convert interleaved gate_up
* graph : fix activation function to match reference (#7 )
* vocab : handle o200k_harmony special tokens
* ggml : add attention sinks support (#1 )
* llama : add attn sinks
* ggml : add attn sinks
* cuda : add attn sinks
* vulkan : add support for sinks in softmax
remove unnecessary return
* ggml : add fused swiglu_oai op (#11 )
* ggml : add fused swiglu_oai op
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* update CUDA impl
* cont : metal impl
* add vulkan impl
* test-backend-ops : more test cases, clean up
* llama : remove unfused impl
* remove extra lines
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: slaren <slarengh@gmail.com >
* repack mxfp4 upon conversion
* clean up a bit
* enable thinking
* add quick hack to render only some special tokens
* fix bf16 conversion
* remove vocab hack
* webui ok
* support chat parsing for gpt-oss
* fix webui
* direct mapping mxfp4, FINALLY
* force using mxfp4
* properly use lazy tensor
* ggml : add mxfp4
ggml : use e8m0 conversion instead of powf
Co-authored-by: Diego Devesa <slarengh@gmail.com >
change kvalues_mxfp4 table to match e2m1 (#6 )
metal : remove quantization for now (not used)
cuda : fix disabled CUDA graphs due to ffn moe bias
vulkan : add support for mxfp4
cont : add cm2 dequant
* ggml : add ggml_add_id (#13 )
* ggml : add ggml_add_id
* add cuda impl
* llama : add weight support check for add_id
* perf opt
* add vulkan impl
* rename cuda files
* add metal impl
* allow in-place ggml_add_id
* llama : keep biases on CPU with --cpu-moe
* llama : fix compile error
ggml-ci
* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw
ggml-ci
* cleanup
ggml-ci
* sycl : fix supports_op for MXFP4
ggml-ci
* fix Unknown reasoning format
* ggml-cpu : fix AVX build
ggml-ci
* fix hip build
ggml-ci
* cuda : add mxfp4 dequantization support for cuBLAS
ggml-ci
* ggml-cpu : fix mxfp4 fallback definitions for some architectures
ggml-ci
* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
Co-authored-by: slaren <slarengh@gmail.com >
b6096
2025-08-05 22:10:36 +03:00
Sigbjørn Skjæret
f324a3b715
chat : only remove double bos/eos if added ( #15086 )
...
* only remove double bos/eos if added
* fix tests
b6095
2025-08-05 20:43:36 +02:00
Georgi Gerganov
be42642581
readme : update hot topics ( #15097 )
2025-08-05 20:19:33 +03:00
Romain Biessy
3306ceabf0
sycl: fix mul_mat selection ( #15092 )
b6093
2025-08-05 18:39:55 +02:00
Juk Armstrong
c81de6e107
Fix glm4moe
bug ( #15088 )
b6092
2025-08-05 13:56:44 +01:00
Alex Wu
22f060c9c4
webui: fix markdown table ( #15081 )
...
* webui: fix markdown table
* webui: fix table display with themes
2025-08-05 13:56:44 +02:00
compilade
ee3a9fcf88
context : fix index overflow on huge outputs ( #15080 )
...
* context : fix overflow when re-ordering huge outputs
* context : fix logits size overflow for huge batches
b6090
2025-08-05 11:27:45 +02:00
Diego Devesa
ec428b02c3
llama : add --n-cpu-moe option ( #15077 )
...
* llama : add --n-cpu-moe option
Keeps the MoE weights of the first N layers in the CPU
b6089
2025-08-05 01:05:36 +02:00
compilade
19f68fa5a4
imatrix : warn when GGUF imatrix is saved without .gguf suffix ( #15076 )
...
* imatrix : add warning when suffix is not .gguf for GGUF imatrix
* imatrix : only warn about suffix when output format is unspecified
b6088
2025-08-04 23:26:52 +02:00
Christian Kastner
41613437ff
cmake: Add GGML_BACKEND_DIR option ( #15074 )
...
* cmake: Add GGML_BACKEND_DIR option
This can be used by distributions to specify where to look for backends
when ggml is built with GGML_BACKEND_DL=ON.
* Fix phrasing
b6087
2025-08-04 21:29:14 +02:00
Sigbjørn Skjæret
e5bebe5251
gguf-py : add --chat-template-file to gguf_new_metadata ( #15075 )
2025-08-04 21:01:48 +02:00
Sam
ef0144c087
model: support GLM 4.5 family of models ( #14939 )
...
* model: Add GLM 4.5 (#14921 )
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Merge in PR suggestions
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* model: Add GLM 4.5 family of models (#14921 )
1. Updated tensor_mapping.py with NextN tensor mappings
- Added proper tensor mappings for all NextN/MTP tensors in /Users/samm/git/llama.cpp/gguf-py/gguf/tensor_mapping.py
- Added mappings for: eh_proj, embed_tokens, enorm, hnorm, shared_head.head, shared_head.norm
2. Added num_nextn_predict_layers configuration
- Added LLM_KV_NUM_NEXTN_PREDICT_LAYERS constant to llama-arch.h and llama-arch.cpp
- Added num_nextn_predict_layers field to llama_hparams struct
- Updated GLM4_MOE parameter loading in llama-model.cpp to read this parameter
- Modified tensor loading logic to conditionally load NextN tensors based on num_nextn_predict_layers
- Added GGUF writer support in gguf_writer.py with add_num_nextn_predict_layers() method
- Updated conversion script to extract and write this parameter from HuggingFace config
3. Added FIM tokens for GLM4_MOE
- Added GLM-4.5's FIM tokens to llama-vocab.cpp:
- <|code_prefix|> for FIM_PRE
- <|code_suffix|> for FIM_SUF
- <|code_middle|> for FIM_MID
4. Removed manual NextN tensor handling
- Removed the special-case handling in convert_hf_to_gguf.py that manually mapped NextN tensors
- NextN tensors are now handled automatically through the proper tensor mapping system
* glm 4.5 update tensors names
* model: glm 4.5 apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* model: glm 4.5 apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* model: glm 4.5 apply suggestions from code review
* Apply suggestions from code review
* patch broken chat template
* typings fix
* add TENSOR_SKIP flag
Co-authored-by: Diego Devesa <slarengh@gmail.com >
* Update src/llama-model-loader.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
Co-authored-by: Diego Devesa <slarengh@gmail.com >
b6085
2025-08-04 20:29:25 +02:00
Sigbjørn Skjæret
2721257e3e
quantize : fix confusing error message if ftype is invalid ( #15071 )
b6084
2025-08-04 18:11:02 +02:00
Reese Levine
587d0118f5
ggml: WebGPU backend host improvements and style fixing ( #14978 )
...
* Add parameter buffer pool, batching of submissions, refactor command building/submission
* Add header for linux builds
* Free staged parameter buffers at once
* Format with clang-format
* Fix thread-safe implementation
* Use device implicit synchronization
* Update workflow to use custom release
* Remove testing branch workflow
b6083
2025-08-04 08:52:43 -07:00
Jeff Bolz
5aa1105da2
vulkan: fix build when using glslang that does not support coopmat2 ( #15062 )
b6082
2025-08-04 07:09:19 +02:00
compilade
d31192b4ee
imatrix : use GGUF by default ( #14842 )
...
* imatrix : use GGUF by default
* imatrix : use GGUF regardless of the output filename
The legacy format can only be produced with --output-format dat
b6081
2025-08-03 22:00:05 +02:00
compilade
0a2f5496be
imatrix : fix 3d activation handling for hybrid and recurrent models ( #14994 )
...
* imatrix : use a single count for dense 3d tensors
* imatrix : fix 3d activations when model tensor is 2d
* imatrix : fix 3d tensor counts
b6080
2025-08-03 21:49:13 +02:00
compilade
11a3811164
memory : handle kv_unified for hybrid models ( #15050 )
b6079
2025-08-03 21:43:07 +02:00
Csaba Kecskemeti
97366dc6ab
vocab : JetBrains Mellum pre-tokenizer ( #15045 )
b6078
2025-08-03 21:38:18 +02:00
Gabriel Larson
83bc2f288c
model : add text-only support for Kimi-VL (and find special tokens in text_config) ( #15051 )
...
* basic kimi-vl textmodel conversion
* check config["text_config"] for special tokens
2025-08-03 16:56:25 +02:00
Jeff Bolz
6c7a441161
vulkan: Use coopmat2 for conv2d ( #14982 )
b6076
2025-08-03 14:23:57 +02:00
lhez
5c0eb5ef54
opencl: fix adreno compiler detection logic ( #15029 )
b6075
2025-08-02 19:51:18 +02:00
Johannes Gäßler
03d4698218
CUDA: use mma FA kernel for gqa > 4 on RTX 4000 ( #15035 )
b6074
2025-08-02 16:37:08 +02:00
leejet
3303c19b16
cuda: make im2col a little faster ( #15025 )
b6073
2025-08-02 17:15:36 +03:00
Daniel Bevenius
4fdea540bd
kv-cache : skip alignment of n_stream in kv-cache log msg [no ci] ( #15040 )
...
This commit removes the right alignment the `n_stream` value in the
log message in the `llama_kv_cache_unified` constructor.
The motivation for this change is to enhance the readability of log
message. Currently the output looks like this:
```console
llama_kv_cache_unified: size = 2048.00 MiB ( 4096 cells, 32 layers, 1/ 1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
```
Notice that the `n_stream` value is right aligned, which makes it a
little harder to read.
With the change in this commit the output will look like
```console
llama_kv_cache_unified: size = 2048.00 MiB ( 4096 cells, 32 layers, 1/1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
```
2025-08-02 17:14:57 +03:00
Georgi Gerganov
a4569c41fd
llama : enable LLAMA_SET_ROWS=1 by default ( #14959 )
...
ggml-ci
b6071
2025-08-02 17:14:21 +03:00
Georgi Gerganov
15e92fd337
cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 ( #15038 )
...
* cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1
ggml-ci
* cont : fix cont types
ggml-ci
* cont : adopt variable names and comment from the other branch
b6070
2025-08-02 17:13:05 +03:00
Sigbjørn Skjæret
2bf3fbf0b5
ci : check that pre-tokenizer hashes are up-to-date ( #15032 )
...
* torch is not required for convert_hf_to_gguf_update
* add --check-missing parameter
* check that pre-tokenizer hashes are up-to-date
2025-08-02 14:39:01 +02:00
Douglas Hanley
711d5e6fe6
convert : fix Qwen3-Embedding pre-tokenizer hash ( #15030 )
2025-08-02 12:51:02 +02:00
Jhen-Jie Hong
f738989dcb
chat : fix multiple tool_calls on hermes-2-pro ( #14962 )
b6067
2025-08-02 18:04:48 +08:00
Jeff Bolz
4cb208c93c
vulkan: coopmat2 mul_mat optimizations ( #14934 )
...
- Increase tile size for k-quants, to match non-k-quants
- Choose more carefully between large and medium tiles, considering how it
interacts with split_k
- Allow larger/non-power of two split_k, and make the splits a multiple of 256
- Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used
b6066
2025-08-02 11:21:37 +02:00
R0CKSTAR
3025b621d1
llama-bench: rename DB table name from test to llama_bench ( #15003 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
b6065
2025-08-02 17:20:40 +08:00
Jeff Bolz
ec0b18802c
vulkan: Support ne[3]>1 in noncontig matrix-vector multiply ( #15015 )
b6064
2025-08-02 10:48:30 +02:00
Douglas Hanley
339bd0268c
model : support Qwen3-Embedding ( #15023 )
b6063
2025-08-02 10:44:50 +02:00
Johannes Gäßler
f906275537
server: enable token array inputs for OAI API ( #15001 )
b6062
2025-08-02 10:12:41 +02:00
Jeff Bolz
a9f7541ec2
vulkan: optimizations for direct convolution ( #14933 )
...
* vulkan: optimizations for direct convolution
- Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill
the GPU. The new size should be amenable to using coopmat, too.
- Fix shmem bank conflicts. 16B padding should work with coopmat.
- Some explicit loop unrolling.
- Skip math/stores work for parts of the tile that are OOB.
- Apply fastdiv opt.
- Disable shuffles for NV.
* Three tiles sizes for CONV_2D, and a heuristic to choose
* reallow collectives for pre-Turing
* make SHMEM_PAD a spec constant
* fixes for intel perf - no shmem padding, placeholder shader core count
* shader variants with/without unrolling
* 0cc4m's fixes for AMD perf
Co-authored-by: 0cc4m <picard12@live.de >
---------
Co-authored-by: 0cc4m <picard12@live.de >
b6061
2025-08-02 09:57:04 +02:00
Johannes Gäßler
9c35706b98
CUDA: fix MMQ nwarps for AMD with warp_size==32 ( #15014 )
b6060
2025-08-01 20:47:32 +02:00
l-austenfeld
c76b420e4c
vendor : update vendored copy of google/minja ( #15011 )
...
* vendor : update vendored copy of google/minja
Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com >
* Re-remove trailing whitespace
Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com >
* Remove another trailing whitespace
Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com >
---------
Signed-off-by: Lennart Austenfeld <l.austenfeld@googlemail.com >
b6059
2025-08-01 16:59:06 +02:00
stevenkuang
0f5ccd6fd1
model : add hunyuan dense ( #14878 )
...
* support hunyuan_v1_dense
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
* update hunyuan_moe to hunyuan_v1_moe
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
* fix rope alpha assert and bos token
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
* add blank line
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
* Revert "update hunyuan_moe to hunyuan_v1_moe"
This reverts commit aa973ca219
.
* use hunyuan_dense instead of hunyuan_v1_dense
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
* fix hunyuan_moe chat template
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
* remove leftover code
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
* update hunyuan dense chat template
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
* fix hunyuan dense vocab and chat template
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
---------
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
b6058
2025-08-01 15:31:12 +02:00
lhez
1c872f71fb
opencl: add f16 for add
, sub
, mul
, div
( #14984 )
b6057
2025-08-01 13:15:44 +02:00
Srihari-mcw
baad94885d
ggml : Q2k interleaving implementation - x86/x64 SIMD ( #14373 )
...
* Initial Q2_K Block Interleaving Implementation
* Addressed review comments and clean up of the code
* Post rebase fixes
* Initial CI/CD fixes
* Update declarations in arch-fallback.h
* Changes for GEMV Q2_K in arch-fallback.h
* Enable repacking only on AVX-512 machines
* Update comments in repack.cpp
* Address q2k comments
---------
Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com >
b6056
2025-08-01 09:20:33 +03:00
Georgi Gerganov
ba42794c9e
graph : fix equal_seq() check ( #14986 )
...
ggml-ci
b6055
2025-08-01 06:38:12 +03:00
diannao
2860d479b4
docker : add cann build pipline ( #14591 )
...
* docker: add cann build pipline
* docker: add cann build pipline
* docker: fix cann devops
* cann : fix multi card hccl
* Update ggml/src/ggml-cann/ggml-cann.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
* Update ggml-cann.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
b6054
2025-08-01 10:02:34 +08:00
R0CKSTAR
484b2091ce
compare-commits.sh: support both llama-bench and test-backend-ops ( #14392 )
...
* compare-commits.sh: support both llama-bench and test-backend-ops
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Speed up the build by specifying -j 12
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* Remove build_number from test-backend-ops db
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* Apply suggestion from @JohannesGaessler
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
* Refine tool selection logic
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* Address review comments
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
---------
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2025-08-01 08:47:27 +08:00
Ed Addario
daf2dd7880
quantize : skip tensor override when in fallback mode ( #14995 )
b6052
2025-07-31 21:32:18 +02:00
Diego Devesa
a06ed5feae
llama : add simple option to enable CPU for MoE weights (--cpu-moe) ( #14992 )
b6051
2025-07-31 20:15:41 +02:00
Aman Gupta
784524053d
Fix params bug in diffusion example ( #14993 )
b6050
2025-08-01 01:22:58 +08:00
Diego Devesa
d6818d06a6
llama : allow other bufts when overriding to CPU, add --no-repack option ( #14990 )
b6049
2025-07-31 18:11:34 +02:00
Ruben Ortlam
e08a98826b
Vulkan: Fix minor debug mode issues ( #14899 )
...
* vulkan: fix debug mode issues
* vulkan: remove broken check_results GGML_OP_SET_ROWS support
b6048
2025-07-31 17:46:54 +02:00