rmatif
6bdda13981
opencl: add tiled mul_mat_f16_f32 ( #14535 )
...
* add tiled mul_mat_f16_f32
* fix trailing whitespace
* add insightful comments
b5867
2025-07-10 14:58:12 -07:00
lhez
0b8855775c
opencl: add set_rows
for f16
and f32
( #14547 )
...
* opencl: add `set_rows` for `f16` and `f32`
* opencl: better choose workgroup size for `set_rows`
b5866
2025-07-10 11:48:52 -07:00
Ryan Mangeno
4bb625b713
Smoldocling support ( #14597 )
...
* support for smoldocling
* fixed merge conflicts
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com >
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com >
* merge conflicts
* pre tokenizer merge fix
* convert : fix smollm3 jinja template (#14586 )
Signed-off-by: ryan-mangeno <ryanmangeno@gmail.com >
* support for smoldocling
Signed-off-by: ryan-mangeno <ryanmangeno@gmail.com >
* fixed merge conflicts
Signed-off-by: ryan-mangeno <ryanmangeno@gmail.com >
* Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-model.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* safetensors tensor mapping
Signed-off-by: ryan-mangeno <ryanmangeno@gmail.com >
* added back accidental removal of clean spaces for hunyuan
* Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* updated hash and reordererd model list
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update include/llama.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update convert_hf_to_gguf_update.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* removed old tensor name
* removed tensor mappings -> handled by smolvlm
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
---------
Signed-off-by: ryan-mangeno <ryanmangeno@gmail.com >
Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com >
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
Co-authored-by: compilade <git@compilade.net >
b5865
2025-07-10 19:41:00 +02:00
Aman Gupta
11ee0fea2a
Docs: script to auto-generate ggml operations docs ( #14598 )
...
* Docs: script to auto-generate ggml operations docs
* Review: formatting changes + change github action
* Use built-in types instead of typing
* docs : add BLAS and Metal ops
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
b5864
2025-07-10 23:29:01 +08:00
Eric Zhang
a457551332
cmake : do not search for curl libraries by ourselves ( #14613 )
...
* cmake : do not search for curl libraries by ourselves
* run : do not search for curl libraries by ourselves
b5863
2025-07-10 15:29:05 +03:00
Akarshan Biswas
704bb7a71c
SYCL: Initial set_rows kernel implementation ( #14562 )
...
* SYCL: Initial set_rows kernel implementation
* Revert max_threads to 256
* Refactor set_rows and address review comments
* Deduplicate conversion function
* Remove guard before kernel launch and refactor
* Fix and add back SFINAE
b5862
2025-07-10 09:29:38 +01:00
Xuan-Son Nguyen
435a6d10d6
llama : minor coding style fix for smollm3 ( #14605 )
b5861
2025-07-10 10:00:20 +03:00
Eric Zhang
f9a867f592
cmake : bump llguidance version to v1.0.1 ( #14609 )
b5860
2025-07-10 08:19:37 +03:00
Eric Zhang
ac44eb6c80
cmake : llguidance build parser library only ( #14608 )
b5859
2025-07-10 08:19:13 +03:00
compilade
a57d1bcb3c
cuda : support Falcon-H1 state size for SSM_SCAN ( #14602 )
b5858
2025-07-09 23:54:38 -04:00
Xuan-Son Nguyen
cb9178f885
llama : remove llm_graph_input_one ( #14603 )
b5857
2025-07-09 23:09:28 +02:00
compilade
4a5686da22
llama : support Jamba hybrid Transformer-Mamba models ( #7531 )
...
* wip: llama : separate recurrent states from the KV cache
This will be necessary to support Jamba
(and other recurrent models mixed with Attention).
Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
* llama : use std::find for seq_nodes in llama_rs_cache
* llama : state checkpoints for recurrent models
* llama : correctly handle more edge cases for the rs cache
* llama : rename many llama_kv_cache_* functions
* llama : remove useless return value for some llama_cache_* functions
* llama : rethink recurrent state cell counts
* llama : begin work on support for variable GQA
This will also be useful for Jamba if we consider the Mamba layers
to have 0 KV heads.
* llama : gracefully fail when not finding hybrid slot
* llama : support Jamba
* llama : fix BERT inference without KV cache
* convert-hf : check for unprocessed Jamba experts
* convert-hf : support Mini-Jamba conversion
* llama : fix Jamba quantization sanity checks
* llama : sequence-length-aware batch splitting
* llama : use equal-sequence-length sub-batches for recurrent models
* ggml : simplify SSM-related operators
* llama : make recurrent state slot allocation contiguous
* llama : adapt internal uses of batches to llama_ubatch
* llama : fix batch split output count for embeddings
* llama : minimize swaps when reordering logits
This reduces overhead when running hellaswag
on thousands of sequences with very small 100k params Mamba models.
* llama : fix edge case finding batch seq_id of split recurrent cell
This otherwise was a problem when running the HellaSwag benchmark
with small batch sizes, making it crash.
* llama : avoid copies for simple batch splits
* ggml : make ggml_ssm_scan not modify its source tensors
* llama : fix shared recurrent tail cell count for small ubatch sizes
Otherwise it was impossible to run the 'parallel' example with '-ub 1'
with a Mamba or Jamba model.
* llama : fix .base() compilation error on Windows
* llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors
The implementation already supported it,
and this makes Mamba's conv step slightly faster.
* mamba : fix non-contiguous usage of ggml_silu
* llama : session saving and reloading for hybrid models
* convert_hf : fix Jamba conversion
* llama : fix mixed signedness comparison
* llama : use unused n_embd_k_gqa in k_shift
This also slightly reduces the diff from the master branch
* llama : begin renaming llama_past back to llama_kv_cache
* llama : remove implicit recurrent state rollbacks
* llama : partially apply clang-format style
* convert : fix jamba conv1d shape squeezing
* graph : add back hybrid memory graph input
But this time it contains the sub-cache graph inputs.
This *should* make it easier to handle updating the inputs
when caching the graph (eventually).
* model : add Jamba to Mamba-specific hparams printing
* jamba : remove redundant nullptr initializations
* model : remove unnecessary prefix for tensor loading constants
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* model : use ggml_swiglu_split for Mamba
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* model : make falcon-h1 use shared mamba2 layer builder
* memory : avoid referring to KV in recurrent cache logs
* gguf-py : avoid adding duplicate tensor mappings for Jamba
Some of the tensor names are common with Llama4
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
b5856
2025-07-09 14:59:57 -04:00
Xuan-Son Nguyen
98bab638fb
ggml : add ggml_scale_bias ( #14417 )
...
* ggml : add ggml_scale_bias
* ggml_vec_mad1_f32
* add more simd
* add CUDA
* sycl
* vulkan
* cann (placeholder)
* opencl
* will this fix cpu?
* fix cuda
* suggestions from coderabbit
* fix cann compile error
* vDSP_vsmsa
* rm __ARM_FEATURE_SVE
* use memcpy for op params
* make code looks more consistent
* use scalar for __ARM_FEATURE_SVE
* add x param to ggml_vec_mad1_f32
b5855
2025-07-09 18:16:12 +02:00
Miaoqian Lin
26a48ad699
ggml : prevent integer overflow in gguf tensor size calculation ( #14595 )
b5854
2025-07-09 14:33:53 +02:00
Dowon
ffd59e7d18
model : add skt/A.X-4.0 model vocabulary ( #14589 )
b5853
2025-07-09 11:22:31 +03:00
Sigbjørn Skjæret
105554595f
llama : remove unintended whitespace ( #14592 )
b5852
2025-07-09 10:19:50 +02:00
ibrahim khadraoui
04655063c4
model : add support for Falcon-H1 family ( #14534 )
...
* v1
* push more fixes
* another fix
* fix
* more fixes
* minor fix
* more cleaning on python code
* python fixes
* changed precision for multipliers float 32->64
* fixes
* another fix
* fix
* pre-norm -> norm
* fix
* Revert "fix"
This reverts commit 243e4d1a50
.
* fix
* small fix ffn_norm
* try
* mix instead of max
* fix vocab size
* conflict solve
* fixed multipliers
* falcon-h1 specefic vocab resolved
* read arch from gguf.MODEL_ARCH
* mamba_d_ssm added to d_inner find_hparam
* remove unused functions from gguf_writer.py
* override modify_tensors instead of get_tensors
* fix conversion and d_inner
* added some cb functions for debugging puposes
* inp_out_ids moved outside of layers loop
* mup_vec create as float64
* fix rope_theta
* injected mup
* clean ups
* rm extra space
* rm unused MAMBA_CHUNK_SIZE
* rm unused key
* add bos False
* changed ROPE_TYPE
* cleaning debugging stuff
* cleaning debug quant
* fix comment
* some cleanups
* some cleanups
* Update src/llama-model-loader.cpp
* more cleanups
* moe cleanuips
* d_ssm -> d_inner;
* cleaning unused hparams
* cleanup
* more cleanups
* more cleanups on python conversion;
* minor cleanups
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* remove todo
* added falcon-h1
* tensor not required
* clean
* remove unneeded attributes
* more cleanups and fixed conversion
* remove final_norm
* flake8 fixes
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* flake8 fixes
* Update src/llama-hparams.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-arch.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* added hashes
* Update src/llama-arch.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update src/llama-vocab.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* update the update file
* Revert "update the update file"
This reverts commit 082ab4ad2a
.
* fix: address suggestions
* fix: update convert_hf_to_gguf.py
* Update gguf-py/gguf/constants.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* d_inner fixed
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* reshaping ssm_norm for 34B
* removing generate_mup
* remove duplicates metadata keys
* rm comment
* final comment
* fix unused args
* fix constants
* fix bad merge
* Update src/llama-model.cpp
Co-authored-by: compilade <git@compilade.net >
* falcon-h1: remove unused ssm_in_b and bad merge
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* falcon-h1: fix last comment
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <git@compilade.net >
* falcon-h1: revert add_add_bos(False)
* falcon-h1: fix tied weights
* falcon-h1: remove whitespace
* falcon-h1: fix wrong size param
* falcon-h1: fix whitespace issues
---------
Co-authored-by: younesbelkada <younes.belkada@tii.ae >
Co-authored-by: Younes B <49240599+younesbelkada@users.noreply.github.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
Co-authored-by: compilade <git@compilade.net >
b5851
2025-07-09 10:03:49 +02:00
Xuan-Son Nguyen
20b7bf8a32
convert : fix smollm3 jinja template ( #14586 )
2025-07-09 09:26:13 +03:00
Jeff Bolz
6efcd65945
vulkan: optimize flash attention split_k_reduce ( #14554 )
...
* vulkan: allow FA split_k with smaller KV values
* vulkan: spread split_k_reduce work across more threads
k_num can get rather large. Use the whole workgroup to reduce the M/L values.
Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
b5849
2025-07-08 20:11:42 +02:00
stevenkuang
699f4392a3
model : fix hunyuan moe chat template ( #14584 )
...
Signed-off-by: stevenkuang <stevenkuang@tencent.com >
b5848
2025-07-08 18:29:29 +02:00
Xuan-Son Nguyen
08382869a2
model : add SmolLM3 ( #14581 )
...
* Init - first pass.
* Model -> ModelBase.
* fix errors in conversion.
* Update the graph.
* up.
* up.
* wip
* cgraph ok
* rm redundant code
---------
Co-authored-by: Vaibhavs10 <vaibhavs10@gmail.com >
b5847
2025-07-08 18:07:01 +02:00
compilade
bb4f7a9e4e
memory : fix broken batch splits for recurrent cache ( #14575 )
...
Splits producing more than one ubatch per batch for recurrent models
were broken with #14512 .
This fixes it by moving the completeness check after the ubatch split loop.
b5846
2025-07-08 18:37:47 +03:00
Jeff Bolz
b8eeb8741d
vulkan : fix rope with partial rotation and non-cont src ( #14582 )
b5845
2025-07-08 15:21:21 +02:00
Alawode Oluwandabira
17a1f0d2d4
server: Add ability to mount server at prefix ( #14544 )
...
* Add server_prefix
* Correct server path env
* Rename cli flag to --api-prefix
* Change all to api_prefix
b5844
2025-07-08 11:47:33 +03:00
Xuan-Son Nguyen
8f22dc0a53
model : add hunyuan moe ( #14425 )
...
* model : add hunyuan moe
* tokenizer ok
* fix tensor name
* cgraph init
* chat template
* wip
* almost working
* skip embed, fix bos
* cleanup
* yarn scaling
* cleanup
* correct rope type
* failed token fix
* ntk alpha freq_base
* tokenization working
* cleanup and pr changes
* vocab_size sanity check
* ntk alpha generic
* Update convert_hf_to_gguf.py
* Apply suggestions from code review
* fix regression
* fix style
---------
Co-authored-by: kooshi <1934337+kooshi@users.noreply.github.com >
b5843
2025-07-08 11:24:06 +03:00
Jeff Bolz
53903ae6fa
vulkan: increase timeout for CI ( #14574 )
2025-07-08 09:38:31 +02:00
Georgi Gerganov
4d0dcd4a06
cuda : fix rope with partial rotation and non-cont src ( #14580 )
...
* cuda : fix rope non-cont
ggml-ci
* cont : fix multi-rope + add test
ggml-ci
* sycl : try fix
ggml-ci
* cont : fix sycl + clean-up cuda
ggml-ci
b5841
2025-07-08 10:15:21 +03:00
Aman Gupta
75c91de6e9
CUDA: add bilinear interpolation for upscale ( #14563 )
b5840
2025-07-08 10:11:18 +08:00
R0CKSTAR
68155c66f0
musa: fix build warnings (unused variable) ( #14561 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
b5839
2025-07-08 07:58:30 +08:00
Sigbjørn Skjæret
e1a7059053
llama : fix incorrect minicpm3 v_states shape ( #14571 )
b5838
2025-07-07 23:35:35 +02:00
Sigbjørn Skjæret
12f55c302b
llama : remove ggml_cont where possible ( #14568 )
b5837
2025-07-07 21:35:08 +02:00
Aman Gupta
b9c3eefde1
CUDA: add bf16 and i32 to getrows ( #14529 )
b5836
2025-07-07 21:45:43 +08:00
Eve
6491d6e4f1
vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) ( #14485 )
...
Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260
Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com >
b5835
2025-07-06 12:29:36 +02:00
Jeff Bolz
e592be1575
vulkan: fix rms_norm+mul fusion ( #14545 )
...
The fused operation was grabbing the epsilon value from the wrong place.
Add an env var to disable fusion.
Add some missing checks for supported shapes/types.
Handle fused rms_norm+mul in check_results.
b5834
2025-07-06 10:08:16 +02:00
Jeff Bolz
a0374a67e2
vulkan: Handle updated FA dim2/3 definition ( #14518 )
...
* vulkan: Handle updated FA dim2/3 definition
Pack mask boolean and n_head_log2 into a single dword to keep the push
constant block under the 128B limit.
* handle null mask for gqa
* allow gqa with dim3>1
b5833
2025-07-05 09:26:04 +02:00
Sigbjørn Skjæret
ddef99522d
server : fix assistant prefilling when content is an array ( #14360 )
b5832
2025-07-05 09:17:14 +02:00
Sigbjørn Skjæret
6681688146
opencl: add GELU_ERF ( #14476 )
b5831
2025-07-04 23:24:56 -07:00
Georgi Gerganov
bac8bed248
eval-callback : check for empty input ( #14539 )
b5830
2025-07-05 07:18:09 +03:00
R0CKSTAR
b81510a7b7
test-backend-ops: add support for specifying output format ( #14368 )
...
* test-backend-ops: add support for specifying output format
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Add build_commit and build_number in test_result
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* refactor
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Get build commit from ggml_commit()
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Merge errors into test_operation_info && address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* remove visitor nonsense
* remove visitor comment
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
* Address review comments
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
---------
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
Co-authored-by: slaren <slarengh@gmail.com >
b5829
2025-07-05 12:10:53 +08:00
Georgi Gerganov
ef797db357
metal : disable fast math in all quantize kernels ( #14528 )
...
ggml-ci
b5828
2025-07-04 19:19:09 +03:00
Georgi Gerganov
67d1ef23c6
batch : add optional for sequential equal split ( #14511 )
...
ggml-ci
b5827
2025-07-04 09:08:59 +03:00
Georgi Gerganov
7b50f7c025
graph : prepare for 4D mask ( #14515 )
...
ggml-ci
b5826
2025-07-04 09:05:36 +03:00
Georgi Gerganov
c79184d2d1
batch : add n_used count ( #14512 )
...
ggml-ci
b5825
2025-07-04 09:04:59 +03:00
luyhcsu
499a8f5a78
CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator ( #14002 )
...
Co-authored-by: luyuhong <luyuhong@kylinos.cn >
b5824
2025-07-04 11:50:07 +08:00
Sigbjørn Skjæret
28657a8229
ggml : implement GEGLU_ERF and GEGLU_QUICK ops ( #14445 )
b5823
2025-07-03 23:07:22 +02:00
lhez
bee28421be
opencl : broadcast for soft_max ( #14510 )
b5822
2025-07-03 20:22:24 +02:00
Jeff Bolz
2b72bedec1
vulkan: support mixed/deepseekR1 FA head sizes ( #14509 )
...
* vulkan: better parameterize FA by head sizes
* vulkan: support mixed/deepseekR1 FA head sizes
b5821
2025-07-03 20:21:14 +02:00
Johannes Gäßler
c8c4495b8d
ggml: backward pass for split swiglu ( #14483 )
b5820
2025-07-03 17:05:18 +02:00
Nicolò Scipione
7b63a71a6b
Fix conditional enabling following arch checks for ggml-sycl ( #14504 )
...
Signed-off-by: nscipione <nicolo.scipione@codeplay.com >
b5819
2025-07-03 11:00:03 +02:00
Xuan-Son Nguyen
0c2ee38ab7
convert : correct gemma 3n conversion ( #14450 )
...
* convert : correct gemma 3n conversion
* rm redundant code
2025-07-03 10:03:06 +02:00