07e4351ce6
convert : allow partial update to the chkhsh pre-tokenizer list ( #13847 )
...
* convert : allow partial update to the chkhsh pre-tokenizer list
* code style
* update tokenizer out
* rm inp/out files for models not having gguf
* fixed hash for glm
* skip nomic-bert-moe test
* Update convert_hf_to_gguf_update.py
* fix minerva-7b hash
* rm redundant import
2025-05-30 12:24:37 +02:00
66c92061f5
tests : remove json.hpp from a test ( #13880 )
...
ggml-ci
2025-05-29 12:17:16 +03:00
f9cd68398b
sampling : make sure samplers return at least 1 token ( #13822 )
...
* sampling : min-p should always return at least one token
ggml-ci
* sampling : same for typical sampling
* tests : sampling tests use min_keep == 0
ggml-ci
2025-05-27 12:07:52 +03:00
03f582ae8f
server: fix streaming crashes ( #13786 )
...
* add preludes to content on partial regex match
* allow all parsers to parse non-tool-call content.
* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
2025-05-26 16:03:57 +01:00
d74e94c1b3
server
: fix format of streamed tool call deltas (diff name, fix id location) (#13800 )
...
* fix deltas of tool_call.function.name
* fix tool_call.id (was in tool_call.function.id!) + add function type
* add tool_call.type
* populate empty tool_call.function.arguments on first delta
2025-05-26 14:56:49 +01:00
e121edc432
server
: add --reasoning-budget 0
to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771 )
...
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
2025-05-26 00:30:51 +01:00
aa50ba462f
tests : improve UGM tokenizer test coverage ( #13773 )
2025-05-25 16:22:29 +02:00
f5cd27b71d
server
: streaming of tool calls and thoughts when --jinja
is on (#12379 )
...
* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com >
2025-05-25 01:48:08 +01:00
759e37b0d8
tests : avoid github urls due to throttling ( #13654 )
2025-05-20 12:03:17 +02:00
aa48e373f2
server
: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802 )
...
* Inject date_string in llama 3.x + fix for functionary v2
https://github.com/ggml-org/llama.cpp/issues/12729
* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
2025-05-15 02:39:51 +01:00
3198405e98
common
: add partial regex support (#12808 )
...
* move string_find_partial_stop & string_ends_with to common
* add common_regex (supports partial matches)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* partial regex: add missing iterator end checks
* string utils: use string_views
* direct throw to avoid ggml.h include
* regex-partial: replace missed ggml_asserts
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2025-05-14 19:50:57 +01:00
10d2af0eaa
llama/ggml: add LLM training support ( #10544 )
...
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
7f323a589f
Add --no-op-offload
to improve -ot
pp perf in MoE models like llama4 400B ( #13386 )
2025-05-11 14:18:39 +02:00
ffc727203a
sampling : make top_n_sigma no-op at <=0 or a single candidate ( #13345 )
2025-05-06 22:36:24 +02:00
27aa259532
mtmd : add C public API ( #13184 )
...
* init
* wip
* working version
* add mtmd::bitmaps
* add test target
* rm redundant define
* test: mtmd_input_chunks_free
* rm outdated comment
* fix merging issue
* explicitly create mtmd::input_chunks
* mtmd_input_chunk_copy
* add clone()
* add const to various places
* add warning about breaking changes
* helper: use mtmd_image_tokens_get_n_pos
2025-05-04 23:43:42 +02:00
9f2da5871f
llama : build windows releases with dl backends ( #13220 )
2025-05-04 14:20:49 +02:00
1d36b3670b
llama : move end-user examples to tools directory ( #13249 )
...
* llama : move end-user examples to tools directory
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2025-05-02 20:27:13 +02:00
b34443923c
sync : ggml ( #13268 )
...
* vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204)
* vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW)
* review: remove src_x/y < 0 checks; add performance tests
* sync : ggml
ggml-ci
* vulkan : fix lint (#0 )
---------
Co-authored-by: Acly <aclysia@gmail.com >
2025-05-02 20:54:30 +03:00
2af6880178
llama-chat : reset glmedge chat template ( #13253 )
...
* reset glmedge chat template
* fix glmedge chat template
2025-05-02 11:06:09 +02:00
e0f572c846
llama-chat : update GLM4 chat template ( #13238 )
...
* update GLM4 chat template
* Update chat template
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
2025-05-01 21:16:38 +02:00
b0ecbd434b
test: non-cont. b in test-backend-ops -o MUL_MAT ( #13187 )
2025-05-01 20:18:56 +02:00
e1e8e0991f
CUDA: batched+noncont MMQ, refactor bs>1 MoE code ( #13199 )
2025-04-30 23:12:59 +02:00
da84c04d8f
docker : do not build tests ( #13204 )
...
* docker : do not build tests
* include "ggml-cpu.h"
2025-04-30 10:44:07 +02:00
4e87962e34
mtmd : fix glm-edge redundant token count ( #13139 )
...
* mtmd : fix glm-edge redundant token count
* fix chat template
* temporary disable GLMEdge test chat tmpl
2025-04-28 16:12:56 +02:00
2d451c8059
common : add common_remote_get_content ( #13123 )
...
* common : add common_remote_get_content
* support max size and timeout
* add tests
2025-04-26 22:58:12 +02:00
d5fe4e81bd
grammar : handle maxItems == 0 in JSON schema ( #13117 )
...
Co-authored-by: Richard Lyons <frob@cloudstaff.com >
2025-04-26 10:10:20 +02:00
edb18b6e8f
clip : fix pixtral on some GPU backends ( #13097 )
...
* clip : fix pixtral on some GPU backends
* refactor inp_raw set
* rm outdated comment
* fix dynamic size
* add TODO
2025-04-25 14:31:42 +02:00
13b4548877
cmake : do not include ./src as public for libllama ( #13062 )
...
* cmake : do not include ./src as public for libllama
ggml-ci
* cmake : rework tests
ggml-ci
* llguidance : remove unicode include
ggml-ci
* cmake : make c++17 private
ggml-ci
2025-04-24 16:00:10 +03:00
658987cfc9
CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID ( #13014 )
...
* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID
* fix logic for RoPE support, CUDA graphs
2025-04-22 21:27:40 +02:00
2f74c354c0
graph : make FA compatible with MLA + add initial Metal kernels ( #12953 )
...
* graph : make mla compatible with FA
* metal : add exp FA kernels for DeepSeek models
ggml-ci
* llama : minor naming updates
ggml-ci
* ggml : disable FA for DS head sizes
* tests : add FA tests for MLA shapes
ggml-ci
2025-04-17 18:16:36 +03:00
015022bb53
vulkan: enable coopmat2 FA gqa and split_k optimizations more often ( #12931 )
...
The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.
split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting.
2025-04-16 20:37:25 +02:00
b6930ebc42
tool-call
: fix non-tool-calling grammar crashes w/ Qwen / Hermes 2 templates (#12900 )
...
* `tool-call`: don't call common_chat_params_init_hermes_2_pro when there aren't tools (or when there's a schema)
* test all chat formats w/o tools
2025-04-11 21:47:52 +02:00
1d2b613445
tests : fix init order ( #0 )
...
ggml-ci
2025-04-11 00:17:47 +03:00
fe92821ea9
ggml : add bilinear upscale support (ggml/1185)
2025-04-11 00:17:47 +03:00
381603a775
ci: detach common from the library ( #12827 )
...
* fix: detach common from the library
* fix: building chat test template
2025-04-09 10:11:11 +02:00
bd3f59f812
cmake : enable curl by default ( #12761 )
...
* cmake : enable curl by default
* no curl if no examples
* fix build
* fix build-linux-cross
* add windows-setup-curl
* fix
* shell
* fix path
* fix windows-latest-cmake*
* run: include_directories
* LLAMA_RUN_EXTRA_LIBS
* sycl: no llama_curl
* no test-arg-parser on windows
* clarification
* try riscv64 / arm64
* windows: include libcurl inside release binary
* add msg
* fix mac / ios / android build
* will this fix xcode?
* try clearing the cache
* add bunch of licenses
* revert clear cache
* fix xcode
* fix xcode (2)
* fix typo
2025-04-07 13:35:19 +02:00
5f696e88e0
sync : minja (inclusionAI/Ling) and update tests ( #12699 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-04-03 13:51:35 +02:00
f01bd02376
vulkan: Implement split_k for coopmat2 flash attention. ( #12627 )
...
When using group query attention, we have one workgroup per KV batch and this
can be very few workgroups (e.g. just 8 in some models). Enable split_k to
spread the work across SMs. This helps a lot when the KV cache is large.
2025-04-02 14:25:08 -05:00
267c1399f1
common : refactor downloading system, handle mmproj with -hf option ( #12694 )
...
* (wip) refactor downloading system [no ci]
* fix all examples
* fix mmproj with -hf
* gemma3: update readme
* only handle mmproj in llava example
* fix multi-shard download
* windows: fix problem with std::min and std::max
* fix 2
2025-04-01 23:44:05 +02:00
7242dd9675
llama-chat : Add Yandex instruct model template support ( #12621 )
...
* add yandex template
* update yandex chat template
* fix tests
* adjust chat template
* fix style
* fix tool macro in template
* add clarify comment
---------
Co-authored-by: Sergei Vorobev <serv01@yandex-team.ru >
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
2025-03-30 20:12:03 +02:00
b4ae50810e
metal : improve FA + improve MoE ( #12612 )
...
* ggml : FA with different K, V head sizes (CPU)
ggml-ci
* metal : add FA with HS=192
* metal : extend FA to support different K and V head sizes
ggml-ci
* metal : add FA vector kernels for heads K 192 and V 128
ggml-ci
* ggml : restrict op on other backends to equal head sizes
ggml-ci
* metal : optimize FA-vec kernel
ggml-ci
* metal : FA remove mq registers
* metal : improve MoE mul_mat_id condition
ggml-ci
* metal : fix comments + remove unnecessary addition
ggml-ci
* metal : avoid too much shared memory usage with mul_mat_id
ggml-ci
2025-03-28 20:21:59 +02:00
2447ad8a98
upgrade to llguidance 0.7.10 ( #12576 )
2025-03-26 11:06:09 -07:00
9b169a4d4e
vulkan: fix mul_mat_vec failure in backend tests ( #12529 )
...
The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.
2025-03-24 07:56:17 +01:00
ba932dfb50
ggml : fix quantized cpy op ( #12310 )
...
* ggml : fix quantized cpy op
ggml-ci
* tests : add cpy tests for all types
ggml-ci
* tests : add BF16 copy tests
ggml-ci
* tests : fix loop for same-type copy
ggml-ci
* tests : add option to permute the dst tensor
ggml-ci
2025-03-22 16:23:26 +02:00
eddfb43850
vulkan: Optimize mul_mat_vec p021 and nc shaders ( #12505 )
...
* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders
* vulkan: Optimize mul_mat_vec p021 and nc shaders.
These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).
Using subgroupAdd in the p021 shader also helps, use that conditionally.
2025-03-22 09:40:11 +01:00
517b5ddbf0
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case ( #12183 )
...
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1
Fixes Issue: #12182
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2025-03-19 20:52:06 +01:00
7dfad387e3
llama: Add support for RWKV v7 architecture ( #12412 )
...
* ggml: Add op l2_norm
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* ggml: Add op rwkv_wkv7
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* llama: Add support for RWKV7 and ARWKV7 models
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* llama: fix inference with RWKV6Qwen2
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* llama: add more (a)rwkv7 variants in size
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* Apply code-format changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* fix MUSA build
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* llama: fix shape error with rwkv using llama-parallel
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
2025-03-18 07:27:50 +08:00
bf69cfe62f
vulkan: fix bug in coopmat1 mul_mat_id ( #12316 )
...
* tests: run mul_mat_id with a larger N
* vulkan: fix bug in coopmat1 mul_mat_id
2025-03-12 06:59:19 +01:00
e128a1bf5b
tests : fix test-quantize-fns to init the CPU backend ( #12306 )
...
ggml-ci
2025-03-10 14:07:15 +02:00
4e39a3c332
server
: extract <think> tags from qwq outputs (#12297 )
...
* extract <think> tags from qwq outputs
* const for all static regexes in chat.cpp
2025-03-10 10:59:03 +00:00