Sigbjørn Skjæret
de8ec1348b
Merge branch 'master' into cisc/test-tokenizers-remote
2025-05-31 21:25:34 +02:00
Olivier Chafik
e15898d1c7
server: allow unclosed thinking tags ( #13931 )
2025-05-31 08:26:10 -07:00
Georgi Gerganov
53f925074d
sync : vendor ( #13901 )
...
* sync : vendor
ggml-ci
* cont : fix httplib version
ggml-ci
* cont : fix lint
* cont : fix lint
* vendor : move to common folder /vendor
ggml-ci
* cont : fix lint
* cont : move httplib to /vendor + use json_fwd.hpp
ggml-ci
* cont : fix server build
ggml-ci
* cont : add missing headers
ggml-ci
* cont : header clean-up
ggml-ci
2025-05-30 16:25:45 +03:00
Xuan-Son Nguyen
10961339b2
mtmd : move helpers to dedicated library ( ⚠️ breaking change) ( #13866 )
...
* mtmd : move helpers to dedicated library
* fix server build
* rm leftover cmakelist code
2025-05-28 22:35:22 +02:00
Đinh Trọng Huy
e0e3aa231d
llama : add support for BertForSequenceClassification reranker ( #13858 )
...
* convert: add support for BertForSequenceClassification
* add support for reranking using BertForSequenceClassification
* merge checks of eos and sep
* fix lint
---------
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp >
2025-05-28 19:01:58 +02:00
Sigbjørn Skjæret
0fe7183ae4
fix prototype for non-curl builds
2025-05-28 11:11:02 +02:00
Sigbjørn Skjæret
2d2e059f4f
make common_download_file_single/multiple public
2025-05-28 09:50:41 +02:00
Olivier Chafik
cdf94a1802
server: --offline mode ( #13804 )
...
* server: --offline mode (env: LLAMA_OFFLINE)
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
2025-05-26 22:34:27 +01:00
Olivier Chafik
03f582ae8f
server: fix streaming crashes ( #13786 )
...
* add preludes to content on partial regex match
* allow all parsers to parse non-tool-call content.
* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
2025-05-26 16:03:57 +01:00
Olivier Chafik
d74e94c1b3
server
: fix format of streamed tool call deltas (diff name, fix id location) (#13800 )
...
* fix deltas of tool_call.function.name
* fix tool_call.id (was in tool_call.function.id!) + add function type
* add tool_call.type
* populate empty tool_call.function.arguments on first delta
2025-05-26 14:56:49 +01:00
Olivier Chafik
f13847cfb5
server: fix regression on streamed non-chat completion w/ stops ( #13785 )
...
* more forgiving message diffs: partial stop words aren't erased, full stops are
* Add (slow) server test for completion + stream + stop
2025-05-26 14:16:37 +01:00
Olivier Chafik
e121edc432
server
: add --reasoning-budget 0
to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771 )
...
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
2025-05-26 00:30:51 +01:00
Percy Piper
c508256db2
rpc : Fix build on OpenBSD ( #13541 )
2025-05-25 15:35:53 +03:00
Olivier Chafik
f5cd27b71d
server
: streaming of tool calls and thoughts when --jinja
is on (#12379 )
...
* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com >
2025-05-25 01:48:08 +01:00
Xuan-Son Nguyen
797990c4bc
mtmd : add ultravox audio input ( #13623 )
...
* convert ok, load ok
* warmup ok
* test
* still does not work?
* fix padding
* temporary give up
* fix merge conflict
* build_ultravox()
* rm test
* fix merge conflict
* add necessary mtmd APIs
* first working version (only 4s of audio)
* will this monster compile?
* fix compile
* please compile
* fPIC
* fix windows
* various fixes
* clean up audio_helpers
* fix conversion
* add some debug stuff
* long audio input ok
* adapt the api
* add --audio arg
* final touch UX
* add miniaudio to readme
* fix typo
* refactor kv metadata
* mtmd_default_marker()
2025-05-22 20:42:48 +02:00
Sigbjørn Skjæret
2aa777d86d
examples : switch retrieval to llama_encode ( #13685 )
...
* switch retrieval to llama_encode
* enable --no-warmup for retrieval
2025-05-21 16:57:38 +02:00
Georgi Gerganov
a4090d1174
llama : remove llama_kv_cache_view API + remove deprecated ( #13653 )
...
ggml-ci
2025-05-20 16:13:16 +03:00
Georgi Gerganov
e298d2fbd0
kv-cache : add SWA support ( #13194 )
...
* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci
2025-05-20 08:05:46 +03:00
psocolovsky
1dfbf2cf3a
common : add load_progress_callback ( #13617 )
2025-05-19 21:17:36 +02:00
Isaac McFadyen
6a2bc8bfb7
server : added --no-prefill-assistant flag ( #13608 )
...
* added no-prefill-assistant flag
* reworded documentation comment
* updated server README.md
2025-05-17 23:59:48 +02:00
Georgi Gerganov
518329b2d4
parallel : add option for non-shared and larger prompts ( #13598 )
...
* parallel : add option for non-shared and larger prompts
* parallel : update readme [no ci]
* cont : add note about base models [no ci]
* parallel : better var name
ggml-ci
2025-05-17 12:58:55 +03:00
Z
3e0be1cace
llguidance : official v0.7.20 release (no actual changes) [noci] ( #13594 )
2025-05-16 22:56:28 +02:00
Olivier Chafik
bc098c3cf0
minja: sync (qwen3) ( #13573 )
...
* minja: sync f06140fa52
- https://github.com/google/minja/pull/67 (@grf53)
- https://github.com/google/minja/pull/66 (@taha-yassine)
- https://github.com/google/minja/pull/63 (@grf53)
- https://github.com/google/minja/pull/58
---------
Co-authored-by: ochafik <ochafik@google.com >
2025-05-15 23:29:10 +01:00
Olivier Chafik
aa48e373f2
server
: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802 )
...
* Inject date_string in llama 3.x + fix for functionary v2
https://github.com/ggml-org/llama.cpp/issues/12729
* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
2025-05-15 02:39:51 +01:00
Olivier Chafik
3198405e98
common
: add partial regex support (#12808 )
...
* move string_find_partial_stop & string_ends_with to common
* add common_regex (supports partial matches)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* partial regex: add missing iterator end checks
* string utils: use string_views
* direct throw to avoid ggml.h include
* regex-partial: replace missed ggml_asserts
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2025-05-14 19:50:57 +01:00
Johannes Gäßler
10d2af0eaa
llama/ggml: add LLM training support ( #10544 )
...
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
David Huang
7f323a589f
Add --no-op-offload
to improve -ot
pp perf in MoE models like llama4 400B ( #13386 )
2025-05-11 14:18:39 +02:00
Sigbjørn Skjæret
43dfd741a5
llguidance : set tokenizer slices to default ( #13424 )
2025-05-10 17:19:52 +02:00
Xuan-Son Nguyen
7fef11766c
arg : add env var to control mmproj ( #13416 )
...
* arg : add env var to control mmproj
* small note about -hf --mmproj
2025-05-10 08:16:29 +02:00
Helton Reis
7c28a74e07
chore(llguidance): use tagged version that does not break the build ( #13413 )
2025-05-09 23:15:39 +03:00
Xuan-Son Nguyen
33eff40240
server : vision support via libmtmd ( #12898 )
...
* server : (experimental) vision support via libmtmd
* mtmd : add more api around mtmd_image_tokens
* mtmd : add more api around mtmd_image_tokens
* mtmd : ability to calc image hash
* shared_ptr for mtmd_image_tokens
* move hash to user-define ID (fixed)
* abstract out the batch management
* small fix
* refactor logic adding tokens to batch
* implement hashing image
* use FNV hash, now hash bitmap instead of file data
* allow decoding image embedding to be split into batches
* rm whitespace
* disable some features when mtmd is on
* fix --no-mmproj-offload
* mtmd_context_params no timings
* refactor server_inp to server_tokens
* fix the failing test case
* init
* wip
* working version
* add mtmd::bitmaps
* add test target
* rm redundant define
* test: mtmd_input_chunks_free
* rm outdated comment
* fix merging issue
* explicitly create mtmd::input_chunks
* mtmd_input_chunk_copy
* add clone()
* improve server_input struct
* clip : fix confused naming ffn_up and ffn_down
* rm ffn_i/o/g naming
* rename n_embd, n_ff
* small fix
* no check n_ff
* fix detokenize
* add const to various places
* add warning about breaking changes
* add c api
* helper: use mtmd_image_tokens_get_n_pos
* fix ctx_shift
* fix name shadowing
* more strict condition
* support remote image_url
* remote image_url log
* add CI test
* do not log base64
* add "has_multimodal" to /props
* remove dangling image
* speculative: use slot.cache_tokens.insert
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* rm can_be_detokenized
* on prmpt processing done, assert cache_tokens.size
* handle_completions_impl returns void
* adapt the new web ui
* update docs and hot topics
* rm assert
* small fix (2)
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2025-05-09 19:29:37 +02:00
Bartowski
efb8b47eda
imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation ( #13389 )
...
* Add --parse-special for enabling parsing of special tokens in imatrix calculation
* whitespace
2025-05-09 11:53:58 +02:00
Diego Devesa
15e03282bb
ci : limit write permission to only the release step + fixes ( #13392 )
...
* ci : limit write permission to only the release step
* fix win cuda file name
* fix license file copy on multi-config generators
2025-05-08 23:45:22 +02:00
Xuan-Son Nguyen
8c83449cb7
server : (webui) revamp the input area, plus many small UI improvements ( #13365 )
...
* rework the input area
* process selected file
* change all icons to heroicons
* fix thought process collapse
* move conversation more menu to sidebar
* sun icon --> moon icon
* rm default system message
* stricter upload file check, only allow image if server has mtmd
* build it
* add renaming
* better autoscroll
* build
* add conversation group
* fix scroll
* extra context first, then user input in the end
* fix <hr> tag
* clean up a bit
* build
* add mb-3 for <pre>
* throttle adjustTextareaHeight to make it less laggy
* (nits) missing padding in sidebar
* rm stray console log
2025-05-08 15:37:29 +02:00
Georgi Gerganov
51fb96b1ff
context : remove logits_all flag ( #13284 )
...
* context : remove logits_all flag
ggml-ci
* llama : remove logits_all flag + reorder llama_context_params
ggml-ci
2025-05-08 14:26:50 +03:00
Ycros
39e73ae0d6
common : Add a warning when we can't match samplers from a string or char. ( #13330 )
2025-05-07 11:23:28 +03:00
Georgi Gerganov
4773d7a02f
examples : remove infill ( #13283 )
...
ggml-ci
2025-05-07 10:28:02 +03:00
oobabooga
233461f812
sampling : Integrate Top-nσ into main sampling chain (and add it to the server) ( #13264 )
...
* sampling: add Top-nσ sampler to `llama-server` and sampler ordering
* revert: sampler ordering
* revert: VS' crappy auto-formatting
* revert: VS' crappy auto-formatting pt.2
* revert: my crappy eye sight...
* sampling: add XTC to Top-nσ sampler chain
* sampling: add Dyna. Temp. to Top-nσ sampler chain
* sampling: actually remove Top-nσ from sampler(oops)
* Integrate top_n_sigma into main sampler chain
* Define COMMON_SAMPLER_TYPE_TOP_N_SIGMA
* Formatting
* Lint
* Exit early in the sampler if nsigma < 0
---------
Co-authored-by: CasualAutopsy <casual_autopsy@outlook.com >
2025-05-05 22:12:19 +02:00
Xuan-Son Nguyen
9b61acf060
mtmd : rename llava directory to mtmd ( #13311 )
...
* mv llava to mtmd
* change ref everywhere
2025-05-05 16:02:55 +02:00
Diego Devesa
1d36b3670b
llama : move end-user examples to tools directory ( #13249 )
...
* llama : move end-user examples to tools directory
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2025-05-02 20:27:13 +02:00
Georgi Gerganov
fab647e884
server : add cache reuse card link to help ( #13230 )
...
* server : add cache reuse card link to help
* args : use short url
2025-05-02 09:48:31 +03:00
Diego Devesa
d7a14c42a1
build : fix build info on windows ( #13239 )
...
* build : fix build info on windows
* fix cuda host compiler msg
2025-05-01 21:48:08 +02:00
Xuan-Son Nguyen
13c9a3319b
arg : remove CURLINFO_EFFECTIVE_METHOD ( #13228 )
2025-05-01 10:23:25 +02:00
Xuan-Son Nguyen
6f67cf1f48
arg : -hf do not fail if url mismatch ( #13219 )
...
* arg : -hf do not fail if url mismatch
* do not return if cannot parse metadata json
2025-04-30 21:29:15 +01:00
Olivier Chafik
3b127c7385
common : add -jf / --json-schema-file flag ( #12011 )
2025-04-30 14:52:35 +02:00
Xuan-Son Nguyen
5933e6fdc9
arg : allow using -hf offline ( #13202 )
...
* arg : allow using -hf offline
* add more comments in code [no ci]
2025-04-30 10:46:32 +02:00
Georgi Gerganov
43f2b07193
common : fix noreturn compile warning ( #13151 )
...
ggml-ci
2025-04-28 11:57:19 +03:00
Xuan-Son Nguyen
85f36e5e71
arg : fix unused variable ( #13142 )
2025-04-28 08:16:59 +03:00
Xuan-Son Nguyen
2d451c8059
common : add common_remote_get_content ( #13123 )
...
* common : add common_remote_get_content
* support max size and timeout
* add tests
2025-04-26 22:58:12 +02:00
frob
d5fe4e81bd
grammar : handle maxItems == 0 in JSON schema ( #13117 )
...
Co-authored-by: Richard Lyons <frob@cloudstaff.com >
2025-04-26 10:10:20 +02:00