88fc854b4b
llama : improve sep token handling ( #14272 )
2025-06-20 14:04:09 +02:00
4c9fdfbe15
ubatch : new splitting logic ( #14217 )
...
ggml-ci
2025-06-20 10:14:14 +03:00
d67341dc18
server : add server parameters for draft model cache type ( #13782 )
...
Co-authored-by: aa956 <27946957+aa956@users.noreply.github.com >
2025-06-19 16:01:03 +03:00
89fea80d29
server : fix incorrect usage of llama_get_embeddings() ( #14225 )
...
* server : fix incorrect usage of llama_get_embeddings()
ggml-ci
* cont : fix the fix
ggml-ci
2025-06-16 22:33:27 +03:00
d3e64b9f49
llama : rework embeddings logic ( #14208 )
...
* llama : rework embeddings logic
ggml-ci
* cont : fix rerank
ggml-ci
* cont : engrish [no ci]
* cont : fix rerank
ggml-ci
* server : support both embeddings and completions with single model
ggml-ci
* cont : avoid embeddings_org
ggml-ci
2025-06-16 14:14:00 +03:00
cd355eda7d
server : When listening on a unix domain socket don't print http:// and port ( #14180 )
...
Instead show something like this:
main: server is listening on file.sock - starting the main loop
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-06-15 23:36:22 +02:00
ffad043973
server : fix SWA condition for full context reprocess ( #14163 )
...
ggml-ci
2025-06-13 11:18:25 +03:00
7d516443dd
server : re-enable SWA speculative decoding ( #14131 )
...
ggml-ci
2025-06-12 11:51:38 +03:00
7781e5fe99
webui: Wrap long numbers instead of infinite horizontal scroll ( #14062 )
...
* webui: Wrap long numbers instead of infinite horizontal scroll
* Use tailwind class
* update index.html.gz
2025-06-11 16:42:25 +02:00
2baf07727f
server : pass default --keep argument ( #14120 )
2025-06-11 13:43:43 +03:00
3a12db23b6
Fixed spec timings to: accepted/tested instead of accepted/drafted ( #14104 )
2025-06-10 16:48:07 +01:00
dc0623fddb
webui: fix sidebar being covered by main content ( #14082 )
...
* webui: fix sidebar being covered by main content
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* webui: update index.html.gz
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-06-09 12:01:17 +02:00
87d34b381d
server : fix LRU check ( #14079 )
...
ggml-ci
2025-06-09 12:57:58 +03:00
745aa5319b
llama : deprecate llama_kv_self_ API ( #14030 )
...
* llama : deprecate llama_kv_self_ API
ggml-ci
* llama : allow llama_memory_(nullptr)
ggml-ci
* memory : add flag for optional data clear in llama_memory_clear
ggml-ci
2025-06-06 14:11:15 +03:00
3637576288
server : disable speculative decoding for SWA models ( #13970 )
...
* server : use swa-full fo draft context
ggml-ci
* server : disable speculative decoding for SWA models
2025-06-02 21:34:40 +03:00
c9bbc77931
server
: update deepseek reasoning format (pass reasoning_content as diffs) (#13933 )
...
* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts
2025-06-02 10:15:44 -07:00
3600cc2886
llama : use n_swa + n_ubatch cells for SWA cache ( #13833 )
...
* llama : use n_swa + n_ubatch cells for SWA cache
ggml-ci
* llama : add warning about multi-sqeuence SWA contexts
2025-05-31 15:57:44 +03:00
c7e0a2054b
webui : Replace alert and confirm with custom modals. ( #13711 )
...
* Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons.
* use Modal Provider to simplify the use of confirm and alert modals.
* Increase the z index of the modal dialogs.
* Update index.html.gz
* also add showPrompt
* rebuild
---------
Co-authored-by: igardev <ivailo.gardev@akros.ch >
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2025-05-31 11:56:08 +02:00
3f55f781f1
llama : auto-batch preparation ( #13845 )
...
* llama : auto-batch
ggml-ci
* context : simplify if branching
2025-05-31 12:55:57 +03:00
51fa76f172
mtmd : drop _shared
from libmtmd
name, merge helpers into libmtmd ( ⚠️ breaking change) ( #13917 )
...
* mtmd : fix missing public header
* no object
* apply suggestion from Georgi
* rm mtmd-helper, merge it to mtmd
* missing vendor include dir
2025-05-31 10:14:29 +02:00
12d0188c0d
kv-cache : refactor + add llama_memory_state_i ( #13746 )
...
* kv-cache : simplify the "struct llama_kv_cache" interface
ggml-ci
* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)
ggml-ci
* kv-cache : some comments
ggml-ci
* context : fix graph reserve for multiple sequences
ggml-ci
* kv-cache : fix typo [no ci]
* kv-cache : fix find_slot() logic for free slots
ggml-ci
* llama : add TODO for deprecating the defrag API in the future
* kv-cache : improve find_slot() using min/max seq pos info
ggml-ci
* llama : handle aborts and compute errors
ggml-ci
* memory : extract state into llama_memory_state
ggml-ci
* kv-cache : add comments
ggml-ci
* server : update batching logic to reset n_batch on successful decode
* server : upon full re-processing, remove the sequence from the cache
* kv-cache : add TODO for doing split_equal when split_simple fails
ggml-ci
2025-05-31 10:24:04 +03:00
53f925074d
sync : vendor ( #13901 )
...
* sync : vendor
ggml-ci
* cont : fix httplib version
ggml-ci
* cont : fix lint
* cont : fix lint
* vendor : move to common folder /vendor
ggml-ci
* cont : fix lint
* cont : move httplib to /vendor + use json_fwd.hpp
ggml-ci
* cont : fix server build
ggml-ci
* cont : add missing headers
ggml-ci
* cont : header clean-up
ggml-ci
2025-05-30 16:25:45 +03:00
10961339b2
mtmd : move helpers to dedicated library ( ⚠️ breaking change) ( #13866 )
...
* mtmd : move helpers to dedicated library
* fix server build
* rm leftover cmakelist code
2025-05-28 22:35:22 +02:00
e0e3aa231d
llama : add support for BertForSequenceClassification reranker ( #13858 )
...
* convert: add support for BertForSequenceClassification
* add support for reranking using BertForSequenceClassification
* merge checks of eos and sep
* fix lint
---------
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp >
2025-05-28 19:01:58 +02:00
c962ae3382
server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode ( #13853 )
...
[fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode
2025-05-28 16:33:54 +02:00
03f582ae8f
server: fix streaming crashes ( #13786 )
...
* add preludes to content on partial regex match
* allow all parsers to parse non-tool-call content.
* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
2025-05-26 16:03:57 +01:00
d74e94c1b3
server
: fix format of streamed tool call deltas (diff name, fix id location) (#13800 )
...
* fix deltas of tool_call.function.name
* fix tool_call.id (was in tool_call.function.id!) + add function type
* add tool_call.type
* populate empty tool_call.function.arguments on first delta
2025-05-26 14:56:49 +01:00
f13847cfb5
server: fix regression on streamed non-chat completion w/ stops ( #13785 )
...
* more forgiving message diffs: partial stop words aren't erased, full stops are
* Add (slow) server test for completion + stream + stop
2025-05-26 14:16:37 +01:00
79c137f776
examples : allow extracting embeddings from decoder contexts ( #13797 )
...
ggml-ci
2025-05-26 14:03:54 +03:00
e121edc432
server
: add --reasoning-budget 0
to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771 )
...
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
2025-05-26 00:30:51 +01:00
2f099b510f
webui : bump max upload file size to 500MB ( #13779 )
2025-05-25 18:02:18 +01:00
d785f9c1fd
server: fix/test add_generation_prompt ( #13770 )
...
Co-authored-by: ochafik <ochafik@google.com >
2025-05-25 10:45:49 +01:00
f5cd27b71d
server
: streaming of tool calls and thoughts when --jinja
is on (#12379 )
...
* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com >
2025-05-25 01:48:08 +01:00
9ecf3e66a3
server : support audio input ( #13714 )
...
* server : support audio input
* add audio support on webui
2025-05-23 11:03:47 +02:00
797990c4bc
mtmd : add ultravox audio input ( #13623 )
...
* convert ok, load ok
* warmup ok
* test
* still does not work?
* fix padding
* temporary give up
* fix merge conflict
* build_ultravox()
* rm test
* fix merge conflict
* add necessary mtmd APIs
* first working version (only 4s of audio)
* will this monster compile?
* fix compile
* please compile
* fPIC
* fix windows
* various fixes
* clean up audio_helpers
* fix conversion
* add some debug stuff
* long audio input ok
* adapt the api
* add --audio arg
* final touch UX
* add miniaudio to readme
* fix typo
* refactor kv metadata
* mtmd_default_marker()
2025-05-22 20:42:48 +02:00
cc74d5be99
server : pad small embedding batches ( #13692 )
...
ggml-ci
2025-05-22 16:33:39 +03:00
5fbfe384d4
server : improve error reporting ( #13680 )
2025-05-21 19:46:56 +03:00
0d5c742161
server : Add the endpoints /api/tags and /api/chat ( #13659 )
...
* Add the endpoints /api/tags and /api/chat
Add the endpoints /api/tags and /api/chat, and improved the model metadata response
* Remove trailing whitespaces
* Removed code that is not needed for copilot to work.
2025-05-21 15:15:27 +02:00
42158ae2e8
server : fix first message identification ( #13634 )
...
* server : fix first message identification
When using the OpenAI SDK (https://github.com/openai/openai-node/blob/master/src/lib/ChatCompletionStream.ts#L623-L626 ) we noticed that the expected assistant role is missing in the first streaming message. Fix this by correctly checking for the first message.
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com >
Signed-off-by: Dorin Geman <dorin.geman@docker.com >
* server : Fix checks for first role message for stream=True
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com >
Signed-off-by: Dorin Geman <dorin.geman@docker.com >
---------
Signed-off-by: Dorin Geman <dorin.geman@docker.com >
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com >
2025-05-21 15:07:57 +02:00
797f2ac062
kv-cache : simplify the interface ( #13660 )
...
* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci
2025-05-21 15:11:13 +03:00
e298d2fbd0
kv-cache : add SWA support ( #13194 )
...
* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci
2025-05-20 08:05:46 +03:00
6a2bc8bfb7
server : added --no-prefill-assistant flag ( #13608 )
...
* added no-prefill-assistant flag
* reworded documentation comment
* updated server README.md
2025-05-17 23:59:48 +02:00
6aa892ec2a
server : do not return error out of context (with ctx shift disabled) ( #13577 )
2025-05-16 21:50:00 +02:00
aea9f8b4e7
webui : improve accessibility for visually impaired people ( #13551 )
...
* webui : improve accessibility for visually impaired people
* add a11y for extra contents
* fix some labels being read twice
* add skip to main content
2025-05-16 21:49:01 +02:00
3cc1f1f1d2
webui : handle PDF input (as text or image) + convert pasted long content to file ( #13562 )
...
* webui : handle PDF input (as text or image)
* handle the case where pdf image + server without mtmd
* fix bug missing pages
2025-05-15 14:24:50 +02:00
c753d7bed0
server : proper error handling for missing elements in messages array (OpenAI compatible backend) ( #13540 )
2025-05-15 08:40:58 +02:00
aa48e373f2
server
: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802 )
...
* Inject date_string in llama 3.x + fix for functionary v2
https://github.com/ggml-org/llama.cpp/issues/12729
* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
2025-05-15 02:39:51 +01:00
3198405e98
common
: add partial regex support (#12808 )
...
* move string_find_partial_stop & string_ends_with to common
* add common_regex (supports partial matches)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* partial regex: add missing iterator end checks
* string utils: use string_views
* direct throw to avoid ggml.h include
* regex-partial: replace missed ggml_asserts
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2025-05-14 19:50:57 +01:00
053174436f
server : passthrough the /models endpoint during loading ( #13535 )
...
* server : passthrough the /models endpoint during loading
* server : update readme + return json for "meta" field
2025-05-14 15:42:10 +03:00
360a9c98e1
server : fix cache_tokens bug with no cache_prompt ( #13533 )
2025-05-14 13:35:07 +02:00