Dorin-Andrei Geman
42158ae2e8
server : fix first message identification ( #13634 )
...
* server : fix first message identification
When using the OpenAI SDK (https://github.com/openai/openai-node/blob/master/src/lib/ChatCompletionStream.ts#L623-L626 ) we noticed that the expected assistant role is missing in the first streaming message. Fix this by correctly checking for the first message.
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com >
Signed-off-by: Dorin Geman <dorin.geman@docker.com >
* server : Fix checks for first role message for stream=True
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com >
Signed-off-by: Dorin Geman <dorin.geman@docker.com >
---------
Signed-off-by: Dorin Geman <dorin.geman@docker.com >
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com >
2025-05-21 15:07:57 +02:00
Georgi Gerganov
797f2ac062
kv-cache : simplify the interface ( #13660 )
...
* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci
2025-05-21 15:11:13 +03:00
Georgi Gerganov
e298d2fbd0
kv-cache : add SWA support ( #13194 )
...
* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci
2025-05-20 08:05:46 +03:00
Isaac McFadyen
6a2bc8bfb7
server : added --no-prefill-assistant flag ( #13608 )
...
* added no-prefill-assistant flag
* reworded documentation comment
* updated server README.md
2025-05-17 23:59:48 +02:00
Xuan-Son Nguyen
6aa892ec2a
server : do not return error out of context (with ctx shift disabled) ( #13577 )
2025-05-16 21:50:00 +02:00
Xuan-Son Nguyen
aea9f8b4e7
webui : improve accessibility for visually impaired people ( #13551 )
...
* webui : improve accessibility for visually impaired people
* add a11y for extra contents
* fix some labels being read twice
* add skip to main content
2025-05-16 21:49:01 +02:00
Xuan-Son Nguyen
3cc1f1f1d2
webui : handle PDF input (as text or image) + convert pasted long content to file ( #13562 )
...
* webui : handle PDF input (as text or image)
* handle the case where pdf image + server without mtmd
* fix bug missing pages
2025-05-15 14:24:50 +02:00
Piotr Wilkin (ilintar)
c753d7bed0
server : proper error handling for missing elements in messages array (OpenAI compatible backend) ( #13540 )
2025-05-15 08:40:58 +02:00
Olivier Chafik
aa48e373f2
server
: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802 )
...
* Inject date_string in llama 3.x + fix for functionary v2
https://github.com/ggml-org/llama.cpp/issues/12729
* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
2025-05-15 02:39:51 +01:00
Olivier Chafik
3198405e98
common
: add partial regex support (#12808 )
...
* move string_find_partial_stop & string_ends_with to common
* add common_regex (supports partial matches)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* partial regex: add missing iterator end checks
* string utils: use string_views
* direct throw to avoid ggml.h include
* regex-partial: replace missed ggml_asserts
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2025-05-14 19:50:57 +01:00
Georgi Gerganov
053174436f
server : passthrough the /models endpoint during loading ( #13535 )
...
* server : passthrough the /models endpoint during loading
* server : update readme + return json for "meta" field
2025-05-14 15:42:10 +03:00
Xuan-Son Nguyen
360a9c98e1
server : fix cache_tokens bug with no cache_prompt ( #13533 )
2025-05-14 13:35:07 +02:00
Xuan-Son Nguyen
bb1681fbd5
webui : use fflate for more deterministic gzip compress ( #13525 )
...
* webui : use pako for more deterministic gzip compress
* simpler code
* use fflate instead of pako
2025-05-14 10:26:12 +02:00
Luca Stefani
d486dd3e8e
webui: Allow pasting file from clipboard ( #13526 )
...
* server: Allow pasting file from clipboard
* server: Prevent default action on file paste
* update build
* format then build combined
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2025-05-14 10:07:31 +02:00
Anudit Nagar
91159ee9df
server : allow content to be null in oaicompat_completion_params_parse ( #13477 )
2025-05-12 13:56:42 +02:00
Anthony Umfer
9a390c4829
tools : fix uninitialized llama_batch in server ( #13436 )
...
* add constructor to initialize server_context::batch, preventing destructor's call to llama_batch_free from causing an invalid free()
* Update tools/server/server.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
* use C++11 initializer syntax
* switch from Copy-list-initialization to Direct-list-initialization
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
2025-05-11 17:08:26 +02:00
Xuan-Son Nguyen
3b24d26c22
server : update docs ( #13432 )
2025-05-10 18:44:49 +02:00
Xuan-Son Nguyen
33eff40240
server : vision support via libmtmd ( #12898 )
...
* server : (experimental) vision support via libmtmd
* mtmd : add more api around mtmd_image_tokens
* mtmd : add more api around mtmd_image_tokens
* mtmd : ability to calc image hash
* shared_ptr for mtmd_image_tokens
* move hash to user-define ID (fixed)
* abstract out the batch management
* small fix
* refactor logic adding tokens to batch
* implement hashing image
* use FNV hash, now hash bitmap instead of file data
* allow decoding image embedding to be split into batches
* rm whitespace
* disable some features when mtmd is on
* fix --no-mmproj-offload
* mtmd_context_params no timings
* refactor server_inp to server_tokens
* fix the failing test case
* init
* wip
* working version
* add mtmd::bitmaps
* add test target
* rm redundant define
* test: mtmd_input_chunks_free
* rm outdated comment
* fix merging issue
* explicitly create mtmd::input_chunks
* mtmd_input_chunk_copy
* add clone()
* improve server_input struct
* clip : fix confused naming ffn_up and ffn_down
* rm ffn_i/o/g naming
* rename n_embd, n_ff
* small fix
* no check n_ff
* fix detokenize
* add const to various places
* add warning about breaking changes
* add c api
* helper: use mtmd_image_tokens_get_n_pos
* fix ctx_shift
* fix name shadowing
* more strict condition
* support remote image_url
* remote image_url log
* add CI test
* do not log base64
* add "has_multimodal" to /props
* remove dangling image
* speculative: use slot.cache_tokens.insert
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* rm can_be_detokenized
* on prmpt processing done, assert cache_tokens.size
* handle_completions_impl returns void
* adapt the new web ui
* update docs and hot topics
* rm assert
* small fix (2)
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2025-05-09 19:29:37 +02:00
Xuan-Son Nguyen
d9c4accaff
server : (webui) rename has_multimodal --> modalities ( #13393 )
...
* server : (webui) rename has_multimodal --> modalities
* allow converting SVG to PNG
* less complicated code
2025-05-09 09:06:37 +02:00
Xuan-Son Nguyen
ee01d71e58
server : (webui) fix a very small misalignment ( #13387 )
...
* server : (webui) fix a very small misalignment
* restore font-bold
2025-05-08 18:51:45 +02:00
Xuan-Son Nguyen
8c83449cb7
server : (webui) revamp the input area, plus many small UI improvements ( #13365 )
...
* rework the input area
* process selected file
* change all icons to heroicons
* fix thought process collapse
* move conversation more menu to sidebar
* sun icon --> moon icon
* rm default system message
* stricter upload file check, only allow image if server has mtmd
* build it
* add renaming
* better autoscroll
* build
* add conversation group
* fix scroll
* extra context first, then user input in the end
* fix <hr> tag
* clean up a bit
* build
* add mb-3 for <pre>
* throttle adjustTextareaHeight to make it less laggy
* (nits) missing padding in sidebar
* rm stray console log
2025-05-08 15:37:29 +02:00
Georgi Gerganov
6562e5a4d6
context : allow cache-less context for embeddings ( #13108 )
...
* context : allow cache-less context for embeddings
ggml-ci
* context : enable reranking with encode()
ggml-ci
* context : encode() clears embd_seq
ggml-ci
* examples : use llama_encode() when appropriate
ggml-ci
* models : nomic bert moe does not require KV cache
* llama : update comments for llama_decode/llama_encode
ggml-ci
* context : update warning log [no ci]
2025-05-08 14:28:33 +03:00
oobabooga
233461f812
sampling : Integrate Top-nσ into main sampling chain (and add it to the server) ( #13264 )
...
* sampling: add Top-nσ sampler to `llama-server` and sampler ordering
* revert: sampler ordering
* revert: VS' crappy auto-formatting
* revert: VS' crappy auto-formatting pt.2
* revert: my crappy eye sight...
* sampling: add XTC to Top-nσ sampler chain
* sampling: add Dyna. Temp. to Top-nσ sampler chain
* sampling: actually remove Top-nσ from sampler(oops)
* Integrate top_n_sigma into main sampler chain
* Define COMMON_SAMPLER_TYPE_TOP_N_SIGMA
* Formatting
* Lint
* Exit early in the sampler if nsigma < 0
---------
Co-authored-by: CasualAutopsy <casual_autopsy@outlook.com >
2025-05-05 22:12:19 +02:00
igardev
b34c859146
server : Webui - change setText command from parent window to also send the message. ( #13309 )
...
* setText command from parent window for llama-vscode now sends the message automatically.
* Upgrade packages versions to fix vulnerabilities with "npm audit fix" command.
* Fix code formatting.
* Add index.html.gz changes.
* Revert "Upgrade packages versions to fix vulnerabilities with "npm audit fix" command."
This reverts commit 67687b7fda
.
* easier approach
* add setTimeout
---------
Co-authored-by: igardev <ivailo.gardev@akros.ch >
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2025-05-05 16:03:31 +02:00
Diego Devesa
1d36b3670b
llama : move end-user examples to tools directory ( #13249 )
...
* llama : move end-user examples to tools directory
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2025-05-02 20:27:13 +02:00