f66f582927
llama : refactor src/llama.cpp
( #10902 )
...
* llama : scatter llama.cpp into multiple modules (wip)
* llama : control-vector -> adapter
* llama : arch
* llama : mmap
ggml-ci
* ci : remove BUILD_SHARED_LIBS=OFF
ggml-ci
* llama : arch (cont)
ggml-ci
* llama : chat
ggml-ci
* llama : model
ggml-ci
* llama : hparams
ggml-ci
* llama : adapter
ggml-ci
* examples : fix
ggml-ci
* rebase
ggml-ci
* minor
* llama : kv cache
ggml-ci
* llama : impl
ggml-ci
* llama : batch
ggml-ci
* cont
ggml-ci
* llama : context
ggml-ci
* minor
* llama : context (cont)
ggml-ci
* llama : model loader
ggml-ci
* common : update lora
ggml-ci
* llama : quant
ggml-ci
* llama : quant (cont)
ggml-ci
* minor [no ci]
2025-01-03 10:18:53 +02:00
0da5d86026
server : allow using LoRA adapters per-request ( #10994 )
...
* slot.can_batch_with
* lora per request
* test: force disable cache prompt
* move can_batch_with check
* fix condition
* add slow test with llama 8b
* update docs
* move lora change task to queue
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* lora_base
* remove redundant check
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2025-01-02 15:05:18 +01:00
45095a61bf
server : clean up built-in template detection ( #11026 )
...
* server : clean up built-in template detection
* fix compilation
* add chat template test
* fix condition
2024-12-31 15:22:01 +01:00
5896c65232
server : add OAI compat for /v1/completions ( #10974 )
...
* server : add OAI compat for /v1/completions
* add test
* add docs
* better docs
2024-12-31 12:34:13 +01:00
9ba399dfa7
server : add support for "encoding_format": "base64" to the */embeddings endpoints ( #10967 )
...
* add support for base64
* fix base64 test
* improve test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2024-12-24 21:33:04 +01:00
09fe2e7613
server: allow filtering llama server response fields ( #10940 )
...
* llama_server_response_fields
* llama_server_response_fields_fix_issues
* params fixes
* fix
* clarify docs
* change to "response_fields"
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2024-12-24 17:39:49 +01:00
485dc01214
server : add system_fingerprint to chat/completion ( #10917 )
...
* server : add system_fingerprint to chat/completion
* update README
2024-12-23 12:02:44 +01:00
57bb2c40cd
server : fix logprobs, make it OAI-compatible ( #10783 )
...
* server : fix logprobs, make it openai-compatible
* update docs
* add std::log
* return pre-sampling p
* sort before apply softmax
* add comment
* fix test
* set p for sampled token
* update docs
* add --multi-token-probs
* update docs
* add `post_sampling_probs` option
* update docs [no ci]
* remove --multi-token-probs
* "top_probs" with "post_sampling_probs"
* resolve review comments
* rename struct token_prob to prob_info
* correct comment placement
* fix setting prob for sampled token
2024-12-19 15:40:08 +01:00
46828872c3
server : (embeddings) using same format for "input" and "content" ( #10872 )
...
* server : (embeddings) using same format for "input" and "content"
* fix test case
* handle empty input case
* fix test
2024-12-18 10:55:09 +02:00
05c3a444b8
server : fill usage info in embeddings and rerank responses ( #10852 )
...
* server : fill usage info in embeddings response
* server : fill usage info in reranking response
2024-12-17 18:00:24 +02:00
89d604f2c8
server: Fix has_next_line
in JSON response ( #10818 )
...
* Update server JSON response.
* Add unit test to check `has_new_line` JSON response
* Remove `has_new_line` unit test changes.
* Address code review comment: type check for `has_new_line` in unit test
2024-12-14 23:29:45 +01:00
484d2f31ae
bug-fix: snprintf prints NULL in place of the last character ( #10419 )
...
* bug-fix: snprintf prints NULL in place of the last character
We need to give snprintf enough space to print the last character and the null character, thus we allocate one extra byte and then ignore it when converting to std::string.
* add comment about extra null-term byte requirement
2024-12-11 14:48:04 +01:00
3573fa8e7b
server : (refactor) no more json in server_task input ( #10691 )
...
* server : (refactor) no more json in server_task input
* add test for slots endpoint
* add tests for /props and /slots
* remove task inf_type
* fix CI by adding safe_json_to_str
* add "model_path" to /props
* update readme
2024-12-07 20:21:09 +01:00
ce4a7b8493
server : various fixes ( #10704 )
...
* server : various fixes
ggml-ci
* server : show curent seed in slot_params
ggml-ci
* fix /slots endpoint
* Update examples/server/server.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* server : reflect endpoint response changes in the readme
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com >
2024-12-07 18:02:05 +02:00
6c5bc0625f
server : (refactoring) do not rely on JSON internally ( #10643 )
...
* server : (refactoring) reduce usage of json internally
* move all response types to struct
* wip [no ci]
* many fixes
* add virtual function
* fix index
* minor style fix
* add std::move
* refactor handle_completions_generic
* add virtual functions
* remove server.hpp
* clarify server_sent_event RFC specs
* apply review comments
* fix model_alias and completion_probabilities
* small clean up
* remove virtual for to_json_oai_compat()
* naming oai_compat --> oaicompat
* fix unwanted recursive call
* update docs
2024-12-06 11:14:32 +01:00
64ed2091b2
server: Add "tokens per second" information in the backend ( #10548 )
...
* add cmake rvv support
* add timings
* remove space
* update readme
* fix
* fix code
* remove empty line
* add test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2024-12-02 14:45:54 +01:00
d9d54e498d
speculative : refactor and add a simpler example ( #10362 )
...
* speculative : refactor and add a simpler example
ggml-ci
* speculative : clean-up and add comments and TODOs [no ci]
* speculative : manage context in common_speculative
ggml-ci
* speculative : simplify
ggml-ci
* speculative : simplify (cont)
ggml-ci
* speculative : add --draft-min CLI arg
* speculative : minor fixup
* make : build fixes
* speculative : do not redraft previous drafts
ggml-ci
* speculative : fix the draft sampling
ggml-ci
* speculative : fix compile warning
* common : refactor args
ggml-ci
* common : change defaults [no ci]
* common : final touches
ggml-ci
2024-11-25 09:58:41 +02:00
42cadc74bd
server : fix slot selection by lru ( #10126 )
...
* server : fix slot selection by lru, migrate lcs to `size_t`
* minor debug log fix
2024-11-02 18:34:56 +02:00
d865d1478c
server : fix smart selection of available slot ( #10120 )
...
* Fix smart selection of available slot
* minor fix
* replace vectors of tokens with shorthands
2024-11-01 14:33:14 +01:00
8d8ff71536
llama : remove Tail-Free sampling ( #10071 )
...
ggml-ci
2024-10-29 10:42:05 +02:00
8125e6cbfc
server : don't overfill the batch during infill ( #10018 )
...
ggml-ci
2024-10-28 08:49:32 +02:00
958367bf53
server : refactor slot input data, move tokenizer to HTTP thread ( #10023 )
...
* server : refactor slot input data, move tokenizer to HTTP thread
* move prompt_tokens.empty() check
* fix incorrect if branch
* fix infinite generation loop
* bring back infill validation
* add infill test
* try fixing format_infill
* fix test
* remove redundant code
* rename completion to inference
* update docs
* use llama_tokens everywhere
2024-10-24 21:51:22 +02:00
a89f75e1b7
server : handle "logprobs" field with false value ( #9871 )
...
Co-authored-by: Gimling <huangjl@ruyi.ai >
2024-10-14 10:04:36 +03:00
c7181bd294
server : reuse cached context chunks ( #9866 )
...
ggml-ci
2024-10-13 18:52:48 +03:00
7eee341bee
common : use common_ prefix for common library functions ( #9805 )
...
* common : use common_ prefix for common library functions
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2024-10-10 22:57:42 +02:00
458367a906
server : better security control for public deployments ( #9776 )
...
* server : more explicit endpoint access settings
* protect /props endpoint
* fix tests
* update server docs
* fix typo
* fix tests
2024-10-08 13:27:04 +02:00
f4d2b8846a
llama : add reranking support ( #9510 )
...
* py : add XLMRobertaForSequenceClassification [no ci]
* py : fix scalar-tensor conversion [no ci]
* py : fix position embeddings chop [no ci]
* llama : read new cls tensors [no ci]
* llama : add classigication head (wip) [no ci]
* llama : add "rank" pooling type
ggml-ci
* server : add rerank endpoint
ggml-ci
* llama : aboud ggml_repeat during classification
* rerank : cleanup + comments
* server : accept /rerank endpoint in addition to /v1/rerank [no ci]
* embedding : parse special tokens
* jina : support v1 reranker
* vocab : minor style
ggml-ci
* server : initiate tests for later
ggml-ci
* server : add docs
* llama : add comment [no ci]
* llama : fix uninitialized tensors
* ci : add rerank tests
ggml-ci
* add reranking test
* change test data
* Update examples/server/server.cpp
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com >
* add `--reranking` argument
* update server docs
* llama : fix comment [no ci]
ggml-ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com >
2024-09-28 17:42:03 +03:00
8a308354f6
server : match OAI structured output response ( #9527 )
2024-09-18 09:50:34 +03:00
6262d13e0b
common : reimplement logging ( #9418 )
...
https://github.com/ggerganov/llama.cpp/pull/9418
2024-09-15 20:46:12 +03:00
78203641fe
server : Add option to return token pieces in /tokenize endpoint ( #9108 )
...
* server : added with_pieces functionality to /tokenize endpoint
* server : Add tokenize with pieces tests to server.feature
* Handle case if tokenizer splits along utf8 continuation bytes
* Add example of token splitting
* Remove trailing ws
* Fix trailing ws
* Maybe fix ci
* maybe this fix windows ci?
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2024-09-12 22:30:11 +02:00
6e7d133a5f
server : refactor multitask handling ( #9274 )
...
* server : remove multitask from server_task
* refactor completions handler
* fix embeddings
* use res_ok everywhere
* small change for handle_slots_action
* use unordered_set everywhere
* (try) fix test
* no more "mutable" lambda
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* use deque
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2024-09-02 17:11:51 +02:00
978ba3d83d
Server: Don't ignore llama.cpp params ( #8754 )
...
* Don't ignore llama.cpp params
* Add fallback for max_tokens
2024-08-04 20:16:23 +02:00
4e24cffd8c
server : handle content array in chat API ( #8449 )
...
* server : handle content array in chat API
* Update examples/server/utils.hpp
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com >
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com >
2024-07-12 14:48:15 +03:00
48e6b92cc3
Add chat template support for llama-cli ( #8068 )
...
* add chat template support for llama-cli
* add help message
* server: simplify format_chat
* more consistent naming
* improve
* add llama_chat_format_example
* fix server
* code style
* code style
* Update examples/main/main.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2024-06-25 21:56:49 +10:00
7a16ce7db2
server : smart slot selection using Longest Common Prefix ( #7728 )
...
* server : Smart selection of available slot using Longest Common Substring
* add usage
* remove trailing whitespaces
* Use Longest Common Prefix (LCP) instead of LCS
* Rename argument
2024-06-08 10:50:31 +03:00
1442677f92
common : refactor cli arg parsing ( #7675 )
...
* common : gpt_params_parse do not print usage
* common : rework usage print (wip)
* common : valign
* common : rework print_usage
* infill : remove cfg support
* common : reorder args
* server : deduplicate parameters
ggml-ci
* common : add missing header
ggml-ci
* common : remote --random-prompt usages
ggml-ci
* examples : migrate to gpt_params
ggml-ci
* batched-bench : migrate to gpt_params
* retrieval : migrate to gpt_params
* common : change defaults for escape and n_ctx
* common : remove chatml and instruct params
ggml-ci
* common : passkey use gpt_params
2024-06-04 21:23:39 +03:00
e586ee4259
change default temperature of OAI compat API from 0 to 1 ( #7226 )
...
* change default temperature of OAI compat API from 0 to 1
* make tests explicitly send temperature to OAI API
2024-05-13 12:40:08 +10:00
c12452c7ae
JSON: [key] -> .at(key), assert() -> GGML_ASSERT ( #7143 )
2024-05-08 21:53:08 +02:00
1fd9c1741d
clean up json_value & server_log ( #7142 )
2024-05-08 13:24:14 +02:00
b97bc3966e
llama : support Llama 3 HF conversion ( #6745 )
...
* Support Llama 3 conversion
The tokenizer is BPE.
* style
* Accept suggestion
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com >
* llama : add llama_token_is_eog()
ggml-ci
* llama : auto-detect more EOT tokens when missing in KV data
* convert : replacing EOS token is a hack
* llama : fix codegemma EOT token + add TODOs
* llama : fix model type string for 8B model
---------
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2024-04-21 14:50:41 +03:00
75cd4c7729
ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response ( #6495 )
...
* ci: bench: support sse and fix prompt processing time
server: add tokens usage in stream mode
* ci: bench: README.md EOL
* ci: bench: remove total pp and tg as it is not accurate
* ci: bench: fix case when there is no token generated
* ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics
* ci: bench: fix finish reason rate
2024-04-06 05:40:47 +02:00
60cdf40cc3
server : handle exception on wrong type in request ( #6452 )
...
Co-authored-by: Jonas Holzner <jonas.holzner.external@hensoldt.net >
2024-04-03 21:09:52 +03:00
ad3a0505e3
Server: clean up OAI params parsing function ( #6284 )
...
* server: clean up oai parsing function
* fix response_format
* fix empty response_format
* minor fixes
* add TODO for logprobs
* update docs
2024-03-25 09:42:17 +01:00
1b26aebe4d
server: flush stdout after logging in both text and json layout ( #6253 )
2024-03-23 13:18:45 +01:00
72114edf06
json-schema-to-grammar : fix order of props + non-str const/enum ( #6232 )
...
* json: ordered json in server/schema converter to respect orig order
* json: ws nits
* json: support non-string const / enums
2024-03-22 15:07:44 +02:00
5b7b0ac8df
json-schema-to-grammar improvements (+ added to server) ( #5978 )
...
* json: fix arrays (disallow `[,1]`)
* json: support tuple types (`[number, string]`)
* json: support additionalProperties (`{[k: string]: [string,number][]}`)
* json: support required / optional properties
* json: add support for pattern
* json: resolve $ref (and support https schema urls)
* json: fix $ref resolution
* join: support union types (mostly for nullable types I think)
* json: support allOf + nested anyOf
* json: support any (`{}` or `{type: object}`)
* json: fix merge
* json: temp fix for escapes
* json: spaces in output and unrestricted output spaces
* json: add typings
* json:fix typo
* Create ts-type-to-grammar.sh
* json: fix _format_literal (json.dumps already escapes quotes)
* json: merge lit sequences and handle negatives
{"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"}
* json: handle pattern repetitions
* Update json-schema-to-grammar.mjs
* Create regex-to-grammar.py
* json: extract repeated regexp patterns to subrule
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* json: handle schema from pydantic Optional fields
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update ts-type-to-grammar.sh
* Update ts-type-to-grammar.sh
* json: simplify nullable fields handling
* json: accept duplicate identical rules
* json: revert space to 1 at most
* json: reuse regexp pattern subrules
* json: handle uuid string format
* json: fix literal escapes
* json: add --allow-fetch
* json: simplify range escapes
* json: support negative ranges in patterns
* Delete commit.txt
* json: custom regex parser, adds dot support & JS-portable
* json: rm trailing spaces
* Update json-schema-to-grammar.mjs
* json: updated server & chat `( cd examples/server && ./deps.sh )`
* json: port fixes from mjs to python
* Update ts-type-to-grammar.sh
* json: support prefixItems alongside array items
* json: add date format + fix uuid
* json: add date, time, date-time formats
* json: preserve order of props from TS defs
* json: port schema converter to C++, wire in ./server
* json: nits
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* json: fix mjs implementation + align outputs
* Update json-schema-to-grammar.mjs.hpp
* json: test C++, JS & Python versions
* json: nits + regen deps
* json: cleanup test
* json: revert from c++17 to 11
* json: nit fixes
* json: dirty include for test
* json: fix zig build
* json: pass static command to std::system in tests (fixed temp files)
* json: fix top-level $refs
* json: don't use c++20 designated initializers
* nit
* json: basic support for reserved names `{number:{number:{root:number}}}`
* Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test)
* json: re-ran server deps.sh
* json: simplify test
* json: support mix of additional props & required/optional
* json: add tests for some expected failures
* json: fix type=const in c++, add failure expectations for non-str const&enum
* json: test (& simplify output of) empty schema
* json: check parsing in test + fix value & string refs
* json: add server tests for OAI JSON response_format
* json: test/fix top-level anyOf
* json: improve grammar parsing failures
* json: test/fix additional props corner cases
* json: fix string patterns (was missing quotes)
* json: ws nit
* json: fix json handling in server when there's no response_format
* json: catch schema conversion errors in server
* json: don't complain about unknown format type in server if unset
* json: cleaner build of test
* json: create examples/json-schema-pydantic-example.py
* json: fix date pattern
* json: move json.hpp & json-schema-to-grammar.{cpp,h} to common
* json: indent 4 spaces
* json: fix naming of top-level c++ function (+ drop unused one)
* json: avoid using namespace std
* json: fix zig build
* Update server.feature
* json: iostream -> fprintf
* json: space before & refs for consistency
* json: nits
2024-03-21 11:50:43 +00:00
47cc7a7bf9
Server: Handle n_keep parameter in the request ( #6174 )
2024-03-20 12:02:34 +01:00
99b71c068f
Server: Use multi-task for embeddings endpoint ( #6001 )
...
* use multitask for embd endpoint
* specify types
* remove redundant {"n_predict", 0}
2024-03-13 11:39:11 +01:00
caa106d4e0
Server: format error to json ( #5961 )
...
* server: format error to json
* server: do not crash on grammar error
* fix api key test case
* revert limit max n_predict
* small fix
* correct coding style
* update completion.js
* launch_slot_with_task
* update docs
* update_slots
* update webui
* update readme
2024-03-11 10:56:41 +01:00
332bdfd798
server : maintain chat completion id for streaming responses ( #5988 )
...
* server: maintain chat completion id for streaming responses
* Update examples/server/utils.hpp
* Update examples/server/utils.hpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2024-03-11 10:09:32 +02:00