Commit Graph

112 Commits

Author SHA1 Message Date
35a2ee9143 Remove unused data and add fixes (#5154)
* Remove unused data and add fixes

* Add missing file

* Address review comments

* Replace the scope of vq allocation
2024-01-27 15:25:55 +01:00
ec903c0341 server : add self-extend support (#5104)
* Ported self extension to server example

* Update server.cpp

* Fixed prompt caching without self extend

* Update server.cpp

* Added description to server readme.

* Update server.cpp

* Update server.cpp

* Update server.cpp

* Update server.cpp

* Update README.md

* Changed descriptions

* server : formatting

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update server.cpp

* Update server.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-27 15:38:05 +02:00
48c857aa10 server : refactored the task processing logic (#5065)
* server: add llama_server_queue struct

* server: add llama_server_response_event

* server: add comments

* server: move all mutexes away from server.cpp

* server: correct multitask response

* server: only add back deferred tasks when one slot is available

* server: fix a race condition cause by "request_completion"
2024-01-26 14:42:20 +02:00
821f0a271e server : defer tasks when "slot unavailable" (#5018)
* server: defer task when no slot is available

* remove unnecessary log

---------

Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>
2024-01-18 22:33:05 +02:00
0ea069b87b server : fix prompt caching with system prompt (#4914) 2024-01-13 19:31:26 +02:00
356327feb3 server : fix deadlock that occurs in multi-prompt scenarios (#4905)
* * fix deadlock

* * dont ruint all whitespace
2024-01-13 16:20:46 +02:00
ee8243adaa server : fix crash with multimodal models without BOS token (#4904) 2024-01-13 16:16:11 +02:00
e7e4df031b llama : ggml-backend integration (#4766)
* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (#4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (#4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12 20:07:38 +01:00
1d118386fe server : fix infill when prompt is empty (#4833) 2024-01-11 23:23:49 +02:00
4330bd83fe server : implement credentialed CORS (#4514)
* Implement credentialed CORS according to MDN

* Fix syntax error

* Move validate_api_key up so it is defined before its first usage
2024-01-11 20:02:48 +02:00
27379455c3 server : support for multiple api keys (#4864)
* server: added support for multiple api keys, added loading api keys from file

* minor: fix whitespace

* added file error handling to --api-key-file, changed code to better
reflect current style

* server: update README.md for --api-key-file

---------

Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-01-11 19:51:17 +02:00
eab6795006 server : add LOG_INFO when model is successfully loaded (#4881)
* added /health endpoint to the server

* added comments on the additional /health endpoint

* Better handling of server state

When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.

* initialized server_state

* fixed a typo

* starting http server before initializing the model

* Update server.cpp

* Update server.cpp

* fixes

* fixes

* fixes

* made ServerState atomic and turned two-line spaces into one-line

* updated `server` readme to document the `/health` endpoint too

* used LOG_INFO after successful model loading
2024-01-11 19:41:39 +02:00
2f043328e3 server : fix typo in model name (#4876) 2024-01-11 16:33:26 +02:00
5c1980d8d4 server : fix build + rename enums (#4870) 2024-01-11 09:10:34 +02:00
cd108e641d server : add a /health endpoint (#4860)
* added /health endpoint to the server

* added comments on the additional /health endpoint

* Better handling of server state

When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.

* initialized server_state

* fixed a typo

* starting http server before initializing the model

* Update server.cpp

* Update server.cpp

* fixes

* fixes

* fixes

* made ServerState atomic and turned two-line spaces into one-line
2024-01-10 21:56:05 +02:00
67984921a7 server : fix n_predict check (#4798) 2024-01-07 08:45:26 +02:00
012cf349ae server : send token probs for "stream == false" (#4714) 2024-01-04 19:56:33 +02:00
32866c5edd editorconfig : fix whitespace and indentation #4710 2024-01-02 13:28:15 +02:00
5d7002d437 server : add --override-kv parameter (#4710)
* Changes to server to allow metadata override

* documentation

* flake.nix: expose full scope in legacyPackages

* flake.nix: rocm not yet supported on aarch64, so hide the output

* flake.nix: expose checks

* workflows: nix-ci: init; build flake outputs

* workflows: nix-ci: add a job for eval

* workflows: weekly `nix flake update`

* workflows: nix-flakestry: drop tag filters

...and add a job for flakehub.com

* workflows: nix-ci: add a qemu job for jetsons

* flake.nix: suggest the binary caches

* flake.lock: update

to a commit recently cached by nixpkgs-cuda-ci

---------

Co-authored-by: John <john@jLap.lan>
Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>
2024-01-02 12:38:15 +02:00
9fbda719de clip : refactor + bug fixes (#4696)
* clip : refactor + bug fixes

ggml-ci

* server : add log message
2023-12-30 23:24:42 +02:00
db49ff8ed7 server : replace sleep with condition variables (#4673)
The server currently schedules tasks using a sleep(5ms) busy loop. This
adds unnecessary latency since most sleep implementations do a round up
to the system scheduling quantum (usually 10ms). Other libc sleep impls
spin for smaller time intervals which results in the server's busy loop
consuming all available cpu. Having the explicit notify() / wait() code
also helps aid in the readability of the server code.

See mozilla-Ocho/llamafile@711344b
2023-12-29 16:24:12 +02:00
60f55e888c server : fix OpenAI server sampling w.r.t. penalty. (#4675) 2023-12-29 16:22:44 +02:00
b93edd22f5 server : allow to generate multimodal embeddings (#4681) 2023-12-29 16:22:10 +02:00
65e5f6dadb Fix OpenAI server sampling w.r.t. temp and seed (#4668)
The default values for tfs_z and typical_p were being set to zero, which
caused the token candidates array to get shrunk down to one element thus
preventing any sampling. Note this only applies to OpenAI API compatible
HTTP server requests.

The solution is to use the default values that OpenAI documents, as well
as ensuring we use the llama.cpp defaults for the rest. I've tested this
change still ensures deterministic output by default. If a "temperature"
greater than 0 is explicitly passed, then output is unique each time. If
"seed" is specified in addition to "temperature" then the output becomes
deterministic once more.

See mozilla-Ocho/llamafile#117
See mozilla-Ocho/llamafile@9e4bf29
2023-12-28 15:20:00 -04:00
6123979952 server : allow to specify custom prompt for penalty calculation (#3727) 2023-12-23 11:31:49 +02:00
0ffc92d2d2 server : disable llm logs if SERVER_VERBOSE is off (#3792) 2023-12-17 17:02:16 +02:00
8edd2b40fd server : fix grammar being ignored (#4494)
Fix bug in identifying the grammar.
2023-12-17 16:57:56 +02:00
eb16dae7e7 server : fix possible ambiguity in content type charset (#4501) 2023-12-17 16:56:09 +02:00
62bd52b7bf server : allow requests larger than 8K (#4500) 2023-12-17 16:54:37 +02:00
88ae8952b6 server : add optional API Key Authentication example (#4441)
* Add API key authentication for enhanced server-client security

* server : to snake_case

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-15 13:49:01 +02:00
948ff137ec server : fix handling of characters that span multiple tokens when streaming (#4446) 2023-12-13 21:57:15 +02:00
d9d4cfef64 server : fix local model name in server (#4420) 2023-12-12 11:25:29 +02:00
bcc0eb4591 llama : per-layer KV cache + quantum K cache (#4309)
* per-layer KV

* remove unnecessary copies

* less code duplication, offload k and v separately

* llama : offload KV cache per-layer

* llama : offload K shift tensors

* llama : offload for rest of the model arches

* llama : enable offload debug temporarily

* llama : keep the KV related layers on the device

* llama : remove mirrors, perform Device -> Host when partial offload

* common : add command-line arg to disable KV cache offloading

* llama : update session save/load

* llama : support quantum K cache (#4312)

* llama : support quantum K cache (wip)

* metal : add F32 -> Q8_0 copy kernel

* cuda : add F32 -> Q8_0 copy kernel

ggml-ci

* cuda : use mmv kernel for quantum cache ops

* llama : pass KV cache type through API

* llama : fix build

ggml-ci

* metal : add F32 -> Q4_0 copy kernel

* metal : add F32 -> Q4_1 copy kernel

* cuda : wip

* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels

* llama-bench : support type_k/type_v

* metal : use mm kernel only for quantum KV cache

* cuda : add comment

* llama : remove memory_f16 and kv_f16 flags

---------

Co-authored-by: slaren <slarengh@gmail.com>

* readme : add API change notice

---------

Co-authored-by: slaren <slarengh@gmail.com>
2023-12-07 13:03:17 +02:00
05cd6e5036 server : recognize cache_prompt parameter in OAI API (#4347) 2023-12-06 20:21:59 +02:00
33e171d1e9 server : fix OpenAI API stop field to be optional (#4299)
(cherry picked from commit Mozilla-Ocho/llamafile@e8c92bcb84)
2023-12-03 11:10:43 +02:00
d5a1cbde60 llama : support optional tensors (#4283) 2023-12-01 20:35:47 +02:00
1d144112c0 server : add --log-disable to disable logging to file (#4260)
* * add --log-disable to disable logging to file in the server example

* * typo fix
2023-12-01 00:25:49 +02:00
f43f09366d server : add single-client multi-prompt support (#4232)
* * add multiprompt support

* * cleanup

* * more cleanup

* * remove atomicity of id_gen, and change lock_guard to unique_lock on completion requests

* * remove all references to mutex_multitasks

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Update examples/server/server.cpp

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* * change to set

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2023-12-01 00:25:04 +02:00
af19d35734 server : OAI API compatibility (#4198)
* Add openai-compatible POST /v1/chat/completions API endpoint to server example

* fix code style

* Update server README.md

* Improve server README.md

* Fix server.cpp code style according to review

* server : some style changes

* server : indentation

* server : enable special tokens during tokenization by default

* server : minor code style

* server : change random string generator

* straightforward /v1/models endpoint

---------

Co-authored-by: kir-gadjello <111190790+kir-gadjello@users.noreply.github.com>
Co-authored-by: Tobi Lütke <tobi@Tobis-MacBook-Pro.local>
2023-11-25 11:29:06 +02:00
55978ce09b Fix incorrect format strings and uninitialized variables. (#4133)
* Fix incorrect format strings and uninitialized variables.

* Address comments

* Add the missing include statement
2023-11-23 22:56:53 +01:00
936c79b227 server : relay error messages (#4131) 2023-11-19 18:54:10 +02:00
91f6499393 Respect tokenizer.ggml.add_bos_token value when tokenizing (#4040)
* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode.

* Respect add_bos_token GGUF metadata value

* gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
2023-11-16 19:14:37 -07:00
d96ca7ded7 server : fix crash when prompt exceeds context size (#3996) 2023-11-10 23:48:21 -06:00
57ad015dc3 server : add min_p param (#3877)
* Update server.cpp with min_p after it was introduced in https://github.com/ggerganov/llama.cpp/pull/3841

* Use spaces instead of tabs

* Update index.html.hpp after running deps.sh

* Fix test - fix line ending
2023-11-08 20:00:34 -06:00
b12fa0d1c1 build : link against build info instead of compiling against it (#3879)
* cmake : fix build when .git does not exist

* cmake : simplify BUILD_INFO target

* cmake : add missing dependencies on BUILD_INFO

* build : link against build info instead of compiling against it

* zig : make build info a .cpp source instead of a header

Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>

* cmake : revert change to CMP0115

---------

Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
2023-11-02 08:50:16 +02:00
898aeca90a llama : implement YaRN RoPE scaling (#2268)
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
2023-11-01 18:04:33 -04:00
ca190bca8e server : re-enable completion and embedded at the same time (#3876) 2023-11-01 11:28:28 +02:00
6e08281e58 Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843)
* Extend llama_kv_cache_seq_rm to allow matichng any sequence

* Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear

Use llama_kv_cache_clear for cache clearing

Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
2023-10-29 11:31:40 -06:00
34b2a5e1ee server : do not release slot on image input (#3798) 2023-10-26 22:54:17 +03:00
ad93962657 server : add parameter -tb N, --threads-batch N (#3584) (#3768)
Co-authored-by: Michael Coppola <m18coppola@gmail.com>
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2023-10-24 23:10:43 +03:00