Commit Graph

1477 Commits

Author SHA1 Message Date
bc21975084 speculative : fix handling of some input params (#9963)
* speculative : fix batch sizes at initialization

ggml-ci

* speculative : handle params.n_predict == -1

* speculative : limit batch size to llama_n_batch
2024-10-21 09:37:12 +03:00
cda0e4b648 llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch (#9745)
* refactor llama_batch_get_one

* adapt all examples

* fix simple.cpp

* fix llama_bench

* fix

* fix context shifting

* free batch before return

* use common_batch_add, reuse llama_batch in loop

* null terminated seq_id list

* fix save-load-state example

* fix perplexity

* correct token pos in llama_batch_allocr
2024-10-18 23:18:01 +02:00
87421a23e8 [SYCL] Add SYCL Backend registry, device and Event Interfaces (#9705)
* implemented missing SYCL event APIs

* sycl : Added device and backend reg interfaces

* Restructured ggml-sycl.cpp
2024-10-18 06:46:16 +01:00
8901755ba3 server : add n_indent parameter for line indentation requirement (#9929)
ggml-ci
2024-10-18 07:32:19 +03:00
17bb928080 readme : remove --memory-f32 references (#9925) 2024-10-17 23:43:05 +03:00
dbf18e4de9 llava : fix typo in error message [no ci] (#9884) 2024-10-16 20:24:05 +03:00
66c2c93082 grammar : fix JSON Schema for string regex with top-level alt. (#9903)
Prior to this commit, using a JSON Schema containing a string
with `pattern` regular expression that uses top-level alternation
(e.g. `"pattern": "^A|B|C|D$"`) would result in invalid JSON
output from the constrained sampling grammar, because it
ended up creating a grammar rule like this for the string:

```
thing ::= "\"" "A" | "B" | "C" | "D" "\"" space
```

Note that this rule will only match a starting quote for the "A" case,
and will only match an ending quote for the "D" case,
so this rule will always produce invalid JSON when used for sampling
(that is, the JSON will always be lacking the starting quote,
the ending quote, or both).

This was fixed in a simple way by adding parentheses to the
generated rule (for all string pattern rules, to keep it simple),
such that the new generated rule looks like this (correct):

```
thing ::= "\"" ("A" | "B" | "C" | "D") "\"" space
```
2024-10-16 19:03:24 +03:00
1f66b699c4 server : fix the disappearance of the end of the text (#9867)
* server: fix the disappearance of the end of the text when streaming with stop strings

* simplify "send text" checks
2024-10-16 11:35:53 +03:00
755a9b2bf0 llama : add infill sampler (#9896)
ggml-ci
2024-10-15 16:35:33 +03:00
223c25a72f server : improve infill context reuse (#9894)
ggml-ci
2024-10-15 16:28:55 +03:00
fbc98b748e sampling : add XTC sampler (#9742)
* Initial XTC commit

Adds XTC sampler, not activated by default, but recommended settings by default.

* Cleanup

* Simplified chances calculation

To be more inline with the original implementation, chance is calculated once at the beginning.

* First fixes by comments

Still need to look into sorting

* Fixed trailing backspaces

* Fixed RNG to be reproduceable 

Thanks to @slaren for directions

* Fixed forgotten header

* Moved `min_keep` 

Moved from conditions to a simple check at the end.

* Fixed broken randomization

Thanks to @slaren for explanation

* Swapped sorting for a custom algorithm

Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.

* Algorithm rework

1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.

* Added XTC to `test-sampling`

* Simplified algorithm and more tests

* Updated info in common and args

* Merged back lost commits in common and arg

* Update dump info in common

* Fixed incorrect min_keep check

* Added XTC to README

* Renamed parameters, fixed info and defaults

* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off

* Initial server support

* Added XTC to server UIs

* Fixed labels in old server UI

* Made algorithm safer and more readable

* Removed xtc_threshold_max

* Fixed arg after update

* Quick fixes by comments

* Simplified algorithm since threshold_max is removed

* Renamed random distribution

* Fixed tests and outdated README

* Small fixes
2024-10-15 12:54:55 +02:00
dcdd535302 server : update preact (#9895) 2024-10-15 12:48:44 +03:00
a89f75e1b7 server : handle "logprobs" field with false value (#9871)
Co-authored-by: Gimling <huangjl@ruyi.ai>
2024-10-14 10:04:36 +03:00
d4c19c0f5c server : accept extra_context for the infill endpoint (#9874)
* server : accept extra_context for the infill endpoint

ggml-ci

* server : update readme [no ci]

* server : use repo-level FIM pattern if possible

ggml-ci
2024-10-13 21:31:35 +03:00
c7181bd294 server : reuse cached context chunks (#9866)
ggml-ci
2024-10-13 18:52:48 +03:00
edc265661c server : add option to time limit the generation phase (#9865)
ggml-ci
2024-10-12 16:14:27 +03:00
1bde94dd02 server : remove self-extend features (#9860)
* server : remove self-extend

ggml-ci

* server : fix context limit check to use slot.n_past

ggml-ci
2024-10-12 16:06:31 +03:00
95c76e8e92 server : remove legacy system_prompt feature (#9857)
* server : remove legacy system_prompt feature

ggml-ci

* readme : update [no ci]

* server : fix non-transformer logic + remove response from /props
2024-10-12 14:51:54 +03:00
11ac9800af llama : improve infill support and special token detection (#9798)
* llama : improve infill support

ggml-ci

* llama : add more FIM token strings

ggml-ci

* server : update prompt on slot restore (#9800)

* gguf : deprecate old FIM token KVs
2024-10-12 08:21:51 +03:00
7eee341bee common : use common_ prefix for common library functions (#9805)
* common : use common_ prefix for common library functions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-10-10 22:57:42 +02:00
0e9f760eb1 rpc : add backend registry / device interfaces (#9812)
* rpc : add backend registry / device interfaces

* llama : add llama_supports_rpc API

* ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server
2024-10-10 20:14:55 +02:00
c7499c557c examples : do not use common library in simple example (#9803)
* examples : do not use common library in simple example

* add command line parser, simplify code
2024-10-10 19:50:49 +02:00
c81f3bbb05 cmake : do not build common library by default when standalone (#9804) 2024-10-09 18:49:52 +02:00
e7022064ab perplexity : fix integer overflow (#9783)
* perplexity : fix integer overflow

ggml-ci

* perplexity : keep n_vocab as int and make appropriate casts

ggml-ci
2024-10-09 17:00:18 +03:00
3dc48fe75a examples : remove llama.vim
An updated version will be added in #9787
2024-10-09 10:55:42 +03:00
dca1d4b58a ggml : fix BLAS with unsupported types (#9775)
* ggml : do not use BLAS with types without to_float

* ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies

* ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits

it's not really internal if everybody uses it
2024-10-08 14:21:43 +02:00
458367a906 server : better security control for public deployments (#9776)
* server : more explicit endpoint access settings

* protect /props endpoint

* fix tests

* update server docs

* fix typo

* fix tests
2024-10-08 13:27:04 +02:00
f4b2dcdf49 readme : fix typo [no ci] 2024-10-06 13:49:41 +03:00
8c475b97b8 rerank : use [SEP] token instead of [BOS] (#9737)
* rerank : use [SEP] token instead of [BOS]

ggml-ci

* common : sanity check for non-NULL tokens

ggml-ci

* ci : adjust rank score interval

ggml-ci

* ci : add shebang to run.sh

ggml-ci
2024-10-05 15:55:04 +03:00
133c7b46b3 Fixed RNG seed docs (#9723)
* Update README.md

fixed RNG seed info

* changed print format to unsigned
2024-10-04 10:54:44 +02:00
841713e1e4 rpc : enable vulkan (#9714)
closes #8536
2024-10-03 13:00:52 +03:00
76b37d1541 gguf-split : improve --split and --merge logic (#9619)
* make sure params --split and --merge are not specified at same time

* update gguf-split params parse logic

* Update examples/gguf-split/gguf-split.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-10-02 10:21:57 +03:00
148844fe97 examples : remove benchmark (#9704)
ggml-ci
2024-10-02 10:14:44 +03:00
cad341d889 metal : reduce command encoding overhead (#9698)
* metal : reduce command encoding overhead

ggml-ci

* metal : add comments
2024-10-01 16:00:25 +03:00
511636df0c ci : reduce severity of unused Pyright ignore comments (#9697) 2024-09-30 14:13:16 -04:00
vb
08a43d05b6 py : update transfomers version (#9694)
* update transfomers version.

* update hfh version.
2024-09-30 18:03:47 +03:00
f4d2b8846a llama : add reranking support (#9510)
* py : add XLMRobertaForSequenceClassification [no ci]

* py : fix scalar-tensor conversion [no ci]

* py : fix position embeddings chop [no ci]

* llama : read new cls tensors [no ci]

* llama : add classigication head (wip) [no ci]

* llama : add "rank" pooling type

ggml-ci

* server : add rerank endpoint

ggml-ci

* llama : aboud ggml_repeat during classification

* rerank : cleanup + comments

* server : accept /rerank endpoint in addition to /v1/rerank [no ci]

* embedding : parse special tokens

* jina : support v1 reranker

* vocab : minor style

ggml-ci

* server : initiate tests for later

ggml-ci

* server : add docs

* llama : add comment [no ci]

* llama : fix uninitialized tensors

* ci : add rerank tests

ggml-ci

* add reranking test

* change test data

* Update examples/server/server.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* add `--reranking` argument

* update server docs

* llama : fix comment [no ci]

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-09-28 17:42:03 +03:00
6102037bbb vocab : refactor tokenizer to reduce init overhead (#9449)
* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* refactor tokenizer

* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* remove unused files

* remove unused fileds to avoid unused filed build error

* avoid symbol link error

* Update src/llama.cpp

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-28 15:10:58 +03:00
afbbfaa537 server : add more env vars, improve gen-docs (#9635)
* server : add more env vars, improve gen-docs

* update server docs

* LLAMA_ARG_NO_CONTEXT_SHIFT
2024-09-25 14:05:13 +02:00
cea1486ecf log : add CONT level for continuing previous log entry (#9610) 2024-09-24 10:15:35 +03:00
0aa15011e3 server : add newline after chat example (#9616) 2024-09-24 09:04:39 +03:00
b0f27361f3 sampling : avoid expensive softmax during greedy sampling (#9605)
* sampling : avoid expensive softmax during greedy sampling

ggml-ci

* speculative : fix default RNG seed + set sparams.n_probs

* Update tests/test-sampling.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* sampling : add clarifying comment [no ci]

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-09-24 09:03:17 +03:00
0b3bf966f4 server : add --no-context-shift option (#9607)
* server : add --no-context-shift option

* small fix

* Update examples/server/tests/features/embeddings.feature

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* tests : minor fix

* revert usage of GGML_ASSERT

* update server documentation

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-23 22:23:54 +02:00
37f8c7b4c9 perplexity : remove extra new lines after chunks (#9596) 2024-09-23 11:28:02 +03:00
63351143b2 quantize : improve type name parsing (#9570)
quantize : do not ignore invalid types in arg parsing

quantize : ignore case of type and ftype arguments
2024-09-20 20:55:36 +02:00
d39e26741f examples : flush log upon ctrl+c (#9559) 2024-09-20 11:46:56 +03:00
722ec1eb51 perplexity : do not escape input data by default (#9548) 2024-09-20 09:38:10 +03:00
6026da52d6 server : clean-up completed tasks from waiting list (#9531)
ggml-ci
2024-09-19 12:44:53 +03:00
eca0fab44e imatrix : disable prompt escape by default (#9543) 2024-09-19 10:58:14 +03:00
8a308354f6 server : match OAI structured output response (#9527) 2024-09-18 09:50:34 +03:00