Commit Graph

1242 Commits

Author SHA1 Message Date
3fe847b574 server : do not apply Markdown formatting in code sections (#6850) 2024-04-24 13:54:24 +03:00
37246b1031 common : revert showing control tokens by default for server (#6860)
* fix: revert showing control tokens by default

* feat: revert changes to default behavior of llama_token_to_piece; provide overridden declaration to receive "bool special" param to toggle showing control tokens

* feat: use the overridden declaration of llama_token_to_piece from common/common.cpp to specify "false" so that control tokens are not shown in chat completion responses"

* common : simplify

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-24 13:15:29 +03:00
28103f4832 Server: fix seed for multiple slots (#6835)
* Server: add tests for consistent results

* sampling: separate rng per sampling context
2024-04-24 11:08:36 +02:00
5cf5e7d490 build: generate hex dump of server assets during build (#6661)
* `build`: generate hex dumps of server assets on the fly

* build: workaround lack of -n on gnu xxd

* build: don't use xxd in cmake

* build: don't call xxd from build.zig

* build: more idiomatic hexing

* build: don't use xxd in Makefile (od hackery instead)

* build: avoid exceeding max cmd line limit in makefile hex dump

* build: hex dump assets at cmake build time (not config time)
2024-04-21 18:48:53 +01:00
40f74e4d73 llama : add option to render special/control tokens (#6807)
* make : fix common dep on llama.h

* llama : add option to render special tokens

* readme : add API change notice

ggml-ci

* swift : fix build
2024-04-21 18:36:45 +03:00
89b0bf0d5d llava : use logger in llava-cli (#6797)
This change removes printf() logging so llava-cli is shell scriptable.
2024-04-21 15:19:04 +03:00
b97bc3966e llama : support Llama 3 HF conversion (#6745)
* Support Llama 3 conversion

The tokenizer is BPE.

* style

* Accept suggestion

Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>

* llama : add llama_token_is_eog()

ggml-ci

* llama : auto-detect more EOT tokens when missing in KV data

* convert : replacing EOS token is a hack

* llama : fix codegemma EOT token + add TODOs

* llama : fix model type string for 8B model

---------

Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-21 14:50:41 +03:00
b8109bc013 doc : server tests require llama to be built with curl enabled (#6788) 2024-04-20 18:29:50 +02:00
637e9a86c2 server: static: upstream upgrade (#6765) 2024-04-19 13:19:01 +02:00
8b1b1f4982 train : add general name (#6752)
* llama : make general.name optional

* train: Add 'general.name' to model metadata

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

---------

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-19 10:16:45 +03:00
0d56246f4b ggml : group all experts in a single ggml_mul_mat_id (#6505)
* ggml : group all experts in a single ggml_mul_mat_id
cuda : improve mmid row copy

* cuda : fix bin bcast with non-cont src0

* test-backend-ops : only run all mul mat tests for base types

* llama : disable moe offloading with SYCL

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-18 15:18:48 +02:00
8cc91dc63c ggml : add llamafile sgemm (#6414)
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.

This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.

On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.

This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
2024-04-16 21:55:30 +03:00
8a56075b07 gritlm : add --outdir option to hf.sh script (#6699)
This commit updates the hf.sh script usage to include the --outdir option
and specifies the models directory as the output directory.

The motivation for this is to avoid cluttering the root directory with
model files.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-16 09:34:06 +03:00
58227ffdeb perplexity : require positive --ctx-size arg (#6695) 2024-04-16 09:28:33 +03:00
7593639ce3 main: add --json-schema / -j flag (#6659)
* main: add --json-schema / -j

* json: move json-schema-to-grammar to common lib

* json: fix zig build
2024-04-15 18:35:21 +01:00
3272896d79 server : revert "minor layout improvements" (#6684)
This reverts commit b3a96f27f0.
2024-04-15 15:18:47 +03:00
8800226d65 Fix --split-max-size (#6655)
* Fix --split-max-size

Byte size calculation was done on int and overflowed.

* add tests.sh

* add examples test scripts to ci run

Will autodiscover examples/*/tests.sh scripts and run them.

* move WORK_PATH to a subdirectory

* clean up before and after test

* explicitly define which scripts to run

* add --split-max-size to readme
2024-04-14 13:12:59 +02:00
de17e3f745 fix memcpy() crash, add missed cmd in guide, fix softmax (#6622)
* disable mmap to fix memcpy crash, add missed cmd in guide, fix softmax

* refactor to disable mmap for SYCL backend

* fix compile error in other os

* refactor the solution, use host buf to fix it, instead of disable mmap

* keep to support mmap()

* use host buff to reduce malloc times

* revert to malloc/free solution, for threaad safe
2024-04-14 10:42:29 +08:00
4bd0f93e4a model: support arch DbrxForCausalLM (#6515)
* model: dbrx convert to gguf
#6344

* llama: support dbrx
#6344

* doc: dbrx: add the model as supported

* scripts: get-wikitext-2 add unzip

* llama: increase maximum experts allowed

* llama: factorize moe graph implementation between grok, mixtral and dbrx


---------

Co-authored-by: Megha Agarwal <16129366+megha95@users.noreply.github.com>
2024-04-13 11:33:52 +02:00
ab9a3240a9 JSON schema conversion: ️ faster repetitions, min/maxLength for strings, cap number length (#6555)
* json: rename python schema converter to make import easier

* server: skip null json_schema / grammar fields

* json: deps management for primitive rules (+ allow null values)

* json: optimize repetitions for minItems/maxItems and regexps: `a{,3}` goes from `"a"? "a"? "a"?` (explosive combos) to `(a (a (a)?)?)?`

* grammars: add troubleshooting section to readme

* json: cap length of numbers to 15 digits before/after decimal point

(avoids infinite gen, e.g. "one third" -> `0.333333333333...`)

* json: unify all repetition code (w/ or w/o sep)

* json: support string minLength/maxLength

* server+json: update server/README w/ result_format

* nits

* json: fix type error w/ python 3.8

* json: fix server/README (json_schema in /completion vs. result_format in /v1/chat/completions)

* json: simplify DOT `{"type": "string", "pattern": "^.$"}`

* json: remove recursion in opt_repetitions (avoids Python stack overflow)

* json: rm dead code

* json: rm useless assert & ggml.h import
2024-04-12 19:43:38 +01:00
4cc120c744 infill : add download instructions for model (#6626)
* infill : add download instructions for model

This commit adds instructions on how to download a CodeLlama model
using the `hf.sh` script. This will download the model and place it
in the `models` directory which is the same model use later by the
infill example.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! infill : add download instructions for model

Clarify the reason for using CodeLlama.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-12 15:11:46 +03:00
24ee66ed0d server : coherent log output for KV cache full (#6637) 2024-04-12 14:49:21 +03:00
5c4d767ac0 chore: Fix markdown warnings (#6625) 2024-04-12 10:52:36 +02:00
ef21ce4ccb imatrix : remove invalid assert (#6632) 2024-04-12 11:49:58 +03:00
81da18e71c eval-callback: use ggml_op_desc to pretty print unary operator name (#6631) 2024-04-12 10:26:47 +02:00
f7001ccc5a As suggested by @slaren, disabling Metal for test to fix CI build on OSX from #6576 (#6619) 2024-04-11 17:44:48 -04:00
cbaadc9294 grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) (#6609)
* grammars: reserve rejects & next candidates

* grammars: reuse new_stacks

* grammars: fix missing sig change in llama.h

* grammars: fix test (api changed)

* grammars: update gbnf-validator.cpp

* grammars: simpler syntax (no swap)
2024-04-11 19:47:34 +01:00
b804b1ef77 eval-callback: Example how to use eval callback for debugging (#6576)
* gguf-debug: Example how to use ggml callback for debugging

* gguf-debug: no mutex, verify type, fix stride.

* llama: cv eval: move cb eval field in common gpt_params

* ggml_debug: use common gpt_params to pass cb eval.
Fix get tensor SIGV random.

* ggml_debug: ci: add tests

* ggml_debug: EOL in CMakeLists.txt

* ggml_debug: Remove unused param n_batch, no batching here

* ggml_debug: fix trailing spaces

* ggml_debug: fix trailing spaces

* common: fix cb_eval and user data not initialized

* ci: build revert label

* ggml_debug: add main test label

* doc: add a model: add a link to ggml-debug

* ggml-debug: add to make toolchain

* ggml-debug: tests add the main label

* ggml-debug: ci add test curl label

* common: allow the warmup to be disabled in llama_init_from_gpt_params

* ci: add curl test

* ggml-debug: better tensor type support

* gitignore : ggml-debug

* ggml-debug: printing also the sum of each tensor

* ggml-debug: remove block size

* eval-callback: renamed from ggml-debug

* eval-callback: fix make toolchain

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-11 14:51:07 +02:00
8228b66dbc gguf : add option to not check tensor data (#6582)
This commit adds an option to the gguf example to not check the tensor
data.

The motivation for this is that it can be nice to use the gguf tool to
read other .gguf files that were not created by the gguf tool.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-10 21:16:48 +03:00
b3a96f27f0 minor layout improvements (#6572)
* minor layout improvements

* added missing file, run deps.sh locally
2024-04-10 19:18:25 +02:00
1b67731e18 BERT tokenizer fixes (#6498)
Key changes:
* BERT conversion: fix abuse of LlamaHfVocab, do not set BOS or EOS
* Nomic Embed conversion: pad vocab instead of slicing embedding tensor
* llama_tokenize: handle added special tokens like HF does
2024-04-09 13:44:08 -04:00
400d5d722d server : detect search query to start webchat (#6554) 2024-04-09 10:31:47 +02:00
beea6e1b16 llama : save and restore kv cache for single seq id (#6341)
* llama : save and restore kv cache for single seq id

* remove trailing whitespace

* respond error in case there's no space in the kv cache

* add kv seq save restore to test case

* add --slot-save-path arg to enable save restore and restrict save location

* Returning 0 for some cases, instead of asserting.

* cleanup error cases

* rename sequence state functions

* rename state get set functions

* add previous function names back in with DEPRECATED notice

* update doc

* adjust endpoints to preferred style

* fix restoring zero cell count

* handle seq rm return value

* unused param

* keep in the size check

* fix return types

* add server test case for slot save restore

* cleanup

* add cake

* cleanup style

* add special

* removing a whole sequence never fails

* move sequence state file functionality from server to llama to match session api and add version tags

* catch exceptions on save as well

* error log messages

* check types for stricter restore

* update server doc

* readme : update API changes date

* strict filename validation

* move include, reject bom as well

* also reject empty filename

* reject whitespace and trailing dot

---------

Co-authored-by: Martin Evans <martindevans@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-08 15:43:30 +03:00
75cd4c7729 ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495)
* ci: bench: support sse and fix prompt processing time
server: add tokens usage in stream mode

* ci: bench: README.md EOL

* ci: bench: remove total pp and tg as it is not accurate

* ci: bench: fix case when there is no token generated

* ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics

* ci: bench: fix finish reason rate
2024-04-06 05:40:47 +02:00
87e21bbacd bench : make n_batch and n_ubatch configurable in Batched bench (#6500)
* bench: make n_batch and n_ubatch configurable

* bench: update doc for batched bench
2024-04-05 21:34:53 +03:00
2e66913e5f server: allow penalizing repetition of newlines on server webpage (#6431) 2024-04-04 17:03:00 +02:00
7a2c92637a ci: bench: add more ftype, fix triggers and bot comment (#6466)
* ci: bench: change trigger path to not spawn on each PR

* ci: bench: add more file type for phi-2: q8_0 and f16.
- do not show the comment by default

* ci: bench: add seed parameter in k6 script

* ci: bench: artefact name perf job

* Add iteration in the commit status, reduce again the autocomment

* ci: bench: add per slot metric in the commit status

* Fix trailing spaces
2024-04-04 12:57:58 +03:00
9b84ae1806 examples : add GBNF validator program (#5948)
* Revising GBNF validator program to be much simpler.

* Changing from streams to using cstdio

* Adding final newline character.
2024-04-04 10:44:28 +03:00
4399f13fb9 server : remove obsolete --memory-f32 option 2024-04-04 09:34:58 +03:00
1a43c7254e server : add option to disable KV offload (#6468) 2024-04-04 09:33:48 +03:00
5fb1574c81 A few small fixes to server's README docs (#6428)
* Typo fix to server's README.md

Fix minor typo ("tonen") in server README.

* server readme grammar/style fixes.

Quickly went through this file to look for inconsistencies in
presentation of defaults, flag options, and looked for typos
and grammar issues.

Not perfect, but hopefully improved.

* Update README.md

Remove an extra space before newline.
2024-04-03 22:22:57 +02:00
60cdf40cc3 server : handle exception on wrong type in request (#6452)
Co-authored-by: Jonas Holzner <jonas.holzner.external@hensoldt.net>
2024-04-03 21:09:52 +03:00
08a0c02060 ggml : mul_mat_id use the same tensor for all the experts (#6387)
* ggml : update mul_mat_id to use the same tensor for all the experts

* update cuda

* minor

* update metal

* update test-backend-ops

* fix cuda

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update convert.py

* update convert-hf-to-gguf.py

* update convert.py for mixtral hf models

* Update convert-hf-to-gguf.py

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* cuda : support non-pow-2 number of experts

* allow quantize to work for split and merged experts models in the same way

* cleanup + disable mmap automatically with split tensors models

* update imatrix

* test-backend-ops : test qwen argsort

* update grok model loading

* llama : add merged experts tensors to the grok tensor map

* minor

* gguf : bump version

* fix quantizing of merged experts

* convert-hf-to-gguf.py : update grok (untested)

* make linter happy

* cuda/argsort : use shared memory instead of pool memory

* convert : fix grok tensor names

* metal : add support for non-pow-2 argsort

* llama : more loader cleanup, better error checking

* cuda : fix warning

* llama : still use mmap for loading old models, but copy the data to a host buffer

* add review note

* llama : remove ffn tensor counting + add sanity check

ggml-ci

* convert : fix handling of n_experts == None

ggml-ci

* imatrix : fix ncall counters

* llama : produce error if imatrix size does not match

* quantize : terminate on errors + trace logs

ggml-ci

* metal : pad shared memory to 16 bytes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-03 16:07:05 +03:00
f7fc5f6c6f split: allow --split-max-size option (#6343)
* split by max size

* clean up arg parse

* split: ok

* add dry run option

* error on 0 tensors

* be positive

* remove next_metadata_size
2024-03-29 22:34:44 +01:00
66ba560256 llava : fix MobileVLM (#6364)
* fix empty bug

* Update MobileVLM-README.md

added more results on devices

* Update MobileVLM-README.md

* Update MobileVLM-README.md

* Update MobileVLM-README.md

* Update MobileVLM-README.md

* Update MobileVLM-README.md

* Update MobileVLM-README.md

* Update examples/llava/MobileVLM-README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update MobileVLM-README.md

remove gguf links

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-28 16:33:10 +02:00
cfc4d75df6 doc: fix outdated default value of batch size (#6336)
* doc: fix outdated default value of batch size

* doc: add doc for ubatch-size
2024-03-28 09:51:06 +01:00
6902cb7f2e server : stop gracefully on SIGTERM (#6348) 2024-03-28 09:50:48 +01:00
d0e2f6416b doc: fix typo in MobileVLM-README.md (#6181) 2024-03-28 13:03:30 +09:00
a016026a3a server: continuous performance monitoring and PR comment (#6283)
* server: bench: init

* server: bench: reduce list of GPU nodes

* server: bench: fix graph, fix output artifact

* ci: bench: add mermaid in case of image cannot be uploaded

* ci: bench: more resilient, more metrics

* ci: bench: trigger build

* ci: bench: fix duration

* ci: bench: fix typo

* ci: bench: fix mermaid values, markdown generated

* typo on the step name

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* ci: bench: trailing spaces

* ci: bench: move images in a details section

* ci: bench: reduce bullet point size

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-03-27 20:26:49 +01:00
1e13987fba embedding : show full embedding for single prompt (#6342)
* embedding : show full embedding for single prompt

To support the use case of creating an embedding for a given prompt, the entire embedding and not just the first part needed to be printed.

Also, show cosine similarity matrix only if there is more than one prompt, as the cosine similarity matrix for a single prompt is always `1.00`.

* Update examples/embedding/embedding.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-27 13:15:44 +02:00