Commit Graph

3128 Commits

Author SHA1 Message Date
b648243496 add/fix gbnf-validator subfolder to cmake 2024-06-08 14:07:56 +01:00
81222f02db prefix more cmake targets w/ llama- 2024-06-08 14:05:34 +01:00
10650b692d rename {main->llama}-cmake-pkg binary 2024-06-08 13:57:06 +01:00
78bca8cb07 fix main refs 2024-06-08 13:52:03 +01:00
ab5efbb3b6 Prefix all example bins w/ llama- 2024-06-08 13:42:01 +01:00
23d0df5bd5 main: target name -> llama-cli 2024-06-08 12:50:35 +01:00
fe93cc96cc Merge remote-tracking branch 'origin/master' into bins 2024-06-08 12:04:52 +01:00
7a16ce7db2 server : smart slot selection using Longest Common Prefix (#7728)
* server : Smart selection of available slot using Longest Common Substring

* add usage

* remove trailing whitespaces

* Use Longest Common Prefix (LCP) instead of LCS

* Rename argument
2024-06-08 10:50:31 +03:00
da799b4189 vulkan : reuse parent extra for views (#7806)
* vulkan : reuse parent extra for views

* Fix validation error when multiple compute contexts are used in a graph

---------

Co-authored-by: 0cc4m <picard12@live.de>
2024-06-07 19:47:49 +02:00
c00fad71e5 gguf-split : change binary multi-byte units to decimal (#7803) 2024-06-07 15:56:01 +03:00
27615f5ab2 cmake : fix BUILD_SHARED_LIBS=ON build (#7784)
common depends on pthreads in Linux
2024-06-07 15:15:07 +03:00
0dba58269f Update server-llm.sh 2024-06-07 11:52:40 +01:00
7027b27d76 server: update cache_prompt documentation [no ci] (#7745) 2024-06-07 11:15:49 +02:00
af8f0169da Update .gitignore 2024-06-07 10:14:03 +01:00
7fbe6006c9 update straggling refs 2024-06-07 09:42:21 +01:00
99df4cc091 rm accidentally checked in bins 2024-06-07 09:40:09 +01:00
a5cabd7649 server : do not get prompt in infill mode (#7286)
* avoid to get prompt in infill mode and embedding mode

* remove embedding mode

* refactor format

---------

Co-authored-by: wudexiang <wudexiang@bytedance.com>
2024-06-07 10:09:45 +03:00
d5c938cd77 [SYCL] fix softmax r2r result wrong issue (#7811) 2024-06-07 14:28:26 +08:00
c9ee7118d5 check for nans in imatrix and quantize (#7807)
* imatrix : detect nan/inf values

* quantize : check imatrix for nan/inf values
2024-06-07 09:01:29 +03:00
fbd83131f5 Merge remote-tracking branch 'origin/master' into bins 2024-06-07 00:51:31 +01:00
a0a7f2b031 Update build.yml 2024-06-07 00:38:05 +01:00
8695baebc0 update more names 2024-06-07 00:21:01 +01:00
ee459f40f6 server : fix --threads-http arg (#7801) 2024-06-06 19:19:59 +03:00
9a03341094 main/server: fix targets 2024-06-06 15:53:25 +01:00
8b7c734473 main: update refs -> llama
fix examples/main ref
2024-06-06 15:44:51 +01:00
f5f19a236f server: simplify nix package 2024-06-06 15:44:40 +01:00
f298cc63d2 server: update refs -> llama-server
gitignore llama-server
2024-06-06 15:44:40 +01:00
849842916d main/server: rename to llama / llama-server for consistency w/ homebrew 2024-06-06 15:28:27 +01:00
f83351f9a6 imatrix : migrate to gpt_params (#7771)
* imatrix : migrate to gpt_params

ggml-ci

* imatrix : add --save-frequency cli arg

* common : fix --no-ppl
2024-06-06 16:30:58 +03:00
ad675e1c67 Added support for . (any character) token in grammar engine. (#6467)
* Added support for . (any characer) token in grammar engine.

* Add integration tests for any-character symbol.
2024-06-06 06:08:52 -07:00
a143c04375 README minor fixes (#7798) [no ci]
derievatives --> derivatives
2024-06-06 22:17:54 +10:00
55b2d0849d grammars: x{min,max} repetition operator (#6640)
* grammars: x{min,max} repetition operator + tweak +/*/? to avoid duplication of original over alternates

* grammars: handle `x{n}` and fix `x{n,n}`

* grammars: document new repetition operators

* grammars: uniform use of int for min & max

* grammars: refactor parser test

* grammar: parsing tests w/ natural pretty print of updated expectations

* grammars: much prettier print of expectations (+ TEST_GRAMMAR_PARSER_PRINT_ALL=1 to force all)

* grammars: improve test pretty print again

* grammars: pretty print rules and chars

* grammars: fix copy rule skipping

* grammars: disallow `a{,}` (not allowed in regexps)

* Update common/grammar-parser.cpp

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* grammars: fix copy rule skipping (again) & display of expectations

* grammars: more test cases

* grammars: update reps parsing to bring ? / * / + closer to before

* json: use new GBNF repetitions{m,n} syntax

* grammars: update performance gotchas w/ repetition advice

* Update examples/json_schema_to_grammar.py

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* Update examples/server/public/json-schema-to-grammar.mjs

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* grammars: comment on rule repetitions

* grammars: ensure unambiguous number alternatives

* grammar: nit typo switched error msgs

* grammar: nit numbering in comment

* json: update numeric rule to be unambiguous

* Apply suggestions from code review

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* Update examples/server/public/json-schema-to-grammar.mjs

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* json: fix integral-part

* grammar: add repetition tests

---------

Co-authored-by: Clint Herron <hanclinto@gmail.com>
2024-06-06 10:07:06 +01:00
f5d7b268ec llama : add jina v2 base code (#7596)
* feat: add changes to handle jina v2 base code

* fix: do not complicate things

* fix: fix the usage of the code model

* fix: fix comments

* fix: fix linting issues

* fix: remove ollama patches

* style : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-06 10:22:41 +03:00
2d08b7fbb4 docker : build only main and server in their images (#7782)
* add openmp lib to dockerfiles

* build only main and server in their docker images
2024-06-06 08:19:49 +03:00
d67caea0d6 docker : add openmp lib (#7780) 2024-06-06 08:17:21 +03:00
7672adeec7 Fix encoding in python scripts (#7733) 2024-06-06 03:07:24 +10:00
7d1a378b8f CUDA: refactor mmq, dmmv, mmvq (#7716)
* CUDA: refactor mmq, dmmv, mmvq

* fix out-of-bounds write

* struct for qk, qr, qi

* fix cmake build

* mmq_type_traits
b3092
2024-06-05 16:53:00 +02:00
2b3389677a ggml : refactor rope norm/neox (#7634)
* ggml : unify rope norm/neox (CPU)

* ggml : fix compile warning

* ggml : remove GLM rope mode

ggml-ci

* metal : better rope implementation

ggml-ci

* cuda : better rope implementation

ggml-ci

* naming : n_orig_ctx -> n_ctx_orig

ggml-ci

* dev : add reminders to update backends

ggml-ci

* vulkan : fix ggml_rope_ext() usage

* cuda : fix array size + indents

ggml-ci
b3091
2024-06-05 11:29:20 +03:00
9973e81c5c readme : remove -ins (#7759)
-ins and --instruct were moved in https://github.com/ggerganov/llama.cpp/pull/7675

I have adjusted the README accordingly.
There was no trace of --chatml in the README.
2024-06-05 09:40:49 +03:00
c90dbe026b Fix per token atrributes bits (#7749) b3089 2024-06-05 01:26:14 +02:00
b90dc566c1 Allow number of nodes in CUDA graph to change (#7738)
Previously the code would have failed to cope in the case that the
number of nodes changes in an existing CUDA graph. This fixes the
issue by removing an unnecessary conditional.
b3088
2024-06-04 22:06:49 +02:00
1442677f92 common : refactor cli arg parsing (#7675)
* common : gpt_params_parse do not print usage

* common : rework usage print (wip)

* common : valign

* common : rework print_usage

* infill : remove cfg support

* common : reorder args

* server : deduplicate parameters

ggml-ci

* common : add missing header

ggml-ci

* common : remote --random-prompt usages

ggml-ci

* examples : migrate to gpt_params

ggml-ci

* batched-bench : migrate to gpt_params

* retrieval : migrate to gpt_params

* common : change defaults for escape and n_ctx

* common : remove chatml and instruct params

ggml-ci

* common : passkey use gpt_params
b3087
2024-06-04 21:23:39 +03:00
554c247caf ggml : remove OpenCL (#7735)
ggml-ci
b3086
2024-06-04 21:23:20 +03:00
0cd6bd3483 llama : remove beam search (#7736) b3085 2024-06-04 21:23:05 +03:00
5ca0944a15 readme : remove obsolete Zig instructions (#7471) b3084 2024-06-04 19:43:01 +03:00
adc9ff3841 llama-bench : allow using a different printer for stderr with -oe (#7722)
compare-commits.sh : hide stdout, use -oe to print markdown
b3083
2024-06-04 14:32:42 +02:00
987d743d6b Improve hipBLAS support in CMake (#7696)
* Improve hipBLAS support in CMake

This improves the detection of the correct CMAKE_PREFIX_PATH when using different distributions or a self-built ROCm SDK.

* Set ROCM_PATH correctly
b3082
2024-06-04 14:09:15 +02:00
b226c1227b refine .gitignore (#7688)
This adds tags and android ndk into the git ignore list
2024-06-04 21:21:26 +10:00
3b38d48609 Per token attributes (#7685)
* Add per token attributes enum
* Using phi-3 for testing 'rstrip'
* Using jina-v2 for testing 'lstrip'
* Brute force test for 'lstrip' and 'rstrip'
* Implement 'rstrip' and 'lstrip'
* Update phi-3 GGUF file (obsolete since 917dc8c)
* Replace llama_token_type with llama_token_attribs
b3080
2024-06-04 09:17:17 +02:00
6d1616944d ggml : prevent builds with -ffinite-math-only (#7726)
This enforces a check that -fno-finite-math-only was set and that the operating
compiling mode is not in finite maths mode. This is because during rewriting of
silu and softmax for cpu #7154 there emerged an issue where the result that was
observed when >1 slot was nondeterministic as found by @JohannesGaessler.

@LostRuins narrowed the problem down to -ffinite-math-only which was theorised
to be due to SiLU, instead of flushing small values to 0, returns NaN or some 
other garbage. @jart proposed a fix that @ggerganov then implemented in this fix

ref https://github.com/ggerganov/llama.cpp/pull/7154#issuecomment-2145661825
b3079
2024-06-04 17:01:09 +10:00