Commit Graph

3122 Commits

Author SHA1 Message Date
fe93cc96cc Merge remote-tracking branch 'origin/master' into bins 2024-06-08 12:04:52 +01:00
7a16ce7db2 server : smart slot selection using Longest Common Prefix (#7728)
* server : Smart selection of available slot using Longest Common Substring

* add usage

* remove trailing whitespaces

* Use Longest Common Prefix (LCP) instead of LCS

* Rename argument
2024-06-08 10:50:31 +03:00
da799b4189 vulkan : reuse parent extra for views (#7806)
* vulkan : reuse parent extra for views

* Fix validation error when multiple compute contexts are used in a graph

---------

Co-authored-by: 0cc4m <picard12@live.de>
2024-06-07 19:47:49 +02:00
c00fad71e5 gguf-split : change binary multi-byte units to decimal (#7803) 2024-06-07 15:56:01 +03:00
27615f5ab2 cmake : fix BUILD_SHARED_LIBS=ON build (#7784)
common depends on pthreads in Linux
2024-06-07 15:15:07 +03:00
0dba58269f Update server-llm.sh 2024-06-07 11:52:40 +01:00
7027b27d76 server: update cache_prompt documentation [no ci] (#7745) 2024-06-07 11:15:49 +02:00
af8f0169da Update .gitignore 2024-06-07 10:14:03 +01:00
7fbe6006c9 update straggling refs 2024-06-07 09:42:21 +01:00
99df4cc091 rm accidentally checked in bins 2024-06-07 09:40:09 +01:00
a5cabd7649 server : do not get prompt in infill mode (#7286)
* avoid to get prompt in infill mode and embedding mode

* remove embedding mode

* refactor format

---------

Co-authored-by: wudexiang <wudexiang@bytedance.com>
2024-06-07 10:09:45 +03:00
d5c938cd77 [SYCL] fix softmax r2r result wrong issue (#7811) 2024-06-07 14:28:26 +08:00
c9ee7118d5 check for nans in imatrix and quantize (#7807)
* imatrix : detect nan/inf values

* quantize : check imatrix for nan/inf values
2024-06-07 09:01:29 +03:00
fbd83131f5 Merge remote-tracking branch 'origin/master' into bins 2024-06-07 00:51:31 +01:00
a0a7f2b031 Update build.yml 2024-06-07 00:38:05 +01:00
8695baebc0 update more names 2024-06-07 00:21:01 +01:00
ee459f40f6 server : fix --threads-http arg (#7801) 2024-06-06 19:19:59 +03:00
9a03341094 main/server: fix targets 2024-06-06 15:53:25 +01:00
8b7c734473 main: update refs -> llama
fix examples/main ref
2024-06-06 15:44:51 +01:00
f5f19a236f server: simplify nix package 2024-06-06 15:44:40 +01:00
f298cc63d2 server: update refs -> llama-server
gitignore llama-server
2024-06-06 15:44:40 +01:00
849842916d main/server: rename to llama / llama-server for consistency w/ homebrew 2024-06-06 15:28:27 +01:00
f83351f9a6 imatrix : migrate to gpt_params (#7771)
* imatrix : migrate to gpt_params

ggml-ci

* imatrix : add --save-frequency cli arg

* common : fix --no-ppl
2024-06-06 16:30:58 +03:00
ad675e1c67 Added support for . (any character) token in grammar engine. (#6467)
* Added support for . (any characer) token in grammar engine.

* Add integration tests for any-character symbol.
2024-06-06 06:08:52 -07:00
a143c04375 README minor fixes (#7798) [no ci]
derievatives --> derivatives
2024-06-06 22:17:54 +10:00
55b2d0849d grammars: x{min,max} repetition operator (#6640)
* grammars: x{min,max} repetition operator + tweak +/*/? to avoid duplication of original over alternates

* grammars: handle `x{n}` and fix `x{n,n}`

* grammars: document new repetition operators

* grammars: uniform use of int for min & max

* grammars: refactor parser test

* grammar: parsing tests w/ natural pretty print of updated expectations

* grammars: much prettier print of expectations (+ TEST_GRAMMAR_PARSER_PRINT_ALL=1 to force all)

* grammars: improve test pretty print again

* grammars: pretty print rules and chars

* grammars: fix copy rule skipping

* grammars: disallow `a{,}` (not allowed in regexps)

* Update common/grammar-parser.cpp

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* grammars: fix copy rule skipping (again) & display of expectations

* grammars: more test cases

* grammars: update reps parsing to bring ? / * / + closer to before

* json: use new GBNF repetitions{m,n} syntax

* grammars: update performance gotchas w/ repetition advice

* Update examples/json_schema_to_grammar.py

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* Update examples/server/public/json-schema-to-grammar.mjs

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* grammars: comment on rule repetitions

* grammars: ensure unambiguous number alternatives

* grammar: nit typo switched error msgs

* grammar: nit numbering in comment

* json: update numeric rule to be unambiguous

* Apply suggestions from code review

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* Update examples/server/public/json-schema-to-grammar.mjs

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* json: fix integral-part

* grammar: add repetition tests

---------

Co-authored-by: Clint Herron <hanclinto@gmail.com>
2024-06-06 10:07:06 +01:00
f5d7b268ec llama : add jina v2 base code (#7596)
* feat: add changes to handle jina v2 base code

* fix: do not complicate things

* fix: fix the usage of the code model

* fix: fix comments

* fix: fix linting issues

* fix: remove ollama patches

* style : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-06 10:22:41 +03:00
2d08b7fbb4 docker : build only main and server in their images (#7782)
* add openmp lib to dockerfiles

* build only main and server in their docker images
2024-06-06 08:19:49 +03:00
d67caea0d6 docker : add openmp lib (#7780) 2024-06-06 08:17:21 +03:00
7672adeec7 Fix encoding in python scripts (#7733) 2024-06-06 03:07:24 +10:00
7d1a378b8f CUDA: refactor mmq, dmmv, mmvq (#7716)
* CUDA: refactor mmq, dmmv, mmvq

* fix out-of-bounds write

* struct for qk, qr, qi

* fix cmake build

* mmq_type_traits
b3092
2024-06-05 16:53:00 +02:00
2b3389677a ggml : refactor rope norm/neox (#7634)
* ggml : unify rope norm/neox (CPU)

* ggml : fix compile warning

* ggml : remove GLM rope mode

ggml-ci

* metal : better rope implementation

ggml-ci

* cuda : better rope implementation

ggml-ci

* naming : n_orig_ctx -> n_ctx_orig

ggml-ci

* dev : add reminders to update backends

ggml-ci

* vulkan : fix ggml_rope_ext() usage

* cuda : fix array size + indents

ggml-ci
b3091
2024-06-05 11:29:20 +03:00
9973e81c5c readme : remove -ins (#7759)
-ins and --instruct were moved in https://github.com/ggerganov/llama.cpp/pull/7675

I have adjusted the README accordingly.
There was no trace of --chatml in the README.
2024-06-05 09:40:49 +03:00
c90dbe026b Fix per token atrributes bits (#7749) b3089 2024-06-05 01:26:14 +02:00
b90dc566c1 Allow number of nodes in CUDA graph to change (#7738)
Previously the code would have failed to cope in the case that the
number of nodes changes in an existing CUDA graph. This fixes the
issue by removing an unnecessary conditional.
b3088
2024-06-04 22:06:49 +02:00
1442677f92 common : refactor cli arg parsing (#7675)
* common : gpt_params_parse do not print usage

* common : rework usage print (wip)

* common : valign

* common : rework print_usage

* infill : remove cfg support

* common : reorder args

* server : deduplicate parameters

ggml-ci

* common : add missing header

ggml-ci

* common : remote --random-prompt usages

ggml-ci

* examples : migrate to gpt_params

ggml-ci

* batched-bench : migrate to gpt_params

* retrieval : migrate to gpt_params

* common : change defaults for escape and n_ctx

* common : remove chatml and instruct params

ggml-ci

* common : passkey use gpt_params
b3087
2024-06-04 21:23:39 +03:00
554c247caf ggml : remove OpenCL (#7735)
ggml-ci
b3086
2024-06-04 21:23:20 +03:00
0cd6bd3483 llama : remove beam search (#7736) b3085 2024-06-04 21:23:05 +03:00
5ca0944a15 readme : remove obsolete Zig instructions (#7471) b3084 2024-06-04 19:43:01 +03:00
adc9ff3841 llama-bench : allow using a different printer for stderr with -oe (#7722)
compare-commits.sh : hide stdout, use -oe to print markdown
b3083
2024-06-04 14:32:42 +02:00
987d743d6b Improve hipBLAS support in CMake (#7696)
* Improve hipBLAS support in CMake

This improves the detection of the correct CMAKE_PREFIX_PATH when using different distributions or a self-built ROCm SDK.

* Set ROCM_PATH correctly
b3082
2024-06-04 14:09:15 +02:00
b226c1227b refine .gitignore (#7688)
This adds tags and android ndk into the git ignore list
2024-06-04 21:21:26 +10:00
3b38d48609 Per token attributes (#7685)
* Add per token attributes enum
* Using phi-3 for testing 'rstrip'
* Using jina-v2 for testing 'lstrip'
* Brute force test for 'lstrip' and 'rstrip'
* Implement 'rstrip' and 'lstrip'
* Update phi-3 GGUF file (obsolete since 917dc8c)
* Replace llama_token_type with llama_token_attribs
b3080
2024-06-04 09:17:17 +02:00
6d1616944d ggml : prevent builds with -ffinite-math-only (#7726)
This enforces a check that -fno-finite-math-only was set and that the operating
compiling mode is not in finite maths mode. This is because during rewriting of
silu and softmax for cpu #7154 there emerged an issue where the result that was
observed when >1 slot was nondeterministic as found by @JohannesGaessler.

@LostRuins narrowed the problem down to -ffinite-math-only which was theorised
to be due to SiLU, instead of flushing small values to 0, returns NaN or some 
other garbage. @jart proposed a fix that @ggerganov then implemented in this fix

ref https://github.com/ggerganov/llama.cpp/pull/7154#issuecomment-2145661825
b3079
2024-06-04 17:01:09 +10:00
bde7cd3cd9 llama : offload to RPC in addition to other backends (#7640)
* llama : offload to RPC in addition to other backends

* - fix copy_tensor being called on the src buffer instead of the dst buffer

- always initialize views in the view_src buffer

- add RPC backend to Makefile build

- add endpoint to all RPC object names

* add rpc-server to Makefile

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
b3078
2024-06-03 20:03:26 +03:00
a5735e4426 ggml : use OpenMP as a thread pool (#7606)
* ggml: Added OpenMP for multi-threads processing

* ggml : Limit the number of threads used to avoid deadlock

* update shared state n_threads in parallel region

* clear numa affinity for main thread even with openmp

* enable openmp by default

* fix msvc build

* disable openmp on macos

* ci : disable openmp with thread sanitizer

* Update ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b3077
2024-06-03 17:14:15 +02:00
0b832d53ba make: fix debug options not being applied to NVCC (#7714) b3076 2024-06-03 16:28:58 +02:00
3d7ebf6312 Vulkan Mixture of Experts (MoE) support (#7628)
* Finish Vulkan mul_mat_id implementation

* Add Vulkan sum_rows and div ops

* Fix MUL_MAT_ID matrix matrix shader

* Fix MUL_MAT_ID matrix vector shader dispatch size

* Fix MUL_MAT_ID matrix vector shader and dispatch code

* Update Vulkan CPU offload for MUL_MAT_ID

* Fix crash when using split mode none and setting a main GPU
b3075
2024-06-03 10:59:14 +02:00
a10cda58d3 cmake : add pkg-config spec file for llama.cpp (#7702) b3074 2024-06-03 11:06:24 +03:00
6f28a333c1 llama : MiniCPM support tied embeddings (#7664)
* support lm_head

* remove the code block

---------

Co-authored-by: zhangkaihuo <zhangkaihuo@modelbest.cn>
b3073
2024-06-03 10:49:30 +03:00