Commit Graph

4489 Commits

Author SHA1 Message Date
d39e26741f examples : flush log upon ctrl+c (#9559) b3789 2024-09-20 11:46:56 +03:00
722ec1eb51 perplexity : do not escape input data by default (#9548) b3788 2024-09-20 09:38:10 +03:00
6026da52d6 server : clean-up completed tasks from waiting list (#9531)
ggml-ci
b3787
2024-09-19 12:44:53 +03:00
eca0fab44e imatrix : disable prompt escape by default (#9543) b3786 2024-09-19 10:58:14 +03:00
64c6af3195 ggml : fix n_threads_cur initialization with one thread (#9538)
* ggml : fix n_threads_cur initialization with one thread

* Update ggml/src/ggml.c

---------

Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
b3785
2024-09-18 10:13:08 -07:00
0d2f22e45c scripts : verify py deps at the start of compare (#9520) 2024-09-18 18:34:32 +03:00
6443ddd985 llama : use reserve/emplace_back in sampler_sample (#9534)
This commit updates the llama_sampler_sample function to use reserve and
emplace_back for the vector of llama_token_data structs.

The motivation for this change is to avoid the creation of n_vocab
default-constructed llama_token_data structs which are then
immediately overwritten.
b3783
2024-09-18 14:42:36 +03:00
8a308354f6 server : match OAI structured output response (#9527) b3782 2024-09-18 09:50:34 +03:00
f799155ab8 server : fix OpenSSL build (remove obsolete LOG_INFO) (#9529) b3781 2024-09-18 09:28:20 +03:00
faf67b3de4 [SYCL]set context default value to avoid memory issue, update guide (#9476)
* set context default to avoid memory issue, update guide

* Update docs/backend/SYCL.md

Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>

---------

Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
2024-09-18 08:30:31 +08:00
7be099fa81 llama-bench: correct argument parsing error message (#9524) b3779 2024-09-17 22:41:38 +02:00
8b836ae731 arg : add env variable for parallel (#9513)
* add env variable for parallel

* Update README.md with env:  LLAMA_ARG_N_PARALLEL
b3778
2024-09-17 16:35:38 +03:00
8344ef58f8 llama : fix n_vocab init for 'no_vocab' case (#9511)
* llama: fixed n_vocab for `no_vocab` models

* llama: updated error output for `llama_decode_internal` and `llama_encode_internal`

* llama: log warning if there's no vocab_size in metadata

* llama: correct vocab size for logging

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b3777
2024-09-17 13:18:22 +03:00
0226613853 threadpool : skip polling for unused threads (#9461)
* threadpool: skip polling for unused threads

Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1).
This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur).

n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written
from one thread and read from other threads (not a race conditions).

* threadpool: further simplify and improve ggml_barrier

Avoid using strict memory order while polling, yet make sure that all threads go through
full memory barrier (memory fence) on ggml_barrier entrace and exit.

* threads: add simple barrier test

This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead.

* threadpool: improve thread sync for new-graphs

Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order
to keep it efficient, once the new graph is detected we do full fence using read-modify-write
with strict memory order.

* threadpool: improve abort handling

Do not use threadpool->ec (exit code) to decide whether to exit the compute loop.
threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it.

Instead introduce atomic threadpool->abort flag used for this. This is consistent with
how we handle threadpool->stop or pause.

While at it add an explicit atomic_load for n_threads_cur for consistency.

* test-barrier: release threadpool before releasing the context

fixes use-after-free detected by gcc thread-sanitizer on x86-64
for some reason llvm sanitizer is not detecting this issue.
2024-09-17 11:19:46 +03:00
503147a9f9 unicode : add <algorithm> (#9508) b3775 2024-09-17 09:51:15 +03:00
0d2ec43833 llama : support IBM Granite architecture (#9412)
* feat(gguf-py): Add Granite model and params to gguf-py

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(convert_hf_to_gguf): Add registration and param setup for Granite

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): Add config parsing for Granite multiplier params

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): First pass at full port of granite deviations from llama

Something is still not working right since the results are mostly terrible,
but on occasion it's producing relevant results at this point, so
_something_ is working.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama.cpp): Determine granite language 3b instruct by vocab size

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert_hf_to_gguf): Use LlamaModel as base for GraniteModel

The defaults in LlamaModel are needed for Granite as well

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama.cpp): Switch Granite param names to use _scale for consistency

Other scalar multipliers are called *_scale, so this provides a more
consistent naming convention.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert_hf_to_gguf/gguf-py): _multiplier -> _scale

The transformers names with _multiplier will now be converted to the _scale
equivalent during conversion.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama.cpp): Use separate switch clause for granite in llm_load_hparams

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
b3774
2024-09-17 09:44:58 +03:00
37f3a3810e llama : add llama_n_head() (#9512) 2024-09-17 09:23:30 +03:00
23e0d70bac ggml : move common CPU backend impl to new header (#9509) b3772 2024-09-16 16:22:07 +02:00
acb2c32c33 llama : rename n_embed to n_embd in rwkv6_time_mix (#9504)
This commit renames n_embed to n_embd in llm_build_rwkv6_time_mix.

The motivation for this change is consistency with the other rwkv6
functions like build_rwkv6 (and other parts of the code base).
b3771
2024-09-16 14:07:13 +03:00
a6a3a5c531 ggml : link MATH_LIBRARY not by its full path (#9339) b3770 2024-09-16 14:06:50 +03:00
d54c21df7e convert : identify missing model files (#9397) b3769 2024-09-16 10:30:22 +03:00
19514d632e cmake : do not hide GGML options + rename option (#9465)
* cmake : do not hide GGML options

ggml-ci

* build : rename flag GGML_CUDA_USE_GRAPHS -> GGML_CUDA_GRAPHS

for consistency

ggml-ci
2024-09-16 10:27:50 +03:00
Eve
5c3d0f1824 ggml : IQ4_NL sgemm + Q4_0 AVX optimization (#9422)
* squashed

readd my iq4_nl sgemm PR https://github.com/ggerganov/llama.cpp/pull/8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per https://github.com/ggerganov/llama.cpp/pull/8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
b3767
2024-09-16 09:48:24 +03:00
0aadac10c7 llama : support OLMoE (#9462) b3766 2024-09-16 09:47:37 +03:00
95ca85168b llama : support MiniCPM3 (#9322)
Co-authored-by: 范睿凯 <fanruikai@modelbest.cn>
b3765
2024-09-16 09:45:20 +03:00
441b72b91f main : option to disable context shift (#9484)
* added cli arg to disable context shift

* reverted precommit

* updated README.md for main

* white space

* allow disabling context shift in the server

* Update common/arg.cpp

no-context-shift only works for main example

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* added server example to --no-context-shift args

* removed server changes

* white space

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b3764
2024-09-16 09:20:01 +03:00
c4965a64f7 metal : handle zero-sized allocs (#9466) b3763 2024-09-16 09:05:56 +03:00
90a2fff0e7 flake.lock: Update (#9488) 2024-09-15 19:14:23 -07:00
6262d13e0b common : reimplement logging (#9418)
https://github.com/ggerganov/llama.cpp/pull/9418
b3761
2024-09-15 20:46:12 +03:00
e6deac31f7 gguf-split : add basic checks (#9499)
* gguf-split : do not overwrite existing files when merging

* gguf-split : error when too many arguments are passed
b3760
2024-09-15 19:02:27 +02:00
6988da94a2 cmake : correct order of sycl flags (#9497) b3759 2024-09-15 19:55:52 +03:00
3c7989fd29 py : add "LLaMAForCausalLM" conversion support (#9485)
Co-authored-by: Csaba Kecskemeti <csabakecskemeti@Csabas-Mac-Pro.local>
b3758
2024-09-15 10:48:25 +03:00
d6b37c881f readme : update tools list (#9475)
* Added link to proprietary wrapper for Unity3d into README.md

Wrapper has prebuild library and was tested on iOS, Android, WebGL, PC, Mac platforms, has online demos like [this](https://d23myu0xfn2ttc.cloudfront.net/rich/index.html) and [that](https://d23myu0xfn2ttc.cloudfront.net/).

* Update README.md

Fixes upon review
b3757
2024-09-15 10:36:53 +03:00
7596487beb cmake : try to fix sycl+intel build (#9487) b3756 2024-09-15 10:06:38 +03:00
822b6322de ggml : ggml_type_name return "NONE" for invalid values (#9458)
When running on Windows, the quantization utility attempts to print the types that are not set which leads to a crash.
b3755
2024-09-14 12:54:37 +03:00
dcdcee3a74 server: add data: [DONE] to /chat/completions stream response (#9459) b3754 2024-09-14 11:36:44 +02:00
1f4111e540 cmake : use list(APPEND ...) instead of set() + dedup linker (#9463)
* cmake : use list(APPEND ...) instead of set() + dedup linker

ggml-ci

* cmake : try fix sycl

* cmake : try to fix sycl 2

* cmake : fix sycl build (#9469)

* try fix sycl build

* use CMAKE_CXX_FLAGS as a string variable

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* one more CMAKE_CXX_FLAGS fix (#9471)

---------

Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com>
b3753
2024-09-14 10:55:05 +03:00
befaf1197f llama : make cell_id const in inp_s_mask block (#9470)
This commit makes the cell_id variable const in the inp_s_mask block.

The motivation for this change is consistency with the code in the
inp_s_copy block.
b3752
2024-09-14 10:50:12 +03:00
feff4aa846 server : add loading html page while model is loading (#9468)
* Adding loading page for '/' server requests

* set content when model is loading

* removed loading html file

* updated cmakelist

* updated makefile

* cleaned up whitespace

* cleanup for PR removed error

* updated server test to handle 503 HTML

* updated server test to handle 503 HTML

* ca†ch 503 before parsing json

* revert test

* account for both api and web browser requests

* precommit corrections

* eol fix

* revert changes to pre-commit

* removed print statement

* made loading message more descriptive

* also support .html files

---------

Co-authored-by: VJHack <flymyplane21@gmail.com>
Co-authored-by: Vinesh Janarthanan <36610342+VJHack@users.noreply.github.com>
b3751
2024-09-13 14:23:11 +02:00
0abc6a2c25 llama : llama_perf + option to disable timings during decode (#9355)
* llama : llama_perf + option to disable timings during decode

ggml-ci

* common : add llama_arg

* Update src/llama.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* perf : separate functions in the API

ggml-ci

* perf : safer pointer handling + naming update

ggml-ci

* minor : better local var name

* perf : abort on invalid sampler pointer

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
b3750
2024-09-13 09:53:38 +03:00
bd35cb0ae3 feat: remove a sampler from a chain (#9445)
* feat: remove a sampler from a chain

* fix: return removed sampler

* fix: safer casting
b3749
2024-09-13 03:54:49 +02:00
78203641fe server : Add option to return token pieces in /tokenize endpoint (#9108)
* server : added with_pieces functionality to /tokenize endpoint

* server : Add tokenize with pieces tests to server.feature

* Handle case if tokenizer splits along utf8 continuation bytes

* Add example of token splitting

* Remove trailing ws

* Fix trailing ws

* Maybe fix ci

* maybe this fix windows ci?

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b3748
2024-09-12 22:30:11 +02:00
e6b7801bd1 cann: Add host buffer type for Ascend NPU (#9406)
* feat: Add host buffer type for Ascend NPU(CANN backend)

* fix some checking errors

* Add a few comments
b3747
2024-09-12 19:46:43 +08:00
e665744317 llava : fix the script error in MobileVLM README (#9054)
Signed-off-by: Erhu Feng <2748250768@qq.com>
b3746
2024-09-12 14:34:22 +03:00
d4c3c10fad lora : raise error if lm_head is ignored (#9103)
* lora : raise error if lm_head is ignored

* fix style

* clarify comment
2024-09-12 14:33:57 +03:00
2a825116b6 cmake : fix for builds without GGML_CDEF_PUBLIC (#9338)
* `GGML_TARGET_DEFINES-NOTFOUND` fix for builds without `GGML_CDEF_PUBLIC`

* Update CMakeLists.txt, spaces fix
b3744
2024-09-12 14:30:01 +03:00
4dc4f5f14a ci : update HIP SDK to 24.Q3 (ROCm 6.1) (#9329) b3743 2024-09-12 14:28:43 +03:00
c837981bba py : add Phi-1.5/Phi-2 tokenizer (#9361)
* add phi2 tokenizer

* add phi name to convert_hf_to_gguf_update.py

* make tokenizer_pre consistent; llama.cpp work
2024-09-12 14:28:20 +03:00
3c26a1644d ci : bump actions/checkout to v4 (#9377) 2024-09-12 14:27:45 +03:00
ff76e18516 cmake : fixed the order of linking libraries for llama-quantize (#9450) b3740 2024-09-12 14:27:14 +03:00