Commit Graph

5569 Commits

Author SHA1 Message Date
d3a2eb592d disable on windows 2025-05-31 23:17:18 +02:00
7210ebe230 revert build changes 2025-05-31 23:16:56 +02:00
05f94a0e90 add arch to matrix 2025-05-31 22:54:37 +02:00
f9a27178e5 download in batches 2025-05-31 22:35:26 +02:00
de8ec1348b Merge branch 'master' into cisc/test-tokenizers-remote 2025-05-31 21:25:34 +02:00
8e1125a8db copy curl dll for tests 2025-05-31 21:22:37 +02:00
b3a89c3d9e docs : Note about necessity of having libcurl installed for standard build. (#13945)
Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
2025-05-31 18:58:35 +02:00
e15898d1c7 server: allow unclosed thinking tags (#13931) b5556 2025-05-31 08:26:10 -07:00
803f8baf4f llama : deprecate explicit kv_self defrag/update calls (#13921)
ggml-ci
b5555
2025-05-31 15:58:33 +03:00
3600cc2886 llama : use n_swa + n_ubatch cells for SWA cache (#13833)
* llama : use n_swa + n_ubatch cells for SWA cache

ggml-ci

* llama : add warning about multi-sqeuence SWA contexts
b5554
2025-05-31 15:57:44 +03:00
c7e0a2054b webui : Replace alert and confirm with custom modals. (#13711)
* Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons.

* use Modal Provider to simplify the use of confirm and alert modals.

* Increase the z index of the modal dialogs.

* Update index.html.gz

* also add showPrompt

* rebuild

---------

Co-authored-by: igardev <ivailo.gardev@akros.ch>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-31 11:56:08 +02:00
3f55f781f1 llama : auto-batch preparation (#13845)
* llama : auto-batch

ggml-ci

* context : simplify if branching
b5552
2025-05-31 12:55:57 +03:00
51fa76f172 mtmd : drop _shared from libmtmd name, merge helpers into libmtmd (⚠️ breaking change) (#13917)
* mtmd : fix missing public header

* no object

* apply suggestion from Georgi

* rm mtmd-helper, merge it to mtmd

* missing vendor include dir
b5551
2025-05-31 10:14:29 +02:00
12d0188c0d kv-cache : refactor + add llama_memory_state_i (#13746)
* kv-cache : simplify the "struct llama_kv_cache" interface

ggml-ci

* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)

ggml-ci

* kv-cache : some comments

ggml-ci

* context : fix graph reserve for multiple sequences

ggml-ci

* kv-cache : fix typo [no ci]

* kv-cache : fix find_slot() logic for free slots

ggml-ci

* llama : add TODO for deprecating the defrag API in the future

* kv-cache : improve find_slot() using min/max seq pos info

ggml-ci

* llama : handle aborts and compute errors

ggml-ci

* memory : extract state into llama_memory_state

ggml-ci

* kv-cache : add comments

ggml-ci

* server : update batching logic to reset n_batch on successful decode

* server : upon full re-processing, remove the sequence from the cache

* kv-cache : add TODO for doing split_equal when split_simple fails

ggml-ci
2025-05-31 10:24:04 +03:00
eb3949938e CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856) (#13895)
* 1.  add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu
2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted code indentation

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fixed incorrect setting of variable types

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the judgment logic

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()'

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add a defensive security assert

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the support judgment logic.

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* revoke the suggest commit changes due to it's not applicable in jetson_device

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add parentheses to enforce operator precedence​

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fix ci bug: add a spaces

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: yangxiao <yang_xl@tju.edu.cn>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: yangxiao <yangxl_zz@qq.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-05-31 08:48:04 +02:00
e562eece7c CUDA: fix typo in FlashAttention code (#13926) b5548 2025-05-30 21:22:03 +02:00
b47ab7b8e9 sched : avoid changing cur_copy when a graph is already allocated (#13922) b5547 2025-05-30 18:56:19 +02:00
dd665cc9d4 parallel : increase the variability of the prompt lengths (#13927)
ggml-ci
b5546
2025-05-30 19:38:07 +03:00
df0c0c7d02 cuda : prevent using split buffers with 3d/4d matrices (#13919) b5545 2025-05-30 16:37:18 +02:00
b49a8ff96b SYCL: Add mrope kernel (#13755)
* SYCL: Add mrope kernel

* feat: Optimize rope operations with vectorization

Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.

* Use ceil_div
b5544
2025-05-30 19:40:57 +05:30
53f925074d sync : vendor (#13901)
* sync : vendor

ggml-ci

* cont : fix httplib version

ggml-ci

* cont : fix lint

* cont : fix lint

* vendor : move to common folder /vendor

ggml-ci

* cont : fix lint

* cont : move httplib to /vendor + use json_fwd.hpp

ggml-ci

* cont : fix server build

ggml-ci

* cont : add missing headers

ggml-ci

* cont : header clean-up

ggml-ci
b5543
2025-05-30 16:25:45 +03:00
db38704f01 convert : fix rwkv bos/eos token (#13844) 2025-05-30 14:50:43 +02:00
07e4351ce6 convert : allow partial update to the chkhsh pre-tokenizer list (#13847)
* convert : allow partial update to the chkhsh pre-tokenizer list

* code style

* update tokenizer out

* rm inp/out files for models not having gguf

* fixed hash for glm

* skip nomic-bert-moe test

* Update convert_hf_to_gguf_update.py

* fix minerva-7b hash

* rm redundant import
b5541
2025-05-30 12:24:37 +02:00
291f2b6913 llama : add support for DistilBert (#13907)
* add distilbert

* small fixes

* add note for LLM_ARCH_DISTIL_BERT

* Use MODEL_ARCH.BERT for DistilBert

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
b5540
2025-05-30 11:56:02 +02:00
4b4843adf3 windows builds adds build type to runtime output 2025-05-30 11:51:46 +02:00
2c90da4c7e llama : use llm_build_granite for minicpm (#13911) b5539 2025-05-30 10:31:48 +02:00
ec9e0301fe cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890) b5538 2025-05-30 01:28:54 +02:00
e83ba3e460 llama : add support for jina-reranker-v2 (#13900) b5537 2025-05-29 21:42:31 +02:00
2b131621e6 gguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method (#13561) gguf-v0.17.0 2025-05-29 15:36:05 +02:00
54a2c7a8cd arm64: optimize q4_k_q8_k kernel with i8mm (#13886)
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------
```
b5535
2025-05-29 14:39:20 +03:00
21fcc21ad5 cmake: Factor out CPU architecture detection (#13883)
* cmake: Define function for querying architecture

The tests and results match exactly those of ggml/src/CMakeLists.txt

* Switch arch detection over to new function
b5534
2025-05-29 12:50:25 +02:00
dd8ba93416 ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882)
* F32-Mamba-Seq_Scan-SVE

* Fix formatting

* ggml : missing space

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5533
2025-05-29 12:18:43 +03:00
66c92061f5 tests : remove json.hpp from a test (#13880)
ggml-ci
b5532
2025-05-29 12:17:16 +03:00
5ca82fc1d7 convert : workaround for AutoConfig dummy labels (#13881) 2025-05-29 10:00:57 +02:00
6385b843a8 llama : add RobertaForSequenceClassification reranker support (#13875) b5530 2025-05-29 08:15:01 +02:00
1b8fb8152d ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843)
* F32-Mamba-SVE

* F32-Mamba-SVE

* Resolve test errors-1

* Resolve test errors-2

* F32-vec-SVE

* F32-vec-SVE

* F32-vec-SVE
b5529
2025-05-29 09:01:33 +03:00
53ae30640e gguf-py : fix SafetensorRemote return on undefined size (< 0) (#13841) 2025-05-28 23:50:20 +02:00
763d06edb7 llama : fix KV shift for qwen2vl (#13870)
* llama : fix KV shift for qwen2vl

* add ref to the PR
b5527
2025-05-28 22:35:31 +02:00
10961339b2 mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866)
* mtmd : move helpers to dedicated library

* fix server build

* rm leftover cmakelist code
b5526
2025-05-28 22:35:22 +02:00
d98f2a35fc ci: disable LLAMA_CURL for Linux cross-builds (#13871) 2025-05-28 15:46:47 -03:00
e0e3aa231d llama : add support for BertForSequenceClassification reranker (#13858)
* convert: add support for BertForSequenceClassification

* add support for reranking using BertForSequenceClassification

* merge checks of eos and sep

* fix lint

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
b5524
2025-05-28 19:01:58 +02:00
aa6dff05be convert: small addition to support LlamaModel (#13838)
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-28 16:34:18 +02:00
Sky
c962ae3382 server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode (#13853)
[fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode
b5522
2025-05-28 16:33:54 +02:00
a3938fb53d convert : fix qwen omni conversion (#13859)
* convert : fix qwen omni conversion

* fix typo
2025-05-28 16:12:35 +02:00
f7873fc698 tests : change umlaut test (#11600) 2025-05-28 15:49:28 +02:00
a68247439b CUDA: fix FA tg at long context for CC >= 8.9 (#13852) b5519 2025-05-28 13:33:37 +02:00
d97b9ade51 correct working directory for all builds
..and change cache file name as per suggestion.
2025-05-28 12:49:36 +02:00
0fe7183ae4 fix prototype for non-curl builds 2025-05-28 11:11:02 +02:00
ecbc92acd0 correct working directory 2025-05-28 10:16:34 +02:00
26b79b6cb3 convert : fix tensor naming conflict for llama 4 vision (#13836)
* convert : fix tensor naming conflict for llama 4 vision

* add comment
2025-05-28 10:05:54 +02:00