Commit Graph

5549 Commits

Author SHA1 Message Date
199a838422 threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling
We talked about adding LOW priority for GGML threads in the original threadpool PR.
It might be useful for some cases to avoid contention.

Latest Windows ARM64 releases started parking (offlining) the CPU cores
more aggresively which results in suboptimal performance with n_threads > 4.
To deal with that we now disable Power Throttling for our threads for the NORMAL
and higher priorities.

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-05-30 17:15:38 -07:00
e562eece7c CUDA: fix typo in FlashAttention code (#13926) b5548 2025-05-30 21:22:03 +02:00
b47ab7b8e9 sched : avoid changing cur_copy when a graph is already allocated (#13922) b5547 2025-05-30 18:56:19 +02:00
dd665cc9d4 parallel : increase the variability of the prompt lengths (#13927)
ggml-ci
b5546
2025-05-30 19:38:07 +03:00
df0c0c7d02 cuda : prevent using split buffers with 3d/4d matrices (#13919) b5545 2025-05-30 16:37:18 +02:00
b49a8ff96b SYCL: Add mrope kernel (#13755)
* SYCL: Add mrope kernel

* feat: Optimize rope operations with vectorization

Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.

* Use ceil_div
b5544
2025-05-30 19:40:57 +05:30
53f925074d sync : vendor (#13901)
* sync : vendor

ggml-ci

* cont : fix httplib version

ggml-ci

* cont : fix lint

* cont : fix lint

* vendor : move to common folder /vendor

ggml-ci

* cont : fix lint

* cont : move httplib to /vendor + use json_fwd.hpp

ggml-ci

* cont : fix server build

ggml-ci

* cont : add missing headers

ggml-ci

* cont : header clean-up

ggml-ci
b5543
2025-05-30 16:25:45 +03:00
db38704f01 convert : fix rwkv bos/eos token (#13844) 2025-05-30 14:50:43 +02:00
07e4351ce6 convert : allow partial update to the chkhsh pre-tokenizer list (#13847)
* convert : allow partial update to the chkhsh pre-tokenizer list

* code style

* update tokenizer out

* rm inp/out files for models not having gguf

* fixed hash for glm

* skip nomic-bert-moe test

* Update convert_hf_to_gguf_update.py

* fix minerva-7b hash

* rm redundant import
b5541
2025-05-30 12:24:37 +02:00
291f2b6913 llama : add support for DistilBert (#13907)
* add distilbert

* small fixes

* add note for LLM_ARCH_DISTIL_BERT

* Use MODEL_ARCH.BERT for DistilBert

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
b5540
2025-05-30 11:56:02 +02:00
2c90da4c7e llama : use llm_build_granite for minicpm (#13911) b5539 2025-05-30 10:31:48 +02:00
ec9e0301fe cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890) b5538 2025-05-30 01:28:54 +02:00
e83ba3e460 llama : add support for jina-reranker-v2 (#13900) b5537 2025-05-29 21:42:31 +02:00
2b131621e6 gguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method (#13561) gguf-v0.17.0 2025-05-29 15:36:05 +02:00
54a2c7a8cd arm64: optimize q4_k_q8_k kernel with i8mm (#13886)
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------
```
b5535
2025-05-29 14:39:20 +03:00
21fcc21ad5 cmake: Factor out CPU architecture detection (#13883)
* cmake: Define function for querying architecture

The tests and results match exactly those of ggml/src/CMakeLists.txt

* Switch arch detection over to new function
b5534
2025-05-29 12:50:25 +02:00
dd8ba93416 ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882)
* F32-Mamba-Seq_Scan-SVE

* Fix formatting

* ggml : missing space

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5533
2025-05-29 12:18:43 +03:00
66c92061f5 tests : remove json.hpp from a test (#13880)
ggml-ci
b5532
2025-05-29 12:17:16 +03:00
5ca82fc1d7 convert : workaround for AutoConfig dummy labels (#13881) 2025-05-29 10:00:57 +02:00
6385b843a8 llama : add RobertaForSequenceClassification reranker support (#13875) b5530 2025-05-29 08:15:01 +02:00
1b8fb8152d ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843)
* F32-Mamba-SVE

* F32-Mamba-SVE

* Resolve test errors-1

* Resolve test errors-2

* F32-vec-SVE

* F32-vec-SVE

* F32-vec-SVE
b5529
2025-05-29 09:01:33 +03:00
53ae30640e gguf-py : fix SafetensorRemote return on undefined size (< 0) (#13841) 2025-05-28 23:50:20 +02:00
763d06edb7 llama : fix KV shift for qwen2vl (#13870)
* llama : fix KV shift for qwen2vl

* add ref to the PR
b5527
2025-05-28 22:35:31 +02:00
10961339b2 mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866)
* mtmd : move helpers to dedicated library

* fix server build

* rm leftover cmakelist code
b5526
2025-05-28 22:35:22 +02:00
d98f2a35fc ci: disable LLAMA_CURL for Linux cross-builds (#13871) 2025-05-28 15:46:47 -03:00
e0e3aa231d llama : add support for BertForSequenceClassification reranker (#13858)
* convert: add support for BertForSequenceClassification

* add support for reranking using BertForSequenceClassification

* merge checks of eos and sep

* fix lint

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
b5524
2025-05-28 19:01:58 +02:00
aa6dff05be convert: small addition to support LlamaModel (#13838)
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-28 16:34:18 +02:00
Sky
c962ae3382 server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode (#13853)
[fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode
b5522
2025-05-28 16:33:54 +02:00
a3938fb53d convert : fix qwen omni conversion (#13859)
* convert : fix qwen omni conversion

* fix typo
2025-05-28 16:12:35 +02:00
f7873fc698 tests : change umlaut test (#11600) 2025-05-28 15:49:28 +02:00
a68247439b CUDA: fix FA tg at long context for CC >= 8.9 (#13852) b5519 2025-05-28 13:33:37 +02:00
26b79b6cb3 convert : fix tensor naming conflict for llama 4 vision (#13836)
* convert : fix tensor naming conflict for llama 4 vision

* add comment
2025-05-28 10:05:54 +02:00
1e8659e65a CANN: Add SOC TYPE printing in cmake configuration (#13837) b5517 2025-05-28 11:54:20 +08:00
a3c30846e4 opencl: add new ops - argsort, div, sub, addrows, sigmoid, group_norm (#13787)
* opencl: add `argsort`

* opencl: add `div`

* opencl: add `add_rows`

* opencl: add `sub`

* opencl: add `sigmoid`, both `f16` and `f32`

* opencl: add `group_norm`
b5516
2025-05-27 12:56:08 -07:00
1701d4c54f opencl: mark mul_mat f32f32 as supporting non-contiguous tensors (#13790) b5515 2025-05-27 12:53:14 -07:00
bef8176387 vulkan: use timestamp queries for GGML_VULKAN_PERF (#13817)
Also change it to be controlled by an env var rather than cmake flag
b5514
2025-05-27 18:39:07 +02:00
34b7c0439e cmake : add llama-cparams.cpp to build (#13832) b5513 2025-05-27 19:08:44 +03:00
f3101a8cc6 SYCL: add gelu_erf kernel (#13749)
* SYCL: add gelu_erf kernel

* refactor code

Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>

* Use scope_op_debug_print

---------

Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>
b5512
2025-05-27 20:52:59 +05:30
1c49c70d07 sync : ggml 2025-05-27 18:05:33 +03:00
a8ea03d8ad ggml : add ggml_repeat_4d (#13824) b5510 2025-05-27 15:53:55 +02:00
05f6ac6283 ggml : riscv: add xtheadvector support (#13720)
* ggml : riscv: add xtheadvector support

* ggml : clean up some macro usage
b5509
2025-05-27 16:21:36 +03:00
bc583e3c63 mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) (#13784)
* mtmd : allow multiple modalities at the same time

* refactor mtmd tokenizer

* fix compile

* ok, missing SinusoidsPositionEmbedding

* first working version

* fix style

* more strict validate of n_embd

* refactor if..else to switch

* fix regression

* add test for 3B

* update docs

* fix tokenizing with add_special

* add more tests

* fix test case "huge"

* rm redundant code

* set_position_mrope_1d rm n_tokens
b5508
2025-05-27 14:06:10 +02:00
72b090da2c docs: remove link for llama-cli function calling (#13810) 2025-05-27 08:52:40 -03:00
7fe03e7446 ggml-cpu: x86 feature detection is specific to x86 (#13811) b5506 2025-05-27 13:18:39 +02:00
952f3953c1 ggml : allow CUDA graphs when using pipeline parallelism (#13814) b5505 2025-05-27 13:05:18 +02:00
81713121ee kv-cells : track min/max used cells and per-sequence positions (#13808)
* kv-cells : track min/max used cells and per-sequence positions

ggml-ci

* kv-cells : fix pos-modification updates for seq_pos

ggml-ci

* kv-cells : add comments

ggml-ci
b5504
2025-05-27 13:49:41 +03:00
f9cd68398b sampling : make sure samplers return at least 1 token (#13822)
* sampling : min-p should always return at least one token

ggml-ci

* sampling : same for typical sampling

* tests : sampling tests use min_keep == 0

ggml-ci
b5503
2025-05-27 12:07:52 +03:00
4f81b33e32 llama : validate seq id batch input (#13809)
* llama : validate seq id batch input

ggml-ci

* cont : fix the fix

ggml-ci
b5502
2025-05-27 09:40:59 +03:00
cdf94a1802 server: --offline mode (#13804)
* server: --offline mode (env: LLAMA_OFFLINE)

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
b5501
2025-05-26 22:34:27 +01:00
a26c4cc11e scripts : add option to compare commits in Debug (#13806)
* scripts : add option to compare commits in Debug

* cont : reuse existing CMAKE_OPTS
2025-05-26 22:24:01 +03:00