199a838422
threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling
...
We talked about adding LOW priority for GGML threads in the original threadpool PR.
It might be useful for some cases to avoid contention.
Latest Windows ARM64 releases started parking (offlining) the CPU cores
more aggresively which results in suboptimal performance with n_threads > 4.
To deal with that we now disable Power Throttling for our threads for the NORMAL
and higher priorities.
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2025-05-30 17:15:38 -07:00
e562eece7c
CUDA: fix typo in FlashAttention code ( #13926 )
b5548
2025-05-30 21:22:03 +02:00
b47ab7b8e9
sched : avoid changing cur_copy when a graph is already allocated ( #13922 )
b5547
2025-05-30 18:56:19 +02:00
dd665cc9d4
parallel : increase the variability of the prompt lengths ( #13927 )
...
ggml-ci
b5546
2025-05-30 19:38:07 +03:00
df0c0c7d02
cuda : prevent using split buffers with 3d/4d matrices ( #13919 )
b5545
2025-05-30 16:37:18 +02:00
b49a8ff96b
SYCL: Add mrope kernel ( #13755 )
...
* SYCL: Add mrope kernel
* feat: Optimize rope operations with vectorization
Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.
* Use ceil_div
b5544
2025-05-30 19:40:57 +05:30
53f925074d
sync : vendor ( #13901 )
...
* sync : vendor
ggml-ci
* cont : fix httplib version
ggml-ci
* cont : fix lint
* cont : fix lint
* vendor : move to common folder /vendor
ggml-ci
* cont : fix lint
* cont : move httplib to /vendor + use json_fwd.hpp
ggml-ci
* cont : fix server build
ggml-ci
* cont : add missing headers
ggml-ci
* cont : header clean-up
ggml-ci
b5543
2025-05-30 16:25:45 +03:00
db38704f01
convert : fix rwkv bos/eos token ( #13844 )
2025-05-30 14:50:43 +02:00
07e4351ce6
convert : allow partial update to the chkhsh pre-tokenizer list ( #13847 )
...
* convert : allow partial update to the chkhsh pre-tokenizer list
* code style
* update tokenizer out
* rm inp/out files for models not having gguf
* fixed hash for glm
* skip nomic-bert-moe test
* Update convert_hf_to_gguf_update.py
* fix minerva-7b hash
* rm redundant import
b5541
2025-05-30 12:24:37 +02:00
291f2b6913
llama : add support for DistilBert ( #13907 )
...
* add distilbert
* small fixes
* add note for LLM_ARCH_DISTIL_BERT
* Use MODEL_ARCH.BERT for DistilBert
---------
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp >
b5540
2025-05-30 11:56:02 +02:00
2c90da4c7e
llama : use llm_build_granite for minicpm ( #13911 )
b5539
2025-05-30 10:31:48 +02:00
ec9e0301fe
cmake: Guard GGML_CPU_ALL_VARIANTS by architecture ( #13890 )
b5538
2025-05-30 01:28:54 +02:00
e83ba3e460
llama : add support for jina-reranker-v2 ( #13900 )
b5537
2025-05-29 21:42:31 +02:00
2b131621e6
gguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method ( #13561 )
gguf-v0.17.0
2025-05-29 15:36:05 +02:00
54a2c7a8cd
arm64: optimize q4_k_q8_k kernel with i8mm ( #13886 )
...
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
-m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
--no-mmap -fa \
-c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
-npl 1,2,4,8,16,32 \
-t 64
---------------------------------------------------------------------
| PP | TG | B | S_PP t/s | S_TG t/s |
| | | | original | this pr | original | this pr |
|-------|--------|------|----------|----------|----------|----------|
| 128 | 128 | 1 | 110.12 | 147.83 | 24.36 | 24.28 |
| 128 | 128 | 2 | 121.16 | 172.42 | 46.36 | 47.93 |
| 128 | 128 | 4 | 120.15 | 169.75 | 74.68 | 84.00 |
| 128 | 128 | 8 | 130.97 | 196.81 | 91.04 | 114.74 |
| 128 | 128 | 16 | 131.01 | 196.88 | 101.43 | 135.79 |
| 128 | 128 | 32 | 130.85 | 196.51 | 106.97 | 147.29 |
---------------------------------------------------------------------
```
b5535
2025-05-29 14:39:20 +03:00
21fcc21ad5
cmake: Factor out CPU architecture detection ( #13883 )
...
* cmake: Define function for querying architecture
The tests and results match exactly those of ggml/src/CMakeLists.txt
* Switch arch detection over to new function
b5534
2025-05-29 12:50:25 +02:00
dd8ba93416
ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm ( #13882 )
...
* F32-Mamba-Seq_Scan-SVE
* Fix formatting
* ggml : missing space
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
b5533
2025-05-29 12:18:43 +03:00
66c92061f5
tests : remove json.hpp from a test ( #13880 )
...
ggml-ci
b5532
2025-05-29 12:17:16 +03:00
5ca82fc1d7
convert : workaround for AutoConfig dummy labels ( #13881 )
2025-05-29 10:00:57 +02:00
6385b843a8
llama : add RobertaForSequenceClassification reranker support ( #13875 )
b5530
2025-05-29 08:15:01 +02:00
1b8fb8152d
ggml: aarch64: Implement SVE F32 kernels for vector functions ( #13843 )
...
* F32-Mamba-SVE
* F32-Mamba-SVE
* Resolve test errors-1
* Resolve test errors-2
* F32-vec-SVE
* F32-vec-SVE
* F32-vec-SVE
b5529
2025-05-29 09:01:33 +03:00
53ae30640e
gguf-py : fix SafetensorRemote return on undefined size (< 0) ( #13841 )
2025-05-28 23:50:20 +02:00
763d06edb7
llama : fix KV shift for qwen2vl ( #13870 )
...
* llama : fix KV shift for qwen2vl
* add ref to the PR
b5527
2025-05-28 22:35:31 +02:00
10961339b2
mtmd : move helpers to dedicated library ( ⚠️ breaking change) ( #13866 )
...
* mtmd : move helpers to dedicated library
* fix server build
* rm leftover cmakelist code
b5526
2025-05-28 22:35:22 +02:00
d98f2a35fc
ci: disable LLAMA_CURL for Linux cross-builds ( #13871 )
2025-05-28 15:46:47 -03:00
e0e3aa231d
llama : add support for BertForSequenceClassification reranker ( #13858 )
...
* convert: add support for BertForSequenceClassification
* add support for reranking using BertForSequenceClassification
* merge checks of eos and sep
* fix lint
---------
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp >
b5524
2025-05-28 19:01:58 +02:00
aa6dff05be
convert: small addition to support LlamaModel ( #13838 )
...
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp >
2025-05-28 16:34:18 +02:00
c962ae3382
server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode ( #13853 )
...
[fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode
b5522
2025-05-28 16:33:54 +02:00
a3938fb53d
convert : fix qwen omni conversion ( #13859 )
...
* convert : fix qwen omni conversion
* fix typo
2025-05-28 16:12:35 +02:00
f7873fc698
tests : change umlaut test ( #11600 )
2025-05-28 15:49:28 +02:00
a68247439b
CUDA: fix FA tg at long context for CC >= 8.9 ( #13852 )
b5519
2025-05-28 13:33:37 +02:00
26b79b6cb3
convert : fix tensor naming conflict for llama 4 vision ( #13836 )
...
* convert : fix tensor naming conflict for llama 4 vision
* add comment
2025-05-28 10:05:54 +02:00
1e8659e65a
CANN: Add SOC TYPE printing in cmake configuration ( #13837 )
b5517
2025-05-28 11:54:20 +08:00
a3c30846e4
opencl: add new ops - argsort
, div
, sub
, addrows
, sigmoid
, group_norm
( #13787 )
...
* opencl: add `argsort`
* opencl: add `div`
* opencl: add `add_rows`
* opencl: add `sub`
* opencl: add `sigmoid`, both `f16` and `f32`
* opencl: add `group_norm`
b5516
2025-05-27 12:56:08 -07:00
1701d4c54f
opencl: mark mul_mat
f32f32
as supporting non-contiguous tensors ( #13790 )
b5515
2025-05-27 12:53:14 -07:00
bef8176387
vulkan: use timestamp queries for GGML_VULKAN_PERF ( #13817 )
...
Also change it to be controlled by an env var rather than cmake flag
b5514
2025-05-27 18:39:07 +02:00
34b7c0439e
cmake : add llama-cparams.cpp to build ( #13832 )
b5513
2025-05-27 19:08:44 +03:00
f3101a8cc6
SYCL: add gelu_erf kernel ( #13749 )
...
* SYCL: add gelu_erf kernel
* refactor code
Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com >
* Use scope_op_debug_print
---------
Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com >
b5512
2025-05-27 20:52:59 +05:30
1c49c70d07
sync : ggml
2025-05-27 18:05:33 +03:00
a8ea03d8ad
ggml : add ggml_repeat_4d ( #13824 )
b5510
2025-05-27 15:53:55 +02:00
05f6ac6283
ggml : riscv: add xtheadvector support ( #13720 )
...
* ggml : riscv: add xtheadvector support
* ggml : clean up some macro usage
b5509
2025-05-27 16:21:36 +03:00
bc583e3c63
mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) ( #13784 )
...
* mtmd : allow multiple modalities at the same time
* refactor mtmd tokenizer
* fix compile
* ok, missing SinusoidsPositionEmbedding
* first working version
* fix style
* more strict validate of n_embd
* refactor if..else to switch
* fix regression
* add test for 3B
* update docs
* fix tokenizing with add_special
* add more tests
* fix test case "huge"
* rm redundant code
* set_position_mrope_1d rm n_tokens
b5508
2025-05-27 14:06:10 +02:00
72b090da2c
docs: remove link for llama-cli function calling ( #13810 )
2025-05-27 08:52:40 -03:00
7fe03e7446
ggml-cpu: x86 feature detection is specific to x86 ( #13811 )
b5506
2025-05-27 13:18:39 +02:00
952f3953c1
ggml : allow CUDA graphs when using pipeline parallelism ( #13814 )
b5505
2025-05-27 13:05:18 +02:00
81713121ee
kv-cells : track min/max used cells and per-sequence positions ( #13808 )
...
* kv-cells : track min/max used cells and per-sequence positions
ggml-ci
* kv-cells : fix pos-modification updates for seq_pos
ggml-ci
* kv-cells : add comments
ggml-ci
b5504
2025-05-27 13:49:41 +03:00
f9cd68398b
sampling : make sure samplers return at least 1 token ( #13822 )
...
* sampling : min-p should always return at least one token
ggml-ci
* sampling : same for typical sampling
* tests : sampling tests use min_keep == 0
ggml-ci
b5503
2025-05-27 12:07:52 +03:00
4f81b33e32
llama : validate seq id batch input ( #13809 )
...
* llama : validate seq id batch input
ggml-ci
* cont : fix the fix
ggml-ci
b5502
2025-05-27 09:40:59 +03:00
cdf94a1802
server: --offline mode ( #13804 )
...
* server: --offline mode (env: LLAMA_OFFLINE)
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
b5501
2025-05-26 22:34:27 +01:00
a26c4cc11e
scripts : add option to compare commits in Debug ( #13806 )
...
* scripts : add option to compare commits in Debug
* cont : reuse existing CMAKE_OPTS
2025-05-26 22:24:01 +03:00