Commit Graph

5542 Commits

Author SHA1 Message Date
db38704f01 convert : fix rwkv bos/eos token (#13844) 2025-05-30 14:50:43 +02:00
07e4351ce6 convert : allow partial update to the chkhsh pre-tokenizer list (#13847)
* convert : allow partial update to the chkhsh pre-tokenizer list

* code style

* update tokenizer out

* rm inp/out files for models not having gguf

* fixed hash for glm

* skip nomic-bert-moe test

* Update convert_hf_to_gguf_update.py

* fix minerva-7b hash

* rm redundant import
b5541
2025-05-30 12:24:37 +02:00
291f2b6913 llama : add support for DistilBert (#13907)
* add distilbert

* small fixes

* add note for LLM_ARCH_DISTIL_BERT

* Use MODEL_ARCH.BERT for DistilBert

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
b5540
2025-05-30 11:56:02 +02:00
2c90da4c7e llama : use llm_build_granite for minicpm (#13911) b5539 2025-05-30 10:31:48 +02:00
ec9e0301fe cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890) b5538 2025-05-30 01:28:54 +02:00
e83ba3e460 llama : add support for jina-reranker-v2 (#13900) b5537 2025-05-29 21:42:31 +02:00
2b131621e6 gguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method (#13561) gguf-v0.17.0 2025-05-29 15:36:05 +02:00
54a2c7a8cd arm64: optimize q4_k_q8_k kernel with i8mm (#13886)
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------
```
b5535
2025-05-29 14:39:20 +03:00
21fcc21ad5 cmake: Factor out CPU architecture detection (#13883)
* cmake: Define function for querying architecture

The tests and results match exactly those of ggml/src/CMakeLists.txt

* Switch arch detection over to new function
b5534
2025-05-29 12:50:25 +02:00
dd8ba93416 ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882)
* F32-Mamba-Seq_Scan-SVE

* Fix formatting

* ggml : missing space

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5533
2025-05-29 12:18:43 +03:00
66c92061f5 tests : remove json.hpp from a test (#13880)
ggml-ci
b5532
2025-05-29 12:17:16 +03:00
5ca82fc1d7 convert : workaround for AutoConfig dummy labels (#13881) 2025-05-29 10:00:57 +02:00
6385b843a8 llama : add RobertaForSequenceClassification reranker support (#13875) b5530 2025-05-29 08:15:01 +02:00
1b8fb8152d ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843)
* F32-Mamba-SVE

* F32-Mamba-SVE

* Resolve test errors-1

* Resolve test errors-2

* F32-vec-SVE

* F32-vec-SVE

* F32-vec-SVE
b5529
2025-05-29 09:01:33 +03:00
53ae30640e gguf-py : fix SafetensorRemote return on undefined size (< 0) (#13841) 2025-05-28 23:50:20 +02:00
763d06edb7 llama : fix KV shift for qwen2vl (#13870)
* llama : fix KV shift for qwen2vl

* add ref to the PR
b5527
2025-05-28 22:35:31 +02:00
10961339b2 mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866)
* mtmd : move helpers to dedicated library

* fix server build

* rm leftover cmakelist code
b5526
2025-05-28 22:35:22 +02:00
d98f2a35fc ci: disable LLAMA_CURL for Linux cross-builds (#13871) 2025-05-28 15:46:47 -03:00
e0e3aa231d llama : add support for BertForSequenceClassification reranker (#13858)
* convert: add support for BertForSequenceClassification

* add support for reranking using BertForSequenceClassification

* merge checks of eos and sep

* fix lint

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
b5524
2025-05-28 19:01:58 +02:00
aa6dff05be convert: small addition to support LlamaModel (#13838)
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-28 16:34:18 +02:00
Sky
c962ae3382 server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode (#13853)
[fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode
b5522
2025-05-28 16:33:54 +02:00
a3938fb53d convert : fix qwen omni conversion (#13859)
* convert : fix qwen omni conversion

* fix typo
2025-05-28 16:12:35 +02:00
f7873fc698 tests : change umlaut test (#11600) 2025-05-28 15:49:28 +02:00
a68247439b CUDA: fix FA tg at long context for CC >= 8.9 (#13852) b5519 2025-05-28 13:33:37 +02:00
26b79b6cb3 convert : fix tensor naming conflict for llama 4 vision (#13836)
* convert : fix tensor naming conflict for llama 4 vision

* add comment
2025-05-28 10:05:54 +02:00
1e8659e65a CANN: Add SOC TYPE printing in cmake configuration (#13837) b5517 2025-05-28 11:54:20 +08:00
a3c30846e4 opencl: add new ops - argsort, div, sub, addrows, sigmoid, group_norm (#13787)
* opencl: add `argsort`

* opencl: add `div`

* opencl: add `add_rows`

* opencl: add `sub`

* opencl: add `sigmoid`, both `f16` and `f32`

* opencl: add `group_norm`
b5516
2025-05-27 12:56:08 -07:00
1701d4c54f opencl: mark mul_mat f32f32 as supporting non-contiguous tensors (#13790) b5515 2025-05-27 12:53:14 -07:00
bef8176387 vulkan: use timestamp queries for GGML_VULKAN_PERF (#13817)
Also change it to be controlled by an env var rather than cmake flag
b5514
2025-05-27 18:39:07 +02:00
34b7c0439e cmake : add llama-cparams.cpp to build (#13832) b5513 2025-05-27 19:08:44 +03:00
f3101a8cc6 SYCL: add gelu_erf kernel (#13749)
* SYCL: add gelu_erf kernel

* refactor code

Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>

* Use scope_op_debug_print

---------

Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>
b5512
2025-05-27 20:52:59 +05:30
1c49c70d07 sync : ggml 2025-05-27 18:05:33 +03:00
a8ea03d8ad ggml : add ggml_repeat_4d (#13824) b5510 2025-05-27 15:53:55 +02:00
05f6ac6283 ggml : riscv: add xtheadvector support (#13720)
* ggml : riscv: add xtheadvector support

* ggml : clean up some macro usage
b5509
2025-05-27 16:21:36 +03:00
bc583e3c63 mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) (#13784)
* mtmd : allow multiple modalities at the same time

* refactor mtmd tokenizer

* fix compile

* ok, missing SinusoidsPositionEmbedding

* first working version

* fix style

* more strict validate of n_embd

* refactor if..else to switch

* fix regression

* add test for 3B

* update docs

* fix tokenizing with add_special

* add more tests

* fix test case "huge"

* rm redundant code

* set_position_mrope_1d rm n_tokens
b5508
2025-05-27 14:06:10 +02:00
72b090da2c docs: remove link for llama-cli function calling (#13810) 2025-05-27 08:52:40 -03:00
7fe03e7446 ggml-cpu: x86 feature detection is specific to x86 (#13811) b5506 2025-05-27 13:18:39 +02:00
952f3953c1 ggml : allow CUDA graphs when using pipeline parallelism (#13814) b5505 2025-05-27 13:05:18 +02:00
81713121ee kv-cells : track min/max used cells and per-sequence positions (#13808)
* kv-cells : track min/max used cells and per-sequence positions

ggml-ci

* kv-cells : fix pos-modification updates for seq_pos

ggml-ci

* kv-cells : add comments

ggml-ci
b5504
2025-05-27 13:49:41 +03:00
f9cd68398b sampling : make sure samplers return at least 1 token (#13822)
* sampling : min-p should always return at least one token

ggml-ci

* sampling : same for typical sampling

* tests : sampling tests use min_keep == 0

ggml-ci
b5503
2025-05-27 12:07:52 +03:00
4f81b33e32 llama : validate seq id batch input (#13809)
* llama : validate seq id batch input

ggml-ci

* cont : fix the fix

ggml-ci
b5502
2025-05-27 09:40:59 +03:00
cdf94a1802 server: --offline mode (#13804)
* server: --offline mode (env: LLAMA_OFFLINE)

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
b5501
2025-05-26 22:34:27 +01:00
a26c4cc11e scripts : add option to compare commits in Debug (#13806)
* scripts : add option to compare commits in Debug

* cont : reuse existing CMAKE_OPTS
2025-05-26 22:24:01 +03:00
4265a87b59 cuda : avoid cuGetErrorString (#13791)
ggml-ci
b5499
2025-05-26 22:14:52 +03:00
6f180b915c SYCL: Add non contiguous support in RMS_NORM and NORM kernels (#13611)
* SYCL: Add non contiguous input support to norm kernel

* refactor and add RMS_NORM non contiguous input support

ggml-ci

* restore subgroup reduction for multi-subgroup thread blocks in norm kernels

* Swap grid dims of nsamples and nrows

ggml-ci

* Revert "Swap grid dims of nsamples and nrows"

This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf.

* restore not required changes
ggml-ci

* address review comments: change it to more like SYCL

* Use a common function to calculate offset

* remove wrap around logic for handling broadcasts

* remove static from calculate_offset fn and use ceil_div
b5498
2025-05-26 21:10:36 +05:30
03f582ae8f server: fix streaming crashes (#13786)
* add preludes to content on partial regex match

* allow all parsers to parse non-tool-call content.

* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
b5497
2025-05-26 16:03:57 +01:00
88c125f2ac examples/training: Fix file name in README (#13803)
This patch fixes binary file names in README.md.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
2025-05-26 16:55:24 +02:00
d74e94c1b3 server: fix format of streamed tool call deltas (diff name, fix id location) (#13800)
* fix deltas of tool_call.function.name

* fix tool_call.id (was in tool_call.function.id!) + add function type

* add tool_call.type

* populate empty tool_call.function.arguments on first delta
b5495
2025-05-26 14:56:49 +01:00
f13847cfb5 server: fix regression on streamed non-chat completion w/ stops (#13785)
* more forgiving message diffs: partial stop words aren't erased, full stops are

* Add (slow) server test for completion + stream + stop
b5494
2025-05-26 14:16:37 +01:00
79c137f776 examples : allow extracting embeddings from decoder contexts (#13797)
ggml-ci
b5493
2025-05-26 14:03:54 +03:00