Commit Graph

5930 Commits

Author SHA1 Message Date
f9a31eea06 CUDA: set_rows + cpy.cu refactor (#14712) b5930 2025-07-18 14:54:18 +08:00
8f974bc1e9 graph : refactor context to not pass gf explicitly (#14629)
ggml-ci
b5929
2025-07-18 08:29:28 +03:00
09651d09ff graph : Pass the graph placeholder message in debug mode (#14748)
Without that condition, this debug log clutters the screen every batch treated in the prompt processing, or every token generated in Kobold.cpp.
b5928
2025-07-18 07:25:54 +03:00
349ea79fce use max work group size for device to replace the magic number (#14732) b5927 2025-07-18 10:23:14 +08:00
670e1360cd convert : fix Ernie4.5 MoE without shared experts (#14746) 2025-07-18 01:17:16 +02:00
760b4484e3 nix : use optionalAttrs for env mkDerivation attrset argument (#14726) 2025-07-17 15:18:16 -07:00
cb887f1bc1 model: add Ernie 4.5 MoE support (#14658)
* Add Ernie4.5 MoE

* Fix Flake errors.

* Properly encode/decode MoE layer step

* Correct tensor mappings (.weight)

* Pass and read n_ff_exp

* n_ff_shexp calculation and further minor changes

* Rope fixes.

* .gitignore fix

* Add unit32 cast for Linux builds

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Further fixes from code review

* Fix trailing whitespace

* Reenable missing experts error

* Code style from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix non-MoE regression

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b5924
2025-07-17 23:15:32 +02:00
d6fb3f6b49 kv-cache : fix k-shift for multiple streams (#14742)
ggml-ci
b5923
2025-07-17 20:52:33 +03:00
01612b7409 llama : reuse compute graphs (#14482)
* llama : reuse compute graphs

ggml-ci

* llama-bench : add graph reuse parameter

ggml-ci

* cont : remove the parameter and the sched resets

ggml-ci

* graph : rename update() to can_reuse()

ggml-ci

* params : remove is_same()

ggml-ci

* graph : set res->params in llm_graph_context constructor

ggml-ci

* graph : avoid set_max_nodes in llm_graph_result

ggml-ci

* kv-cache : reuse llama_context's graph result instance

ggml-ci

* context : reset the previous graph result upon memory updates

ggml-ci

* batch : llama_ubatch now carries its data instead of pointing to balloc

ggml-ci

* merge : fix build

ggml-ci

* graph : fix can_reuse() checks when flash-attention is disabled

* graph : move llm_graph_result impl in source file + debug env

ggml-ci
b5922
2025-07-17 19:08:33 +03:00
086cf81e88 llama : fix parallel processing for lfm2 (#14705) b5921 2025-07-17 09:22:11 +02:00
d9b691081c kv-cache : opt mask set input (#14600)
ggml-ci
b5920
2025-07-17 09:49:15 +03:00
ad57d3edd2 batch : fix uninitialized has_cpl flag (#14733)
ggml-ci
b5919
2025-07-17 09:45:54 +03:00
1ba45d4982 ci : disable failing vulkan crossbuilds (#14723) 2025-07-16 20:52:08 -03:00
19e5943d9e convert : make hf token optional (#14717)
* make hf token optional

* fail if we can't get necessary tokenizer config
2025-07-16 23:17:43 +02:00
496957e1cb llama : fix parameter order for hybrid memory initialization (#14725) b5916 2025-07-16 21:17:25 +02:00
21c021745d ggml: Add initial WebGPU backend (#14521)
* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults

* Initialize webgpu device

* Making progress on setting up the backend

* Finish more boilerplate/utility functions

* Organize file and work on alloc buffer

* Add webgpu_context to prepare for actually running some shaders

* Work on memset and add shader loading

* Work on memset polyfill

* Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it

* Implement get_tensor and buffer_clear

* Finish rest of setup

* Start work on compute graph

* Basic mat mul working

* Work on emscripten build

* Basic WebGPU backend instructions

* Use EMSCRIPTEN flag

* Work on passing ci, implement 4d tensor multiplication

* Pass thread safety test

* Implement permuting for mul_mat and cpy

* minor cleanups

* Address feedback

* Remove division by type size in cpy op

* Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends

* Fix name

* Fix macos dawn prefix path
b5915
2025-07-16 18:18:51 +03:00
b0f0ecc3dc model : support output bias for qwen2 (#14711)
Co-authored-by: qwaqrm <qwaqrm@126.com>
b5914
2025-07-16 18:02:06 +03:00
225e7a1438 llama : add high-throughput mode (#14363)
* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (#14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b5913
2025-07-16 16:35:42 +03:00
ab14019821 Support diffusion models: Add Dream 7B (#14644)
* Support diffusion models: Add Dream 7B

* Move diffusion to examples

* Move stuff to examples. Add patch to not use kv-cache

* Address review comments

* Make sampling fast

* llama: remove diffusion functions

* Add basic timings + cleanup

* More cleanup

* Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length

* fixup!

* Review: move everything to diffusion-cli for now
b5912
2025-07-16 20:03:51 +08:00
64978340b0 ggml : add asserts (#14720)
* ggml : add asserts

ggml-ci

* cont : fix constant type

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
b5911
2025-07-16 14:43:32 +03:00
6ffd4e9c44 server : pre-calculate EOG logit biases (#14721)
ggml-ci
b5910
2025-07-16 14:04:12 +03:00
e4841d24d3 llama : fix parallel processing for plamo2 (#14716) b5909 2025-07-16 12:12:22 +02:00
538cc77f7f server : fix handling of the ignore_eos flag (#14710)
ggml-ci
b5908
2025-07-16 12:13:57 +03:00
5cae766541 scripts: synthetic prompt mode for server-bench.py (#14695) 2025-07-16 09:33:28 +02:00
4b91d6f71f convert : only check for tokenizer folder if we need it (#14704) 2025-07-16 08:52:04 +02:00
cf91f217f1 convert : add pre-computed hashes first to prevent order mishaps (#14701) 2025-07-16 08:51:12 +02:00
79e0b68c17 llama: add LLAMA_API to deprecated llama_kv_self_seq_div (#14708)
Add LLAMA_API to fix the run-time error with llama-cpp-python in Windows env:
attributeError: function 'llama_kv_self_seq_div' not found.
Did you mean: 'llama_kv_self_seq_add'?

Although llama_kv_self_seq_div() has been marked deprecated but
it is necessary to export it to make llama-cpp-python happy.

Observed software version:
OS: windows
compiler: MSVC
llama-cpp-python: tag: v0.3.12-cu124
llama.cpp: tag: b5833

Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Co-authored-by: Min-Hua Chen <minhua.chen@neuchips.ai>
b5904
2025-07-16 07:00:42 +03:00
c81f4192f9 gguf-py : dump bpw per layer and model in markdown mode (#14703) 2025-07-16 00:04:42 +02:00
4a4f426944 model : add Kimi-K2 support (#14654)
* Kimi-K2 conversion

* add Kimi_K2  pre type

* Kimi-K2

* Kimi-K2 unicode

* Kimi-K2

* LLAMA_MAX_EXPERTS 384

* fix vocab iteration

* regex space fix

* add kimi-k2 to pre_computed_hashes

* Updated with kimi-k2 get_vocab_base_pre hash

* fix whitespaces

* fix flake errors

* remove more unicode.cpp whitespaces

* change set_vocab() flow

* add moonshotai-Kimi-K2.jinja to /models/templates/

* update moonshotai-Kimi-K2.jinja

* add kimi-k2 chat template

* add kimi-k2

* update NotImplementedError

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* except Exception

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* LLM_CHAT_TEMPLATE_KIMI_K2 if(add_ass){}

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b5902
2025-07-15 21:54:22 +02:00
ba1ceb3456 vulkan: fix noncontig check for mat_mul_id splitting (#14683)
* vulkan: fix noncontig check for mat_mul_id splitting

Remove supports_op check for > 4096 (splitting fixes this)

* vulkan: fix batched matmul dequant for Q*_K
b5901
2025-07-15 21:51:09 +02:00
10a0351a97 vulkan: add RTE variants for glu/add/sub/mul/div (#14653) b5900 2025-07-15 21:32:11 +02:00
68e37a61a7 model : add PLaMo-2 support (#14560)
* Add PLaMo-2 model using hybrid memory module

* Fix z shape

* Add cmath to include from llama-vocab.h

* Explicitly dequantize normalization weights before RoPE apply

* Revert unnecessary cast because the problem can be solved by excluding attn_k, attn_q when quantizing

* Use ATTN_K/Q_NORM for k,q weights to prevent quantization

* Remove SSM_BCDT that is not used from anywhere

* Do not duplicate embedding weights for output.weight

* Fix tokenizer encoding problem for multibyte strings

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Use LLM_FFN_SWIGLU instead of splitting ffn_gate and ffn_up

* Remove unnecessary part for Grouped Query Attention

* Fix how to load special token id to gguf

* Remove unused tensor mapping

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Remove llama_vocab_plamo2 class and replace it with llm_tokenizer_plamo2_session to follow the other tokenizer implementations

* Update src/llama-vocab.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix plamo2 tokenizer session to prevent multiple calls of build()

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5899
2025-07-15 18:11:42 +02:00
cbc68be51d cuda: fix build warnings in set-rows.cu (unused variable) (#14687)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b5898
2025-07-15 15:28:53 +08:00
bdca38376f sycl: Hotfix for non dnnl codepath (#14677) b5897 2025-07-14 18:12:42 +01:00
55c509daf5 ggml : refactor llamafile_sgemm PPC code (#14673)
Remove un-necessary templates from class definition and packing functions
Reduce deeply nested conditionals, if-else switching in mnapck function
Replace repetitive code with inline functions in Packing functions

2 ~ 7% improvement in Q8 Model
15 ~ 50% improvement in Q4 Model

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
b5896
2025-07-14 16:16:42 +03:00
9c9e4fc635 llama-context: add ability to get logits (#14672) b5895 2025-07-14 21:01:41 +08:00
494c5899cb scripts: benchmark for HTTP server throughput (#14668)
* scripts: benchmark for HTTP server throughput

* fix server connection reset
b5894
2025-07-14 13:14:30 +02:00
0f4c6ec0f1 SYCL: use 1D kernel for set_rows (#14618)
* SYCL: Use 1D kernel for set_rows

* Remove dangling comment

* Refactor and use ceil_div
b5893
2025-07-14 10:37:55 +01:00
65a3ebb0aa sycl: Batched mulmat rework for oneDNN dispatch (#14617) b5892 2025-07-14 10:37:35 +01:00
0d9226763c llama : add jinja template for rwkv-world (#14665)
* llama : add jinja template for rwkv-world

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b5891
2025-07-14 07:43:43 +08:00
982e347255 quantize : fix minor logic flaw in --tensor-type (#14572) b5890 2025-07-13 18:02:17 +02:00
923e3ea2e3 cuda : add set rows for bf16 (#14664) b5889 2025-07-13 15:01:24 +02:00
e743cddb60 cuda : add ELU support (#14657) b5888 2025-07-13 11:33:16 +02:00
05fec5bd29 ggml : add build-time message to remind about ggml_set_rows (#14661)
ggml-ci
b5887
2025-07-13 10:36:33 +03:00
dcf7f2ea3c metal : Add missing unary ops Metal support (#14660) b5886 2025-07-13 08:38:13 +03:00
84b396e051 cmake : Add CMake presets for Linux and GCC (#14656) 2025-07-13 08:12:36 +03:00
c31e60647d tests : cover lfm2 cases in test_ssm_conv (#14651) b5884 2025-07-12 19:10:14 +02:00
67eade1bf9 docs : add LFM2 to models section (#14650)
* readme : add LFM2 to models section

* fix copy paste...
2025-07-12 19:07:08 +02:00
7de5c7cab6 CUDA: add set rows for f32 and f16 (#14551)
* CUDA: add set rows for f32 and f16

* Review: change kernel params, use strides from host

* Use 1-d kernel

* Review: use int64_t for blockDim.x, rename nb->s for clarity
b5882
2025-07-12 16:31:38 +03:00
8eff95544e sync : ggml 2025-07-12 16:13:27 +03:00