Commit Graph

5739 Commits

Author SHA1 Message Date
c46503014d cmake: remove shader-gen step-targets from ggml-vulkan (#14226)
* Remove step-targets from vulkan-shaders-gen

* Unset DESTDIR when building vulkan-shaders-gen
b5689
2025-06-17 22:33:25 +02:00
860a9e4eef ggml-cpu : remove the weak alias trick (#14221) b5688 2025-06-17 12:58:32 +03:00
fe9d60e74a musa: fix build warning (unused variable) (#14231)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b5687
2025-06-17 17:48:08 +08:00
e434e69183 common : suggest --jinja when autodetection fails (#14222) b5686 2025-06-16 21:58:42 +02:00
89fea80d29 server : fix incorrect usage of llama_get_embeddings() (#14225)
* server : fix incorrect usage of llama_get_embeddings()

ggml-ci

* cont : fix the fix

ggml-ci
b5685
2025-06-16 22:33:27 +03:00
6adc3c3ebc llama : add thread safety test (#14035)
* llama : add thread safety test

* llamafile : remove global state

* llama : better LLAMA_SPLIT_MODE_NONE logic

when main_gpu < 0 GPU devices are not used

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5684
2025-06-16 08:11:43 -07:00
0dbcabde8c cmake: clean up external project logic for vulkan-shaders-gen (#14179)
* Remove install step for vulkan-shaders-gen

* Add install step to normalize msvc with make

* Regenerate modified shaders at build-time
b5683
2025-06-16 10:32:13 -03:00
ad590be98c model : add NeoBERT (#14164)
* convert neobert model to gguf

* add inference graph

* fix flake8 lint

* followed reviewer suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* follow reviewers suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* override NeoBERT feed-forward length

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5682
2025-06-16 14:53:41 +02:00
7d6d91babf HIP: disable rocwmma on gfx12 by default until rocm 7.0 (#14202) b5681 2025-06-16 13:47:38 +02:00
d3e64b9f49 llama : rework embeddings logic (#14208)
* llama : rework embeddings logic

ggml-ci

* cont : fix rerank

ggml-ci

* cont : engrish [no ci]

* cont : fix rerank

ggml-ci

* server : support both embeddings and completions with single model

ggml-ci

* cont : avoid embeddings_org

ggml-ci
2025-06-16 14:14:00 +03:00
3ba0d843c6 ggml: Add Android support for GGML_CPU_ALL_VARIANTS (#14206) b5679 2025-06-16 11:47:57 +02:00
0bf49eb668 convert : remove arcee change in convert_hf_to_gguf_update.py (#14207) 2025-06-16 10:16:06 +02:00
4ad243677b gguf-py : allow key override when adding value to GGUFWriter (#14194)
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-06-16 09:20:59 +02:00
c89c2d1ab9 vulkan: mutex around vkQueueSubmit (#14127)
This fixes the remaining crash in test-thread-safety on my system.
b5676
2025-06-16 08:21:08 +02:00
3555b3004b ggml-cpu : rework weak alias on apple targets (#14146)
* ggml-cpu : rework weak alias on apple targets

* fix powerpc detection

* fix ppc detection

* fix powerpc detection on darwin
b5675
2025-06-16 13:54:15 +08:00
d7da8dc83a model : Add support for Arcee AI's upcoming AFM model (#14185)
* Add Arcee AFM support

* Add draft update code

* Fix linter and update URL, may still not be final

* Update src/llama-model.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Remote accidental blank line

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
b5674
2025-06-16 01:04:06 +02:00
cd355eda7d server : When listening on a unix domain socket don't print http:// and port (#14180)
Instead show something like this:

main: server is listening on file.sock - starting the main loop

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
b5673
2025-06-15 23:36:22 +02:00
30e5b01de2 quantize : change int to unsigned int for KV overrides (#14197) b5672 2025-06-15 18:53:45 +02:00
e54b394082 CUDA/HIP: fix ssm_scan on devices where warp size is not 32 (#14196) b5671 2025-06-15 17:30:13 +02:00
2c2caa4443 HIP: Replace usage of depricated preprocessor macro __AMDGCN_WAVEFRONT_SIZE__ (#14183) b5670 2025-06-15 15:45:27 +02:00
5fce5f948d kv-cache : fix use-after-move of defrag info (#14189)
ggml-ci
b5669
2025-06-15 10:52:11 +03:00
9ae4143bc6 model : add dots.llm1 architecture support (#14044) (#14118)
Adds:

* Dots1Model to convert_hf_to_gguf.py

* Computation graph code to llama-model.cpp

* Chat template to llama-chat.cpp to detect this model's template.

---

The model is called "dots.llm1" (I decided to shorten it to dots1 or
DOTS1 in the code generally) architecture.

The only models that exist as of writing of this commit that follow this
architecture are "dots.llm1.inst" and "dots.llm1.base" from here:

* https://huggingface.co/rednote-hilab/dots.llm1.inst

* https://huggingface.co/rednote-hilab/dots.llm1.base

The model architecture is a combination of Qwen and Deepseek parts, as
seen here:

ffe12627b4/src/transformers/models/dots1/modular_dots1.py
b5668
2025-06-15 09:52:06 +02:00
c311ac664d cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188)
ggml-ci
b5667
2025-06-15 10:08:58 +03:00
b9912ac570 batch : auto-gen positions + verify multi-sequence input (#14177)
* batch : verify multi-sequence input batches

ggml-ci

* cont : auto-gen positions + verify multi-seq input

ggml-ci

* cont : first print debug info, then perform validation

ggml-ci

* cont : fix position auto-gen + add comments

ggml-ci
b5666
2025-06-15 09:18:37 +03:00
00ba772610 docs : remove WIP since PR has been merged (#13912) 2025-06-15 08:06:37 +02:00
3cb203c89f llama-chat : Do not throw when tool parsing fails (#14012)
Currently when a model generates output which looks like a tool call,
but is invalid an exception is thrown and not handled, causing the cli
or llama-server to bail. Instead, handle the chat parser exception and
simply return the generated text in such cases.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
b5664
2025-06-14 17:25:15 +01:00
2e42be42bd compare-llama-bench: add option to plot (#14169)
* compare llama-bench: add option to plot

* Address review comments: convert case + add type hints

* Add matplotlib to requirements

* fix tests

* Improve comment and fix assert condition for test

* Add back default test_name, add --plot_log_scale

* use log_scale regardless of x_values
2025-06-14 10:34:20 +02:00
fb85a288d7 vocab : fix build (#14175)
ggml-ci
b5662
2025-06-13 20:03:05 +03:00
40643edb86 sycl: fix docker image (#14144) 2025-06-13 18:32:56 +02:00
3cfbbdb44e Merge commit from fork
* vocab : prevent integer overflow during load

* Add static cast and GGML_ABORT

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-13 19:20:25 +03:00
80709b70a2 batch : add LLAMA_BATCH_DEBUG environment variable (#14172)
* batch : add LLAMA_BATCH_DEBUG environment variable

ggml-ci

* cont : improve seq_id display
b5659
2025-06-13 18:35:00 +03:00
26ff3685bf docs : Update multimodal.md (#14122)
* Update multimodal.md

* Update multimodal.md
2025-06-13 15:17:53 +02:00
60c666347b batch : rework llama_batch_allocr (#14153)
* batch : rework llama_batch_allocr

ggml-ci

* cont : move validation inside class

ggml-ci

* cont : move output counting to class

ggml-ci

* cont : minor

ggml-ci

* batch : add TODOs

ggml-ci
b5657
2025-06-13 13:47:55 +03:00
b7cc7745e3 readme : remove survey link (#14168) 2025-06-13 11:55:44 +03:00
cc8d081879 cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167)
* cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT

* cmake: Pass on LLAMA_BUILD_* to GGML_BUILD_*
b5655
2025-06-13 10:38:52 +02:00
d714dadb57 pooling : make cls_b and cls_out_b optional (#14165)
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
b5654
2025-06-13 11:34:08 +03:00
ffad043973 server : fix SWA condition for full context reprocess (#14163)
ggml-ci
b5653
2025-06-13 11:18:25 +03:00
0889eba570 sycl: Adding additional cpy dbg print output (#14034) b5652 2025-06-13 08:51:39 +01:00
c61285e739 SYCL: Bump oneMath commit (#14152)
Update oneMath commit to merged PR https://github.com/uxlfoundation/oneMath/pull/669
which adds SYCL-Graph support for recording CUDA BLAS commands.

With this change the `MUL_MAT` tests now pass on DPC++ CUDA backends with SYCL-Graph
enabled. Prior to this change, an error would be thrown.

```
$ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2

UR CUDA ERROR:
        Value:           700
        Name:            CUDA_ERROR_ILLEGAL_ADDRESS
        Description:     an illegal memory access was encountered
        Function:        operator()
        Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154

Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator()
SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code!
  in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598
$HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
```
b5651
2025-06-13 08:45:37 +01:00
09cf2c7c65 cmake : Improve build-info.cpp generation (#14156)
* cmake: Simplify build-info.cpp generation

The rebuild of build-info.cpp still gets triggered when .git/index gets
changes.

* cmake: generate build-info.cpp in build dir
b5650
2025-06-13 09:51:34 +03:00
c33fe8b8c4 vocab : prevent heap overflow when vocab is too small (#14145)
ggml-ci
b5649
2025-06-13 08:03:54 +03:00
ed52f3668e sycl: Remove not needed copy f16->f32 for dnnl mul mat (#14125) b5648 2025-06-12 15:15:11 +02:00
a681b4ba83 readme : remove project status link (#14149) 2025-06-12 14:43:09 +03:00
7d516443dd server : re-enable SWA speculative decoding (#14131)
ggml-ci
b5646
2025-06-12 11:51:38 +03:00
f6e1a7aa87 context : simplify output counting logic during decode (#14142)
* batch : remove logits_all flag

ggml-ci

* context : simplify output counting logic during decode

ggml-ci

* cont : fix comments
b5645
2025-06-12 11:50:01 +03:00
c3ee46fab4 batch : remove logits_all flag (#14141)
ggml-ci
b5644
2025-06-12 11:49:26 +03:00
e2c0b6e46a cmake : handle whitepsaces in path during metal build (#14126)
* cmake : handle whitepsaces in path during metal build

ggml-ci

* cont : proper fix

ggml-ci

---------

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2025-06-12 10:14:24 +03:00
9596506965 kv-cache : fix split_equal handling in unified implementation (#14130)
ggml-ci
b5642
2025-06-12 10:02:15 +03:00
a20b2b05bc context : round n_tokens to next multiple of n_seqs when reserving (#14140)
This fixes RWKV inference which otherwise failed
when the worst case ubatch.n_seq_tokens rounded to 0.
b5641
2025-06-12 02:56:04 -04:00
2e89f76b7a common: fix issue with regex_escape routine on windows (#14133) b5640 2025-06-11 17:19:44 -03:00