Commit Graph

4999 Commits

Author SHA1 Message Date
2f21123c1d vulkan: Adjust coopmat2 tile sizes and selection heuristic (#12258) b4899 2025-03-17 10:35:00 +01:00
374101fd74 cmake : enable building llama.cpp using system libggml (#12321)
* cmake: Factor out compiler flag function from ggml

llama.cpps's build requires it, too, and we may want to make use of it
without add_subdirectory(ggml).

* cmake: Enable building against system ggml

This facilitates package maintenance for Linux distributions, where the
libggml library most likely will be shipped as an individual package
upon which a llama.cpp package depends.
b4898
2025-03-17 11:05:23 +02:00
b3c9a65673 SYCL: set extras only on GGML_TYPE_Q4_0 (#12366)
* SYCL: set extras only on GGML_TYPE_Q4_0

* release tensor_extras in reset buffer interface
b4897
2025-03-17 09:45:12 +08:00
8ba95dca20 llama : fix OLMo-2-0325-32B-Instruct K-norm size (#12400) b4896 2025-03-16 19:46:36 +02:00
dc079cfdff context : fix init of n_outputs (#12397)
ggml-ci
b4895
2025-03-16 19:29:36 +02:00
7b61bcc87c ci : add --symlinks to xcframework zip command (#12409)
This commit adds the --symlinks option to the zip command used to create
the xcframework zip file. This is necessary to create symlinks in the
zip file. Without this option,  the Versions symlink is stored as a
regular directory entry in the zip file, rather than as a symlink in the
zip which causes the followig error in xcode:
```console
Couldn't resolve framework symlink for '/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current': readlink(/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current): Invalid argument (22)
```

Refs: https://github.com/ggml-org/llama.cpp/pull/11996#issuecomment-2727026377
2025-03-16 18:22:05 +01:00
f4c3dd5daa llama-tts : add '-o' option (#12398)
* added -o option to specify an output file name

* llama-tts returns ENOENT in case of file write error

note : PR #12042 is closed as superseded with this one.
b4893
2025-03-15 17:23:11 +01:00
3d35d87b41 SYCL: Delete redundant plus sign and space (#12391) b4892 2025-03-15 15:49:03 +01:00
b19bd064c0 SYCL : support non-contiguous tensors in binary ops (add, sub, etc) (#12399)
* sycl : support non-contiguous tensors in binary ops

* sycl : silence unused variable warning

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
b4891
2025-03-15 22:19:30 +08:00
92a391327e [CANN]MUL_MAT optimization (#12382) 2025-03-15 09:31:08 +08:00
9f2250ba72 Add CLI arg to llama-run to adjust the number of threads used (#12370)
We default to 4, sometimes we want to manually adjust this

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
b4889
2025-03-14 16:41:20 +00:00
774973b8f3 main : add -sysf / --system-prompt-file (#12249) (#12250)
* add system_prompt_file

* add -sysf / --system-prompt-file

* remove system_prompt_file
b4888
2025-03-14 16:57:05 +01:00
8fcb563613 Load all MoE experts during warmup (#11571)
* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup

* common : use new API to enable warmup mode during model warmup

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-03-14 13:47:05 +01:00
add2a3aa5a server: fix "--grammar-file" parameter (#12285) b4886 2025-03-14 11:21:17 +01:00
c522ce4143 graph : simplify attn input build for unified KV cache (#12381)
ggml-ci
b4885
2025-03-14 10:47:44 +02:00
081bee8c64 hparams : add SWA rope parameters (#12374)
ggml-ci
b4884
2025-03-14 09:03:24 +02:00
84d5475541 llama : fix Gemma3 SWA KV cache shift (#12373)
* llama : fix Gemma3 SWA KV cache shift

ggml-ci

* hparams : add comment [no ci]
2025-03-13 19:08:07 +02:00
be7c303410 arg : no n_predict = -2 for examples except for main and infill (#12364) b4882 2025-03-13 12:34:54 +01:00
e0dbec0bc6 llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181)
* llama : refactor llama_context, llama_kv_cache, llm_build_context

ggml-ci

* graph : don't mutate the KV cache during defrag

ggml-ci

* context : reduce virtuals + remove test function

ggml-ci

* context : move interface implementation to source file + factory

ggml-ci

* graph : move KV cache build functions to llama_context impl

ggml-ci

* graph : remove model reference from build_pooling

ggml-ci

* graph : remove llama_model reference

ggml-ci

* kv_cache : provide rope factors

ggml-ci

* graph : rework inputs to use only unique_ptr, remove attn input abstraction

ggml-ci

* context : remove llama_context_i abstraction

ggml-ci

* context : clean-up

ggml-ci

* graph : clean-up

ggml-ci

* llama : remove redundant keywords (struct, enum)

ggml-ci

* model : adapt gemma3

ggml-ci

* graph : restore same attention ops as on master

ggml-ci

* llama : remove TODO + fix indent

ggml-ci
2025-03-13 12:35:44 +02:00
2048b5913d server : fix crash when using verbose output with input tokens that are not in printable range (#12178) (#12338)
* Fix DOS index bug

* Remove new APIs

* remove extra line

* Remove from API

* Add extra newline

* Update examples/server/server.cpp

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
b4880
2025-03-13 11:10:05 +01:00
f08f4b3187 Update build.yml for Windows Vulkan builder to use Vulkan 1.4.304 SDK for VK_NV_cooperative_matrix2 support (#12301) b4879 2025-03-12 20:06:58 +01:00
80a02aa858 llama.swiftui : fix xcframework dir in README [no ci] (#12353)
This commit fixes the path to the xcframework in the README file which I
had forgotten to change after renaming the build directory.
2025-03-12 13:45:32 +01:00
363f8c5d67 sycl : variable sg_size support for mmvq kernels (#12336) b4877 2025-03-12 09:57:32 +00:00
34c961b181 CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (#12315)
When fattn-wmma was ported over to warp64 various bits that also touch fattn-vec where converted to
selectable warp size, however the fattn-vec kernels dont work with 64 wide warps for now, so we need
to avoid launching them with parameters for warp64
b4876
2025-03-12 10:14:11 +01:00
7841fc723e llama : Add Gemma 3 support (+ experimental vision capability) (#12343)
* llama : Add Gemma 3 text-only support

* fix python coding style

* fix compile on ubuntu

* python: fix style

* fix ubuntu compile

* fix build on ubuntu (again)

* fix ubuntu build, finally

* clip : Experimental support for Gemma 3 vision (#12344)

* clip : Experimental support for Gemma 3 vision

* fix build

* PRId64
b4875
2025-03-12 09:30:24 +01:00
bf69cfe62f vulkan: fix bug in coopmat1 mul_mat_id (#12316)
* tests: run mul_mat_id with a larger N

* vulkan: fix bug in coopmat1 mul_mat_id
b4874
2025-03-12 06:59:19 +01:00
10f2e81809 CUDA/HIP: refractor mmqv to unify the calculation of nwarps and rows per block between host and device code. (#12177)
refactor mmqv to unify the calculation of nwarps and rows per block between host and device code.

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b4873
2025-03-11 20:16:03 +01:00
ba7654380a ggml-backend : fix backend search path (#12330)
* Fix backend search path

* replace .native() with '/'

* reverted .native()
b4872
2025-03-11 14:25:17 +01:00
6ab2e4765a metal : Cache the Metal library at the device context level (#12265) b4871 2025-03-11 13:45:02 +02:00
96e1280839 clip : bring back GPU support (#12322)
* clip : bring back GPU support

* use n_gpu_layers param

* fix double free

* ggml_backend_init_by_type

* clean up
b4870
2025-03-11 09:20:16 +01:00
Eve
2c9f833d17 mat vec double buffer (#12188) b4869 2025-03-10 19:28:11 +00:00
251364549f musa: support new arch mp_31 and update doc (#12296)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b4868
2025-03-10 18:18:25 +01:00
8acdacb3ea opencl: use OpenCL C standard supported by the device (#12221)
This patch nudges the llama.cpp a bit to be supported on PoCL which
doesn't support OpenCL C CL2.0. The issue is solved by querying the
device for the supported OpenCL C versions and using the highest one
available.
b4867
2025-03-10 09:57:00 -07:00
89b2b56e86 readme: added Sidekick to available UIs (#12311) 2025-03-10 16:13:09 +02:00
e128a1bf5b tests : fix test-quantize-fns to init the CPU backend (#12306)
ggml-ci
b4865
2025-03-10 14:07:15 +02:00
6ef79a67ca common : refactor '-o' option (#12278)
As discussed in PR 'llama-tts : add -o option' (#12042):

* common_params : 'out_file' string is the only output file name parameter left in common_params. It's intended to be used in all example programs implementing an '-o' option.

* cvector-generator, export-lora, imatrix : default output filenames moved from 'common_params' to the 'main()' of each example program.
b4864
2025-03-10 13:34:13 +02:00
4e39a3c332 server: extract <think> tags from qwq outputs (#12297)
* extract <think> tags from qwq outputs

* const for all static regexes in chat.cpp
b4863
2025-03-10 10:59:03 +00:00
be421fc429 tool-call: ensure there's always a non-empty tool call id (#12292) 2025-03-10 09:45:29 +00:00
87c2630546 allow missing content in message if tool_calls provided (#12293) b4861 2025-03-10 09:45:07 +00:00
2b3a25c212 sampler: fixes trigger tokens + lazy grammars (fix typo cast from token to string) (#12291)
* Fix typo in lazy grammar handling (fixes trigger tokens)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4860
2025-03-10 09:44:42 +00:00
8352cdc87b llava : fix bug in minicpm-v code (#11513)
* fix bug in minicpm-v code

* update readme of minicpm-v
b4859
2025-03-10 10:33:24 +02:00
1e2f78a004 server : add speculative decoding presets for FIM (#12287) 2025-03-09 19:08:20 +02:00
0fd7ca7a21 authors : update (#12271) 2025-03-08 18:26:00 +02:00
6fefc05a7a ggml-backend : make path_str compatible with C++20 (#12269) b4856 2025-03-08 17:02:39 +01:00
7ab364390f server : infill gen ends on new line (#12254) b4855 2025-03-07 20:54:30 +02:00
7c7f3b7f43 ggml : skip intermediate .air file when compiling .metallib (#12247)
This commit updates the compilation of default.metallib to skip the
intermediate .air (Apple Intermediate Representation) file.

The motivation for this change is to simplify the custom command a
little and avoid generating and then removing the .air file.
b4854
2025-03-07 14:15:27 +01:00
102ac1891d sync : ggml
ggml-ci
b4853
2025-03-07 14:49:44 +02:00
d6ae2fa061 ggml : ggml_compute_forward_concat() for arbitrary tensor type (ggml/1118)
* ggml_compute_forward_concat() for arbitrary tensor type

* Check that tensors' type match

* ggml-cpu.c: check type of source tensors

* ggml-cpu.c: move tensor type check to ggml_compute_forward_concat()

* ggml.c: check concatenated tensor type

* Remove tensor type check from ggml_compute_forward_concat() in ggml-cpu.c

..., as it was moved to ggml.c.
2025-03-07 14:49:44 +02:00
68d0027f3d ggml-cpu: faster AVX2 variant for IQ1_M (#12216) b4851 2025-03-07 13:54:22 +02:00
ea002810a2 ci : fix save-load test invocations (#12245) 2025-03-07 12:19:31 +02:00