Commit Graph

4959 Commits

Author SHA1 Message Date
b8b173274d server : remove old commented code [no ci] 2025-03-20 18:20:54 +02:00
8a23b4a54a server : avoid common_batch
ggml-ci
2025-03-20 16:52:24 +02:00
76fd7d6f5b perplexity : avoid common_batch
ggml-ci
2025-03-20 12:28:39 +02:00
8b80d68338 embedding : avoid common_batch
ggml-ci
2025-03-19 14:29:04 +02:00
6f54ee660c retrieval : avoid common_batch
ggml-ci
2025-03-19 13:50:15 +02:00
32c2c41d5e android : fix permission 2025-03-19 10:49:30 +01:00
96ca6e8d23 swift : adapt to new API 2025-03-19 10:48:42 +02:00
b0db7fc2c6 android : adapt to new API 2025-03-19 10:16:55 +02:00
23d7407314 Merge pull request #15 from ggml-org/xsn/private_batch_api
speculative : adapt to new llama API
2025-03-19 09:15:09 +01:00
7a3c178d78 speculative : adapt to new llama API
ggml-ci
2025-03-18 22:05:44 +02:00
dc4bb64290 Merge branch 'master' into xsn/private_batch_api 2025-03-18 15:45:22 +01:00
8551c44d84 context : always use non-causal attention for encoder graphs (#12447)
* context : always use non-causal attention for encoder graphs

ggml-ci

* context : move the change to llama_context::encode()

ggml-ci
b4914
2025-03-18 13:05:49 +02:00
35cae5ba05 SYCL: using graphs is configurable by environment variable and compile option (#12371)
* alberto changes

* enable sycl graphs by env variable

* fixed compilation warnings in ggml-sycl.cpp

* renamed graph variables

* fix markdown in docs/backend/SYCL.md

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>

* fix markdown in docs/backend/SYCL.md again

* compiling graphs by default, renamed graph_enable to graph_disable

---------

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
b4913
2025-03-18 11:16:31 +01:00
810e0af3f5 server : fix warmup draft cache type (#12446)
ggml-ci
b4912
2025-03-18 12:05:42 +02:00
eba92d64c3 cmake : fix PowerPC build (#12241)
Closes #12240
b4911
2025-03-18 11:37:33 +02:00
d9a14523bb ggml : add SVE support for q6_K_q8_K (#12361) b4910 2025-03-18 10:14:39 +02:00
fd123cfead Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (#12434) b4909 2025-03-18 07:21:40 +01:00
a53f7f7b88 fixed compilation warnings in ggml-sycl (#12424) b4908 2025-03-18 08:51:25 +08:00
7dfad387e3 llama: Add support for RWKV v7 architecture (#12412)
* ggml: Add op l2_norm

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* ggml: Add op rwkv_wkv7

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: Add support for RWKV7 and ARWKV7 models

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: fix inference with RWKV6Qwen2

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: add more (a)rwkv7 variants in size

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Apply code-format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* fix MUSA build

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: fix shape error with rwkv using llama-parallel

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
b4907
2025-03-18 07:27:50 +08:00
60c902926c docs : bring llama-cli conversation/template docs up-to-date (#12426) 2025-03-17 21:14:32 +01:00
b1b132efcb cuda : enable CUDA Graph on CUDA Toolkit < 12.x (#12394)
* Enable CUDA Graph on CTK < 12.x

`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.

* Fix compilation errors with MUSA

* Disable CUDA Graph for MUSA
b4905
2025-03-17 20:25:13 +02:00
01e8f2138b ggml-vulkan: remove unused find_program(glslc) (#12416)
It's already found by FindVulkan.cmake in the parent CMakeLists
2025-03-17 13:35:43 -03:00
484a8ab513 vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (#12312) b4903 2025-03-17 09:26:18 -05:00
cf2270e4d3 vulkan: subgroup size tuning (#12087)
* vulkan: subgroup size test

* Vulkan: Add device architecture enum and logic to recognize AMD generations

* vulkan: use new architecture logic to specify subgroup size

* Initial vulkan subgroup size tuning for RDNA3

* vulkan: commonize RDNA subgroup tuning

* vulkan: override subgroup size if required_subgroup_size = 0

* vulkan: disable warp 32 for RDNA3

* vulkan: fine tuned RDNA1 subgroup sizes

* vulkan: adjusted subgroup size map

* vulkan: fixed RDNA2 subgroup map

---------

Co-authored-by: 0cc4m <picard12@live.de>
b4902
2025-03-17 12:42:33 +01:00
eab5606d7b Apply suggestions from code review 2025-03-17 12:17:14 +01:00
de788e071b Update examples/tts/tts.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-17 12:05:23 +01:00
f07690c930 vulkan: use fp32 in coopmat2 q4_k dequant function (#12309) b4901 2025-03-17 10:43:35 +01:00
891c63956d vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (#12273)
* vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking
b4900
2025-03-17 10:41:59 +01:00
2f21123c1d vulkan: Adjust coopmat2 tile sizes and selection heuristic (#12258) b4899 2025-03-17 10:35:00 +01:00
374101fd74 cmake : enable building llama.cpp using system libggml (#12321)
* cmake: Factor out compiler flag function from ggml

llama.cpps's build requires it, too, and we may want to make use of it
without add_subdirectory(ggml).

* cmake: Enable building against system ggml

This facilitates package maintenance for Linux distributions, where the
libggml library most likely will be shipped as an individual package
upon which a llama.cpp package depends.
b4898
2025-03-17 11:05:23 +02:00
b3c9a65673 SYCL: set extras only on GGML_TYPE_Q4_0 (#12366)
* SYCL: set extras only on GGML_TYPE_Q4_0

* release tensor_extras in reset buffer interface
b4897
2025-03-17 09:45:12 +08:00
8ba95dca20 llama : fix OLMo-2-0325-32B-Instruct K-norm size (#12400) b4896 2025-03-16 19:46:36 +02:00
dc079cfdff context : fix init of n_outputs (#12397)
ggml-ci
b4895
2025-03-16 19:29:36 +02:00
7b61bcc87c ci : add --symlinks to xcframework zip command (#12409)
This commit adds the --symlinks option to the zip command used to create
the xcframework zip file. This is necessary to create symlinks in the
zip file. Without this option,  the Versions symlink is stored as a
regular directory entry in the zip file, rather than as a symlink in the
zip which causes the followig error in xcode:
```console
Couldn't resolve framework symlink for '/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current': readlink(/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current): Invalid argument (22)
```

Refs: https://github.com/ggml-org/llama.cpp/pull/11996#issuecomment-2727026377
2025-03-16 18:22:05 +01:00
f4c3dd5daa llama-tts : add '-o' option (#12398)
* added -o option to specify an output file name

* llama-tts returns ENOENT in case of file write error

note : PR #12042 is closed as superseded with this one.
b4893
2025-03-15 17:23:11 +01:00
3d35d87b41 SYCL: Delete redundant plus sign and space (#12391) b4892 2025-03-15 15:49:03 +01:00
b19bd064c0 SYCL : support non-contiguous tensors in binary ops (add, sub, etc) (#12399)
* sycl : support non-contiguous tensors in binary ops

* sycl : silence unused variable warning

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
b4891
2025-03-15 22:19:30 +08:00
92a391327e [CANN]MUL_MAT optimization (#12382) 2025-03-15 09:31:08 +08:00
624a683c6f fix compile 2025-03-14 22:30:29 +01:00
116b9a1662 rename to init_from_text 2025-03-14 22:17:07 +01:00
9f2250ba72 Add CLI arg to llama-run to adjust the number of threads used (#12370)
We default to 4, sometimes we want to manually adjust this

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
b4889
2025-03-14 16:41:20 +00:00
eaffba0f2e llama_batch_ext_ptr::from_text/embd 2025-03-14 17:12:03 +01:00
774973b8f3 main : add -sysf / --system-prompt-file (#12249) (#12250)
* add system_prompt_file

* add -sysf / --system-prompt-file

* remove system_prompt_file
b4888
2025-03-14 16:57:05 +01:00
8fcb563613 Load all MoE experts during warmup (#11571)
* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup

* common : use new API to enable warmup mode during model warmup

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-03-14 13:47:05 +01:00
8e7714fa77 fix compile 2025-03-14 11:28:15 +01:00
a363251fac qwen2vl: use llama_batch_ext_set_pos 2025-03-14 11:25:36 +01:00
add2a3aa5a server: fix "--grammar-file" parameter (#12285) b4886 2025-03-14 11:21:17 +01:00
ba79369615 fix llama_batch_ext_init_from_embd 2025-03-14 11:17:22 +01:00
07d84fa3c2 fix missing n_past in various places
this is actually a revert of cda0e4b648
2025-03-14 10:47:08 +01:00
32940369d3 fix gemma3-cli 2025-03-14 10:33:28 +01:00