b8b173274d
server : remove old commented code [no ci]
2025-03-20 18:20:54 +02:00
8a23b4a54a
server : avoid common_batch
...
ggml-ci
2025-03-20 16:52:24 +02:00
76fd7d6f5b
perplexity : avoid common_batch
...
ggml-ci
2025-03-20 12:28:39 +02:00
8b80d68338
embedding : avoid common_batch
...
ggml-ci
2025-03-19 14:29:04 +02:00
6f54ee660c
retrieval : avoid common_batch
...
ggml-ci
2025-03-19 13:50:15 +02:00
32c2c41d5e
android : fix permission
2025-03-19 10:49:30 +01:00
96ca6e8d23
swift : adapt to new API
2025-03-19 10:48:42 +02:00
b0db7fc2c6
android : adapt to new API
2025-03-19 10:16:55 +02:00
23d7407314
Merge pull request #15 from ggml-org/xsn/private_batch_api
...
speculative : adapt to new llama API
2025-03-19 09:15:09 +01:00
7a3c178d78
speculative : adapt to new llama API
...
ggml-ci
2025-03-18 22:05:44 +02:00
dc4bb64290
Merge branch 'master' into xsn/private_batch_api
2025-03-18 15:45:22 +01:00
8551c44d84
context : always use non-causal attention for encoder graphs ( #12447 )
...
* context : always use non-causal attention for encoder graphs
ggml-ci
* context : move the change to llama_context::encode()
ggml-ci
b4914
2025-03-18 13:05:49 +02:00
35cae5ba05
SYCL: using graphs is configurable by environment variable and compile option ( #12371 )
...
* alberto changes
* enable sycl graphs by env variable
* fixed compilation warnings in ggml-sycl.cpp
* renamed graph variables
* fix markdown in docs/backend/SYCL.md
Co-authored-by: Romain Biessy <romain.biessy@codeplay.com >
* fix markdown in docs/backend/SYCL.md again
* compiling graphs by default, renamed graph_enable to graph_disable
---------
Co-authored-by: Romain Biessy <romain.biessy@codeplay.com >
b4913
2025-03-18 11:16:31 +01:00
810e0af3f5
server : fix warmup draft cache type ( #12446 )
...
ggml-ci
b4912
2025-03-18 12:05:42 +02:00
eba92d64c3
cmake : fix PowerPC build ( #12241 )
...
Closes #12240
b4911
2025-03-18 11:37:33 +02:00
d9a14523bb
ggml : add SVE support for q6_K_q8_K ( #12361 )
b4910
2025-03-18 10:14:39 +02:00
fd123cfead
Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues ( #12434 )
b4909
2025-03-18 07:21:40 +01:00
a53f7f7b88
fixed compilation warnings in ggml-sycl ( #12424 )
b4908
2025-03-18 08:51:25 +08:00
7dfad387e3
llama: Add support for RWKV v7 architecture ( #12412 )
...
* ggml: Add op l2_norm
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* ggml: Add op rwkv_wkv7
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* llama: Add support for RWKV7 and ARWKV7 models
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* llama: fix inference with RWKV6Qwen2
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* llama: add more (a)rwkv7 variants in size
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* Apply code-format changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* fix MUSA build
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* llama: fix shape error with rwkv using llama-parallel
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
b4907
2025-03-18 07:27:50 +08:00
60c902926c
docs : bring llama-cli conversation/template docs up-to-date ( #12426 )
2025-03-17 21:14:32 +01:00
b1b132efcb
cuda : enable CUDA Graph on CUDA Toolkit < 12.x ( #12394 )
...
* Enable CUDA Graph on CTK < 12.x
`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.
* Fix compilation errors with MUSA
* Disable CUDA Graph for MUSA
b4905
2025-03-17 20:25:13 +02:00
01e8f2138b
ggml-vulkan: remove unused find_program(glslc) ( #12416 )
...
It's already found by FindVulkan.cmake in the parent CMakeLists
2025-03-17 13:35:43 -03:00
484a8ab513
vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader ( #12312 )
b4903
2025-03-17 09:26:18 -05:00
cf2270e4d3
vulkan: subgroup size tuning ( #12087 )
...
* vulkan: subgroup size test
* Vulkan: Add device architecture enum and logic to recognize AMD generations
* vulkan: use new architecture logic to specify subgroup size
* Initial vulkan subgroup size tuning for RDNA3
* vulkan: commonize RDNA subgroup tuning
* vulkan: override subgroup size if required_subgroup_size = 0
* vulkan: disable warp 32 for RDNA3
* vulkan: fine tuned RDNA1 subgroup sizes
* vulkan: adjusted subgroup size map
* vulkan: fixed RDNA2 subgroup map
---------
Co-authored-by: 0cc4m <picard12@live.de >
b4902
2025-03-17 12:42:33 +01:00
eab5606d7b
Apply suggestions from code review
2025-03-17 12:17:14 +01:00
de788e071b
Update examples/tts/tts.cpp
...
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2025-03-17 12:05:23 +01:00
f07690c930
vulkan: use fp32 in coopmat2 q4_k dequant function ( #12309 )
b4901
2025-03-17 10:43:35 +01:00
891c63956d
vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking ( #12273 )
...
* vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking
b4900
2025-03-17 10:41:59 +01:00
2f21123c1d
vulkan: Adjust coopmat2 tile sizes and selection heuristic ( #12258 )
b4899
2025-03-17 10:35:00 +01:00
374101fd74
cmake : enable building llama.cpp using system libggml ( #12321 )
...
* cmake: Factor out compiler flag function from ggml
llama.cpps's build requires it, too, and we may want to make use of it
without add_subdirectory(ggml).
* cmake: Enable building against system ggml
This facilitates package maintenance for Linux distributions, where the
libggml library most likely will be shipped as an individual package
upon which a llama.cpp package depends.
b4898
2025-03-17 11:05:23 +02:00
b3c9a65673
SYCL: set extras only on GGML_TYPE_Q4_0 ( #12366 )
...
* SYCL: set extras only on GGML_TYPE_Q4_0
* release tensor_extras in reset buffer interface
b4897
2025-03-17 09:45:12 +08:00
8ba95dca20
llama : fix OLMo-2-0325-32B-Instruct K-norm size ( #12400 )
b4896
2025-03-16 19:46:36 +02:00
dc079cfdff
context : fix init of n_outputs ( #12397 )
...
ggml-ci
b4895
2025-03-16 19:29:36 +02:00
7b61bcc87c
ci : add --symlinks to xcframework zip command ( #12409 )
...
This commit adds the --symlinks option to the zip command used to create
the xcframework zip file. This is necessary to create symlinks in the
zip file. Without this option, the Versions symlink is stored as a
regular directory entry in the zip file, rather than as a symlink in the
zip which causes the followig error in xcode:
```console
Couldn't resolve framework symlink for '/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current': readlink(/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current): Invalid argument (22)
```
Refs: https://github.com/ggml-org/llama.cpp/pull/11996#issuecomment-2727026377
2025-03-16 18:22:05 +01:00
f4c3dd5daa
llama-tts : add '-o' option ( #12398 )
...
* added -o option to specify an output file name
* llama-tts returns ENOENT in case of file write error
note : PR #12042 is closed as superseded with this one.
b4893
2025-03-15 17:23:11 +01:00
3d35d87b41
SYCL: Delete redundant plus sign and space ( #12391 )
b4892
2025-03-15 15:49:03 +01:00
b19bd064c0
SYCL : support non-contiguous tensors in binary ops (add, sub, etc) ( #12399 )
...
* sycl : support non-contiguous tensors in binary ops
* sycl : silence unused variable warning
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com >
b4891
2025-03-15 22:19:30 +08:00
92a391327e
[CANN]MUL_MAT optimization ( #12382 )
2025-03-15 09:31:08 +08:00
624a683c6f
fix compile
2025-03-14 22:30:29 +01:00
116b9a1662
rename to init_from_text
2025-03-14 22:17:07 +01:00
9f2250ba72
Add CLI arg to llama-run to adjust the number of threads used ( #12370 )
...
We default to 4, sometimes we want to manually adjust this
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
b4889
2025-03-14 16:41:20 +00:00
eaffba0f2e
llama_batch_ext_ptr::from_text/embd
2025-03-14 17:12:03 +01:00
774973b8f3
main : add -sysf / --system-prompt-file ( #12249 ) ( #12250 )
...
* add system_prompt_file
* add -sysf / --system-prompt-file
* remove system_prompt_file
b4888
2025-03-14 16:57:05 +01:00
8fcb563613
Load all MoE experts during warmup ( #11571 )
...
* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup
* common : use new API to enable warmup mode during model warmup
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com >
2025-03-14 13:47:05 +01:00
8e7714fa77
fix compile
2025-03-14 11:28:15 +01:00
a363251fac
qwen2vl: use llama_batch_ext_set_pos
2025-03-14 11:25:36 +01:00
add2a3aa5a
server: fix "--grammar-file" parameter ( #12285 )
b4886
2025-03-14 11:21:17 +01:00
ba79369615
fix llama_batch_ext_init_from_embd
2025-03-14 11:17:22 +01:00
07d84fa3c2
fix missing n_past in various places
...
this is actually a revert of cda0e4b648
2025-03-14 10:47:08 +01:00
32940369d3
fix gemma3-cli
2025-03-14 10:33:28 +01:00