llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-08-14 04:17:53 -04:00

Author	SHA1	Message	Date
Oliver Simons	6028bf7435	CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n (#15132 ) * Factor out `reduce_rows_f32` from common.cuh This increases iteration cycle speed by not having to recompile every kernel all the time * Hide memory-latency by loop unrolling in reduce_rows_f32 * Further optimizations to `reduce_rows_f32` 1. Increase threadblock size to better hide latency of memory requests. As a consequence of bigger threadblocks, do 2-step summation, using shared memory to communicate results between invocations 2. Use sum_temp array to reduce waits on sum 3. Adjust num_unroll to reflext bigger threadblock 4. Improve default block_dims, increase support for more block_dims * Add perf tests for `reduce_rows_f32` kernel * Add heuristic to toggle 128/512 threads based on sm count Break even point was the minimum of the following multiples. \| GPU Model \| Nrow SM Count Multiple \| \| ----------- \| ----------- \| \| RTX 4000 SFF ADA \| 2.0x \| \| RTX 6000 ADA \| 2.5x \| \| RTX PRO 6000 Blackwell Max-Q \| 3.04x \| \| RTX PRO 4500 Blackwell \| 3.15x \| * Ensure perf gains also for small ncols and large nrows Alternative to this, one could have also made the number of unrollings template-able, but that would require compiling the kernel multiple times, increasing binary size unnecessarily * Modify perf and unit-tests * Apply auto-formatting by clang * Fix CI build failure See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486 Building with VS generator worked though. * Remove sm_count property from `ggml_backend_cuda_context` Requested by @JohannesGaessler, and should fix remaining CI issues as a side-effect * Add CUB-based implementation for GGML_OP_MEAN Currently this branch is only executed for nrows==1 * Add heuristics to execute CUB branch only when it brings perf Heuristics were determined on the following HW: * RTX 4000 SFF ADA * RTX 6000 ADA * RTX PRO 6000 Blackwell Max-Q * RTX PRO 4500 Blackwell * Add unit-test for CUB-based mean Tests should run with CUDA Graphs enabled per default on NVGPUs * Rename `USE_CUB` to `GGML_CUDA_USE_CUB` Suggested by @JohannesGaessler * Unindent Preprocessor directives See https://github.com/ggml-org/llama.cpp/pull/15132#discussion_r2269213506	2025-08-13 10:04:46 +02:00
Sachin Desai	3db4da56a5	chat : support Granite model reasoning and tool call (#14864 )	2025-08-06 20:27:30 +02:00
Sigbjørn Skjæret	65c797c4fa	chat : fix yandex chat template (#15116 )	2025-08-06 13:26:49 +02:00
Georgi Gerganov	fd1234cb46	llama : add gpt-oss (#15091 ) * oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>	2025-08-05 22:10:36 +03:00
Sigbjørn Skjæret	f324a3b715	chat : only remove double bos/eos if added (#15086 ) * only remove double bos/eos if added * fix tests	2025-08-05 20:43:36 +02:00
Jhen-Jie Hong	f738989dcb	chat : fix multiple tool_calls on hermes-2-pro (#14962 )	2025-08-02 18:04:48 +08:00
Jeff Bolz	ec0b18802c	vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (#15015 )	2025-08-02 10:48:30 +02:00
Georgi Gerganov	00131d6eaf	tests : update for LLAMA_SET_ROWS=1 (#14961 ) * test-thread-safety : each context uses a single sequence * embedding : handle --parallel argument ggml-ci * save-load : handle -np 1 ggml-ci * thread-safety : avoid overriding threads, reduce test case arg ggml-ci	2025-07-30 15:12:02 +03:00
Sigbjørn Skjæret	138b288b59	cuda : add softcap fusion (#14907 )	2025-07-29 14:22:03 +02:00
Leonard Mosescu	bda62193b2	test-backend-ops : extend test case filtering (#14865 ) * Extend test case filtering 1. Allow passing multiple (comma-separated?) ops to test-backend-ops. This can be convenient when working on a set of ops, when you'd want to test them together (but without having to run every single op). For example: `test-backend-ops.exe test -o "ADD,RMS_NORM,ROPE,SILU,SOFT_MAX"` 2. Support full test-case variation string in addition to basic op names. This would make it easy to select a single variation, either for testing or for benchmarking. It can be particularly useful for profiling a particular variation (ex. a CUDA kernel), for example: `test-backend-ops.exe perf -b CUDA0 -o "MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=2)"` These two can be combined. As the current `-o`, this change doesn't try to detect/report an error if an filter doesn't name existing ops (ex. misspelled) * Updating the usage help text * Update tests/test-backend-ops.cpp	2025-07-28 18:04:27 +02:00
Erik Scholz	89d1029559	vulkan : add fp16 support for the conv_2d kernel (#14872 ) * add f16 to conv_2d testing * weaken conv2d test error threshold	2025-07-27 12:04:33 +02:00
Aman Gupta	446595b9b3	Docs: add instructions for adding backends (#14889 )	2025-07-27 09:36:43 +08:00
Georgi Gerganov	18f3b5ff9e	tests : add non-cont K,V FA tests ggml-ci	2025-07-23 14:08:09 +03:00
Aman Gupta	8c988fa41d	CUDA: add fused rms norm (#14800 )	2025-07-23 09:25:42 +08:00
Jeff Bolz	c2e058f1b4	vulkan/cuda: Fix im2col when KW!=KH (#14789 ) The tid is decomposed into "ow + kyOW + kxOW*KH". Change "ksize" to match.	2025-07-21 13:35:40 +02:00
Ervin Áron Tasnádi	a979ca22db	ggml: adds CONV_2D op and direct GEMM Vulkan implementation (#14316 ) * ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan * ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly with gemm (no need for im2col), * test-backend-ops: adds test_case_ref to check the validity/performance of ops against reference implementations having different graphs, adds tests * * Performance fixes: minimized branch divergence, uses collectives to eliminate redundant calculation, macros removed. * Kernel shared memory size check * Updates test-backend-ops to support graphs for performance measurement. * * Apple/Win32 compile errors fixed * Subgroup size used to determine tile size -> fixes llvmpipe errors. * Collectives disabled by default. * Intel support is disabled as the performance is poor. * Conv2d enabled for Intel with disabled collectives, disabled for Apple * test-backend-ops modifications are reverted * Trailing spaces and missing override fixed. * Triggering pipeline relaunch. * Code formatted with .clang-format.	2025-07-19 21:59:08 +02:00
Georgi Gerganov	bf9087f59a	metal : fuse add, mul + add tests (#14596 ) ggml-ci	2025-07-18 20:37:26 +03:00
Georgi Gerganov	225e7a1438	llama : add high-throughput mode (#14363 ) * kv-cache : prepare K/V buffers for separation ggml-ci * batched-bench : fix oob write ggml-ci * llama : add "virtual sequences" ggml-ci * llama : use "stream" vs "virtual sequence" ggml-ci * graph : fix stream splitting when KV cache is not used ggml-ci * kv-cache : add multi-stream save/load support ggml-ci * llama : add "--attn-streams" flag ggml-ci * kv-cache : fix handling when find_slot fails ggml-ci * kv-cache : restore find_slot impl ggml-ci * kv-cache : add comments * kv-cache : add bounds checks for sequence id ggml-ci * cont : add n_seq_max to batch allocr ggml-ci * kv-cache : perform stream copies lazily after llama_synchronize ggml-ci * kv-cache : avoid throwing exceptions across the C boundary ggml-ci * CUDA: 4D FlashAttention support (#14628) * CUDA: 4D FlashAttention support * CUDA: fix WMMA FA kernel * llama : rename attn_streams -> kv_unified ggml-ci * common : rename kv_split -> kv_unified ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-07-16 16:35:42 +03:00
Tarek Dakhran	c31e60647d	tests : cover lfm2 cases in test_ssm_conv (#14651 )	2025-07-12 19:10:14 +02:00
Acly	3e303b1107	vulkan : implement ggml_roll (ggml/1290) ggml-ci	2025-07-12 14:25:44 +03:00
Aman Gupta	11ee0fea2a	Docs: script to auto-generate ggml operations docs (#14598 ) * Docs: script to auto-generate ggml operations docs * Review: formatting changes + change github action * Use built-in types instead of typing * docs : add BLAS and Metal ops --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-10 23:29:01 +08:00
compilade	a57d1bcb3c	cuda : support Falcon-H1 state size for SSM_SCAN (#14602 )	2025-07-09 23:54:38 -04:00
Xuan-Son Nguyen	98bab638fb	ggml : add ggml_scale_bias (#14417 ) * ggml : add ggml_scale_bias * ggml_vec_mad1_f32 * add more simd * add CUDA * sycl * vulkan * cann (placeholder) * opencl * will this fix cpu? * fix cuda * suggestions from coderabbit * fix cann compile error * vDSP_vsmsa * rm __ARM_FEATURE_SVE * use memcpy for op params * make code looks more consistent * use scalar for __ARM_FEATURE_SVE * add x param to ggml_vec_mad1_f32	2025-07-09 18:16:12 +02:00
Georgi Gerganov	4d0dcd4a06	cuda : fix rope with partial rotation and non-cont src (#14580 ) * cuda : fix rope non-cont ggml-ci * cont : fix multi-rope + add test ggml-ci * sycl : try fix ggml-ci * cont : fix sycl + clean-up cuda ggml-ci	2025-07-08 10:15:21 +03:00
Jeff Bolz	e592be1575	vulkan: fix rms_norm+mul fusion (#14545 ) The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.	2025-07-06 10:08:16 +02:00
R0CKSTAR	b81510a7b7	test-backend-ops: add support for specifying output format (#14368 ) * test-backend-ops: add support for specifying output format Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Add build_commit and build_number in test_result Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * refactor Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Get build commit from ggml_commit() Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Merge errors into test_operation_info && address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * remove visitor nonsense * remove visitor comment Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> --------- Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2025-07-05 12:10:53 +08:00
Johannes Gäßler	c8c4495b8d	ggml: backward pass for split swiglu (#14483 )	2025-07-03 17:05:18 +02:00
Georgi Gerganov	9067487c44	ggml : fix FA mask dim 2 and 3 (#14505 ) * ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1	2025-07-03 10:46:57 +03:00
Georgi Gerganov	d4cdd9c1c3	ggml : remove kompute backend (#14501 ) ggml-ci	2025-07-03 07:48:32 +03:00
Aman Gupta	55c2646b45	CUDA: add dynamic shared mem to softmax, refactor general usage (#14497 )	2025-07-03 07:45:11 +08:00
compilade	5d46babdc2	llama : initial Mamba-2 support (#9126 ) * llama : initial Mamba-2 support * ggml : SIMD ggml_ssm_scan for Mamba-2 * ggml : improve ggml_mul speed when masking recurrent states * llama : support running Mamba-Codestral-7B-v0.1 * llama : fix Mamba-2 conv state saving * ggml : make the ggml_mul fast broadcast path more consistently formatted * llama : remove unused variable * llama : add missing break * convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly. * llama : avoid redundant state copy for Mamba 1 and 2 * metal : attempt to adapt SSM_SCAN for Mamba-2 * metal : fix SSM_SCAN pipeline scope * metal : use log and exp instead of log1pf and expf in SSM_SCAN * metal : remove unused arguments for SSM_SCAN The max index is 31, so trimming the arguments is necessary. * metal : add back n_seqs to SSM_SCAN args Whoops, this is needed for the offset in the concatenated output. * metal : fix SSM_SCAN state head offset * metal : fix wrong number of tokens per sequence in SSM_SCAN * ggml : remove unused fast broadcast path in GGML_MUL This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity. * ggml : avoid multiply by D in GGML_OP_SSM_SCAN This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks * convert : fix flake8 lint * metal : fix confusion between ; and , * metal : add missing args for nb references in ssm_scan_f32_group * metal : single-user mamba2 inference works * kv-cache : remove const_cast when setting inputs for s_copy And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy. * convert : avoid AutoConfig for Mamba and Mamba2 hparams * kv-cache : allow context shift for recurrent models * graph : fix recurrent state copies when avoiding copies Works, but using lambda functions might not be that clean. * ggml : fix mamba2 ssm scan when compiled with SVE * ggml-cpu : reorder SVE FMA for consistency with other SIMD arches * cuda : implement ssm scan for Mamba2 There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2 * mamba : fix mismatched new and delete size for llm_build_mamba Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1\|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON * cuda : graceful fallback for Mamba-1 models with weird embd size	2025-07-02 13:10:24 -04:00
Georgi Gerganov	ec68e84c32	ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435 ) ggml-ci	2025-07-02 15:48:33 +03:00
Jeff Bolz	6a746cf9c4	vulkan: Split large mul_mat_id to fit in shared memory (#14451 )	2025-07-01 10:43:08 +02:00
Acly	431b2c24f3	ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285) * add "align corners" mode for bilinear upscale, and allow downscaling * add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag * test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners	2025-07-01 11:06:39 +03:00
Diego Devesa	eb3fa2913e	test-backend-ops : disable llama test (#14461 )	2025-06-30 12:43:15 +02:00
Vedran Miletić	e9b6350e61	scripts : make the shell scripts cross-platform (#14341 )	2025-06-30 10:17:18 +02:00
Sigbjørn Skjæret	a0535ffa0d	ggml : implement REGLU/GEGLU/SWIGLU ops (#14158 ) * implement unary REGLU/GEGLU/SWIGLU cpu ops * relax constraints * duplicate shape of source * fix ggml_vec_geglu_f16 * special case gated ops * implement unary REGLU/GEGLU/SWIGLU cuda ops * tighten constraints again * refactor into GGML_GLU_OP * metal : add glu kernels ggml-ci * add CUDA_GLU_BLOCK_SIZE [no ci] * more constraints and use 64bit ints ggml-ci * 64bit multiplication [no ci] * implement swapped variants (cpu/cuda) * update comment [no ci] ggml-ci * Vulkan: Add GLU ops and shaders * SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate * ggml : implement GLU for split up/gate (#14181) * implement GLU for split up/gate * add tests for ggml_glu_split * Vulkan: Implement glu_split logic and shader support * add split to logging [no ci] * SYCL: refactor element_size ops and add split up and gate support to gated kernels * SYCL: switch GEGLU to use tanh approximation --------- Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Akarshan <akarshan@menlo.ai> * GGML: increase OP count in assertion * Refactor: Optimize SYCL element-wise operations with unary function inlining This commit refactors the SYCL element-wise operations to improve performance by: - Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead. - Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic. - Replacing direct kernel calls with calls to these inlined functions. - Using `__dpct_inline__` to encourage compiler inlining. - Minor code cleanup and consistency improvements. The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices. * vulkan: Increase workgroup size for GLU, for performance (#14345) * vulkan: Increase workgroup size for GLU, for performance * vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup * merge fix * metal : add support for split and swap ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Akarshan <akarshan@menlo.ai> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-06-29 11:04:10 +02:00
Jeff Bolz	bd9c981d72	vulkan: Add fusion support for RMS_NORM+MUL (#14366 ) * vulkan: Add fusion support for RMS_NORM+MUL - Add a use_count to ggml_tensor, so we can detect if an output is used more than once. - Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor. - Add detection logic and basic fusion logic in ggml-vulkan. - Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test. * extract some common fusion logic * fix -Winconsistent-missing-override * move ggml_can_fuse to a common function * build fix * C and C++ versions of can_fuse * move use count to the graph to avoid data races and double increments when used in multiple threads * use hash table lookup to find node index * change use_counts to be indexed by hash table slot * minimize hash lookups style fixes * last node doesn't need single use. fix type. handle mul operands being swapped. * remove redundant parameter --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-06-29 09:43:36 +02:00
Aman Gupta	27208bf657	CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361 ) * CUDA: add bf16 and f32 support to cublas_mul_mat_batched * Review: add type traits and make function more generic * Review: make check more explicit, add back comments, and fix formatting * Review: fix formatting, remove useless type conversion, fix naming for bools	2025-06-29 01:30:53 +08:00
Radoslav Gerganov	8d94219a4a	ggml : add ggml_set_rows (#14274 ) * ggml : add ggml_set_rows Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: #8366 * use I64 for indices * ggml : add repeat impl for i64 * ggml : add ggml_is_contiguous_rows * ggml : ggml_set_rows support broadcast * ggml : ggml_set_rows support quantized dst ggml-ci * ggml : support GGML_TYPE_F32 ".from_float" trait * ggml : ggml_set_rows update comment + better index name * tests : add ggml_set_rows * metal : add ggml_set_rows implementation ggml-ci * ggml : simplify forward_dup_f32 * ggml : fix supports_op * tests : add comment to set_rows * ggml : leave the repeat_i64 for a separate PR ggml-ci * ggml : set_rows use std::min instead of MIN * ggml : better error message for set_rows unsupported type * metal : perform op->type check only once * tests : more consistent implementation + more tests ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-06-27 16:41:40 +03:00
Georgi Gerganov	e8215dbb96	metal : add special-case mat-vec mul for ne00 == 4 (#14385 ) ggml-ci	2025-06-26 15:51:19 +03:00
Aman Gupta	aa064b2eb7	CUDA: add mean operation (#14313 ) * CUDA: add mean operation * add back sum_rows_f32_cuda * Review: early exit if col!=0	2025-06-22 12:39:54 +08:00
Aman Gupta	c959f462a0	CUDA: add conv_2d_transpose (#14287 ) * CUDA: add conv_2d_transpose * remove direct include of cuda_fp16 * Review: add brackets for readability, remove ggml_set_param and add asserts	2025-06-20 22:48:24 +08:00
Diego Devesa	6adc3c3ebc	llama : add thread safety test (#14035 ) * llama : add thread safety test * llamafile : remove global state * llama : better LLAMA_SPLIT_MODE_NONE logic when main_gpu < 0 GPU devices are not used --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-06-16 08:11:43 -07:00
Sigbjørn Skjæret	d4e0d95cf5	chore : clean up relative source dir paths (#14128 )	2025-06-11 19:04:23 +02:00
Sigbjørn Skjæret	cc66a7f78f	tests : add test-tokenizers-repo (#14017 )	2025-06-11 17:16:32 +02:00
Diego Devesa	7f4fbe5183	llama : allow building all tests on windows when not using shared libs (#13980 ) * llama : allow building all tests on windows when not using shared libraries * add static windows build to ci * tests : enable debug logs for test-chat --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-06-09 20:03:09 +02:00
Ervin Áron Tasnádi	0d3984424f	ggml-vulkan: adds support for op CONV_TRANSPOSE_1D (#13813 ) * * ggml-vulkan: adds op CONV_TRANSPOSE_1D * test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D * Missing barrier added to shader. Number of additional tests reduced to 108. * * Fixes typo in variable name. * Removes extra whitespaces. * Adds int64->int32 casts to prevent possible warnings. * Problem size reduced in tests to pass tests with llvmpipe. * supports_op condition moved from unintended position	2025-06-04 22:02:00 +02:00
Olivier Chafik	c9bbc77931	`server`: update deepseek reasoning format (pass reasoning_content as diffs) (#13933 ) * server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat * update unit/test_tool_call.py::test_thoughts	2025-06-02 10:15:44 -07:00
Johannes Gäßler	7675c555a1	gguf: fix failure on version == 0 (#13956 )	2025-06-01 18:08:05 +02:00

1 2 3 4 5 ...

437 Commits