llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-06-26 11:45:21 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	62af464227	batch : fix check for empty sequences in memory (#14364 ) * batch : fix check for empty sequences in memory ggml-ci * cont : reuse the var ggml-ci b5752	2025-06-24 18:26:30 +03:00
Mathieu Baudier	c148cf1946	cmake : use LLAMA_BUILD_NUMBER when defining LLAMA_INSTALL_VERSION (#14362 ) b5751	2025-06-24 15:05:31 +02:00
Nigel Bosch	1b809cee22	server : move no API key doc to /health (#14352 )	2025-06-24 10:59:11 +02:00
Sigbjørn Skjæret	abf241045d	main : honor --verbose-prompt on interactive prompts (#14350 ) b5749	2025-06-24 09:31:00 +02:00
Bartowski	901e20bbe5	jinja : Add Mistral-Small-3.2-24B-Instruct-2506.jinja (#14349 ) This will allow the use of tools on the llama-server	2025-06-24 09:17:58 +03:00
uvos	0142961a2e	CUDA/HIP: optimize mmv paths taken for HIP devices (#14324 ) Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b5747	2025-06-24 01:12:56 +02:00
bandoti	ce82bd0117	ci: add workflow for relocatable cmake package (#14346 )	2025-06-23 15:30:51 -03:00
Jeff Bolz	bf2a99e3cb	vulkan: update windows SDK in release.yml (#14344 ) b5745	2025-06-23 15:44:48 +02:00
Molly Sophia	72c6bc3f3d	llama : better rwkv chat template and add missing `inputs.use_jinja` setting (#14336 ) * llama-cli : add missing `inputs.use_jinja` setting Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama : better legacy chat template for rwkv Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> b5744	2025-06-23 19:56:19 +08:00
Johannes Gäßler	defe2158dd	CUDA: mul_mat_v support for batch sizes > 1 (#14262 ) * CUDA: mul_mat_v support for batch sizes > 1 * use 64 bit math for initial offset calculation b5743	2025-06-23 13:11:31 +02:00
Georgi Gerganov	7b50d589a8	kv-cells : fix tracking of seq_pos (#14339 ) * kv-cells : fix tracking of seq_pos during cache reuse ggml-ci * cont : improve error message ggml-ci * cont : add more comments b5742	2025-06-23 12:27:35 +03:00
Jeff Bolz	3a9457df96	vulkan: update windows SDK in CI (#14334 )	2025-06-23 10:19:24 +02:00
Ed Addario	fa4a9f2a1c	quantize : handle user-defined pruning of whole layers (blocks) (#13037 ) b5740	2025-06-22 23:16:26 +02:00
Sigbjørn Skjæret	238005c2dc	gguf-py : fix SpecialVocab parsing when post_processor is null (#14330 )	2025-06-22 19:46:17 +02:00
Ruikai Peng	66aba7aca9	run : avoid double tokenization (#14327 ) * run : avoid double tokenization by adopting common_tokenize heuristic * build : fix windows gcc and clang warnings * lint : fixed trailing whitepace * run : fix is_first flag b5738	2025-06-23 01:28:06 +08:00
Georgi Gerganov	f1f5e82df6	examples : fix is_first logic for tokenization (#14329 ) ggml-ci b5737	2025-06-22 20:10:07 +03:00
uvos	af3373f1ad	HIP: enable vec fattn on RDNA4 (#14323 ) b5736	2025-06-22 16:51:23 +02:00
yuiseki	5d5c066de8	mtmd : fix Pixtral OOM with large images by capping image_size to 1024 (#14326 ) Mistral Small 2506 models using Pixtral vision encoder were running out of GPU memory when processing images larger than 1024x1024 pixels due to exponential memory growth from unlimited image size. This fix applies the same 1024x1024 limit used by Qwen2VL models to prevent OOM issues while maintaining compatibility with existing models. b5735	2025-06-22 14:44:57 +02:00
Sigbjørn Skjæret	40bfa04c95	common : use std::string_view now that we target c++17 (#14319 ) b5734	2025-06-22 08:37:43 +03:00
Aman Gupta	aa064b2eb7	CUDA: add mean operation (#14313 ) * CUDA: add mean operation * add back sum_rows_f32_cuda * Review: early exit if col!=0 b5733	2025-06-22 12:39:54 +08:00
Sigbjørn Skjæret	aa0ef5c578	gguf-py : fix Qwen3-Embedding eos token (#14314 )	2025-06-21 18:12:05 +02:00
Markus Tavenrath	bb16041cae	Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (#13792 ) * Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled. * remove #ifdef for debug utils and add queue marker. b5731	2025-06-21 08:17:12 +02:00
Sigbjørn Skjæret	58cba76a9a	gguf-py : fix TemplateProcessing pair when bos/eos is missing (#14312 )	2025-06-21 07:33:21 +02:00
Georgi Gerganov	67ae5312e2	metal : fix thread-safety (#14300 ) ggml-ci b5729	2025-06-21 08:04:18 +03:00
Georgi Gerganov	692e3cdd0a	memory : rename interface to llama_memory_context_i (#14296 ) * memory : rename interface to llama_memory_context_i ggml-ci * cont : fix comments * cont : use "mctx" for referencing a memory context ggml-ci b5728	2025-06-21 08:03:46 +03:00
Daniel Han	b23fa0b3f4	convert : fix Llama 4 conversion (#14311 )	2025-06-21 06:32:01 +02:00
Georgi Gerganov	06cbedfca1	sync : ggml ggml-ci b5726	2025-06-20 21:02:47 +03:00
Acly	b7147673f2	Add `ggml_roll` (ggml/1274) * ggml : add ggml_roll * use set/get_op_params & std::min	2025-06-20 21:02:47 +03:00
David Chiu	d860dd99a4	docs : fix the link to llama.h (#14293 )	2025-06-20 19:43:35 +02:00
Aman Gupta	c959f462a0	CUDA: add conv_2d_transpose (#14287 ) * CUDA: add conv_2d_transpose * remove direct include of cuda_fp16 * Review: add brackets for readability, remove ggml_set_param and add asserts b5723	2025-06-20 22:48:24 +08:00
Sigbjørn Skjæret	22015b2092	lint : remove trailing whitepace (#14304 ) b5722	2025-06-20 16:37:44 +02:00
Ruikai Peng	dd6e6d0b6a	vocab : prevent tokenizer overflow (#14301 ) * vocab : prevent stack overflow in tokenize * vocab : return error instead of aborting on oversized token count * vocab : INT32_MIN from llama_tokenize on overflow b5721	2025-06-20 07:13:06 -07:00
Nicolò Scipione	8308f98c7f	sycl: add usage of enqueue_functions extension (#14244 ) * Add header and namespace to use enqueue_functions extension * Convert submit and parallel_for to use new extension in convert.cpp * Convert submit and parallel_for to use extension in ggml-sycl.cpp * Convert submit and parallel_for to use extension in gla.cpp * Convert submit and parallel_for in mmq.cpp * Convert submit and parallel_for in mmvq.cpp * Convert submit and parallel_for in remaining files * Convert all simple parallel_for to nd_launch from enqueue_functions extension * Wrapping extension in general function Create a general function that enable the enqueue_functions extension if it is enable in the compiler, otherwise call the general SYCL function to launch kernels. --------- Signed-off-by: nscipione <nicolo.scipione@codeplay.com> b5720	2025-06-20 15:07:21 +02:00
Christian Kastner	6369be0735	Implement GGML_CPU_ALL_VARIANTS for PowerPC (#14286 ) * Add PowerPC feature detection and scoring * ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for PowerPC * ggml-cpu: Delay some initializations until function is called When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU. --------- Co-authored-by: Diego Devesa <slarengh@gmail.com> b5719	2025-06-20 14:17:32 +02:00
Sigbjørn Skjæret	88fc854b4b	llama : improve sep token handling (#14272 ) b5718	2025-06-20 14:04:09 +02:00
Diego Devesa	e28c1b93fd	cuda : synchronize graph capture and cublas handle destruction (#14288 ) Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread b5717	2025-06-20 13:57:36 +02:00
Georgi Gerganov	d27b3ca175	ggml : fix repack work size for mul_mat_id (#14292 ) ggml-ci b5716	2025-06-20 11:19:15 +03:00
Charles Xu	9230dbe2c7	ggml: Update KleidiAI to v1.9.0 (#14277 ) b5715	2025-06-20 10:51:01 +03:00
Georgi Gerganov	812939a9e9	model : more uniform output id handling (#14275 ) * model : more uniform output id handling ggml-ci * cont : revert n_outputs < n_tokens optimization ggml-ci * cont : fix out_ids initialization ggml-ci b5714	2025-06-20 10:50:27 +03:00
Georgi Gerganov	4c9fdfbe15	ubatch : new splitting logic (#14217 ) ggml-ci b5713	2025-06-20 10:14:14 +03:00
Aman Gupta	9eaa51e7f0	CUDA: add conv_2d_dw (#14265 ) * CUDA: add conv_2d_dw * better naming * simplify using template * Review: fix operation ordering in ggml-cuda, use __forceinline__, use more const b5712	2025-06-20 09:50:24 +08:00
Diego Devesa	8f71d0f3e8	ggml-cpu : remove unnecesary arm feature detection (#14281 ) Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code. b5711	2025-06-19 21:24:14 +02:00
Alex Trotta	381174bbda	gguf-py : make sentencepiece optional (#14200 ) * Make sentencepiece optional * Bump to 0.18.0 * Bump patch instead of minor Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net> gguf-v0.17.1	2025-06-19 15:56:12 +02:00
aa956	d67341dc18	server : add server parameters for draft model cache type (#13782 ) Co-authored-by: aa956 <27946957+aa956@users.noreply.github.com> b5709	2025-06-19 16:01:03 +03:00
fanyang	456af35eb7	build : suppress gcc15 compile warnings (#14261 ) * Change _contains_any() substrs to std::string_view and fix the find comparison logic. b5708	2025-06-19 14:49:48 +02:00
Anton Mitkov	600e3e9b50	sycl: Cleanup codepaths in Get Rows in sycl backend (#14215 ) Addresses unused reorder path b5707	2025-06-19 11:40:21 +01:00
bashayer hijji	fffcce535e	llama-bench : add --no-warmup flag (#14224 ) (#14270 ) Add no_warmup parameter to cmd_params struct and command-line parsing to allow users to skip warmup runs before benchmarking. - Add no_warmup boolean field to cmd_params struct - Add --no-warmup command-line argument parsing - Add help text documentation for the new flag - Wrap existing warmup logic in conditional check - Maintain full backward compatibility (warmup enabled by default) Addresses #14224 b5706	2025-06-19 12:24:12 +02:00
pqnet	5fc7856815	convert : fix remote option in Windows (#14100 )	2025-06-19 12:21:40 +02:00
Aaron Teo	faed5a5f5d	llamafile : support s390x SIMD instruction set (#14273 ) b5704	2025-06-19 11:48:54 +02:00
0cc4m	10bb545c5b	Vulkan: Set device max size for host memory to avoid OOM warning and fallback to CPU buffer (#14249 ) b5703	2025-06-19 09:15:42 +02:00

1 2 3 4 5 ...

5752 Commits