llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-06-27 12:05:03 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	36f8e20d08	kv-cache : utilize ggml_set_rows broadcast ggml-ci	2025-06-23 13:22:51 +03:00
Georgi Gerganov	332f073589	cont : support non-continuous slots ggml-ci	2025-06-23 13:22:47 +03:00
Georgi Gerganov	39d0b1e8df	cont : kv-cells cp/set for non-cont slots ggml-ci	2025-06-23 13:21:37 +03:00
Georgi Gerganov	f875d6cb72	cont : migrate to using set of indices instead of slot head ggml-ci	2025-06-23 13:21:36 +03:00
Georgi Gerganov	db2bb378b1	cont : gate the ggml_set_rows usage with env var ggml-ci	2025-06-23 13:21:36 +03:00
Georgi Gerganov	79dac3c861	kv-cache : use ggml_set_rows ggml-ci	2025-06-23 13:21:36 +03:00
Radoslav Gerganov	1f647b5992	ggml : fix supports_op	2025-06-23 13:21:36 +03:00
Radoslav Gerganov	eba97574da	ggml : simplify forward_dup_f32	2025-06-23 13:21:36 +03:00
Georgi Gerganov	c0cfc2f78b	metal : add ggml_set_rows implementation ggml-ci	2025-06-23 13:21:36 +03:00
Georgi Gerganov	828e5d2fcd	tests : add ggml_set_rows	2025-06-23 13:21:35 +03:00
Georgi Gerganov	e73690a69d	ggml : ggml_set_rows update comment + better index name	2025-06-23 13:21:35 +03:00
Georgi Gerganov	e89709721b	ggml : support GGML_TYPE_F32 ".from_float" trait	2025-06-23 13:21:35 +03:00
Georgi Gerganov	630c84a2bd	ggml : ggml_set_rows support quantized dst ggml-ci	2025-06-23 13:21:35 +03:00
Georgi Gerganov	df71c803b4	ggml : ggml_set_rows support broadcast	2025-06-23 13:21:35 +03:00
Georgi Gerganov	313a444b22	ggml : add ggml_is_contiguous_rows	2025-06-23 13:21:35 +03:00
Georgi Gerganov	695b6b7025	ggml : add repeat impl for i64	2025-06-23 13:21:34 +03:00
Radoslav Gerganov	f2cd962fe2	use I64 for indices	2025-06-23 13:21:34 +03:00
Radoslav Gerganov	c1a581a10b	ggml : add ggml_set_rows Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: #8366	2025-06-23 13:21:32 +03:00
Georgi Gerganov	7b50d589a8	kv-cells : fix tracking of seq_pos (#14339 ) * kv-cells : fix tracking of seq_pos during cache reuse ggml-ci * cont : improve error message ggml-ci * cont : add more comments b5742	2025-06-23 12:27:35 +03:00
Jeff Bolz	3a9457df96	vulkan: update windows SDK in CI (#14334 )	2025-06-23 10:19:24 +02:00
Ed Addario	fa4a9f2a1c	quantize : handle user-defined pruning of whole layers (blocks) (#13037 ) b5740	2025-06-22 23:16:26 +02:00
Sigbjørn Skjæret	238005c2dc	gguf-py : fix SpecialVocab parsing when post_processor is null (#14330 )	2025-06-22 19:46:17 +02:00
Ruikai Peng	66aba7aca9	run : avoid double tokenization (#14327 ) * run : avoid double tokenization by adopting common_tokenize heuristic * build : fix windows gcc and clang warnings * lint : fixed trailing whitepace * run : fix is_first flag b5738	2025-06-23 01:28:06 +08:00
Georgi Gerganov	f1f5e82df6	examples : fix is_first logic for tokenization (#14329 ) ggml-ci b5737	2025-06-22 20:10:07 +03:00
uvos	af3373f1ad	HIP: enable vec fattn on RDNA4 (#14323 ) b5736	2025-06-22 16:51:23 +02:00
yuiseki	5d5c066de8	mtmd : fix Pixtral OOM with large images by capping image_size to 1024 (#14326 ) Mistral Small 2506 models using Pixtral vision encoder were running out of GPU memory when processing images larger than 1024x1024 pixels due to exponential memory growth from unlimited image size. This fix applies the same 1024x1024 limit used by Qwen2VL models to prevent OOM issues while maintaining compatibility with existing models. b5735	2025-06-22 14:44:57 +02:00
Sigbjørn Skjæret	40bfa04c95	common : use std::string_view now that we target c++17 (#14319 ) b5734	2025-06-22 08:37:43 +03:00
Aman Gupta	aa064b2eb7	CUDA: add mean operation (#14313 ) * CUDA: add mean operation * add back sum_rows_f32_cuda * Review: early exit if col!=0 b5733	2025-06-22 12:39:54 +08:00
Sigbjørn Skjæret	aa0ef5c578	gguf-py : fix Qwen3-Embedding eos token (#14314 )	2025-06-21 18:12:05 +02:00
Markus Tavenrath	bb16041cae	Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (#13792 ) * Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled. * remove #ifdef for debug utils and add queue marker. b5731	2025-06-21 08:17:12 +02:00
Sigbjørn Skjæret	58cba76a9a	gguf-py : fix TemplateProcessing pair when bos/eos is missing (#14312 )	2025-06-21 07:33:21 +02:00
Georgi Gerganov	67ae5312e2	metal : fix thread-safety (#14300 ) ggml-ci b5729	2025-06-21 08:04:18 +03:00
Georgi Gerganov	692e3cdd0a	memory : rename interface to llama_memory_context_i (#14296 ) * memory : rename interface to llama_memory_context_i ggml-ci * cont : fix comments * cont : use "mctx" for referencing a memory context ggml-ci b5728	2025-06-21 08:03:46 +03:00
Daniel Han	b23fa0b3f4	convert : fix Llama 4 conversion (#14311 )	2025-06-21 06:32:01 +02:00
Georgi Gerganov	06cbedfca1	sync : ggml ggml-ci b5726	2025-06-20 21:02:47 +03:00
Acly	b7147673f2	Add `ggml_roll` (ggml/1274) * ggml : add ggml_roll * use set/get_op_params & std::min	2025-06-20 21:02:47 +03:00
David Chiu	d860dd99a4	docs : fix the link to llama.h (#14293 )	2025-06-20 19:43:35 +02:00
Aman Gupta	c959f462a0	CUDA: add conv_2d_transpose (#14287 ) * CUDA: add conv_2d_transpose * remove direct include of cuda_fp16 * Review: add brackets for readability, remove ggml_set_param and add asserts b5723	2025-06-20 22:48:24 +08:00
Sigbjørn Skjæret	22015b2092	lint : remove trailing whitepace (#14304 ) b5722	2025-06-20 16:37:44 +02:00
Ruikai Peng	dd6e6d0b6a	vocab : prevent tokenizer overflow (#14301 ) * vocab : prevent stack overflow in tokenize * vocab : return error instead of aborting on oversized token count * vocab : INT32_MIN from llama_tokenize on overflow b5721	2025-06-20 07:13:06 -07:00
Nicolò Scipione	8308f98c7f	sycl: add usage of enqueue_functions extension (#14244 ) * Add header and namespace to use enqueue_functions extension * Convert submit and parallel_for to use new extension in convert.cpp * Convert submit and parallel_for to use extension in ggml-sycl.cpp * Convert submit and parallel_for to use extension in gla.cpp * Convert submit and parallel_for in mmq.cpp * Convert submit and parallel_for in mmvq.cpp * Convert submit and parallel_for in remaining files * Convert all simple parallel_for to nd_launch from enqueue_functions extension * Wrapping extension in general function Create a general function that enable the enqueue_functions extension if it is enable in the compiler, otherwise call the general SYCL function to launch kernels. --------- Signed-off-by: nscipione <nicolo.scipione@codeplay.com> b5720	2025-06-20 15:07:21 +02:00
Christian Kastner	6369be0735	Implement GGML_CPU_ALL_VARIANTS for PowerPC (#14286 ) * Add PowerPC feature detection and scoring * ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for PowerPC * ggml-cpu: Delay some initializations until function is called When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU. --------- Co-authored-by: Diego Devesa <slarengh@gmail.com> b5719	2025-06-20 14:17:32 +02:00
Sigbjørn Skjæret	88fc854b4b	llama : improve sep token handling (#14272 ) b5718	2025-06-20 14:04:09 +02:00
Diego Devesa	e28c1b93fd	cuda : synchronize graph capture and cublas handle destruction (#14288 ) Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread b5717	2025-06-20 13:57:36 +02:00
Georgi Gerganov	d27b3ca175	ggml : fix repack work size for mul_mat_id (#14292 ) ggml-ci b5716	2025-06-20 11:19:15 +03:00
Charles Xu	9230dbe2c7	ggml: Update KleidiAI to v1.9.0 (#14277 ) b5715	2025-06-20 10:51:01 +03:00
Georgi Gerganov	812939a9e9	model : more uniform output id handling (#14275 ) * model : more uniform output id handling ggml-ci * cont : revert n_outputs < n_tokens optimization ggml-ci * cont : fix out_ids initialization ggml-ci b5714	2025-06-20 10:50:27 +03:00
Georgi Gerganov	4c9fdfbe15	ubatch : new splitting logic (#14217 ) ggml-ci b5713	2025-06-20 10:14:14 +03:00
Aman Gupta	9eaa51e7f0	CUDA: add conv_2d_dw (#14265 ) * CUDA: add conv_2d_dw * better naming * simplify using template * Review: fix operation ordering in ggml-cuda, use __forceinline__, use more const b5712	2025-06-20 09:50:24 +08:00
Diego Devesa	8f71d0f3e8	ggml-cpu : remove unnecesary arm feature detection (#14281 ) Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code. b5711	2025-06-19 21:24:14 +02:00

1 2 3 4 5 ...

5760 Commits