llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-08-01 15:09:32 -04:00

Author	SHA1	Message	Date
Xuan-Son Nguyen	7c727fbe39	arg : add --no-mmproj-offload (#13093 ) * arg : add --no-mmproj-offload * Update common/arg.cpp b5178	2025-04-24 14:04:14 +02:00
Xuan-Son Nguyen	80982e815e	arg : clean up handling --mmproj with -hf (#13082 ) * arg : clean up handling --mmproj with -hf * rm change about no_mmproj * Revert "rm change about no_mmproj" This reverts commit `2cac8e0efb`. * handle no_mmproj explicitly * skip download mmproj on examples not using it b5177	2025-04-24 12:14:13 +02:00
Georgi Gerganov	7604a7d6b8	metal : fix floating-point range of attention scores in FA kernels (#13090 ) ggml-ci b5176	2025-04-24 10:38:30 +03:00
Eve	b3b6d862cf	vulkan: matmul gcn tuning (#13016 ) * tune matmul for gcn * this one is more power efficient * Update ggml/src/ggml-vulkan/ggml-vulkan.cpp Co-authored-by: 0cc4m <picard12@live.de> * disable this tune for the proprietary driver --------- Co-authored-by: 0cc4m <picard12@live.de> b5175	2025-04-24 09:18:33 +02:00
pl752	5630406959	llama-mtmd-cli: Sigint rework in mtmd vision example (#13080 ) * Sigint rework in mtmd vision example * Applied suggestions on mtmd-cli PR * Forgot to invert one of the conditions * Update examples/llava/mtmd-cli.cpp * Removed redundant exit check --------- Co-authored-by: pl752 <maximpl752@gmail.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> b5174	2025-04-23 23:32:35 +02:00
Xuan-Son Nguyen	ecda2ec4b3	mtmd : Support Pixtral 12B (#13065 ) * add pixtral text model (vision is wip) * cgraph ok, just missing 2D RoPE * fix bad rebase * first working version * fix problem with img_break token * support dynamic image size * update docs * update test script b5173	2025-04-23 20:21:59 +02:00
piDack	eb1776b15a	convert : Append mult-eos,half-rope,bos to GLM4-0414 and Z (#13021 ) * append mult-eos,half-rope,bos to GLM4-0414 * remove unset var	2025-04-23 16:59:14 +02:00
Radoslav Gerganov	2cca6c01e4	rpc : add command line option for number of threads for the CPU backend (#13060 ) closes #13051 b5171	2025-04-23 10:32:49 +03:00
Johannes Gäßler	658987cfc9	CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (#13014 ) * CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID * fix logic for RoPE support, CUDA graphs b5170	2025-04-22 21:27:40 +02:00
Xuan-Son Nguyen	dc39a5e7a8	mtmd : support SmolVLM (version 1 and 2) (#13050 ) * mtmd : support SmolVLM (version 1 and 2) * correct chat template * fix n_patches * scale_factor is an int * add more models to test b5169	2025-04-22 16:24:54 +02:00
Georgi Gerganov	ab47dec3d3	security : add note about RPC and server functionality (#13061 ) * security : add note about RPC functionality * security : add note about llama-server	2025-04-22 16:16:10 +03:00
Georgi Gerganov	7b53389c24	metal : add memory pool for temp allocs (#12850 ) * metal : add memory pool for temp allocs (wip) [no ci] * cont : free buffers from the heap * cont : resize heap [no ci] * cont : refactor heap [no ci] * cont : heap for each cmd buffer [no ci] * cont : fix free * wip * cont : fix alignment [no ci] * cont : not working .. [no ci] * cont : heap allocation now works [no ci] * cont : use MTLHeapTypePlacement ggml-ci * metal : use dynamic MTLHeap allocations ggml-ci * metal : add comments * metal : disable softmax use of mem_pool ggml-ci * metal : final touches	2025-04-22 16:15:51 +03:00
Xuan-Son Nguyen	243453533e	llava : update documentations (#13055 ) * llava : update documentations * fix typo b5166	2025-04-22 10:37:00 +02:00
Diego Devesa	1d735c0b4f	ggml : add SSE 4.2 and x64 base variant for CPUs without AVX (#12871 ) * ggml : add SSE 4.2 variant for CPUs without AVX * ggml : add x64 base ABI variant b5165	2025-04-21 18:13:51 +02:00
Akarshan Biswas	5368ddda7a	SYCL: Add non-contiguous support in ROPE (#12993 ) ggml-ci b5164	2025-04-21 19:13:30 +05:30
Xuan-Son Nguyen	84a9bf2fc2	mtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` (#13012 ) * mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli` * support for minicpmv * remove cpp files of llava and minicpmv * update hot topics * mtmd : add not supported msg for qwen2vl * Update examples/llava/mtmd.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b5163	2025-04-21 15:32:58 +02:00
Xuan-Son Nguyen	2016f07bd1	convert : experimental support for `--mmproj` flag (#13023 ) * convert : experimental support for `--mmproj` flag * fix bad ctrl+f replace * fix style * split into subclasses TextModel and VisionModel * rename Mode --> ModelBase * small fix * correct CLIP_VISION arch name (because existing GGUF already use it) * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> * fix Mistral3Model * fix typo Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net> b5162	2025-04-20 23:29:36 +02:00
Jeffrey Morgan	6602304814	llava: fix errors in clip.h on certain compilers (#13030 ) b5161	2025-04-20 12:15:41 +02:00
Jeff Bolz	66168204be	vulkan: support noncontiguous rms_norm (#13031 ) b5160	2025-04-20 10:50:02 +02:00
Jeffrey Morgan	4ba9d711ba	metal: add neg operator (#13029 ) b5159	2025-04-20 08:28:40 +03:00
bandoti	00137157fc	Disable CI cross-compile builds (#13022 ) b5158	2025-04-19 18:05:03 +02:00
Sigbjørn Skjæret	fb28f4f80e	gguf-py : fix upload python package workflow (#13020 ) gguf-v0.16.2	2025-04-19 16:26:38 +02:00
Xuan-Son Nguyen	37b9f0d29d	clip : refactor, add `image_manipulation` and `llava_uhd` classes (#13011 ) * clip : refactor, add `image_manipulation` and `llava_uhd` * refactor llava-1.6 preprocessing * simplify logic for llava-1.5 * missing include b5156	2025-04-19 09:15:45 +02:00
Daniel Tang	6408210082	main : Fix Ctrl+D/newline handling (#12951 ) This restores the behavior from #491. This does not affect Ctrl+D's ability to terminate --multiline-input lines (#1040). This also actually implements #587: "If the user wants the text to end in a newline, this should be accomplished by explicitly adding a newline by using \ followed by return, then returning control by pressing return again." Fixes #12949 b5155	2025-04-18 22:02:55 +02:00
Chris Thompson	aff9d107b0	gguf-py : GGUF Editor GUI - Python + Qt6 (#12930 ) gguf-v0.16.1	2025-04-18 20:30:41 +02:00
Xuan-Son Nguyen	35370ba945	server : use std::move whenever possible (#12936 ) * server : use std::move whenever possible * use r-value ref * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * make task creation scoped * restore std::move * fix task_id not set correctly * apply changes from suggestion Co-authored-by: ggerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b5153	2025-04-18 19:58:12 +02:00
Akarshan Biswas	8d66005763	SYCL: Refactor and enable FP16 in binary broadcast OPs (#12975 ) * SYCL: refactor move to a separate file * Fix binbcast * Remove duplicates * fix include formatting * fix typo b5152	2025-04-18 15:57:56 +02:00
Xuan-Son Nguyen	b9154ecff9	mtmd : add methods to access `mtmd_image_tokens` (#12906 ) * mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * fix prompt_modified * rm redundant data member b5151	2025-04-18 10:04:51 +02:00
Radoslav Gerganov	2db9ba1464	rpc : add RPC_CMD_HELLO (#12955 ) Add RPC_CMD_HELLO for getting the version of the protocol implemend by the server. Follow the semantic versioning rules at https://semver.org Hopefully this bring better user experience when we make breaking changes at the protocol level and avoid issues like #12465 b5150	2025-04-18 10:13:42 +03:00
Georgi Gerganov	2f74c354c0	graph : make FA compatible with MLA + add initial Metal kernels (#12953 ) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci b5149	2025-04-17 18:16:36 +03:00
Alan Gray	207c22ec2d	ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (#12970 ) b5148	2025-04-17 15:19:42 +02:00
hipudding	7a395f67a7	CANN: Add support for async operator submission (#12864 ) Submit operators using asynchronous threads to improve performance. Use the environment variable GGML_CANN_ASYNC_MODE to control whether asynchronous submission is enabled. It is disabled by default. Testing shows a 10%–20% performance improvement in scenarios with small parameter sizes, especially in quantized models. b5147	2025-04-17 20:34:16 +08:00
Mikko Juola	971f245b3b	llama : recognize IBM Granite 3.3 FIM tokens (#12988 ) The Granite's FIM tokens are very similar to Qwen's; it's just that they use underscore instead of a dash. So <fim_middle> for example instead of <fim-middle>. Opening up tokenizer_config.json in ibm-granite/granite-3.3-8b-base shows: ``` "<fim_prefix>", "<fim_middle>", "<fim_suffix>", "<fim_pad>", ... "<reponame>", ``` b5146	2025-04-17 11:37:05 +03:00
kimminsu	12b17501e6	opencl: fix incorrect local_size index in profiling log (#12868 ) b5145	2025-04-16 14:25:57 -07:00
Jeff Bolz	015022bb53	vulkan: enable coopmat2 FA gqa and split_k optimizations more often (#12931 ) The grouped query attention optmization doesn't require a power of two ratio, the only thing relying on it was the modulo operation written as bitwise &. split_k need not depend on gqa_ratio - enable it any time there's only one workgroup in the X dimension. The shader gets the split index from the x coord, and multiple workgroups in the X dimension (pre-split) indicates a larger FA operation that wouldn't need splitting. b5144	2025-04-16 20:37:25 +02:00
Chenguang Li	b43d89e311	CANN: Add 310P operator support check (#12962 ) b5143	2025-04-16 16:21:05 +08:00
lhez	80f19b4186	opencl: split `ggml-opencl.cl` into multiple files and cleanup (#12886 ) * opencl: refactor - split the kernel files --------- Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com> * opencl: split more kernels into separate files * opencl: specify subgroup size instead of querying it * opencl: refine Adreno cl compiler version parsing * opencl: skip some kernels not used by Adreno on old compilers * opencl: refine logic for selecting Adreno kernels * opencl: refine Adreno cl compiler version * opencl: cleanup preprocessor for kernels * opencl: consider Adreno CL compiler on Windows * opencl: add final newline for `mul_mv_f16_f16.cl` --------- Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com> b5142	2025-04-15 12:26:00 -07:00
Georgi Gerganov	f8f820cc4d	metal : add FA-vec kernels for head size 96 (#12952 ) ggml-ci b5141	2025-04-15 14:45:05 +03:00
hipudding	54a7272043	CANN: Add x86 build ci (#12950 ) * CANN: Add x86 build ci * CANN: fix code format b5140	2025-04-15 12:08:55 +01:00
David Huang	84778e9770	CUDA/HIP: Share the same unified memory allocation logic. (#12934 ) Replace compile-time `GGML_HIP_UMA` with environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY`. This unifies the usage on NVIDIA and AMD GPUs, and allows a single binary to be shared between integrated and dedicated GPUs.	2025-04-15 11:20:38 +02:00
Akarshan Biswas	510676475f	SYCL: Add ROPE vision kernel (#12887 ) * SYCL: Add ROPE vision kernel * Add comment about rope mode b5138	2025-04-15 10:37:42 +02:00
Juk Armstrong	daa422881a	llama : DeepSeek V2/V3 MLA implementation (#12801 ) * Merged using squash to remove all noise commit messages * Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large * Removed 3 conts (2x RoPE and 1x RMS-norm) * Changed to use `<cmath>` instead of `<math.h>` * Reverted removal of the 3 conts * Used `reshape` in `llm_graph_context::build_attn_mha()` * Use `k_pe = ggml_reshape` * Removed the 3 conts again * Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF * Removed MQA optimisation from `build_attn_mha()` as no gains now * Simplified `is_mla` branch in `llm_build_deepseek2()` * Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls * Fixed call to `build_attn` in `llm_build_t5_enc` b5137	2025-04-15 09:49:57 +03:00
Srihari-mcw	eccc7a1602	ggml : Add AVX512 implementation of GEMM - Q4_Kx8 (#12829 ) * Add AVX512 implementation of GEMM - q4kx8 * Update changes to remove unnecessary whitespaces b5136	2025-04-15 09:22:36 +03:00
Chenguang Li	0019279bb5	CANN: Opt ROPE optimization (#12865 ) * [CANN]Opt ROPE optimization * [CANN]Codestyle adjustment * [CANN]Fix the ROPE precision issue * [CANN]codestyle fix * [CANN]add rope unsupport case Signed-off-by: noemotiovon <noemotiovon@gmail.com> b5135	2025-04-15 10:09:35 +08:00
Xinpeng Dou	b0c75ac9f9	CANN: Optimize CANN buffer pool memory management (#12875 ) Multiple optional memory pools are provided for CANN, including VMM, priority queue-based, and traditional memory pools. 1.When the memory pool is available and GGML_CANN_DISABLE_VMM_POOL is not defined, the VMM pool is selected by default. 2.Otherwise, if GGML_CANN_ENABLE_BUF_PRIO_POOL is defined, the priority queue-based memory pool is used. 3.If neither condition is met, the default memory pool is used. b5134	2025-04-15 10:04:24 +08:00
Russyyds	d6d2c2ab8c	Add performance print for gemma3 in example (#12929 ) b5133	2025-04-14 19:18:20 +02:00
Akarshan Biswas	75afa0ae31	SYCL: Fix im2col (#12910 ) * SYCL: Fix im2col * restore local workgroup size adjustments for large inputs * restore format b5132	2025-04-14 14:23:53 +02:00
Radoslav Gerganov	c772d54926	rpc : use ggml_context_ptr (#12938 ) b5131	2025-04-14 13:59:34 +03:00
Neo Zhang Jianyu	81c7e64fc2	dsiable curl lib check, this action is missed by commit `bd3f59f812` (#12761 ) (#12937 )	2025-04-14 18:19:07 +08:00
Georgi Gerganov	526739b879	sync : ggml ggml-ci b5129	2025-04-14 09:26:15 +03:00

1 2 3 4 5 ...

5178 Commits