llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-06-27 12:05:03 +00:00

Author	SHA1	Message	Date
oobabooga	233461f812	sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (#13264 ) * sampling: add Top-nσ sampler to `llama-server` and sampler ordering * revert: sampler ordering * revert: VS' crappy auto-formatting * revert: VS' crappy auto-formatting pt.2 * revert: my crappy eye sight... * sampling: add XTC to Top-nσ sampler chain * sampling: add Dyna. Temp. to Top-nσ sampler chain * sampling: actually remove Top-nσ from sampler(oops) * Integrate top_n_sigma into main sampler chain * Define COMMON_SAMPLER_TYPE_TOP_N_SIGMA * Formatting * Lint * Exit early in the sampler if nsigma < 0 --------- Co-authored-by: CasualAutopsy <casual_autopsy@outlook.com> b5286	2025-05-05 22:12:19 +02:00
igardev	b34c859146	server : Webui - change setText command from parent window to also send the message. (#13309 ) * setText command from parent window for llama-vscode now sends the message automatically. * Upgrade packages versions to fix vulnerabilities with "npm audit fix" command. * Fix code formatting. * Add index.html.gz changes. * Revert "Upgrade packages versions to fix vulnerabilities with "npm audit fix" command." This reverts commit `67687b7fda`. * easier approach * add setTimeout --------- Co-authored-by: igardev <ivailo.gardev@akros.ch> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-05-05 16:03:31 +02:00
Xuan-Son Nguyen	9b61acf060	mtmd : rename llava directory to mtmd (#13311 ) * mv llava to mtmd * change ref everywhere b5284	2025-05-05 16:02:55 +02:00
Xuan-Son Nguyen	5215b91e93	clip : fix confused naming ffn_up and ffn_down (#13290 ) * clip : fix confused naming ffn_up and ffn_down * rm ffn_i/o/g naming * rename n_embd, n_ff * small fix * no check n_ff b5283	2025-05-05 12:54:44 +02:00
Sigbjørn Skjæret	ae803bfc3d	convert : bailingmoe : set yarn metadata if present (#13312 )	2025-05-05 12:34:26 +02:00
Akarshan Biswas	66645a5285	SYCL: Disable mul_mat kernels for noncontiguous tensor b (#13308 ) ggml-ci b5281	2025-05-05 13:39:10 +05:30
Xuan-Son Nguyen	27aa259532	mtmd : add C public API (#13184 ) * init * wip * working version * add mtmd::bitmaps * add test target * rm redundant define * test: mtmd_input_chunks_free * rm outdated comment * fix merging issue * explicitly create mtmd::input_chunks * mtmd_input_chunk_copy * add clone() * add const to various places * add warning about breaking changes * helper: use mtmd_image_tokens_get_n_pos b5280	2025-05-04 23:43:42 +02:00
Diego Devesa	9fdfcdaedd	rpc : use backend registry, support dl backends (#13304 ) b5279	2025-05-04 21:25:43 +02:00
Aaron Teo	6eb7d25c70	ggml : activate s390x simd for Q3_K (#13301 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> b5278	2025-05-04 19:49:12 +02:00
Diego Devesa	86bd60d3fe	llava/mtmd : fixes to fully support dl backends (#13303 ) b5277	2025-05-04 17:05:20 +02:00
Diego Devesa	9f2da5871f	llama : build windows releases with dl backends (#13220 ) b5276	2025-05-04 14:20:49 +02:00
Johannes Gäßler	93c4e23905	CUDA: fix race condition in MMQ stream-k fixup (#13299 ) b5275	2025-05-04 14:16:39 +02:00
Johannes Gäßler	8afbd96818	CUDA: fix race condition in MMQ ids_dst (#13294 ) b5274	2025-05-04 13:58:38 +02:00
Jeff Bolz	8ae5ebcf85	vulkan: Additional type support for unary, binary, and copy (#13266 ) Support f16->f32 copy. Support f16->f16 and f32->f32 unary ops. Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div. b5273	2025-05-04 07:17:16 +02:00
Johannes Gäßler	3e959f0976	imatrix: fix oob writes if src1 is not contiguous (#13286 ) b5272	2025-05-04 00:50:37 +02:00
Xuan-Son Nguyen	36667c8edc	clip : revert the change of BOI/EOI token for GLM-edge (⚠️ breaking change) (#13259 ) b5271	2025-05-03 20:07:54 +02:00
ymcki	3bf785f3ef	llama : Llama-3_1-Nemotron-Ultra-253B-v1 support (#12843 ) b5270	2025-05-03 17:39:51 +02:00
Diego Devesa	1d36b3670b	llama : move end-user examples to tools directory (#13249 ) * llama : move end-user examples to tools directory --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> b5269	2025-05-02 20:27:13 +02:00
Georgi Gerganov	b34443923c	sync : ggml (#13268 ) * vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204) * vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW) * review: remove src_x/y < 0 checks; add performance tests * sync : ggml ggml-ci * vulkan : fix lint (#0) --------- Co-authored-by: Acly <aclysia@gmail.com>	2025-05-02 20:54:30 +03:00
Georgi Gerganov	a75cb30dc9	context : fix reorder logic (#13267 ) ggml-ci b5267	2025-05-02 20:54:13 +03:00
shalinib-ibm	3f3769ba76	ggml : Enable MMA for BF16 in llamafile_sgemm (#13148 ) This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type. This change results in 9x - 40x gains in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark. The patch is tested with Meta-Lllama-3-8B, and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com> b5266	2025-05-02 19:53:12 +03:00
Jared Van Bortel	2f567611c0	llama-model : support Qwen2 embedding models and pooling_mode_lasttoken (#13245 ) b5265	2025-05-02 11:42:30 -04:00
Jared Van Bortel	7d2123484e	convert : use correct context length for nomic-embed-text-v2 (#13216 )	2025-05-02 11:41:54 -04:00
Xuan-Son Nguyen	074e42ab31	convert : converting mmproj for Qwen2/2.5VL from convert_hf_to_gguf (#13209 ) * wip * qwen2.5vl ok * vision: fix models missing "text_config" * add test * fix test repo name * fix 32B model * Revert "fix 32B model" This reverts commit `651752f1ae`. * clarify about 32B * rm qwen surgery script * update llava/readme * move V_ENC_EMBD_PATCH handling to Qwen2VLVisionModel	2025-05-02 17:17:15 +02:00
Georgi Gerganov	c642bc014c	kv-cache : separate recurrent vs non-recurrent impl (#12799 ) * kv-cache : serparate recurrent vs non-recurrent impl (wip) ggml-ci * kv-cache : init -> contructor + add llama_memory_params ggml-ci * kv-cache : fix callback reference ggml-ci * context : llama_kv_cache -> llama_memory_i ggml-ci * context : move memory creation logic to model ggml-ci * llama : remove reference of memory during encode ggml-ci * kv-cache : hide padding details in the implementation ggml-ci * kv-cache : add ubatch_next() ggml-ci * context : simplify sbatch logic ggml-ci * kv-cache : hide defrag logic in the implementation ggml-ci * context : hide kv cache details in implementation ggml-ci * build : fix ggml-ci * cont : another fix ggml-ci * kv-cache : simplify interface (wip) ggml-ci * kv-cache : use separate KV cell structs for unified/recurrent ggml-ci * kv-cache : clean-up ggml-ci * model : better llama_model::create_model() signature ggml-ci * kv-cache : fix recurrent seq_rm() ggml-ci * kv-cache : replace `struct callbacks` with `llama_model &` ggml-ci * kv-cache : replace `struct graph_params` with `llama_context &` ggml-ci * kv-cache : fix offload check ggml-ci * context : avoid passing unique_ptr ggml-ci * kv-cache : avoid using the backends from the llama_context ref #13113 ggml-ci * kv-cache : more consistent debug logs [no ci] * kv-cache : do not pass the full llama_context for kv graphs ggml-ci * kv-cache : remove comment * kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext ggml-ci * kv-cache : fix recurrent multi-user case ggml-ci * memory : remove comments [no ci]	2025-05-02 17:48:36 +03:00
Sigbjørn Skjæret	cb06a3c363	llama : orion rope type is neox (#13261 ) b5261	2025-05-02 12:44:24 +02:00
Sigbjørn Skjæret	626083faf7	llama : plamo rope type is neox (#13260 ) b5260	2025-05-02 12:40:56 +02:00
piDack	2af6880178	llama-chat : reset glmedge chat template (#13253 ) * reset glmedge chat template * fix glmedge chat template b5259	2025-05-02 11:06:09 +02:00
Shakil Ahmed	e84773ab60	mtmd-cli : fix out_of_range when input image path is empty (#13244 ) * fix out_of_range error to keep the chat loop running * Update examples/llava/mtmd-cli.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * mtmd-cli : load image right away * add a new line for readability * rm printf * Update examples/llava/mtmd-cli.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update examples/llava/mtmd-cli.cpp --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> b5258	2025-05-02 10:20:27 +02:00
Georgi Gerganov	fab647e884	server : add cache reuse card link to help (#13230 ) * server : add cache reuse card link to help * args : use short url b5257	2025-05-02 09:48:31 +03:00
Xuan-Son Nguyen	dcf886007d	convert : explicitly disable trust_remote_code for AutoConfig (#13246 )	2025-05-02 08:45:10 +02:00
bandoti	d24d592808	ci: fix cross-compile sync issues (#12804 ) b5255	2025-05-01 19:06:39 -03:00
Justin Santa Barbara	8efbdadc61	rpc : avoid uninitialized memory in serialize_tensor (#13210 ) Zero out the name and padding buffers. b5254	2025-05-01 23:32:11 +02:00
Jesse Gross	f057808ffa	ggml: Don't assert fail when tensor data changes (#13222 ) The following scenario will cause an assertion failure in the graph allocator: - Build and allocate a graph containing a tensor with a non-NULL data pointer - Build and allocate a new graph where that data is NULL Result: ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed This happens during revalidation because we think that memory should have been previously allocated based on the current graph but in reality the previous graph was different. In this situation, we should do a full reallocation pass. b5253	2025-05-01 22:46:10 +02:00
Diego Devesa	d7a14c42a1	build : fix build info on windows (#13239 ) * build : fix build info on windows * fix cuda host compiler msg b5252	2025-05-01 21:48:08 +02:00
Loïc Carrère	b6e4ff69b8	clip : (minicpmv) Re-enable upscaling of images smaller than the CLIP image size (#13237 )	2025-05-01 21:32:21 +02:00
matteo	e0f572c846	llama-chat : update GLM4 chat template (#13238 ) * update GLM4 chat template * Update chat template Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> b5250	2025-05-01 21:16:38 +02:00
Jeff Bolz	79f26e9e12	vulkan: Add bfloat16 support (#12554 ) * vulkan: Add bfloat16 support This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16. The extension is required for coopmat multiply support, but matrix-vector multiply trivially promotes bf16 to fp32 and doesn't require the extension. The copy/get_rows shaders also don't require the extension. It's probably possible to fall back to non-coopmat and promote to fp32 when the extension isn't supported, but this change doesn't do that. The coopmat support also requires a glslc that supports the extension, which currently requires a custom build. * vulkan: Support bf16 tensors without the bf16 extension or coopmat support Compile a variant of the scalar mul_mm shader that will promote the bf16 values to float, and use that when either the bf16 extension or the coopmat extensions aren't available. * vulkan: bfloat16 fixes (really works without bfloat16 support now) * vulkan: fix spirv-val failure and reenable -O b5249	2025-05-01 20:49:39 +02:00
Jeff Bolz	fc727bcdd5	vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader (#13191 ) * vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader b5248	2025-05-01 20:19:31 +02:00
Johannes Gäßler	b0ecbd434b	test: non-cont. b in test-backend-ops -o MUL_MAT (#13187 )	2025-05-01 20:18:56 +02:00
Georgi Gerganov	b1dd4d08e8	sync : ggml ggml-ci b5246	2025-05-01 20:15:34 +03:00
Daniel Bevenius	99881f77d8	whisper : add check that target name exists (whisper/3103) This commit adds a check to makes sure that the target exists before trying to add compile options to ignore warnings when using MSVC. The motivation for this is currently the build is broken depending on the cmake options provided. With this fix it should be possible to build even if the targets are not actually available. Refs: https://github.com/ggml-org/whisper.cpp/pull/3090#issuecomment-2842760104	2025-05-01 20:15:34 +03:00
Daniel Bevenius	b5769d92b4	ggml : suppress Windows compiler warnings (whisper/3075) * whisper: suppress Windows compiler warnings This commit disables compiler warnings on window using MSVC. The motivation for these changes is that some compilers generate warnings for these conversion, for example Windows MSVC, and there are quite a few of them. This makes it a little difficult to spot new warnings that may be introduced and also can be difficult for users/embedders of ggml where these warnings are hard to separate from their own warnings. * squash! whisper: suppress Windows compiler warnings Move ggml related warnings into ggml. This commit also fixes the indentation and adds a missing whitespace to the if statement.	2025-05-01 20:15:34 +03:00
Xuan-Son Nguyen	8936784f7a	mtmd : add vision support for Mistral Small 3.1 (#13231 ) * convert ok * load ok, missing patch merger * ah sheet it works * update llava/readme * add test * fix test b5243	2025-05-01 17:05:42 +02:00
Xuan-Son Nguyen	13c9a3319b	arg : remove CURLINFO_EFFECTIVE_METHOD (#13228 ) b5242	2025-05-01 10:23:25 +02:00
Jared Van Bortel	a70183eb00	llama-model : fix the reported size class for nomic-embed-text-v2-moe (#13223 ) b5241	2025-05-01 10:09:41 +03:00
Georgi Gerganov	8d33d740c3	sync : ggml	2025-05-01 10:00:39 +03:00
Diego Devesa	4254bb4951	ggml : fix ggml_gallocr_ptr type (ggml/1205) b5239	2025-05-01 09:58:44 +03:00
Georgi Gerganov	9998540149	cuda : fix unused variable compile warning (whisper/0) ggml-ci	2025-05-01 09:58:44 +03:00
Johannes Gäßler	e1e8e0991f	CUDA: batched+noncont MMQ, refactor bs>1 MoE code (#13199 ) b5237	2025-04-30 23:12:59 +02:00

1 2 3 4 5 ...

5336 Commits