llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-06-27 03:55:20 +00:00

Author	SHA1	Message	Date
Xuan-Son Nguyen	6f67cf1f48	arg : -hf do not fail if url mismatch (#13219 ) * arg : -hf do not fail if url mismatch * do not return if cannot parse metadata json b5236	2025-04-30 21:29:15 +01:00
ddh0	16a457facd	fix typo: `n_ctx_pre_seq` -> `n_ctx_per_seq` (#13221 ) b5235	2025-04-30 21:28:43 +01:00
Xuan-Son Nguyen	3e168bede4	convert : improve model arch handling (#13122 ) * convert : improve model arch handling * use AutoConfig * rm trust_remote_code * Update convert_hf_to_gguf.py * fix self.block_count for vision * fix NomicBertModel	2025-04-30 16:56:24 +02:00
Tatsuya Tanaka	ceda28ef8e	llava : remove duplicate include (#13207 ) b5233	2025-04-30 15:25:20 +02:00
Olivier Chafik	3b127c7385	common : add -jf / --json-schema-file flag (#12011 ) b5232	2025-04-30 14:52:35 +02:00
Jeff Bolz	e5007a5edf	vulkan: use uint array index to avoid glslang bug (#13193 ) b5231	2025-04-30 14:38:37 +02:00
shalinib-ibm	416313773b	ggml : fix ppc64le build (#13176 ) Build fails with compilation error on power pc. This patch fixes the same. Tested with unit tests run via --build <build_dir> && cd <build_dir> && make test Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com> b5230	2025-04-30 13:17:08 +02:00
Xuan-Son Nguyen	07c2e2f76c	convert : correct typo image_mean --> image_std (#13208 )	2025-04-30 13:06:15 +02:00
Aaron Teo	44cd8d91ff	feat(ggml-cpu): enable z17 compile (#13182 ) z17 compilation requires GCC 15.1.0 and onwards Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> b5228	2025-04-30 10:47:35 +01:00
Xuan-Son Nguyen	5933e6fdc9	arg : allow using -hf offline (#13202 ) * arg : allow using -hf offline * add more comments in code [no ci]	2025-04-30 10:46:32 +02:00
Xuan-Son Nguyen	da84c04d8f	docker : do not build tests (#13204 ) * docker : do not build tests * include "ggml-cpu.h" b5226	2025-04-30 10:44:07 +02:00
xiaofei	a0f7016d17	rpc : fix cache directory initialization (#13188 ) Signed-off-by: xiaofei <hbuxiaofei@gmail.com> b5225	2025-04-30 09:29:22 +03:00
Johannes Gäßler	19e899ce21	scripts: n_depth for compare-llama-bench [no ci] (#13201 )	2025-04-29 23:32:04 +02:00
matteo	e2e1ddb93a	server : Prefilling assistant message in openai compatible API (#13174 ) * Prefilling assistant message in openai compatible API * fixed indentation * fixed code convention * simplify method usage * no more than one assistant message at end of messages * merge checks into prefill code * Update examples/server/utils.hpp --------- Co-authored-by: matteo <matteo@naspc.lan> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> b5223	2025-04-29 20:33:10 +02:00
Georgi Gerganov	d9d398f84f	sampling : when top-k <= 0 -> noop (#13173 ) ggml-ci b5222	2025-04-29 20:22:57 +03:00
Alberto Cabrera Pérez	5a63980117	llama-bench: fixed size of fields to correctly map to values (#13183 ) b5221	2025-04-29 17:24:36 +02:00
Johannes Gäßler	cdf76586b2	CUDA: fix non-cont. inputs for batched mat mul (#13155 ) b5220	2025-04-29 16:00:27 +02:00
Sigbjørn Skjæret	7d3af70b08	llama : llm_type order by size (#13177 ) b5219	2025-04-29 13:25:53 +02:00
Xuan-Son Nguyen	00e3e5a194	mtmd : add qwen2vl and qwen2.5vl (#13141 ) * llava : add clip_n_output_tokens, deprecate clip_n_patches * mtmd : add qwen2vl and qwen2.5vl * decode_embd_batch::set_position_... * working version * deprecate llama-qwen2vl-cli * correct order W, H of clip_embd_nbytes_by_img * edit existing line in hot topics b5218	2025-04-29 11:47:04 +02:00
Sigbjørn Skjæret	e98b3692be	llama : set qwen3 model type sizes (#13175 ) b5217	2025-04-29 11:00:31 +02:00
Xuan-Son Nguyen	b6ce7430b7	llama-graph : fix text position for mrope (#13159 ) * llama-graph : fix text position for mrope * fix typo * explicitly set 4th dim in the loop b5216	2025-04-29 09:45:49 +03:00
AT	5f5e39e1ba	model : Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture (#12466 ) * Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture - Adds MoE-based embedding model supporting multilingual embeddings. - Selects architecture variant based on hyperparameter detection (MoE layers). - Removes unnecessary subclass initialization checks for clarity. https://www.nomic.ai/blog/posts/nomic-embed-text-v2 Co-authored-by: Jared Van Bortel <jared@nomic.ai> * fix tokenizer * don't rename this tensor --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai> b5215	2025-04-28 22:52:15 +03:00
Xuan-Son Nguyen	eaea325324	clip : fix model size display (#13153 ) b5214	2025-04-28 21:23:19 +02:00
Ville Vesilehto	43ddab6eee	fix(rpc): Improve input validation and error handling (#13069 ) * fix(rpc): Improve input validation and error handling The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - Type Validation: `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range before calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - Bounds Checks: Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - Size Checks: Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - Error Propagation: - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): address pr comments removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): ambiguous nullptr from create_node rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): initial zero check in create_node The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * fix(rpc): Handle get_alloc_size failure in server Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): input size validation in graph_compute Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove extra status code setting Removes the explicit setting of `response.result = GGML_STATUS_FAILED` when `create_node` returns `nullptr` within `graph_compute`. Primary signal is the `false` return value in case of failure. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove redundant check for tensor->type Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> b5213	2025-04-28 21:00:20 +03:00
Vishal Agarwal	1831f538f7	llama-bench: add `-d` depth arg (#13096 ) * add depth param * update llama-bench README and add depth param * llama-bench: default params for depth arg for faster execution * Update examples/llama-bench/README.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * fix buffer print ub * use user provided args * remove extra whitespaces --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b5212	2025-04-28 16:50:39 +02:00
Xuan-Son Nguyen	4e87962e34	mtmd : fix glm-edge redundant token count (#13139 ) * mtmd : fix glm-edge redundant token count * fix chat template * temporary disable GLMEdge test chat tmpl b5211	2025-04-28 16:12:56 +02:00
pockers21	fb0471d175	context : do not clear output buffer on reserve (#13152 ) Co-authored-by: pockers21 <liyang2@uniontech.com> b5210	2025-04-28 16:45:40 +03:00
Xuan-Son Nguyen	d2b2031e5f	llama : (mrope) allow using normal 1D position for text token (#13138 ) * llama : (mrope) use normal position for text token * rm n_pos_per_embd from llm_graph_input_attn_temp b5209	2025-04-28 14:20:56 +02:00
Xuan-Son Nguyen	5fa9e63be8	clip : refactor set input for cgraph + fix qwen2.5vl input (#13136 ) * clip : refactor set input for cgraph * more strict assert * minicpmv : use clip_n_mmproj_embd instead of copying the same code everywhere * split qwen2 and qwen2.5 code blocks * minor style fix b5208	2025-04-28 12:18:59 +02:00
Akarshan Biswas	a4c340f974	SYCL: Add all missing unary kernels (#13074 ) * SYCL: Add all missing unary kernels ggml-ci * decouple kernel launch range from data size using strided loop * use ciel_div helper for num_blocks ggml-ci * clean auto imported header files b5207	2025-04-28 11:33:25 +02:00
Georgi Gerganov	d0a417f3c7	readme : update hot topics (#13150 )	2025-04-28 12:10:18 +03:00
Georgi Gerganov	43f2b07193	common : fix noreturn compile warning (#13151 ) ggml-ci b5205	2025-04-28 11:57:19 +03:00
Xuan-Son Nguyen	e5d6c2554e	llama-chat : fix typo GML --> GLM (#13143 ) b5204	2025-04-28 10:11:58 +02:00
R0CKSTAR	f0dd6a1926	musa: fix typo in cc control (#13144 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-04-28 09:33:28 +02:00
Johannes Gäßler	69699be48a	CUDA: fix q_nope_absorbed prec for DS 2 Lite f16 (#13137 ) b5202	2025-04-28 09:29:26 +02:00
Xuan-Son Nguyen	85f36e5e71	arg : fix unused variable (#13142 ) b5201	2025-04-28 08:16:59 +03:00
4onen	c0a97b762e	llama-bench : Add `--override-tensors` arg (#12922 ) * Add --override-tensors option to llama-bench * Correct llama-bench --override-tensors to --override-tensor * llama-bench: Update --override-tensors parsing to match --tensor-split, appear in test matrix. * Make new llama-bench util functions static to fix Ubuntu CI * llama-bench: Correct -ot corner cases (No -ot calls, leading and trailing empty -ot spans, etc.) b5200	2025-04-27 23:48:26 +02:00
matteo	ced44be342	llama-chat : fix wrong template in GLM4-0414 (#13140 ) * fix wrong template in GLM4-0414 * fix spaces * no bos token since it is already in the template * moved the chatgml4 check to higher priority * restored template for old GLM models * moved the GLM4 template check in the correct place with correct check b5199	2025-04-27 21:57:32 +02:00
R0CKSTAR	e291450b76	musa: fix build warning (#13129 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> b5198	2025-04-27 13:22:49 +02:00
LostRuins Concedo	59e991c23c	Fixes Qwen2.5VL segfault during inference with https://github.com/ggml-org/llama.cpp/pull/12402 as has_qwen2vl_merger migration was incomplete (#13133 ) b5197	2025-04-27 12:43:37 +02:00
HimariO	ca2bb89eac	clip : Add Qwen2.5VL support (#12402 ) * implment vision model architecture, gguf convertor * handle window attention inputs * add debug utils * fix few incorrect tensor memory layout * move position id remap out of ggml to avoid int32 cuda operations * cleaning up * ignore transformers Qwen2_5_xxx type check * remove not so often use `qwen2vl-cli` debug functions * remove commented-out code blocks * fix attn weight scaling after rebase * add `PROJECTOR_TYPE_QWEN2_5_VL` * remove `KEY_USE_GLU_MLP`, `KEY_USE_RMS_NORM` * replace `KEY_FULLATTN_BLK_IDX` with `KEY_WIN_ATTN_PATTERN` * remove `attn_window_size` from gguf * fix model conversion * clean up * fix merging problem * add test --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> b5196	2025-04-27 10:10:34 +02:00
Xuan-Son Nguyen	2d451c8059	common : add common_remote_get_content (#13123 ) * common : add common_remote_get_content * support max size and timeout * add tests b5195	2025-04-26 22:58:12 +02:00
Xuan-Son Nguyen	4753791e70	clip : improve projector naming (#13118 ) * clip : improve projector naming * no more kv has_llava_projector * rm unused kv * rm more unused b5194	2025-04-26 22:39:47 +02:00
SXX	77d5e9a76a	ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs (#13107 ) * ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion * move fp converter to ggml-cpu * Switch ggml_compute_forward_get_rows_f16/bf16 to new ggml_cpu_fp16/bf16_to_fp32 b5193	2025-04-26 16:05:31 +02:00
frob	d5fe4e81bd	grammar : handle maxItems == 0 in JSON schema (#13117 ) Co-authored-by: Richard Lyons <frob@cloudstaff.com> b5192	2025-04-26 10:10:20 +02:00
Diego Devesa	295354ea68	llama : fix K-shift with quantized K and BLAS backend (#13113 ) b5191	2025-04-25 19:40:11 +02:00
City	558a764713	Force FP32 compute in GLM4 FFN Down (#13101 ) * Force FP32 compute in cuBLAS GEMM * Revert "Force FP32 compute in cuBLAS GEMM" This reverts commit `6efd872732`. * Force F32 compute in GLM4 ffn down * Edit comment to clarify issue Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b5190	2025-04-25 14:38:34 +02:00
Xuan-Son Nguyen	edb18b6e8f	clip : fix pixtral on some GPU backends (#13097 ) * clip : fix pixtral on some GPU backends * refactor inp_raw set * rm outdated comment * fix dynamic size * add TODO b5189	2025-04-25 14:31:42 +02:00
Neo Zhang Jianyu	514c45608f	change the reorder tensor from init to execute OP (#13003 ) b5188	2025-04-25 17:37:51 +08:00
Radoslav Gerganov	553a5c3a9f	rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943 ) RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency. b5187	2025-04-25 10:08:08 +03:00

1 2 3 4 5 ...

5336 Commits