llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-08-13 03:47:46 -04:00

Author	SHA1	Message	Date
Diego Devesa	482548716f	releases : use dl backend for linux release, remove arm64 linux release (#13996 ) b5587	2025-06-04 13:15:54 +02:00
Xuan-Son Nguyen	3ac67535c8	llama-graph : use ggml_repeat_4d (#13998 ) b5586	2025-06-04 10:11:26 +02:00
Johannes Gäßler	0b4be4c435	CUDA: fix FTZ in FA for Gemma 3 (#13991 ) b5585	2025-06-04 08:57:05 +02:00
Georgi Gerganov	e0e806f52e	kv-cache : fix unified::seq_rm to work with seq_id < 0 (#13985 ) ggml-ci b5584	2025-06-04 09:50:32 +03:00
Jeff Bolz	7e00e60ef8	vulkan: fix warnings in perf logger querypool code (#13937 )	2025-06-03 20:30:22 +02:00
Xuan-Son Nguyen	ea1431b0fa	docs : add "Quick start" section for new users (#13862 ) * docs : add "Quick start" section for non-technical users * rm flox * Update README.md	2025-06-03 13:09:36 +02:00
lhez	71e74a3ac9	opencl: add `backend_synchronize` (#13939 ) * This is not needed by the normal use where the result is read using `tensor_get`, but it allows perf mode of `test-backend-ops` to properly measure performance. b5581	2025-06-02 16:54:58 -07:00
rmatif	bfb1e012a0	OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat (#13840 ) * add concat, pad, repeat, tsembd, tanh, upscale * small fixes b5580	2025-06-02 16:53:36 -07:00
Georgi Gerganov	3637576288	server : disable speculative decoding for SWA models (#13970 ) * server : use swa-full fo draft context ggml-ci * server : disable speculative decoding for SWA models b5579	2025-06-02 21:34:40 +03:00
Georgi Gerganov	ea394d7ab1	metal : use F32 accumulators in FA kernels (#13975 ) ggml-ci b5578	2025-06-02 21:33:40 +03:00
Georgi Gerganov	5582c49c39	gemma : more consistent attention scaling for v2 and v3 (#13951 ) * gemma : fix attn scale for 27B * cont : apply scale before attn * cont : consistent attention scaling b5577	2025-06-02 20:54:26 +03:00
Olivier Chafik	c9bbc77931	`server`: update deepseek reasoning format (pass reasoning_content as diffs) (#13933 ) * server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat * update unit/test_tool_call.py::test_thoughts b5576	2025-06-02 10:15:44 -07:00
Xuan-Son Nguyen	bfd322796c	mtmd : fix memory leak in mtmd_helper_eval_chunk_single (#13961 ) * mtmd : fix memory in mtmd_helper_eval_chunk_single * mtmd-cli : fix mem leak * Update tools/mtmd/mtmd-cli.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b5575	2025-06-02 16:29:28 +02:00
shalinib-ibm	093e3f1feb	cmake : Handle mixed-case 'Power' strings in POWER CPU detection (#13966 ) Some systems report the CPU implementation as "Power11" instead of "POWER11". The existing CMake logic uses a case-sensitive regular expression to extract the CPU generation, which fails when the casing doesn't exactly match "POWER". This patch provides a fix by first converting the string to uppercase before applying the regex. Signed-off-by: root <root@rheldb2v.pperf.tadn.ibm.com> Co-authored-by: root <root@rheldb2v.pperf.tadn.ibm.com> b5574	2025-06-02 15:18:36 +03:00
Atharva Dubey	663445b0de	sycl: quantize and reorder the input to q8_1 when reorder is enabled (#13826 ) * [WIP]: fuse q8 quantization and reorder * wip2: fuse q8 quantization and reorder * working q8 reorder commit * restored common.hpp * remove debug prints * remove unnecessary headers and remove trailing whitespace * Update ggml/src/ggml-sycl/ggml-sycl.cpp Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com> --------- Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com> b5573	2025-06-02 10:12:20 +01:00
Johannes Gäßler	7675c555a1	gguf: fix failure on version == 0 (#13956 ) b5572	2025-06-01 18:08:05 +02:00
Sigbjørn Skjæret	5e1c3aed40	convert : fix nomic-bert-moe mask token (#13757 ) b5571	2025-06-01 18:07:21 +02:00
Sigbjørn Skjæret	c496fe0b1d	convert : fix vocab padding code for bert models (#13954 )	2025-06-01 17:23:11 +02:00
Aaron Teo	e57bb87ced	ggml: check if non-native endian model is being loaded (#13943 ) * gguf: prevent non-native endian models from being loaded Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * gguf: update error message Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * gguf: make the non-native endian check more verbose Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move ggml_assert location Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: reword the endianness check error message Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> b5569	2025-06-01 16:53:57 +02:00
Georgi Gerganov	f3a4b1659c	sync : ggml ggml-ci b5568	2025-06-01 13:43:57 +03:00
Kai Pastor	108009f5c7	vulkan : Remove unexpected ; (ggml/1253)	2025-06-01 13:43:57 +03:00
Kai Pastor	d337252acf	cmake : Fix broken CMake error messages (ggml/1252)	2025-06-01 13:43:57 +03:00
Radoslav Gerganov	af6f91db47	ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247) The implementation is already deleted with commit 9d0762e. closes: #1235	2025-06-01 13:43:57 +03:00
Georgi Gerganov	a7b8d35f78	sync : whisper.cpp (ggml/1250) * ggml : Fix backtrace breaking Windows build (whisper/3203) * sync : whisper.cpp ggml-ci --------- Co-authored-by: Daniel Tang <danielzgtg.opensource@gmail.com>	2025-06-01 13:43:57 +03:00
Radoslav Gerganov	6eba72b71c	ggml : install dynamic backends (ggml/1240) * ggml : install dynamic backends Make sure dynamic backends are installed in $CMAKE_INSTALL_BINDIR	2025-06-01 13:43:57 +03:00
Daniel Tang	fedf034a98	ggml : Print backtrace on uncaught C++ exceptions (ggml/1232) The goal is to have what users call "full logs" contain the backtrace. This is registered upon ggml_init. Also fixes a minor fd leak on Linux.	2025-06-01 13:43:57 +03:00
ddh0	8726392d3d	readme : update bindings (#13950 )	2025-06-01 11:44:30 +03:00
Georgi Gerganov	c04621711a	parallel : fix n_junk == 0 (#13952 ) b5560	2025-06-01 11:42:16 +03:00
Georgi Gerganov	0fc16b42e8	kv-cache : split implementation in separate sources (#13920 ) ggml-ci b5559	2025-06-01 11:39:27 +03:00
Max Krasnyansky	053b1539c0	threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (#12995 ) * threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling We talked about adding LOW priority for GGML threads in the original threadpool PR. It might be useful for some cases to avoid contention. Latest Windows ARM64 releases started parking (offlining) the CPU cores more aggresively which results in suboptimal performance with n_threads > 4. To deal with that we now disable Power Throttling for our threads for the NORMAL and higher priorities. Co-authored-by: Diego Devesa <slarengh@gmail.com> * threading: disable SetThreadInfo() calls for older Windows versions * Update tools/llama-bench/llama-bench.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com> b5558	2025-05-31 15:39:19 -07:00
Jiří Podivín	b3a89c3d9e	docs : Note about necessity of having libcurl installed for standard build. (#13945 ) Signed-off-by: Jiri Podivin <jpodivin@gmail.com>	2025-05-31 18:58:35 +02:00
Olivier Chafik	e15898d1c7	server: allow unclosed thinking tags (#13931 ) b5556	2025-05-31 08:26:10 -07:00
Georgi Gerganov	803f8baf4f	llama : deprecate explicit kv_self defrag/update calls (#13921 ) ggml-ci b5555	2025-05-31 15:58:33 +03:00
Georgi Gerganov	3600cc2886	llama : use n_swa + n_ubatch cells for SWA cache (#13833 ) * llama : use n_swa + n_ubatch cells for SWA cache ggml-ci * llama : add warning about multi-sqeuence SWA contexts b5554	2025-05-31 15:57:44 +03:00
igardev	c7e0a2054b	webui : Replace alert and confirm with custom modals. (#13711 ) * Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons. * use Modal Provider to simplify the use of confirm and alert modals. * Increase the z index of the modal dialogs. * Update index.html.gz * also add showPrompt * rebuild --------- Co-authored-by: igardev <ivailo.gardev@akros.ch> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-05-31 11:56:08 +02:00
Georgi Gerganov	3f55f781f1	llama : auto-batch preparation (#13845 ) * llama : auto-batch ggml-ci * context : simplify if branching b5552	2025-05-31 12:55:57 +03:00
Xuan-Son Nguyen	51fa76f172	mtmd : drop `_shared` from `libmtmd` name, merge helpers into libmtmd (⚠️ breaking change) (#13917 ) * mtmd : fix missing public header * no object * apply suggestion from Georgi * rm mtmd-helper, merge it to mtmd * missing vendor include dir b5551	2025-05-31 10:14:29 +02:00
Georgi Gerganov	12d0188c0d	kv-cache : refactor + add llama_memory_state_i (#13746 ) * kv-cache : simplify the "struct llama_kv_cache" interface ggml-ci * kv-cache : revert the (n_swa + n_ubatch) change (for next PR) ggml-ci * kv-cache : some comments ggml-ci * context : fix graph reserve for multiple sequences ggml-ci * kv-cache : fix typo [no ci] * kv-cache : fix find_slot() logic for free slots ggml-ci * llama : add TODO for deprecating the defrag API in the future * kv-cache : improve find_slot() using min/max seq pos info ggml-ci * llama : handle aborts and compute errors ggml-ci * memory : extract state into llama_memory_state ggml-ci * kv-cache : add comments ggml-ci * server : update batching logic to reset n_batch on successful decode * server : upon full re-processing, remove the sequence from the cache * kv-cache : add TODO for doing split_equal when split_simple fails ggml-ci	2025-05-31 10:24:04 +03:00
Shawn yang	eb3949938e	CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856 ) (#13895 ) * 1. add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu 2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature * Update ggml/src/ggml-cuda/ggml-cuda.cu Adjusted code indentation Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/ggml-cuda.cu Fixed incorrect setting of variable types Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/ggml-cuda.cu Adjusted the judgment logic Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()' * Update ggml/src/ggml-cuda/ggml-cuda.cu Add a defensive security assert Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/ggml-cuda.cu Adjusted the support judgment logic. Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * revoke the suggest commit changes due to it's not applicable in jetson_device * Update ggml/src/ggml-cuda/ggml-cuda.cu Add parentheses to enforce operator precedence Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update ggml/src/ggml-cuda/ggml-cuda.cu Fix ci bug: add a spaces Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: yangxiao <yang_xl@tju.edu.cn> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: yangxiao <yangxl_zz@qq.com> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-05-31 08:48:04 +02:00
Johannes Gäßler	e562eece7c	CUDA: fix typo in FlashAttention code (#13926 ) b5548	2025-05-30 21:22:03 +02:00
Diego Devesa	b47ab7b8e9	sched : avoid changing cur_copy when a graph is already allocated (#13922 ) b5547	2025-05-30 18:56:19 +02:00
Georgi Gerganov	dd665cc9d4	parallel : increase the variability of the prompt lengths (#13927 ) ggml-ci b5546	2025-05-30 19:38:07 +03:00
Diego Devesa	df0c0c7d02	cuda : prevent using split buffers with 3d/4d matrices (#13919 ) b5545	2025-05-30 16:37:18 +02:00
Akarshan Biswas	b49a8ff96b	SYCL: Add mrope kernel (#13755 ) * SYCL: Add mrope kernel * feat: Optimize rope operations with vectorization Uses `sycl::vec` to load and store two elements at a time, significantly improving performance in `rope_norm`, `rope_neox`, and `rope_multi`. This reduces the number of memory accesses and leverages SIMD instructions for faster execution. * Use ceil_div b5544	2025-05-30 19:40:57 +05:30
Georgi Gerganov	53f925074d	sync : vendor (#13901 ) * sync : vendor ggml-ci * cont : fix httplib version ggml-ci * cont : fix lint * cont : fix lint * vendor : move to common folder /vendor ggml-ci * cont : fix lint * cont : move httplib to /vendor + use json_fwd.hpp ggml-ci * cont : fix server build ggml-ci * cont : add missing headers ggml-ci * cont : header clean-up ggml-ci b5543	2025-05-30 16:25:45 +03:00
Sigbjørn Skjæret	db38704f01	convert : fix rwkv bos/eos token (#13844 )	2025-05-30 14:50:43 +02:00
Xuan-Son Nguyen	07e4351ce6	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 ) * convert : allow partial update to the chkhsh pre-tokenizer list * code style * update tokenizer out * rm inp/out files for models not having gguf * fixed hash for glm * skip nomic-bert-moe test * Update convert_hf_to_gguf_update.py * fix minerva-7b hash * rm redundant import b5541	2025-05-30 12:24:37 +02:00
Đinh Trọng Huy	291f2b6913	llama : add support for DistilBert (#13907 ) * add distilbert * small fixes * add note for LLM_ARCH_DISTIL_BERT * Use MODEL_ARCH.BERT for DistilBert --------- Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp> b5540	2025-05-30 11:56:02 +02:00
zhangkaihuo	2c90da4c7e	llama : use llm_build_granite for minicpm (#13911 ) b5539	2025-05-30 10:31:48 +02:00
Christian Kastner	ec9e0301fe	cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890 ) b5538	2025-05-30 01:28:54 +02:00

1 2 3 4 5 ...

5587 Commits