llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-07-28 03:55:06 -04:00

Author	SHA1	Message	Date
chen fan	ef6198b5a5	CANN: weight format to NZ for Ascend310P3 (#14407 ) * weight format to nz for 310p * remove quant weight format to nz * clean code * fix * make the conditions for converting weights to NZ format consistent * clean code	2025-07-25 21:24:50 +08:00
Aman Gupta	1e55890e40	CUDA: add fused rms norm (#14800 )	2025-07-25 21:24:50 +08:00
Csaba Kecskemeti	9b5125679c	ggml : model card yaml tab->2xspace (#14819 )	2025-07-25 21:24:50 +08:00
Jeff Bolz	44d4801a25	vulkan: fix rms_norm_mul to handle broadcasting dim0 (#14817 )	2025-07-25 21:24:50 +08:00
Molly Sophia	10a676558d	llama : add model type detection for rwkv7 7B&14B (#14816 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-07-25 21:24:50 +08:00
Ed Addario	45fc00e2c0	imatrix: add option to display importance score statistics for a given imatrix file (#12718 ) * Add --show-statistics option * Add --show-statistics logic * Add tensor name parsing * Tidy output format * Fix typo in title * Improve tensor influence ranking * Add better statistics * Change statistics' sort order * Add Cosine Similarity * Add header search path * Change header search path to private * Add weighted statistics per layer * Update report title * Refactor compute_statistics out of main * Refactor compute_cossim out of load_imatrix * Refactor compute_statistics out of load_imatrix * Move imatrix statistics calculation into its own functions * Add checks and validations * Remove unnecessary include directory * Rename labels * Add m_stats getter and refactor compute_statistics out of load_imatrix * Refactor variable names * Minor cosmetic change * Retrigger checks (empty commit) * Rerun checks (empty commit) * Fix unnecessary type promotion Co-authored-by: compilade <git@compilade.net> * Reverting change to improve code readability * Rerun checks (empty commit) * Rerun checks (empty commit) * Rerun checks - third time's the Charm 🤞 (empty commit) * Minor cosmetic change * Update README * Fix typo * Update README * Rerun checks (empty commit) * Re-implement changes on top of #9400 * Update README.md * Update README * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md * Remove duplicate option in print_usage() * Update README.md * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Remove input check * Remove commented out code --------- Co-authored-by: compilade <git@compilade.net>	2025-07-25 21:24:50 +08:00
stduhpf	888b75ba61	Mtmd: add a way to select device for vision encoder (#14236 ) * Mtmd: add a way to select device for vision encoder * simplify * format * Warn user if manual device selection failed * initialize backend to nullptr	2025-07-25 21:24:50 +08:00
Sigbjørn Skjæret	4c94f27ab7	cuda : implement bf16 cpy ops and enable bf16 cont (#14763 ) * implement bf16 cpy ops and enable bf16 cont * deduplicate copy functions * deduplicate checks	2025-07-25 21:24:50 +08:00
lhez	1e54562db3	opencl: remove unreachable `return` (#14806 )	2025-07-25 21:24:50 +08:00
Molly Sophia	0dd3cd5540	server : allow setting `--reverse-prompt` arg (#14799 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-07-25 21:24:50 +08:00
R0CKSTAR	9e500e2355	cuda: remove linking to cublasLt (#14790 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-25 21:24:50 +08:00
Sigbjørn Skjæret	e77f241b84	opencl: fix `im2col` when `KW!=KH` (#14803 )	2025-07-25 21:24:50 +08:00
rmatif	120add9ef4	opencl: add conv2d kernel (#14403 ) * add conv2d kernel * fix trailing whitespace * whitespace fixe * handle f16 input and f16 kernel, more opt * resolve conflicts * use enqueue_ndrange_kernel	2025-07-25 21:24:50 +08:00
Romain Biessy	f04095bde9	sycl: Fix im2col (#14797 )	2025-07-25 21:24:50 +08:00
Charles Xu	549f9eb1b5	kleidiai: add support for get_rows (#14676 ) * kleidiai: add support for get_rows * apply fixes based on code review * apply more fixes based on code review	2025-07-25 21:24:50 +08:00
Radoslav Gerganov	ae77ded2c2	docs : fix backends table in README.md (#14796 )	2025-07-25 21:24:50 +08:00
Jeff Bolz	a2cdf559c2	vulkan/cuda: Fix im2col when KW!=KH (#14789 ) The tid is decomposed into "ow + kyOW + kxOW*KH". Change "ksize" to match.	2025-07-25 21:24:49 +08:00
Aaron Teo	8410b085ea	docs: update huggingface links + reword Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-21 18:31:18 +08:00
Aaron Teo	e086c5e3a7	docs: update s390x document for sentencepiece Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-21 18:21:39 +08:00
Molly Sophia	c82d48ec23	llama : fix `--reverse-prompt` crashing issue (#14794 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com> b5949	2025-07-21 17:38:36 +08:00
IsaacDynamo	b4efd77f8a	server : add parse_special option to /tokenize endpoint (#14783 )	2025-07-21 10:24:51 +03:00
Aman Gupta	2be60cbc27	docs : fix link for tools/perplexity in README.md (#14780 )	2025-07-20 20:13:47 +02:00
rspOverflow	b526ad2668	Documentation: Further revisions to the Vulkan section in build.md (#14785 ) * Documentation: Revised and further improved the Vulkan instructions for Linux users in build.md. * Minor: Revise step 2 of the Vulkan instructions for Linux users in build.md	2025-07-20 18:55:32 +02:00
Aman Gupta	938b785764	Clang-format: local files first + fix BinPacking (#14779 )	2025-07-20 19:42:34 +08:00
0cc4m	36c153248f	Contrib: add 0cc4m as codeowner for Vulkan backend (#14775 )	2025-07-19 23:47:21 +03:00
Ervin Áron Tasnádi	a979ca22db	ggml: adds CONV_2D op and direct GEMM Vulkan implementation (#14316 ) * ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan * ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly with gemm (no need for im2col), * test-backend-ops: adds test_case_ref to check the validity/performance of ops against reference implementations having different graphs, adds tests * * Performance fixes: minimized branch divergence, uses collectives to eliminate redundant calculation, macros removed. * Kernel shared memory size check * Updates test-backend-ops to support graphs for performance measurement. * * Apple/Win32 compile errors fixed * Subgroup size used to determine tile size -> fixes llvmpipe errors. * Collectives disabled by default. * Intel support is disabled as the performance is poor. * Conv2d enabled for Intel with disabled collectives, disabled for Apple * test-backend-ops modifications are reverted * Trailing spaces and missing override fixed. * Triggering pipeline relaunch. * Code formatted with .clang-format. b5943	2025-07-19 21:59:08 +02:00
compilade	90083283ec	imatrix : use GGUF to store importance matrices (#9400 ) * imatrix : allow processing multiple chunks per batch * perplexity : simplify filling the batch * imatrix : fix segfault when using a single chunk per batch * imatrix : use GGUF to store imatrix data * imatrix : fix conversion problems * imatrix : use FMA and sort tensor names * py : add requirements for legacy imatrix convert script * perplexity : revert changes * py : include imatrix converter requirements in toplevel requirements * imatrix : avoid using designated initializers in C++ * imatrix : remove unused n_entries * imatrix : allow loading mis-ordered tensors Sums and counts tensors no longer need to be consecutive. * imatrix : more sanity checks when loading multiple imatrix files * imatrix : use ggml_format_name instead of std::string concatenation Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * quantize : use unused imatrix chunk_size with LLAMA_TRACE * common : use GGUF for imatrix output by default * imatrix : two-way conversion between old format and GGUF * convert : remove imatrix to gguf python script * imatrix : use the function name in more error messages * imatrix : don't use FMA explicitly This should make comparisons between the formats easier because this matches the behavior of the previous version. * imatrix : avoid returning from void function save_imatrix * imatrix : support 3d tensors with MUL_MAT * quantize : fix dataset name loading from gguf imatrix * common : move string_remove_suffix from quantize and imatrix Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * imatrix : add warning when legacy format is written * imatrix : warn when writing partial data, to help guess dataset coverage Also make the legacy format store partial data by using neutral values for missing data. This matches what is done at read-time for the new format, and so should get the same quality in case the old format is still used. * imatrix : avoid loading model to convert or combine imatrix * imatrix : avoid using imatrix.dat in README --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b5942	2025-07-19 12:51:22 -04:00
Peter0x44	d4b91ea7b2	vulkan: Add logging for bf16 features to ggml_vk_print_gpu_info (#13274 ) (#14707 ) b5941	2025-07-19 17:58:03 +02:00
0cc4m	83f5872404	Vulkan: Fix fprintf format-security warning (#14770 ) b5940	2025-07-19 17:47:53 +02:00
rspOverflow	f0d4d176df	Documentation: Update build.md's Vulkan section (#14736 ) * Documentation: Rewrote and updated the "Without docker" portion of the Vulkan backend build documentation. * Documentation: Reorganize build.md's Vulkan section.	2025-07-19 12:18:36 +02:00
Georgi Gerganov	b17230917c	sync : ggml	2025-07-19 11:46:50 +03:00
Georgi Gerganov	bf9087f59a	metal : fuse add, mul + add tests (#14596 ) ggml-ci b5937	2025-07-18 20:37:26 +03:00
Georgi Gerganov	9fb1042ce6	graph : fix graph reuse reset of params (#14760 ) ggml-ci b5936	2025-07-18 20:08:33 +03:00
Georgi Gerganov	2adf8d83ac	parallel : add option for different RNG seeds (#14757 ) ggml-ci b5935	2025-07-18 17:33:41 +03:00
Oliver Simons	021cc28bef	cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs (#14741 ) * Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs Gemma3n uses Matrix-Matrix addition as part of their input processing, wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size of 1 is used. * Exclude `project_per_layer_input` by matching node names This ensures that all other graphs which don't exhibit this pattern do not have their behavior changed. * Revert unnecessary formatting changes b5934	2025-07-18 04:35:32 -07:00
Georgi Gerganov	d498af3d5a	graph : avoid huge warm-up graphs for MoE models (#14753 ) * graph : avoid huge warm-up graphs for MoE models ggml-ci * cont : bump max nodes to 8x model tensors b5933	2025-07-18 14:31:15 +03:00
Georgi Gerganov	eacdeb5bfc	model : fix build after merge conflict (#14754 ) b5932	2025-07-18 11:53:55 +03:00
lgai-exaone	e0cb5c5cb8	model : add EXAONE 4.0 support (#14630 )	2025-07-18 10:45:49 +02:00
Aman Gupta	f9a31eea06	CUDA: set_rows + cpy.cu refactor (#14712 ) b5930	2025-07-18 14:54:18 +08:00
Georgi Gerganov	8f974bc1e9	graph : refactor context to not pass gf explicitly (#14629 ) ggml-ci b5929	2025-07-18 08:29:28 +03:00
Nexes the Elder	09651d09ff	graph : Pass the graph placeholder message in debug mode (#14748 ) Without that condition, this debug log clutters the screen every batch treated in the prompt processing, or every token generated in Kobold.cpp. b5928	2025-07-18 07:25:54 +03:00
Neo Zhang Jianyu	349ea79fce	use max work group size for device to replace the magic number (#14732 ) b5927	2025-07-18 10:23:14 +08:00
Piotr Wilkin (ilintar)	670e1360cd	convert : fix Ernie4.5 MoE without shared experts (#14746 )	2025-07-18 01:17:16 +02:00
Wroclaw	760b4484e3	nix : use optionalAttrs for env mkDerivation attrset argument (#14726 )	2025-07-17 15:18:16 -07:00
Piotr Wilkin (ilintar)	cb887f1bc1	model: add Ernie 4.5 MoE support (#14658 ) * Add Ernie4.5 MoE * Fix Flake errors. * Properly encode/decode MoE layer step * Correct tensor mappings (.weight) * Pass and read n_ff_exp * n_ff_shexp calculation and further minor changes * Rope fixes. * .gitignore fix * Add unit32 cast for Linux builds * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Further fixes from code review * Fix trailing whitespace * Reenable missing experts error * Code style from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix non-MoE regression Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b5924	2025-07-17 23:15:32 +02:00
Georgi Gerganov	d6fb3f6b49	kv-cache : fix k-shift for multiple streams (#14742 ) ggml-ci b5923	2025-07-17 20:52:33 +03:00
Georgi Gerganov	01612b7409	llama : reuse compute graphs (#14482 ) * llama : reuse compute graphs ggml-ci * llama-bench : add graph reuse parameter ggml-ci * cont : remove the parameter and the sched resets ggml-ci * graph : rename update() to can_reuse() ggml-ci * params : remove is_same() ggml-ci * graph : set res->params in llm_graph_context constructor ggml-ci * graph : avoid set_max_nodes in llm_graph_result ggml-ci * kv-cache : reuse llama_context's graph result instance ggml-ci * context : reset the previous graph result upon memory updates ggml-ci * batch : llama_ubatch now carries its data instead of pointing to balloc ggml-ci * merge : fix build ggml-ci * graph : fix can_reuse() checks when flash-attention is disabled * graph : move llm_graph_result impl in source file + debug env ggml-ci b5922	2025-07-17 19:08:33 +03:00
Tarek Dakhran	086cf81e88	llama : fix parallel processing for lfm2 (#14705 ) b5921	2025-07-17 09:22:11 +02:00
Georgi Gerganov	d9b691081c	kv-cache : opt mask set input (#14600 ) ggml-ci b5920	2025-07-17 09:49:15 +03:00
Georgi Gerganov	ad57d3edd2	batch : fix uninitialized has_cpl flag (#14733 ) ggml-ci b5919	2025-07-17 09:45:54 +03:00

1 2 3 4 5 ...

5968 Commits