llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-07-26 19:23:37 -04:00

Author	SHA1	Message	Date
Max Krasnyansky	199a838422	threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling We talked about adding LOW priority for GGML threads in the original threadpool PR. It might be useful for some cases to avoid contention. Latest Windows ARM64 releases started parking (offlining) the CPU cores more aggresively which results in suboptimal performance with n_threads > 4. To deal with that we now disable Power Throttling for our threads for the NORMAL and higher priorities. Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-05-30 17:15:38 -07:00
Johannes Gäßler	e562eece7c	CUDA: fix typo in FlashAttention code (#13926 ) b5548	2025-05-30 21:22:03 +02:00
Diego Devesa	b47ab7b8e9	sched : avoid changing cur_copy when a graph is already allocated (#13922 ) b5547	2025-05-30 18:56:19 +02:00
Georgi Gerganov	dd665cc9d4	parallel : increase the variability of the prompt lengths (#13927 ) ggml-ci b5546	2025-05-30 19:38:07 +03:00
Diego Devesa	df0c0c7d02	cuda : prevent using split buffers with 3d/4d matrices (#13919 ) b5545	2025-05-30 16:37:18 +02:00
Akarshan Biswas	b49a8ff96b	SYCL: Add mrope kernel (#13755 ) * SYCL: Add mrope kernel * feat: Optimize rope operations with vectorization Uses `sycl::vec` to load and store two elements at a time, significantly improving performance in `rope_norm`, `rope_neox`, and `rope_multi`. This reduces the number of memory accesses and leverages SIMD instructions for faster execution. * Use ceil_div b5544	2025-05-30 19:40:57 +05:30
Georgi Gerganov	53f925074d	sync : vendor (#13901 ) * sync : vendor ggml-ci * cont : fix httplib version ggml-ci * cont : fix lint * cont : fix lint * vendor : move to common folder /vendor ggml-ci * cont : fix lint * cont : move httplib to /vendor + use json_fwd.hpp ggml-ci * cont : fix server build ggml-ci * cont : add missing headers ggml-ci * cont : header clean-up ggml-ci b5543	2025-05-30 16:25:45 +03:00
Sigbjørn Skjæret	db38704f01	convert : fix rwkv bos/eos token (#13844 )	2025-05-30 14:50:43 +02:00
Xuan-Son Nguyen	07e4351ce6	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 ) * convert : allow partial update to the chkhsh pre-tokenizer list * code style * update tokenizer out * rm inp/out files for models not having gguf * fixed hash for glm * skip nomic-bert-moe test * Update convert_hf_to_gguf_update.py * fix minerva-7b hash * rm redundant import b5541	2025-05-30 12:24:37 +02:00
Đinh Trọng Huy	291f2b6913	llama : add support for DistilBert (#13907 ) * add distilbert * small fixes * add note for LLM_ARCH_DISTIL_BERT * Use MODEL_ARCH.BERT for DistilBert --------- Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp> b5540	2025-05-30 11:56:02 +02:00
zhangkaihuo	2c90da4c7e	llama : use llm_build_granite for minicpm (#13911 ) b5539	2025-05-30 10:31:48 +02:00
Christian Kastner	ec9e0301fe	cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890 ) b5538	2025-05-30 01:28:54 +02:00
Sigbjørn Skjæret	e83ba3e460	llama : add support for jina-reranker-v2 (#13900 ) b5537	2025-05-29 21:42:31 +02:00
Sigbjørn Skjæret	2b131621e6	gguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method (#13561 ) gguf-v0.17.0	2025-05-29 15:36:05 +02:00
Yibo Cai	54a2c7a8cd	arm64: optimize q4_k_q8_k kernel with i8mm (#13886 ) This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q4_k_m quantization model. - 34% ~ 50% S_PP uplift for all batch sizes - 12% ~ 37% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- \| PP \| TG \| B \| S_PP t/s \| S_TG t/s \| \| \| \| \| original \| this pr \| original \| this pr \| \|-------\|--------\|------\|----------\|----------\|----------\|----------\| \| 128 \| 128 \| 1 \| 110.12 \| 147.83 \| 24.36 \| 24.28 \| \| 128 \| 128 \| 2 \| 121.16 \| 172.42 \| 46.36 \| 47.93 \| \| 128 \| 128 \| 4 \| 120.15 \| 169.75 \| 74.68 \| 84.00 \| \| 128 \| 128 \| 8 \| 130.97 \| 196.81 \| 91.04 \| 114.74 \| \| 128 \| 128 \| 16 \| 131.01 \| 196.88 \| 101.43 \| 135.79 \| \| 128 \| 128 \| 32 \| 130.85 \| 196.51 \| 106.97 \| 147.29 \| --------------------------------------------------------------------- ``` b5535	2025-05-29 14:39:20 +03:00
Christian Kastner	21fcc21ad5	cmake: Factor out CPU architecture detection (#13883 ) * cmake: Define function for querying architecture The tests and results match exactly those of ggml/src/CMakeLists.txt * Switch arch detection over to new function b5534	2025-05-29 12:50:25 +02:00
Vineel Abhinav	dd8ba93416	ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882 ) * F32-Mamba-Seq_Scan-SVE * Fix formatting * ggml : missing space --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b5533	2025-05-29 12:18:43 +03:00
Georgi Gerganov	66c92061f5	tests : remove json.hpp from a test (#13880 ) ggml-ci b5532	2025-05-29 12:17:16 +03:00
Sigbjørn Skjæret	5ca82fc1d7	convert : workaround for AutoConfig dummy labels (#13881 )	2025-05-29 10:00:57 +02:00
Sigbjørn Skjæret	6385b843a8	llama : add RobertaForSequenceClassification reranker support (#13875 ) b5530	2025-05-29 08:15:01 +02:00
Vineel Abhinav	1b8fb8152d	ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843 ) * F32-Mamba-SVE * F32-Mamba-SVE * Resolve test errors-1 * Resolve test errors-2 * F32-vec-SVE * F32-vec-SVE * F32-vec-SVE b5529	2025-05-29 09:01:33 +03:00
Beinsezii	53ae30640e	gguf-py : fix SafetensorRemote return on undefined size (< 0) (#13841 )	2025-05-28 23:50:20 +02:00
Xuan-Son Nguyen	763d06edb7	llama : fix KV shift for qwen2vl (#13870 ) * llama : fix KV shift for qwen2vl * add ref to the PR b5527	2025-05-28 22:35:31 +02:00
Xuan-Son Nguyen	10961339b2	mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866 ) * mtmd : move helpers to dedicated library * fix server build * rm leftover cmakelist code b5526	2025-05-28 22:35:22 +02:00
bandoti	d98f2a35fc	ci: disable LLAMA_CURL for Linux cross-builds (#13871 )	2025-05-28 15:46:47 -03:00
Đinh Trọng Huy	e0e3aa231d	llama : add support for BertForSequenceClassification reranker (#13858 ) * convert: add support for BertForSequenceClassification * add support for reranking using BertForSequenceClassification * merge checks of eos and sep * fix lint --------- Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp> b5524	2025-05-28 19:01:58 +02:00
Đinh Trọng Huy	aa6dff05be	convert: small addition to support LlamaModel (#13838 ) Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>	2025-05-28 16:34:18 +02:00
Sky	c962ae3382	server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode (#13853 ) [fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode b5522	2025-05-28 16:33:54 +02:00
Xuan-Son Nguyen	a3938fb53d	convert : fix qwen omni conversion (#13859 ) * convert : fix qwen omni conversion * fix typo	2025-05-28 16:12:35 +02:00
Alex Fanthome	f7873fc698	tests : change umlaut test (#11600 )	2025-05-28 15:49:28 +02:00
Johannes Gäßler	a68247439b	CUDA: fix FA tg at long context for CC >= 8.9 (#13852 ) b5519	2025-05-28 13:33:37 +02:00
Xuan-Son Nguyen	26b79b6cb3	convert : fix tensor naming conflict for llama 4 vision (#13836 ) * convert : fix tensor naming conflict for llama 4 vision * add comment	2025-05-28 10:05:54 +02:00
leo-pony	1e8659e65a	CANN: Add SOC TYPE printing in cmake configuration (#13837 ) b5517	2025-05-28 11:54:20 +08:00
lhez	a3c30846e4	opencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm` (#13787 ) * opencl: add `argsort` * opencl: add `div` * opencl: add `add_rows` * opencl: add `sub` * opencl: add `sigmoid`, both `f16` and `f32` * opencl: add `group_norm` b5516	2025-05-27 12:56:08 -07:00
lhez	1701d4c54f	opencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (#13790 ) b5515	2025-05-27 12:53:14 -07:00
Jeff Bolz	bef8176387	vulkan: use timestamp queries for GGML_VULKAN_PERF (#13817 ) Also change it to be controlled by an env var rather than cmake flag b5514	2025-05-27 18:39:07 +02:00
Georgi Gerganov	34b7c0439e	cmake : add llama-cparams.cpp to build (#13832 ) b5513	2025-05-27 19:08:44 +03:00
Akarshan Biswas	f3101a8cc6	SYCL: add gelu_erf kernel (#13749 ) * SYCL: add gelu_erf kernel * refactor code Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com> * Use scope_op_debug_print --------- Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com> b5512	2025-05-27 20:52:59 +05:30
Georgi Gerganov	1c49c70d07	sync : ggml	2025-05-27 18:05:33 +03:00
Xuan-Son Nguyen	a8ea03d8ad	ggml : add ggml_repeat_4d (#13824 ) b5510	2025-05-27 15:53:55 +02:00
xctan	05f6ac6283	ggml : riscv: add xtheadvector support (#13720 ) * ggml : riscv: add xtheadvector support * ggml : clean up some macro usage b5509	2025-05-27 16:21:36 +03:00
Xuan-Son Nguyen	bc583e3c63	mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) (#13784 ) * mtmd : allow multiple modalities at the same time * refactor mtmd tokenizer * fix compile * ok, missing SinusoidsPositionEmbedding * first working version * fix style * more strict validate of n_embd * refactor if..else to switch * fix regression * add test for 3B * update docs * fix tokenizing with add_special * add more tests * fix test case "huge" * rm redundant code * set_position_mrope_1d rm n_tokens b5508	2025-05-27 14:06:10 +02:00
bandoti	72b090da2c	docs: remove link for llama-cli function calling (#13810 )	2025-05-27 08:52:40 -03:00
Christian Kastner	7fe03e7446	ggml-cpu: x86 feature detection is specific to x86 (#13811 ) b5506	2025-05-27 13:18:39 +02:00
Diego Devesa	952f3953c1	ggml : allow CUDA graphs when using pipeline parallelism (#13814 ) b5505	2025-05-27 13:05:18 +02:00
Georgi Gerganov	81713121ee	kv-cells : track min/max used cells and per-sequence positions (#13808 ) * kv-cells : track min/max used cells and per-sequence positions ggml-ci * kv-cells : fix pos-modification updates for seq_pos ggml-ci * kv-cells : add comments ggml-ci b5504	2025-05-27 13:49:41 +03:00
Georgi Gerganov	f9cd68398b	sampling : make sure samplers return at least 1 token (#13822 ) * sampling : min-p should always return at least one token ggml-ci * sampling : same for typical sampling * tests : sampling tests use min_keep == 0 ggml-ci b5503	2025-05-27 12:07:52 +03:00
Georgi Gerganov	4f81b33e32	llama : validate seq id batch input (#13809 ) * llama : validate seq id batch input ggml-ci * cont : fix the fix ggml-ci b5502	2025-05-27 09:40:59 +03:00
Olivier Chafik	cdf94a1802	server: --offline mode (#13804 ) * server: --offline mode (env: LLAMA_OFFLINE) --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> b5501	2025-05-26 22:34:27 +01:00
Georgi Gerganov	a26c4cc11e	scripts : add option to compare commits in Debug (#13806 ) * scripts : add option to compare commits in Debug * cont : reuse existing CMAKE_OPTS	2025-05-26 22:24:01 +03:00

1 2 3 4 5 ...

5549 Commits