llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-06-27 03:55:20 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	4decf2c4df	TMP : push artifacts	2025-01-24 14:54:24 +02:00
Georgi Gerganov	3a35bfe1f7	cmake : put libs in /bin	2025-01-24 14:42:46 +02:00
Georgi Gerganov	ff4cb6ef4c	release : pack /lib and /include in the packages	2025-01-24 13:28:37 +02:00
Eric Curtin	01f37edf1a	Update llama-run README.md (#11386 ) For consistency Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2025-01-24 09:39:24 +00:00
stduhpf	c07e87f38b	server : (webui) put DeepSeek R1 CoT in a collapsible <details> element (#11364 ) * webui : put DeepSeek R1 CoT in a collapsible <details> element * webui: refactor split * webui: don't use regex to split cot and response * webui: format+qol * webui: no loading icon if the model isn't generating * ui fix, add configs * add jsdoc types * only filter </think> for assistant msg * build * update build --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-01-24 09:02:38 +01:00
Jeff Bolz	564804b79b	tests: fix some mul_mat test gaps (#11375 ) Now that we have batched mat-vec mul Vulkan shaders for up to n==8, these tests weren't actually exercising the mat-mat mul path. Test n==9 as well. Also, change to use all_types. b4539	2025-01-23 14:51:24 -06:00
Eric Curtin	05f63cc9ee	Update documentation (#11373 ) To show -n, -ngl, --ngl is acceptable. Signed-off-by: Eric Curtin <ecurtin@redhat.com> b4538	2025-01-23 20:04:31 +00:00
Eric Curtin	f7fb43cd0b	Add -ngl (#11372 ) Most other llama.cpp cli tools accept -ngl with a single dash. Signed-off-by: Eric Curtin <ecurtin@redhat.com> b4537	2025-01-23 16:16:18 +00:00
Xuan Son Nguyen	5845661640	server : add more clean up when cancel_tasks is called (#11340 ) * server : add more clean up when cancel_tasks is called * fix recv_with_timeout * std::remove_if * fix std::remove_if b4536	2025-01-23 13:56:05 +01:00
Eric Curtin	f211d1dc10	Treat hf.co/ prefix the same as hf:// (#11350 ) ollama uses hf.co/ to specify huggingface prefix, like RamaLama uses hf:// Treat them similarly. Signed-off-by: Eric Curtin <ecurtin@redhat.com> b4535	2025-01-23 10:38:20 +00:00
amd-dwang	955a6c2d91	Vulkan-run-test: fix mmq_wg_denoms (#11343 ) There should be a copy-and-paste error here. mmq_wg_denoms should be used together with warptile_mmq, instead of wg_denoms. b4534	2025-01-23 08:14:28 +01:00
Jeff Bolz	1971adf55e	vulkan: sort shaders for more deterministic binary (#11315 ) Fixes #11306. b4533	2025-01-23 08:07:50 +01:00
Jeff Bolz	5245729e33	vulkan: fix diag_mask_inf (#11323 ) With robustbufferaccess disabled, this shader was showing OOB stores. There is a bounds check in the code, but the workgrouop dimensions were reversed vs CUDA and it was running the wrong number of threads. So fix the workgroup dimensions and disable robustness for this pipeline. b4532	2025-01-23 08:01:17 +01:00
Diego Devesa	6152129d05	main : update README documentation for batch size (#11353 ) * main : update README documentation for batch size * fix formatting * minor	2025-01-22 19:22:20 +01:00
Georgi Gerganov	16d3df7ab0	readme : add plugin links (#11355 )	2025-01-22 19:44:26 +02:00
Diego Devesa	12c2bdf2de	server : fix draft context not being released (#11354 ) b4529	2025-01-22 17:44:40 +01:00
Olivier Chafik	c64d2becb1	`minja`: sync at `0f5f7f2b37` (#11352 ) b4528	2025-01-22 16:16:27 +00:00
Jiří Podivín	96f4053934	Adding logprobs to /v1/completions (#11344 ) Signed-off-by: Jiri Podivin <jpodivin@redhat.com> b4527	2025-01-22 12:51:32 +01:00
Olivier Chafik	a94f3b2727	`common`: utils to split / join / repeat strings (from json converter) (#11342 ) * Factor string_join, string_split, string_repeat into common * json: refactor to surface a versatile builder * Update common.cpp b4526	2025-01-22 09:51:44 +00:00
tc-mb	3e3357fd77	llava : support Minicpm-omni (#11289 ) * init * add readme * update readme * no use make * update readme * update fix code * fix editorconfig-checker * no change convert py * use clip_image_u8_free b4525	2025-01-22 09:35:48 +02:00
Olivier Chafik	6171c9d258	Add Jinja template support (#11016 ) * Copy minja from `58f0ca6dd7` * Add --jinja and --chat-template-file flags * Add missing <optional> include * Avoid print in get_hf_chat_template.py * No designated initializers yet * Try and work around msvc++ non-macro max resolution quirk * Update test_chat_completion.py * Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template * Refactor test-chat-template * Test templates w/ minja * Fix deprecation * Add --jinja to llama-run * Update common_chat_format_example to use minja template wrapper * Test chat_template in e2e test * Update utils.py * Update test_chat_completion.py * Update run.cpp * Update arg.cpp * Refactor common_chat_* functions to accept minja template + use_jinja option * Attempt to fix linkage of LLAMA_CHATML_TEMPLATE * Revert LLAMA_CHATML_TEMPLATE refactor * Normalize newlines in test-chat-templates for windows tests * Forward decl minja::chat_template to avoid eager json dep * Flush stdout in chat template before potential crash * Fix copy elision warning * Rm unused optional include * Add missing optional include to server.cpp * Disable jinja test that has a cryptic windows failure * minja: fix vigogne (https://github.com/google/minja/pull/22) * Apply suggestions from code review Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Finish suggested renamings * Move chat_templates inside server_context + remove mutex * Update --chat-template-file w/ recent change to --chat-template * Refactor chat template validation * Guard against missing eos/bos tokens (null token otherwise throws in llama_vocab::impl::token_get_attr) * Warn against missing eos / bos tokens when jinja template references them * rename: common_chat_template[s] * reinstate assert on chat_templates.template_default * Update minja to `b8437df626` * Update minja to https://github.com/google/minja/pull/25 * Update minja from https://github.com/google/minja/pull/27 * rm unused optional header --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b4524	2025-01-21 13:18:51 +00:00
Xuan Son Nguyen	e28245f35f	export-lora : fix tok_embd tensor (#11330 ) b4523	2025-01-21 14:07:12 +01:00
Radoslav Gerganov	6da5bec81c	rpc : better caching of the base buffer pointer (#11331 ) There is no need to use map, just store the base pointer in the buffer context. b4522	2025-01-21 15:06:41 +02:00
Eric Curtin	2e2f8f093c	linenoise.cpp refactoring (#11301 ) More RAII mainly Signed-off-by: Eric Curtin <ecurtin@redhat.com> b4521	2025-01-21 09:32:35 +00:00
Georgi Gerganov	2139667ec4	metal : fix out-of-bounds write (#11314 ) ggml-ci b4520	2025-01-21 08:48:13 +02:00
Georgi Gerganov	80d0d6b4b7	common : add -hfd option for the draft model (#11318 ) * common : add -hfd option for the draft model * cont : fix env var * cont : more fixes b4519	2025-01-20 22:29:43 +02:00
Jeff Bolz	aea8ddd516	vulkan: fix coopmat2 validation failures (#11284 ) mul mat and flash attention shaders were loading f32 types directly into A/B matrices, which happens to work but is technically invalid usage. For FA, we can load it as an Accumulator matrix and convert and this is not in the inner loop and is cheap enough. For mul mat, it's more efficient to do this conversion in a separate pass and have the input(s) be f16. coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3. b4518	2025-01-20 10:38:32 -06:00
Georgi Gerganov	9f7add1cde	examples : fix add_special conditions (#11311 )	2025-01-20 16:36:08 +02:00
Christopher Nielsen	90d987b105	mmap: add include for cerrno (#11296 ) ggml-ci Co-authored-by: Xuan Son Nguyen <son@huggingface.co> b4516	2025-01-20 16:02:43 +02:00
Michael Podvitskiy	a4251edd6f	cmake: fix shell command quoting in build-info script (#11309 )	2025-01-20 16:02:15 +02:00
Xuan Son Nguyen	ec7f3ac9ab	llama : add support for Deepseek-R1-Qwen distill model (#11310 ) * llama : add support for Deepseek-R1-Qwen distill model * coding style b4514	2025-01-20 14:35:07 +01:00
Georgi Gerganov	ef6dada60c	cont : fix whitespaces (#11305 ) b4513	2025-01-20 09:29:32 +02:00
Kyle Bruene	ae3c1db2f9	llama : re-add LLM_ARCH_PHIMOE (#11305 ) Phi 3.5 MoE was partially removed during a refactor. The code was originally in llama.cpp and should be in llama-model.cpp after the refactor. b4512	2025-01-20 09:21:01 +02:00
Georgi Gerganov	92bc493917	tests : increase timeout when sanitizers are enabled (#11300 ) * tests : increase timeout when sanitizers are enabled * tests : add DEFAULT_HTTP_TIMEOUT	2025-01-19 20:22:30 +02:00
Georgi Gerganov	b9daaffe02	simple-chat : fix BOS being added to each message (#11278 ) b4510	2025-01-19 18:12:09 +02:00
Nicolò Scipione	99487b57d4	SYCL: Introducing memory host pool (#11251 ) * Implement host pool for matrix_info Creating a new memory pool on the host to store memory location for matrix_info needed to launch gemm_batch from oneMKL/oneMath. Removing complex support in gemm_batch since it is not used in llama.cpp * Remove unnecessary headers and cast * Reorder member variable to avoid warning on initialization * Formatting * Remove unused variable * Address PR review feedback - remove warning --------- Signed-off-by: nscipione <nicolo.scipione@codeplay.com> b4509	2025-01-19 21:33:34 +08:00
Eric Curtin	a1649cc13f	Adding linenoise.cpp to llama-run (#11252 ) This is a fork of linenoise that is C++17 compatible. I intend on adding it to llama-run so we can do things like traverse prompt history via the up and down arrows: https://github.com/ericcurtin/linenoise.cpp Signed-off-by: Eric Curtin <ecurtin@redhat.com> b4508	2025-01-18 14:42:31 +00:00
Georgi Gerganov	4dd34ff831	cmake : add sanitizer flags for llama.cpp (#11279 ) * cmake : add sanitizer flags for llama.cpp ggml-ci * tests : fix compile warnings ggml-ci * cmake : move sanitizer flags to llama_add_compile_flags ggml-ci * cmake : move llama.cpp compile flags to top level lists ggml-ci * cmake : apply only sanitizer flags at top level ggml-ci * tests : fix gguf context use in same_tensor_data * gguf-test: tensor data comparison * dummy : trigger ggml-ci * unicode : silence gcc warnings ggml-ci * ci : use sanitizer builds only in Debug mode ggml-ci * cmake : add status messages [no ci] --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-01-18 16:18:15 +02:00
Xuan Son Nguyen	f30f099228	server : implement cancellable request (#11285 ) * server : implement cancellable request * fix typo * httplib 0.18.5 * fix i underflow b4506	2025-01-18 14:12:05 +01:00
Georgi Gerganov	f26c874179	scripts : restore hf.sh (#11288 ) ggml-ci	2025-01-18 13:18:32 +02:00
LostRuins Concedo	6390a998bf	tts : add guide tokens support (#11186 ) * Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences. * applied linting suggestions, updated to latest llama_vocab changes, added a safety check, added newline to guide token start b4504	2025-01-18 12:20:57 +02:00
Jeff Bolz	44e18ef939	vulkan: fix coopmat2 flash attention for non-contiguous inputs (#11281 ) Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression. Add noncontiguous FA tests in test-backend-ops. Fixes #11268. b4503	2025-01-18 09:26:50 +01:00
codezjx	3edfa7d375	llama.android: add field formatChat to control whether to parse special tokens when send message (#11270 ) b4502	2025-01-17 14:57:56 +02:00
Radoslav Gerganov	667d72846c	rpc : early register backend devices (#11262 ) Early register RPC devices and do not propagate RPC specifics in the llama model structures. ref: #10609 b4501	2025-01-17 10:57:09 +02:00
Georgi Gerganov	a133566d34	vocab : fix double-eos check (#11273 ) ggml-ci b4500	2025-01-17 09:28:00 +02:00
David Renshaw	960ec65273	llama : fix deprecation message: vocabable -> vocab (#11269 ) b4499	2025-01-17 08:12:01 +01:00
musoles	7a689c415e	README : added kalavai to infrastructure list (#11216 )	2025-01-17 01:10:49 +01:00
Jeff Bolz	bd38ddea01	vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (#11166 ) * vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl Shaders are based on cpy.cu. * vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32 * ggml: copy q->f32 assumes some contiguity in the destination b4497	2025-01-16 22:47:10 +01:00
Jeff Bolz	466300fe14	vulkan: optimize coopmat2 q4_k/q5_k dequant functions. (#11206 ) Do masking on whole dwords, fetch all scales at once.	2025-01-16 22:23:49 +01:00
Jeff Bolz	206bc53422	vulkan: optimize coopmat2 q2_k dequant function (#11130 )	2025-01-16 22:16:39 +01:00

1 2 3 4 5 ...

4544 Commits