llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-07-28 13:20:27 -04:00

Author	SHA1	Message	Date
Xuan Son Nguyen	45abe0f74e	server : replace behave with pytest (#10416 ) * server : replace behave with pytest * fix test on windows * misc * add more tests * more tests * styling * log less, fix embd test * added all sequential tests * fix coding style * fix save slot test * add parallel completion test * fix parallel test * remove feature files * update test docs * no cache_prompt for some tests * add test_cache_vs_nocache_prompt	2024-11-26 16:20:18 +01:00
Georgi Gerganov	ab96610b1e	cmake : enable warnings in llama (#10474 ) * cmake : enable warnings in llama ggml-ci * cmake : add llama_get_flags and respect LLAMA_FATAL_WARNINGS * cmake : get_flags -> ggml_get_flags * speculative-simple : fix warnings * cmake : reuse ggml_get_flags ggml-ci * speculative-simple : fix compile warning ggml-ci	2024-11-26 14:18:08 +02:00
Georgi Gerganov	84e1c33cde	server : fix parallel speculative decoding (#10513 ) ggml-ci	2024-11-26 13:36:40 +02:00
Georgi Gerganov	811872a59d	speculative : simplify the implementation (#10504 ) ggml-ci	2024-11-26 12:29:38 +02:00
Eric Curtin	0cc63754b8	Introduce llama-run (#10291 ) It's like simple-chat but it uses smart pointers to avoid manual memory cleanups. Less memory leaks in the code now. Avoid printing multiple dots. Split code into smaller functions. Uses no exception handling. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2024-11-25 22:56:24 +01:00
Georgi Gerganov	47f931c8f9	server : enable cache_prompt by default (#10501 ) ggml-ci	2024-11-25 21:50:07 +02:00
Diego Devesa	10bce0450f	llama : accept a list of devices to use to offload a model (#10497 ) * llama : accept a list of devices to use to offload a model * accept `--dev none` to completely disable offloading * fix dev list with dl backends * rename env parameter to LLAMA_ARG_DEVICE for consistency	2024-11-25 19:30:06 +01:00
brucepro	a9a678a6b2	Add download chat feature to server chat (#10481 ) * Add download chat feature to server chat Add a download feature next to the delete chat feature in the server vue chat interface. * code style --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-11-25 17:11:55 +01:00
Georgi Gerganov	9ca2e67762	server : add speculative decoding support (#10455 ) * server : add speculative decoding support ggml-ci * server : add helper function slot.can_speculate() ggml-ci	2024-11-25 16:31:38 +02:00
Diego Devesa	5931c1f233	ggml : add support for dynamic loading of backends (#10469 ) * ggml : add support for dynamic loading of backends --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-25 15:13:39 +01:00
Georgi Gerganov	d9d54e498d	speculative : refactor and add a simpler example (#10362 ) * speculative : refactor and add a simpler example ggml-ci * speculative : clean-up and add comments and TODOs [no ci] * speculative : manage context in common_speculative ggml-ci * speculative : simplify ggml-ci * speculative : simplify (cont) ggml-ci * speculative : add --draft-min CLI arg * speculative : minor fixup * make : build fixes * speculative : do not redraft previous drafts ggml-ci * speculative : fix the draft sampling ggml-ci * speculative : fix compile warning * common : refactor args ggml-ci * common : change defaults [no ci] * common : final touches ggml-ci	2024-11-25 09:58:41 +02:00
Diego Devesa	fab5d30ff6	llama : add .clang-format file (#10415 )	2024-11-20 12:57:53 +01:00
Johannes Gäßler	4e54be0ec6	llama/ex: remove --logdir argument (#10339 )	2024-11-16 23:00:41 +01:00
MaggotHATE	bcdb7a2386	server: (web UI) Add samplers sequence customization (#10255 ) * Samplers sequence: simplified and input field. * Removed unused function * Modify and use `settings-modal-short-input` * rename "name" --> "label" --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-11-16 14:26:54 +01:00
Xuan Son Nguyen	9901068ac7	server : (web UI) add copy button for code block, fix api key (#10242 ) * server : (web ui) add copy btn for code blocks * fix problem with api key * use settings-modal-short-input component * always show copy btn for code snippet	2024-11-15 10:48:49 +01:00
Diego Devesa	ae8de6d50a	ggml : build backends as libraries (#10256 ) * ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>	2024-11-14 18:04:35 +01:00
Georgi Gerganov	2a82891a85	speculative : fix out-of-bounds access (#10289 )	2024-11-14 11:44:15 +02:00
Alexey Parfenov	ff7fb670d0	server : add missing docs (#10269 )	2024-11-13 13:16:30 +02:00
Jhen-Jie Hong	0e712a5acb	server : fix incorrect res in validate_model_chat_template (#10272 ) * server : fix validate_model_chat_template * server : fix chat res	2024-11-13 13:15:23 +02:00
Brian	a0ec17b32e	metadata: Detailed Dataset Authorship Metadata (#8875 ) Converter script can now read these two fields as a detailed base model and dataset source. This was done so that it will be easier for Hugging Face to integrate detailed metadata as needed. - base_model_sources (List[dict], optional) - dataset_sources (List[dict], optional) Dataset now represented as: - general.dataset.count - general.dataset.{id}.name - general.dataset.{id}.author - general.dataset.{id}.version - general.dataset.{id}.organization - general.dataset.{id}.description - general.dataset.{id}.url - general.dataset.{id}.doi - general.dataset.{id}.uuid - general.dataset.{id}.repo_url This also adds to base model these metadata: - general.base_model.{id}.description	2024-11-13 21:10:38 +11:00
Georgi Gerganov	b141e5f6ef	server : enable KV cache defrag by default (#10233 ) ggml-ci	2024-11-11 08:38:43 +02:00
MaggotHATE	505f33274d	server : (web UI) Add back sampler settings (#10239 ) * Add back samplers to server * Added tooltips with basic information * Fixed stretching of input fields. * use component for settings input, move help msg to tooltips --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2024-11-10 15:42:25 -04:00
haopeng	8fc393f246	scripts : fix pattern and get n_tokens in one go (#10221 )	2024-11-09 09:06:54 +02:00
Georgi Gerganov	841f27abdb	metal : optimize FA kernels (#10171 ) * ggml : add ggml_flash_attn_ext_get_prec * metal : use F16 precision in FA kernels ggml-ci * metal : minor clean-up * metal : compile-guard bf16 FA kernels ggml-ci * build : remove obsolete compile flag [no ci] * metal : prevent int overflows [no ci] * cuda : disable BF16 FA ggml-ci * metal : fix BF16 requirement for FA kernels ggml-ci * make : clean-up [no ci]	2024-11-08 13:47:22 +02:00
Xuan Son Nguyen	76c6e7f105	server : minor UI fix (#10207 )	2024-11-07 18:44:38 -04:00
Xuan Son Nguyen	a71d81cf8c	server : revamp chat UI with vuejs and daisyui (#10175 ) * server : simple chat UI with vuejs and daisyui * move old files to legacy folder * embed deps into binary * basic markdown support * add conversation history, save to localStorage * fix bg-base classes * save theme preferences * fix tests * regenerate, edit, copy buttons * small fixes * docs: how to use legacy ui * better error handling * make CORS preflight more explicit * add GET method for CORS * fix tests * clean up a bit * better auto scroll * small fixes * use collapse-arrow * fix closeAndSaveConfigDialog * small fix * remove console.log * fix style for <pre> element * lighter bubble color (less distract when reading)	2024-11-07 17:31:10 -04:00
Georgi Gerganov	b11f9ba9b8	server : remove hack for extra parallel slot (#10187 ) ggml-ci	2024-11-06 13:29:01 +02:00
Xuan Son Nguyen	9e0ecfb697	server : clarify /slots endpoint, add is_processing (#10162 ) * server : clarify /slots endpoint, add is_processing * fix tests	2024-11-04 16:33:29 +01:00
Diego Devesa	9f40989351	ggml : move CPU backend to a separate file (#10144 )	2024-11-03 19:34:08 +01:00
sasha0552	42cadc74bd	server : fix slot selection by lru (#10126 ) * server : fix slot selection by lru, migrate lcs to `size_t` * minor debug log fix	2024-11-02 18:34:56 +02:00
Georgi Gerganov	45950415ed	server : fix endpoint checks (#10135 ) ggml-ci	2024-11-02 18:34:00 +02:00
Diego Devesa	b634f8a26f	simple-chat : only add bos on first prompt (#10129 )	2024-11-02 13:08:53 +01:00
Diego Devesa	a6744e43e8	llama : add simple-chat example (#10124 ) * llama : add simple-chat example --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-11-01 23:50:59 +01:00
sasha0552	d865d1478c	server : fix smart selection of available slot (#10120 ) * Fix smart selection of available slot * minor fix * replace vectors of tokens with shorthands	2024-11-01 14:33:14 +01:00
Kevin Gibbons	0a683e8088	server : include scheme when printing URL (#10106 )	2024-10-31 14:02:35 +01:00
Rich Dougherty	6763f713bb	readme : more lora detail in main example readme (#10064 )	2024-10-30 13:22:39 +01:00
Diego Devesa	c5b0f4b5d9	llama : refactor model loader with backend registry (#10026 )	2024-10-30 02:01:23 +01:00
Georgi Gerganov	8d8ff71536	llama : remove Tail-Free sampling (#10071 ) ggml-ci	2024-10-29 10:42:05 +02:00
Georgi Gerganov	8125e6cbfc	server : don't overfill the batch during infill (#10018 ) ggml-ci	2024-10-28 08:49:32 +02:00
wwoodsTM	ff252ea48e	llama : add DRY sampler (#9702 ) * sampling : add DRY sampler (post-refactor) * DRY: Trying to fix coauthors, removed unneeded line * DRY: Fixed redundant code * DRY: Fixed crash issue due to DRY being in chain but uninitialized --------- Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com> Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>	2024-10-25 19:07:34 +03:00
Michael Podvitskiy	d80fb71f8b	llama: string_split fix (#10022 ) * llama: Refactor string_split to use template specialization, fixes parsing strings with spaces * llama: Add static_assert in the string_split template to ensure the correct template specialization is used for std::string	2024-10-25 17:57:54 +02:00
Georgi Gerganov	bc5ba007b2	server : check that the prompt fits in the slot's context (#10030 ) ggml-ci	2024-10-25 10:13:46 +03:00
Xuan Son Nguyen	958367bf53	server : refactor slot input data, move tokenizer to HTTP thread (#10023 ) * server : refactor slot input data, move tokenizer to HTTP thread * move prompt_tokens.empty() check * fix incorrect if branch * fix infinite generation loop * bring back infill validation * add infill test * try fixing format_infill * fix test * remove redundant code * rename completion to inference * update docs * use llama_tokens everywhere	2024-10-24 21:51:22 +02:00
wwoodsTM	0a1c750c80	server : samplers accept the prompt correctly (#10019 )	2024-10-23 22:27:51 +03:00
Georgi Gerganov	2d3aba9ee8	llama.vim : bump generation time limit to 3s [no ci]	2024-10-23 17:16:56 +03:00
Michael Coppola	ac113a0fee	llama.vim : add classic vim support (#9995 ) * added classic vim support * fixed ring update, removed blank line * minor * minor * minor doc update * removed uneeded var * minor * minor * fixed job_start creating new scratch buffers * fixed job_start creating new scratch buffers * fixed ghost text indenting when expandtab is on * removed unused code * minor * unified fim_on_exit * minor * vim ghost text rendering now uses pos_x and pos_y parameters * renamed _hlgroup to hlgroup_ * renamed _ghost_text to ghost_text_, moved nvim/vim detection to llama#init() * minor --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2024-10-23 14:09:26 +03:00
Georgi Gerganov	e94a138d64	llama.vim : fix info text display [no ci] (#9787 )	2024-10-22 00:37:55 +03:00
Georgi Gerganov	e01c67affe	llama.vim : move info to the right of screen [no ci] (#9787 ) 'eol' messes up the rendering with nvim v0.10.2 for some reason	2024-10-21 22:53:18 +03:00
Georgi Gerganov	dbd5f2f573	llama.vim : plugin for Neovim (#9787 )	2024-10-21 20:25:02 +03:00
Georgi Gerganov	55e47786e3	llama : default sampling changes + greedy update (#9897 ) * llama : deprecate softmax sampler + fix dist sampler ggml-ci * tests : replace macros with functions ggml-ci * sampling : change temperature sampler logic For t <= 0.0f, keep the max logit intact and set the rest to -inf * cont : no need for special "greedy" logic top-k == 1 is the same * tests : init prob correctly * llama : handle temp <= 0.0 in the temp_ext sampler too ggml-ci * cont : avoid extra loop in temperature sampler for sub-zero temp ggml-ci	2024-10-21 09:46:40 +03:00

... 5 6 7 8 9 ...

1477 Commits