llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-07-20 17:49:18 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	6b0a7420d0	llama : KV cache view API + better KV cache management (#4170 ) * llama : keep track of used KV cells + better KV cache management * llama : zero KV cache used upon clear ggml-ci * llama : allow exporting a view of the KV cache (#4180) * Allow exporting a view of the KV cache * Allow dumping the sequences per cell in common * Track max contiguous cells value and position as well * Fix max contiguous empty cells index calculation Make dump functions deal with lengths or sequences counts > 10 better * Fix off by one error in dump_kv_cache_view * Add doc comments for KV cache view functions Eliminate cell sequence struct; use llama_seq_id directly Minor cleanups * common : add -dkvc arg for enabling kv cache dumps --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com> b1554	2023-11-23 19:07:56 +02:00
Georgi Gerganov	d103d935c0	readme : update hot topics	2023-11-23 13:51:22 +02:00
Daniel Bevenius	9d5949f04b	examples : fix typo in parallel example doc comment (#4181 ) Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> b1552	2023-11-23 13:34:20 +02:00
Georgi Gerganov	ff8238f71d	docs : add llama-star arch idea	2023-11-23 11:35:04 +02:00
Galunid	8e672efe63	stablelm : simplify + speedup generation (#4153 ) b1550	2023-11-21 16:22:30 +01:00
Galunid	0b871f1a04	finetune - update readme to mention llama support only (#4148 )	2023-11-20 19:30:00 +01:00
Aaryaman Vasishta	dfc7cd48b1	readme : update ROCm Windows instructions (#4122 ) * Update README.md * Update README.md Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2023-11-20 17:02:46 +02:00
Seb C	881800d1f0	main : Add ChatML functionality to main example (#4046 ) Co-authored-by: Sebastian Cramond <sebby37@users.noreply.github.com> b1547	2023-11-20 14:56:59 +01:00
Galunid	f23c0359a3	ci : add flake8 to github actions (python linting) (#4129 ) Disabled rules: * E203 Whitespace before ':' - disabled because we often use 'C' Style where values are aligned * E211 Whitespace before '(' (E211) - disabled because we often use 'C' Style where values are aligned * E221 Multiple spaces before operator - disabled because we often use 'C' Style where values are aligned * E225 Missing whitespace around operator - disabled because it's broken so often it seems like a standard * E231 Missing whitespace after ',', ';', or ':' - disabled because we often use 'C' Style where values are aligned * E241 Multiple spaces after ',' - disabled because we often use 'C' Style where values are aligned * E251 Unexpected spaces around keyword / parameter equals - disabled because it's broken so often it seems like a standard * E261 At least two spaces before inline comment - disabled because it's broken so often it seems like a standard * E266 Too many leading '#' for block comment - sometimes used as "section" separator * E501 Line too long - disabled because it's broken so often it seems like a standard * E701 Multiple statements on one line (colon) - broken only in convert.py when defining abstract methods (we can use# noqa instead) * E704 Multiple statements on one line - broken only in convert.py when defining abstract methods (we can use# noqa instead) b1546	2023-11-20 11:35:47 +01:00
Branden Butler	40a34fe8d0	speculative : fix prompt tokenization in speculative example (#4025 ) * Support special tokens and not adding BOS to prompt in speculative * Adapt to new should_add_bos function * Ensure tgt and dft have same add_bos setting b1545	2023-11-20 11:50:04 +02:00
Georgi Gerganov	dae06c06e5	Revert "finetune : add --n-gpu-layers flag info to --help (#4128 )" This reverts commit `05e8301e45`. b1544	2023-11-19 19:16:07 +02:00
Clark Saben	05e8301e45	finetune : add --n-gpu-layers flag info to --help (#4128 ) b1543	2023-11-19 18:56:38 +02:00
SoftwareRenderer	936c79b227	server : relay error messages (#4131 ) b1542	2023-11-19 18:54:10 +02:00
kchro3	262005ad9d	common : comma should be semicolon (#4137 ) b1541	2023-11-19 18:52:57 +02:00
Georgi Gerganov	35985acffa	gitignore : tokenize	2023-11-19 18:50:49 +02:00
slaren	e937066420	gguf-py : export chat templates (#4125 ) * gguf-py : export chat templates * llama.cpp : escape new lines in gguf kv info prints * gguf-py : bump version * gguf-py : check chat_template type * gguf-py : initialize chat_template b1539	2023-11-19 11:10:52 +01:00
Kerfuffle	28a2e6e7d4	tokenize example: Respect normal add BOS token behavior (#4126 ) Allow building with Makefile b1538	2023-11-18 14:48:17 -07:00
Galunid	0b5c3b0457	scripts : Remove missed baichuan convert script (#4127 )	2023-11-18 21:08:33 +01:00
Kerfuffle	2923f17f6f	Clean up ggml-cuda.cu warnings when compiling with clang (for ROCM) (#4124 ) * ggml-cuda.cu: Clean up warnings when compiling with clang * ggml-cuda.cu: Move static items into anonymous namespace * ggml-cuda.cu: Fix use of namespace start macro * Revert "ggml-cuda.cu: Fix use of namespace start macro" This reverts commit `26c1149026`. * Revert "ggml-cuda.cu: Move static items into anonymous namespace" This reverts commit `e29757e0f7`. b1536	2023-11-18 08:11:18 -07:00
slaren	bbecf3f415	llama : increase max nodes (#4115 ) b1535	2023-11-17 21:39:11 +02:00
Roger Meier	8e9361089d	build : support ppc64le build for make and CMake (#3963 ) * build: support ppc64le build for make and CMake * build: keep __POWER9_VECTOR__ ifdef and extend with __powerpc64__ Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b1534	2023-11-17 18:11:23 +02:00
Georgi Gerganov	5ad387e994	tokenize : fix trailing whitespace b1533	2023-11-17 18:01:38 +02:00
zakkor	2fa02b4b3d	examples : add tokenize (#4039 ) b1532	2023-11-17 17:36:44 +02:00
Don Mahurin	2ab0707acb	convert : use 'model' value if it exists. This allows karpathy/tinyllamas to load (#4089 ) Co-authored-by: Don Mahurin <@>	2023-11-17 17:32:34 +02:00
John	11173c92d6	py : Falcon HF compatibility (#4104 ) Falcon HF compatibility	2023-11-17 17:24:30 +02:00
Jannis Schönleber	9e87ef60e1	common : improve yaml log escaping (#4080 ) * logging: improve escaping in yaml output * logging: include review feedback b1529	2023-11-17 17:24:07 +02:00
Huawei Lin	c7cce1246e	llava : fix compilation warning that fread return value is not used (#4069 ) b1528	2023-11-17 17:22:56 +02:00
Jiří Podivín	f7d5e97542	py : remove superfluous import statements (#4076 ) Signed-off-by: Jiri Podivin <jpodivin@gmail.com> Co-authored-by: Jiri Podivin <jpodivin@redhat.com>	2023-11-17 17:20:53 +02:00
Jiří Podivín	ba4cf5c0bf	train : move number of gpu layers argument parsing to common/train.cpp (#4074 ) - introduces help entry for the argument - cuts '--gpu-layers' form in order to simplify usage and documentation. Signed-off-by: Jiri Podivin <jpodivin@gmail.com> Co-authored-by: Jiri Podivin <jpodivin@redhat.com> b1526	2023-11-17 17:19:16 +02:00
slaren	e85bb1a8e7	llama : add functions to get the model's metadata (#4013 ) * llama : add functions to get the model's metadata * format -> std::to_string * better documentation b1525	2023-11-17 17:17:37 +02:00
gwjr	3e916a07ac	finetune : speed-up ggml_compute_forward_out_prod_f32 via BLAS (#4079 ) * Remove logically superfluous assertions and order by dimension * Use cblas_sgemm() to implement ggml_compute_forward_out_prod() * Remove ggml_compute_forward_out_prod_use_blas(), fix compiling errors on cmake/zig, remove trailing whitespace * Add openBLAS support for sgemm() in compute_forward_out_prod() b1524	2023-11-17 16:48:19 +02:00
Andrew Godfrey	947f64f163	finetune : zero the loraB initial vectors (#4082 ) * finetune : zero the loraB initial vectors Without this, the first iteration is starting out far from the base model, instead of exactly on it. Zeroing loraB is what the paper recommends. loralib also zeroes at least one of the init vector pairs (though it departs from the paper in using a different distribution for the other vector, in some cases). * tabs to spaces * Use ggml_set_zero instead of adding a new function b1523	2023-11-17 11:23:11 +01:00
Andrew Godfrey	b83e149ec6	cuda : get_row_rounding F32 (#4095 ) * Fix #4017 * Update ggml-cuda.cu Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Update ggml-cuda.cu Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> b1522	2023-11-17 10:01:15 +02:00
Georgi Gerganov	4f447a4833	llama : fix data units (#4101 ) * llama : fix data units ggml-ci * Revert "llama : fix data units" This reverts commit `f5feac831f`. * llama : disambiguate data units ggml-ci b1521	2023-11-17 10:00:15 +02:00
Kerfuffle	91f6499393	Respect tokenizer.ggml.add_bos_token value when tokenizing (#4040 ) * gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode. * Respect add_bos_token GGUF metadata value * gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time b1520	2023-11-16 19:14:37 -07:00
texmex76	8da46278e1	gguf : fix potential infinite loops while parsing (#4100 ) Co-authored-by: Bernhard Gstrein <gstrein@cs.uni-freiburg.de> b1519	2023-11-16 17:01:48 +02:00
Jared Van Bortel	a6fc554e26	llama : restore prefix space in llama tokenizer (#4081 ) b1518	2023-11-15 11:34:47 -05:00
slaren	1cf2850d52	ggml-cuda : increase max graph size (#4084 ) b1517	2023-11-15 14:58:13 +02:00
Michael Potter	6bb4908a17	Fix MacOS Sonoma model quantization (#4052 ) Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b1516	2023-11-14 12:34:41 -05:00
Galunid	36eed0c42c	stablelm : StableLM support (#3586 ) * Add support for stablelm-3b-4e1t * Supports GPU offloading of (n-1) layers b1515	2023-11-14 11:17:12 +01:00
afrideva	b46d12f86d	convert.py: also look for plain model.safetensors (#4043 ) * add safetensors to convert.py help message * Check for single-file safetensors model * Update convert.py "model" option help message * revert convert.py help message change	2023-11-13 18:03:40 -07:00
M. Yusuf Sarıgöz	bd90eca237	llava : fix regression for square images in #3613 (#4056 ) b1513	2023-11-13 18:20:52 +03:00
Georgi Gerganov	3d68f364f1	ggml : sync (im2col, GPU conv, 32-bit arm compat) (#4060 ) ggml-ci b1512	2023-11-13 16:55:52 +02:00
Georgi Gerganov	c049b37d7b	readme : update hot topics	2023-11-13 14:18:08 +02:00
Georgi Gerganov	4760e7cc0b	sync : ggml (backend v2) (#3912 ) * sync : ggml (backend v2) (wip) * sync : migrate examples and llama.cpp to dynamic graphs (wip) * sync : update tests + fix max op params to 64 ggml-ci * sync : ggml-cuda ggml-ci * llama : fix save/load state context size ggml-ci * sync : try to fix build on tvOS * sync : pass custom graph sizes in training examples * sync : update graph copies to new ggml API * sync : update sync-ggml.sh with new files * scripts : fix header in sync script * train : fix context size calculations * llama : increase inference graph size up to 4096 nodes * train : allocate grads for backward graphs * train : allocate grads for gb_tmp b1510	2023-11-13 14:16:23 +02:00
Kerfuffle	bb50a792ec	Add ReLU and SQR CUDA ops to (partially) fix Persimmon offloading (#4041 ) * Add ReLU and SQR CUDA ops to fix Persimmon offloading * Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers b1509	2023-11-13 01:58:15 -07:00
Kerfuffle	21fd874c8d	gguf-py: gguf_writer: Use bytearray to build metadata (#4051 ) * gguf-py: gguf_writer: Use BytesIO to build metadata * Use bytearray instead Bump gguf-py package version	2023-11-12 16:39:37 -07:00
Richard Kiss	532dd74e38	Fix some documentation typos/grammar mistakes (#4032 ) * typos * Update examples/parallel/README.md Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com> --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>	2023-11-11 23:04:58 -07:00
M. Yusuf Sarıgöz	e86fc56f75	Fix gguf-convert-endian script (#4037 ) * Fix gguf-convert-endian script * Bump version and update description	2023-11-11 08:35:31 -07:00
Alexey Parfenov	d96ca7ded7	server : fix crash when prompt exceeds context size (#3996 ) b1505	2023-11-10 23:48:21 -06:00

... 58 59 60 61 62 ...

4504 Commits