llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-08-14 04:17:53 -04:00

Author	SHA1	Message	Date
Xuan Son Nguyen	cda0e4b648	llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch (#9745 ) * refactor llama_batch_get_one * adapt all examples * fix simple.cpp * fix llama_bench * fix * fix context shifting * free batch before return * use common_batch_add, reuse llama_batch in loop * null terminated seq_id list * fix save-load-state example * fix perplexity * correct token pos in llama_batch_allocr b3943	2024-10-18 23:18:01 +02:00
Radoslav Gerganov	afd9909a64	rpc : backend refactoring (#9912 ) * rpc : refactor backend Use structs for RPC request/response messages * rpc : refactor server b3942	2024-10-18 14:33:58 +03:00
Ouadie EL FAROUKI	87421a23e8	[SYCL] Add SYCL Backend registry, device and Event Interfaces (#9705 ) * implemented missing SYCL event APIs * sycl : Added device and backend reg interfaces * Restructured ggml-sycl.cpp b3941	2024-10-18 06:46:16 +01:00
Ma Mingfei	60ce97c9d8	add amx kernel for gemm (#8998 ) add intel amx isa detection add vnni kernel for gemv cases add vnni and amx kernel support for block_q8_0 code cleanup fix packing B issue enable openmp fine tune amx kernel switch to aten parallel pattern add error message for nested parallelism code cleanup add f16 support in ggml-amx add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS update CMakeList update README fix some compilation warning fix compiler warning when amx is not enabled minor change ggml-ci move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp ggml-ci update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16 ggml-ci add amx as an ggml-backend update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h minor change update CMakeLists.txt minor change apply weight prepacking in set_tensor method in ggml-backend fix compile error ggml-ci minor change ggml-ci update CMakeLists.txt ggml-ci add march dependency minor change ggml-ci change ggml_backend_buffer_is_host to return false for amx backend ggml-ci fix supports_op use device reg for AMX backend ggml-ci minor change ggml-ci minor change fix rebase set .buffer_from_host_ptr to be false for AMX backend b3940	2024-10-18 13:34:36 +08:00
Georgi Gerganov	8901755ba3	server : add n_indent parameter for line indentation requirement (#9929 ) ggml-ci b3939	2024-10-18 07:32:19 +03:00
Daniel Bevenius	6f55bccbb8	llama : rename batch_all to batch (#8881 ) This commit addresses the TODO in the code to rename the `batch_all` parameter to `batch` in `llama_decode_internal`. b3938	2024-10-18 01:41:51 +02:00
Georgi Gerganov	17bb928080	readme : remove --memory-f32 references (#9925 ) b3937	2024-10-17 23:43:05 +03:00
Georgi Gerganov	9f45fc1e99	llama : change warning to debug log b3936	2024-10-17 23:27:42 +03:00
Georgi Gerganov	99bd4ac28c	llama : infill sampling handle very long tokens (#9924 ) * llama : infill sampling handle very long tokens ggml-ci * cont : better indices ggml-ci b3935	2024-10-17 22:32:47 +03:00
Tim Wang	3752217ed5	readme : update bindings list (#9918 ) Co-authored-by: Tim Wang <tim.wang@ing.com>	2024-10-17 09:57:14 +03:00
Diego Devesa	f010b77a37	vulkan : add backend registry / device interfaces (#9721 ) * vulkan : add backend registry / device interfaces * llama : print devices used on model load b3933	2024-10-17 02:46:58 +02:00
Gilad S.	2194200278	fix: allocating CPU buffer with size `0` (#9917 ) b3932	2024-10-17 01:34:22 +02:00
Gilad S.	73afe681aa	fix: use `vm_allocate` to allocate CPU backend buffer on macOS (#9875 ) * fix: use `vm_allocate` to allocate CPU backend buffer on macOS * fix: switch to `posix_memalign` to keep existing `free()` usages work * feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS * style: formatting * fix: move const outside of `#ifndef` * style: formatting * fix: unused var * fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h` * fix: unused var * fix: page align to `GGUF_DEFAULT_ALIGNMENT` * fix: page align to `TENSOR_ALIGNMENT` * fix: convert `TENSOR_ALIGNMENT` to a macro * fix: increase page size to `32` on iOS * fix: iOS page size * fix: `hbw_posix_memalign` alignment b3931	2024-10-17 00:36:51 +02:00
Daniel Bevenius	9e04102448	llama : suppress conversion from 'size_t' to 'int' (#9046 ) * llama : suppress conversion from 'size_t' to 'int' This commit updates llm_tokenizer_spm.tokenize to suppress/remove the following warnings that are generated on Windows when using MSVC: ```console src\llama-vocab.cpp(211,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data src\llama-vocab.cpp(517,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data ``` This is done by adding a cast for the size_t returned from symbols.size(). I believe this is safe as it seems unlikely that symbols, which stores an entry for each UTF8 character, would become larger than INT_MAX. The motivation for this change is to reduce the number of warnings that are currently generated when building on Windows. * squash! llama : suppress conversion from 'size_t' to 'int' Move cast into for loop. b3930	2024-10-16 20:34:28 +03:00
Daniel Bevenius	dbf18e4de9	llava : fix typo in error message [no ci] (#9884 )	2024-10-16 20:24:05 +03:00
Joe Eli McIlvain	66c2c93082	grammar : fix JSON Schema for string regex with top-level alt. (#9903 ) Prior to this commit, using a JSON Schema containing a string with `pattern` regular expression that uses top-level alternation (e.g. `"pattern": "^A\|B\|C\|D$"`) would result in invalid JSON output from the constrained sampling grammar, because it ended up creating a grammar rule like this for the string: ``` thing ::= "\"" "A" \| "B" \| "C" \| "D" "\"" space ``` Note that this rule will only match a starting quote for the "A" case, and will only match an ending quote for the "D" case, so this rule will always produce invalid JSON when used for sampling (that is, the JSON will always be lacking the starting quote, the ending quote, or both). This was fixed in a simple way by adding parentheses to the generated rule (for all string pattern rules, to keep it simple), such that the new generated rule looks like this (correct): ``` thing ::= "\"" ("A" \| "B" \| "C" \| "D") "\"" space ``` b3928	2024-10-16 19:03:24 +03:00
Molly Sophia	10433e8b45	llama : add tensor name for "result_norm" (#9907 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com> b3927	2024-10-16 13:10:21 +03:00
Alexey Parfenov	1f66b699c4	server : fix the disappearance of the end of the text (#9867 ) * server: fix the disappearance of the end of the text when streaming with stop strings * simplify "send text" checks b3926	2024-10-16 11:35:53 +03:00
Georgi Gerganov	0e41b300ed	sync : ggml b3925	2024-10-16 11:28:14 +03:00
Daniel Bevenius	cd60b88bf7	ggml-alloc : remove buffer_id from leaf_alloc (ggml/987) This commit removes the buffer_id field from the leaf_alloc struct. The motivation for is that this field is only written to and never read/used as far as I can tell. Each tensor_alloc has a buffer_id field and this is what caused me to look into this more closely, to understand what the buffer_id in leaf_alloc was used for.	2024-10-16 11:28:01 +03:00
leo-pony	becfd387f6	[CANN] Fix cann compilation error (#9891 ) Fix cann compilation error after merging llama.cpp supports dynamically loadable backends. b3923	2024-10-16 08:51:46 +08:00
Georgi Gerganov	755a9b2bf0	llama : add infill sampler (#9896 ) ggml-ci b3922	2024-10-15 16:35:33 +03:00
Georgi Gerganov	223c25a72f	server : improve infill context reuse (#9894 ) ggml-ci b3921	2024-10-15 16:28:55 +03:00
MaggotHATE	fbc98b748e	sampling : add XTC sampler (#9742 ) * Initial XTC commit Adds XTC sampler, not activated by default, but recommended settings by default. * Cleanup * Simplified chances calculation To be more inline with the original implementation, chance is calculated once at the beginning. * First fixes by comments Still need to look into sorting * Fixed trailing backspaces * Fixed RNG to be reproduceable Thanks to @slaren for directions * Fixed forgotten header * Moved `min_keep` Moved from conditions to a simple check at the end. * Fixed broken randomization Thanks to @slaren for explanation * Swapped sorting for a custom algorithm Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable. * Algorithm rework 1. Scan token from top till the first non-penalizable 2. Remove the last captured token (the least probable above threshold) 3. Shift all tokens to override the remaining penalizable 4. Penalize and put them at the the bottom. * Added XTC to `test-sampling` * Simplified algorithm and more tests * Updated info in common and args * Merged back lost commits in common and arg * Update dump info in common * Fixed incorrect min_keep check * Added XTC to README * Renamed parameters, fixed info and defaults * probability is at 0 by default, but XTC is included in sampling queue * threshold higher than 0.5 switches XTC off * Initial server support * Added XTC to server UIs * Fixed labels in old server UI * Made algorithm safer and more readable * Removed xtc_threshold_max * Fixed arg after update * Quick fixes by comments * Simplified algorithm since threshold_max is removed * Renamed random distribution * Fixed tests and outdated README * Small fixes b3920	2024-10-15 12:54:55 +02:00
Georgi Gerganov	dcdd535302	server : update preact (#9895 )	2024-10-15 12:48:44 +03:00
Michał Tuszyński	4c42f93b22	readme : update bindings list (#9889 )	2024-10-15 11:20:34 +03:00
VoidIsVoid	a89f75e1b7	server : handle "logprobs" field with false value (#9871 ) Co-authored-by: Gimling <huangjl@ruyi.ai> b3917	2024-10-14 10:04:36 +03:00
agray3	13dca2a54a	Vectorize load instructions in dmmv f16 CUDA kernel (#9816 ) * Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b3916	2024-10-14 02:49:08 +02:00
Georgi Gerganov	d4c19c0f5c	server : accept extra_context for the infill endpoint (#9874 ) * server : accept extra_context for the infill endpoint ggml-ci * server : update readme [no ci] * server : use repo-level FIM pattern if possible ggml-ci	2024-10-13 21:31:35 +03:00
Georgi Gerganov	c7181bd294	server : reuse cached context chunks (#9866 ) ggml-ci b3914	2024-10-13 18:52:48 +03:00
Georgi Gerganov	92be9f1216	flake.lock: Update (#9870 ) Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/bc947f541ae55e999ffdb4013441347d83b00feb?narHash=sha256-NOiTvBbRLIOe5F6RbHaAh6%2B%2BBNjsb149fGZd1T4%2BKBg%3D' (2024-10-04) → 'github:NixOS/nixpkgs/5633bcff0c6162b9e4b5f1264264611e950c8ec7?narHash=sha256-9UTxR8eukdg%2BXZeHgxW5hQA9fIKHsKCdOIUycTryeVw%3D' (2024-10-09) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-10-12 20:11:26 -07:00
Georgi Gerganov	edc265661c	server : add option to time limit the generation phase (#9865 ) ggml-ci b3912	2024-10-12 16:14:27 +03:00
Georgi Gerganov	1bde94dd02	server : remove self-extend features (#9860 ) * server : remove self-extend ggml-ci * server : fix context limit check to use slot.n_past ggml-ci b3911	2024-10-12 16:06:31 +03:00
Georgi Gerganov	95c76e8e92	server : remove legacy system_prompt feature (#9857 ) * server : remove legacy system_prompt feature ggml-ci * readme : update [no ci] * server : fix non-transformer logic + remove response from /props	2024-10-12 14:51:54 +03:00
Georgi Gerganov	11ac9800af	llama : improve infill support and special token detection (#9798 ) * llama : improve infill support ggml-ci * llama : add more FIM token strings ggml-ci * server : update prompt on slot restore (#9800) * gguf : deprecate old FIM token KVs b3909	2024-10-12 08:21:51 +03:00
R0CKSTAR	943d20b411	musa : update doc (#9856 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-10-12 08:09:53 +03:00
Diego Devesa	96776405a1	ggml : move more prints to the ggml log system (#9839 ) * ggml : move more prints to the ggml log system * show BLAS OpenMP warnings in all builds using debug print b3907	2024-10-11 15:34:45 +02:00
Diego Devesa	7eee341bee	common : use common_ prefix for common library functions (#9805 ) * common : use common_ prefix for common library functions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b3906	2024-10-10 22:57:42 +02:00
Diego Devesa	0e9f760eb1	rpc : add backend registry / device interfaces (#9812 ) * rpc : add backend registry / device interfaces * llama : add llama_supports_rpc API * ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server b3905	2024-10-10 20:14:55 +02:00
R0CKSTAR	cf8e0a3bb9	musa: add docker image support (#9685 ) * mtgpu: add docker image support Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: enable docker workflow Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> b3904	2024-10-10 20:10:37 +02:00
Diego Devesa	c7499c557c	examples : do not use common library in simple example (#9803 ) * examples : do not use common library in simple example * add command line parser, simplify code b3903	2024-10-10 19:50:49 +02:00
Diego Devesa	c81f3bbb05	cmake : do not build common library by default when standalone (#9804 ) b3902	2024-10-09 18:49:52 +02:00
Georgi Gerganov	e7022064ab	perplexity : fix integer overflow (#9783 ) * perplexity : fix integer overflow ggml-ci * perplexity : keep n_vocab as int and make appropriate casts ggml-ci b3901	2024-10-09 17:00:18 +03:00
Georgi Gerganov	3dc48fe75a	examples : remove llama.vim An updated version will be added in #9787	2024-10-09 10:55:42 +03:00
Diego Devesa	dca1d4b58a	ggml : fix BLAS with unsupported types (#9775 ) * ggml : do not use BLAS with types without to_float * ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies * ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits it's not really internal if everybody uses it b3899	2024-10-08 14:21:43 +02:00
Xuan Son Nguyen	458367a906	server : better security control for public deployments (#9776 ) * server : more explicit endpoint access settings * protect /props endpoint * fix tests * update server docs * fix typo * fix tests b3898	2024-10-08 13:27:04 +02:00
standby24x7	fa42aa6d89	scripts : fix spelling typo in messages and comments (#9782 ) Signed-off-by: Masanari Iida <standby24x7@gmail.com>	2024-10-08 09:19:53 +03:00
Diego Devesa	6374743747	ggml : add backend registry / device interfaces to BLAS backend (#9752 ) * ggml : add backend registry / device interfaces to BLAS backend * fix mmap usage when using host buffers b3896	2024-10-07 21:55:08 +02:00
Andrew Minh Nguyen	f1af42fa8c	Update building for Android (#9672 ) * docs : clarify building Android on Termux * docs : update building Android on Termux * docs : add cross-compiling for Android * cmake : link dl explicitly for Android b3895	2024-10-07 09:37:31 -07:00
Georgi Gerganov	6279dac039	flake.lock: Update (#9753 ) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/bcef6817a8b2aa20a5a6dbb19b43e63c5bf8619a?narHash=sha256-HO4zgY0ekfwO5bX0QH/3kJ/h4KvUDFZg8YpkNwIbg1U%3D' (2024-09-12) → 'github:hercules-ci/flake-parts/3d04084d54bedc3d6b8b736c70ef449225c361b1?narHash=sha256-K5ZLCyfO/Zj9mPFldf3iwS6oZStJcU4tSpiXTMYaaL0%3D' (2024-10-01) • Updated input 'flake-parts/nixpkgs-lib': '`356624c120`.tar.gz?narHash=sha256-Ss8QWLXdr2JCBPcYChJhz4xJm%2Bh/xjl4G0c0XlP6a74%3D' (2024-09-01) → '`fb192fec7c`.tar.gz?narHash=sha256-0xHYkMkeLVQAMa7gvkddbPqpxph%2BhDzdu1XdGPJR%2BOs%3D' (2024-10-01) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/1925c603f17fc89f4c8f6bf6f631a802ad85d784?narHash=sha256-J%2BPeFKSDV%2BpHL7ukkfpVzCOO7mBSrrpJ3svwBFABbhI%3D' (2024-09-26) → 'github:NixOS/nixpkgs/bc947f541ae55e999ffdb4013441347d83b00feb?narHash=sha256-NOiTvBbRLIOe5F6RbHaAh6%2B%2BBNjsb149fGZd1T4%2BKBg%3D' (2024-10-04) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-10-07 09:35:42 -07:00

... 3 4 5 6 7 ...

4143 Commits