Commit Graph

1195 Commits

Author SHA1 Message Date
a73ccf1aa3 llama : replace (permute + reshape + view_1d) with (view_3d) (#2538)
ggml-ci
master-a73ccf1
2023-08-17 10:47:09 +03:00
7cf54e1f74 tests : adds simple llama grammar tests (#2618)
* adds simple llama grammar tests

* fix lint and add Makefile

* 0 terminate code_points

* avoid dangling pointers in candidate cleanup

* cleanup grammar at end of test
master-7cf54e1
2023-08-17 10:41:01 +03:00
a872a2b28e ggml-alloc : fix discrepency between measure&eval (#2639)
The GGML memory allocator consistently places a tensor within the
optimal-fit memory block, which is the smallest block capable of
accommodating the tensor's size. During the measurement phase, the final
block is generously sized, ensuring it never qualifies as the
optimal-fit block as long as there exists another block capable of
accommodating the tensor. Nevertheless, in the evaluation phase, the
last block is constrained in size and could potentially qualify as the
optimal-fit block. Consequently, there exists the possibility of a
tensor being allocated to a different region during evaluation, leading
to more memory fragmentation in our scratch buffer.

This recent commit guarantees uniform behavior of the allocator across
both the measurement and evaluation phases, eliminating discrepancies
between the two.
master-a872a2b
2023-08-17 10:35:53 +03:00
0919a0f73d cmake : install ggml-meta.metal if LLAMA_METAL (#2449) master-0919a0f 2023-08-16 23:09:49 +03:00
ed53db86c3 metal : print error of load pipeline state (#2564)
* metal : print error of load pipeline state

* metal : return null if load pipeline failed
2023-08-16 23:09:03 +03:00
fc8ef549e5 metal : enable ggml-alloc (#2627)
* metal: enable ggml-alloc

Make ggml-alloc work with concurrently dispatch.

* style-fix

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-fc8ef54
2023-08-16 23:08:28 +03:00
bf83bff674 metal : matrix-matrix multiplication kernel (#2615)
* metal: matrix-matrix multiplication kernel

This commit removes MPS and uses custom matrix-matrix multiplication
kernels for all quantization types. This commit also adds grouped-query
attention to support llama2 70B.

* metal: fix performance degradation from gqa

Integers are slow on the GPU, and 64-bit divides are extremely slow.
In the context of GQA, we introduce a 64-bit divide that cannot be
optimized out by the compiler, which results in a decrease of ~8% in
inference performance. This commit fixes that issue by calculating a
part of the offset with a 32-bit divide. Naturally, this limits the
size of a single matrix to ~4GB. However, this limitation should
suffice for the near future.

* metal: fix bugs for GQA and perplexity test.

I mixed up ne02 and nb02 in previous commit.
master-bf83bff
2023-08-16 23:07:04 +03:00
b5ffb2849d scripts : add helper script to get wikitext 2023-08-15 10:05:25 +03:00
3ebb00935f server : add missing /json-schema-to-grammar.mjs (#2616)
fixes #2611
master-3ebb009
2023-08-15 06:14:14 +08:00
d783f7982e metal : return null instead of exit(1) (#2573) master-d783f79 2023-08-14 16:37:39 +03:00
d75561df20 server : add --numa support (#2524) master-d75561d 2023-08-14 16:36:42 +03:00
348acf188c llama : add missing enum keyword in function signatures (#2610) master-348acf1 2023-08-14 16:35:16 +03:00
1cd06fa25e CUDA: launch_bounds, small q4_K, q5_K mmq refactor (#2596) master-1cd06fa 2023-08-14 10:41:22 +02:00
2feb8934eb server : fix default grammar by use empty string in the UI (#2604) master-2feb893 2023-08-14 16:20:17 +08:00
5517d6e692 server : implement json-schema-to-grammar.mjs & add grammar param in the UI (#2588)
* server : implement json-schema-to-grammar.mjs by follow python impl

* server : add grammar support in chat.mjs

* server : implement grammer param in the UI

* server : generate .hpp

* server : remove trailing whitespaces

* server : generate .hpp

* server : fix sort of prop pairs

* server : optimize regex & iteration
master-5517d6e
2023-08-14 15:16:54 +08:00
f31b539714 Enhance Windows 7 and below compatibility. (#2592)
* Enhance Windows 7 compatibility.
* Clean away unnecessary preprocessor conditional
master-f31b539
2023-08-13 20:59:16 -07:00
ee77efea2a test : add simple grammar parsing tests (#2594)
* adds simple grammar parsing tests

* adds cassert header
master-ee77efe
2023-08-13 17:00:48 +03:00
f64d44a9b9 CUDA: Fixed OpenLLaMA 3b mmq, reduced compile time (#2590) master-f64d44a 2023-08-13 00:24:45 +02:00
b19edd54d5 Adding support for llama2.c models (#2559) master-b19edd5 2023-08-12 01:17:25 +02:00
53dc399472 server: fixed wrong variable name in timing json (#2579)
* server: fixed wrong variable name in timing json

* remove redunct entry
master-53dc399
2023-08-12 00:35:14 +02:00
9ca4abed89 Handle ENABLE_VIRTUAL_TERMINAL_PROCESSING more gracefully on earlier versions of Windows. master-9ca4abe 2023-08-10 13:11:36 -07:00
e59fcb2bc1 Add --n-predict -2 for stopping generation on full context (#2565) master-e59fcb2 2023-08-10 16:28:27 +02:00
1638757767 Fix grammar-based sampling issue in server (#2566) master-1638757 2023-08-10 13:16:38 +03:00
916a9acdd0 ggml-alloc: Don't try to re-use buffers of external tensors (#2562)
* ggml-alloc: Don't try to re-use buffers of external tensors

They might be weights that came from another context, so we
have no control over them (and they might be re-used elsewhere
so writing to them would be a bad idea).

* ggml-alloc: >= when checking for out-of-bounds

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
master-916a9ac
2023-08-09 22:47:42 +02:00
ea04a4ca19 add log_callback to llama_context_params for custom logging. (#2234)
* add log_callback to llama_context_params for custom logging.

* Fix macro expansion on gcc

* Add struct llama_state for global variables and move log_callback there

* Turn log level into enum and some minor changes.

* Remove model_for_logging parameter (not needed anymore)

* Convert remaining fprintf(stderr, ...) calls to use new macros.

* Fix enum and initialize g_state

* Fix log calls after merge

* Fix missing static

* Add back all the new lines in the logging strings

* Add comment for llama_log_callback and replace remaining printf calls

---------

Co-authored-by: grahameth <->
Co-authored-by: Helmut <helmut.buhler@inf.h-brs.de>
master-ea04a4c
2023-08-09 22:46:40 +02:00
25d43e0eb5 CUDA: tuned mul_mat_q kernels (#2546) master-25d43e0 2023-08-09 09:42:34 +02:00
f5bfea0580 Allow passing grammar to completion endpoint (#2532)
* Allow passing grammar to completion endpoint
master-f5bfea0
2023-08-08 16:29:19 +03:00
acfc5478ff CUDA: tighter VRAM scratch size for 65b/70b (#2551) master-acfc547 2023-08-08 14:38:16 +02:00
7ed8d1fe7f llm.vim : multiline autocompletion, get rid of "^@" (#2543) 2023-08-08 15:07:02 +03:00
e7f94d6fdc vim : bring back simple llm.vim example 2023-08-08 15:06:18 +03:00
2d7baaf50f vim : streaming and more (#2495)
* Update Vim plugin

* Remove getbufoneline usage, Add input bind example.

getbufoneline() appears to be a recently added function and has been
replaced with getbufline for compatibility.

An additional example that explains how to add a keybind that works in
insert mode was added.
2023-08-08 14:44:48 +03:00
f3c3b4b167 Add --rope-scale parameter (#2544)
* common.cpp : Add --rope-scale parameter
* README.md : Add info about using linear rope scaling
master-f3c3b4b
2023-08-07 19:07:19 +02:00
93356bdb7a ggml : mul mat tweaks (#2372)
* ggml : mul mat wip

ggml-ci

* ggml : alternative thread distribution for mul_mat

ggml-ci

* ggml : mul_mat block tiling attempt

* ggml : mul_mat threads yield

ggml-ci
master-93356bd
2023-08-07 14:25:58 +03:00
60baff7c85 ggml : pad result of ggml_nbytes() master-60baff7 2023-08-07 14:24:42 +03:00
9082b5dfbf ggml : change params pointer (style change) (#2539)
ggml-ci
master-9082b5d
2023-08-07 13:55:18 +03:00
99d29c0094 ggml : sync (custom ops) (#2537)
ggml-ci
master-99d29c0
2023-08-07 13:20:09 +03:00
3d9a551816 Fixed mmap prefetch for GPU offloading (#2529) master-3d9a551 2023-08-07 10:09:40 +02:00
f6f9896ac3 metal : fix out-of-bounds access + inc concurrency nodes (#2416)
* metal : fix out-of-bounds access + style changes

* metal : increase concurrency nodes to 2*GGML_MAX_NODES
2023-08-07 10:52:57 +03:00
34a14b28ff [Makefile] Move ARM CFLAGS before compilation (#2536) master-34a14b2 2023-08-07 09:21:46 +03:00
7297128db8 [Zig] Rewrite build for Zig 0.11 (#2514)
* zig build fixes

* Disable LTO on Windows.
2023-08-07 08:35:53 +03:00
86c3219895 console : fix issue related to Windows 11 PowerShell console mode persistence (#2521) master-86c3219 2023-08-06 09:49:34 +03:00
2e8265ae17 convert.py : add missing abstract methods for quantized data (#2491) 2023-08-06 09:34:05 +03:00
f514d1b306 CUDA: faster k-quant mul_mat_q kernels (#2525) master-f514d1b 2023-08-05 18:20:44 +02:00
332311234a fix firefox autoscroll (#2519) master-3323112 2023-08-04 22:16:11 +02:00
182af739c4 server: regenerate completion.js.hpp (#2515) master-182af73 2023-08-04 21:00:57 +02:00
4329d1acb0 CUDA: use min compute capability of GPUs actually used (#2506) master-4329d1a 2023-08-04 17:35:22 +02:00
02f9d96a86 CUDA: check if event is NULL before cudaStreamWaitEvent (#2505)
Fixes #2503
master-02f9d96
2023-08-04 17:34:32 +02:00
3498588e0f Add --simple-io option for subprocesses and break out console.h and cpp (#1558) master-3498588 2023-08-04 08:20:12 -07:00
5f631c2679 Fixing race condition in server and partial stream handling in frontend. (#2391)
* Fixing race condition in server.cpp and partial stream handling in completion.js

* Reverting assert edits.

* Adding newline to eof
master-5f631c2
2023-08-04 13:37:24 +02:00
415e99fec2 Stream save llama context data to file instead of allocating entire buffer upfront (#2488)
* added stream saving context data to file to avoid allocating unnecessary amounts of memory

* generalised copying state data to file or buffer

* added comments explaining how copy_state_data works

* fixed trailing whitespaces

* fixed save load state example

* updated save load state to use public function in llama.cpp

* - restored breakage of the llama_copy_state_data API
- moved new logic for copying llama state data to internal function

* fixed function declaration order

* restored save load state example

* fixed whitepace

* removed unused llama-util.h include

* Apply suggestions from code review

Co-authored-by: slaren <slarengh@gmail.com>

* Apply code review suggestions

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
master-415e99f
2023-08-04 13:29:52 +02:00