Commit Graph

5076 Commits

Author SHA1 Message Date
2d43387daf ggml : fix builds, add ggml-quants-k.o (close #1712, close #1710) master-2d43387 2023-06-06 10:18:03 +03:00
7ad7750c5c gitignore : add .clang-tidy 2023-06-06 09:55:25 +03:00
7a74dee6b4 llama : temporary disable Q6_K output quantization (#1711) master-7a74dee 2023-06-06 09:39:38 +03:00
590250f7a9 metal : add checks for buffer size (#1706)
Co-authored-by: Spencer Sutton <Spencer.Sutton@precisely.com>
master-590250f
2023-06-06 06:28:17 +03:00
f4c55d3bd7 docs : add performance troubleshoot + example benchmark documentation (#1674)
* test anchor link

* test table

* add benchmarks

* Add performance troubleshoot & benchmark

* add benchmarks

* remove unneeded line

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-05 23:32:36 +03:00
f1465624c2 readme : fix typo (#1700)
Fix a typo in a command in README.md
2023-06-05 23:28:37 +03:00
c2df36d60d llama : consistently catch and throw only exceptions deriving from std::exception (#1599)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-c2df36d
2023-06-05 23:24:29 +03:00
9d0693bce3 metal : use shared buffers between CPU and GPU (#1696)
* Use MTLDevice.newBufferWithBytesNoCopy to share buffers between CPU and GPU

* Page-align buffers used by Metal

* Remove trailing whitespace

* Only import unistd.h for Metal builds

* metal : remove unnecessary copies

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-9d0693b
2023-06-05 23:24:04 +03:00
efe0507632 ggml : fix internal overflow in ggml_time_us on Windows (#1702)
Co-authored-by: grahameth <->
master-efe0507
2023-06-05 23:11:49 +03:00
e7fe66e670 ci : disable auto tidy (#1705) master-e7fe66e 2023-06-05 23:05:05 +03:00
99009e72f8 ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684)
* Starting to add k-quantization to ggml

I think it is better to have quantization separate from
ggml. For now just adding the k-quants there, but it would be
better to also factor out the existing ggml quantizations.

* Adding Q3_K and Q8_K (de)-quantization

* Q3_K now working on CUDA and AVX2/scalar

CUDA is not ideal - ~50% slower than Q4_0 for
single token prediction, about the same in batch
mode (perplexity). CPU single token is ~55 ms
(on Ryzen 7950X).

* Some improvement for Q3_K on CUDA

It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0.

* Some more CUDA optimizations for Q3_K

Single token is now 20.5 ms/token (~20% slower than Q4_0).
Perplexity is on par with Q4_0.

* Adding Q4_K - scalar, AVX2, CUDA

Performance is the same or perhaps very slightly better than Q4_0 on the CPU.
On the GPU, single token prediction is ~10% better than Q4_0,
batch mode (perplexity is about the same).

* Adding Q6_K - scalar, AVX2, CUDA

Performance is ~40% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 6-bit model is ~44% larger than the 4-bit.
On the GPU, single token prediction is ~6% lower than Q4_0,
batch mode (perplexity) is even closer (but still slower).

* Adding Q5_K - scalar, AVX2, CUDA

Performance is ~20% lower compared to Q4_K on the CPU.
This is to be expected, considering that we are memory bound
on the CPU and the 5-bit model is ~22% larger than the 4-bit.
On the GPU, single token prediction is about the same as Q4_0
for both, single token and batch prediction.

* Per convention, all QX_K quantizations use Q5_K for output.weight

* Adding quantization mixes

* Quantization mixes: didn't quite get what I wanted in the last commit

* Q4_K dot product for ARM_NEON

* Q6_K dot product for ARM_NEON

* Q5_K dot product for ARM_NEON

* Adding Q3_K dot for ARM_NEON

It is 22% slower than Q4_K, despite the smaller model size.
On x86_64, where we are memory bound, the Q3_K model is
quite a bit faster than Q4_K.

* A very slightly faster ARM_NEON Q3_K dot

* Adding Q2_K - just CUDA for now

Token prediction is pretty good - about 15.5 ms on a RTX 4080.
Perplexity is about the same as Q4_K.

* Adding scalar and AVX2 Q2_K dot

* Adding ARM_NEON Q2_K dot

About the same performance as Q4_K.

* A slightly faster ARM_NEON Q2_K dot

Single token prediction is now ~36 ms on M2 Max.
The code is much simpler too.

* Fixed bug in Q2_K CUDA dot product kernel

Stranegly enough, for the few prompts I tried with the 7B model
the responses looked perfectly reasonable. Only realized something
is not quite right when I tried the larger models and started getting
nonse back.

In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X
box iusing CUDA and model fully loaded on the GPU are
  ~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B.
The max number of layers that fit in VRAM for The 65B is 32.
With that, we get ~330 ms per token, which is not that much faster
than just running on the CPU (~470 ms per token).

* Don't print zeros/NaNs when no count histogram has been collected

* A 10% faster CUDA vector dot kernel for Q3_K

Q3_K is now running at ~18.5 ms / token on CUDA,
so the gap to Q4_0 is only 10%.
It seems memory acccess pattern is more important for
performance than the amount of computation the kernel
does.

* A slightly daster Q4_K AVX2 dot product

For perplexity, where we are less memory bound, time per
pass drops by ~5%. Barely measurable difference for single
token prediction.

* A slightly faster ARM_NEON A4_K dot product

* Minor

* Fix quantization error test

We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit
quantization variants.

* Fix docker build

I have been sloppy with vector reinterpret casts on ARM_NEON.
It seems clang is very forgiving in that regard.

* Added forgotten ggml.o dependence on k_quants.h to the Makefile

* Had unintentionally committed the Makefile with -Ofast enabled

* ggml : rename k_quants -> ggml-quants-k, use lowercase in code

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-99009e7
2023-06-05 22:56:18 +03:00
5220a991a5 Increase 3B scratch buffers. (#1698)
The 128 MB was too optimistic.
Too bad it is not dynamically computed.
master-5220a99
2023-06-05 13:43:08 +03:00
d1f563a743 llama : fix Metal KV cache sync (close #1695) master-d1f563a 2023-06-05 10:19:03 +03:00
827f5eda91 readme : update hot topics 2023-06-04 23:38:19 +03:00
ecb217db4f llama : Metal inference (#1642)
* mtl : export the LLaMA computation graph

* ci : disable temporary

* mtl : adapt the MNIST example as starter

* mtl : no need for mtl-export tool, add cli arg for main instead

* mtl : export just a small part of the graph for now to make it easier

* mtl : move MSL code into separate file for easy editing

* mtl : initial get_rows_q4_0 kernel

* mtl : confirmed get_rows_q4_0 is working correctly

* mtl : add rms_norm kernel + confirm working

* mtl : add mul kernel + confirm working

* mtl : initial mul_mat Q4 kernel (wrong results)

* mtl : mul_mat fixes (still wrong)

* mtl : another mul_mat Q4 (still does not work)

* mtl : working mul_mat q4

* ggml : fix handling of "view" ops in ggml_graph_import()

* mtl : add rope kernel

* mtl : add reshape and transpose handling

* ggml : store offset as opt arg for ggml_view_xd() operators

* mtl : add cpy kernel + handle view ops

* mtl : confirm f16 x f32 attention mul mat

* mtl : add scale kernel

* mtl : add diag_mask_inf kernel

* mtl : fix soft_max kernel

* ggml : update ggml_nbytes() to handle non-contiguous tensors

* mtl : verify V tensor contents

* mtl : add f32 -> f32 cpy kernel

* mtl : add silu kernel

* mtl : add non-broadcast mul kernel

* mtl : full GPU inference of the computation graph

* mtl : optimize rms_norm and soft_max kernels

* mtl : add f16 mat x f32 vec multiplication kernel

* mtl : fix bug in f16 x f32 mul mat + speed-up computation

* mtl : faster mul_mat_q4_0_f32 kernel

* mtl : fix kernel signature + roll inner loop

* mtl : more threads for rms_norm + better timing

* mtl : remove printfs from inner loop

* mtl : simplify implementation

* mtl : add save/load vocab to ggml file

* mtl : plug Metal inference into llama.cpp (very quick-n-dirty)

* mtl : make it work with main example

Lots of hacks but at least now it generates text

* mtl : preparing for merge

* mtl : clean-up ggml mtl interface + suport scratch / inplace

* mtl : remove temp / debug code

* metal : final refactoring and simplification

* Revert "ci : disable temporary"

This reverts commit 98c267fc77.

* metal : add comments

* metal : clean-up stuff, fix typos

* readme : add Metal instructions

* readme : add example for main
master-ecb217d
2023-06-04 23:34:30 +03:00
dcb2ed4826 OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel (#1653)
* Use events instead of clFinish, where possible

* OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel

* Reduce queueing overhead for contiguous tensors by using single mul kernel call

* Adapt to #1612 cl_mem malloc changes

* Reduce code duplication between cuda and opencl branches

* Improve implementation
master-dcb2ed4
2023-06-04 08:12:05 +02:00
d8bd0013e8 Add info about CUDA_VISIBLE_DEVICES (#1682) 2023-06-03 16:35:20 +03:00
b5c85468a3 Docker: change to calling convert.py (#1641)
Deprecation disclaimer was added to convert-pth-to-ggml.py
2023-06-03 15:11:53 +03:00
136476e898 Fix prompt cache saving and chat-persistent rollover (#1678)
* Fix prompt cache saving and chat-persistent rollover (fixes #1670)

* clang-tidy

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
master-136476e
2023-06-03 07:28:45 -04:00
ffb06a345e OpenLLaMA 3B support (#1588)
This adds support to llama.cpp to load the model.

Currently missing are changes that are required from convert.py to convert the model correctly. It needs some changes to start reading the JSON configuration for HF models instead of deriving the values by guessing.

Co-authored-by: FNsi <125447286+FNsi@users.noreply.github.com>
master-ffb06a3
2023-05-30 21:24:22 +03:00
7552ac5863 ggml : sync cgraph import / export API master-7552ac5 2023-05-29 19:31:44 +03:00
5d1830b99d ggml : fix bug in ggml_alibi 2023-05-29 19:30:49 +03:00
248367605e Work around for recalculating logits in cached prompts (Fixes #1585) (#1609)
* Work around for recalculating logits in cached prompts
master-2483676
2023-05-29 05:13:40 -07:00
0e730dd23b Adding git in container package dependencies (#1621)
Git added to build packages for version information in docker image

Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
2023-05-28 21:45:50 -07:00
3b126f654f LLAMA_DEBUG adds debug symbols (#1617) master-3b126f6 2023-05-28 21:01:02 +02:00
1b78ed2081 Only show -ngl option when relevant + other doc/arg handling updates (#1625)
1. Add a `LLAMA_SUPPORTS_GPU_OFFLOAD` define to `llama.h` (defined when compiled with CLBlast or cuBLAS)
2. Update the argument handling in the common example code to only show the `-ngl`, `--n-gpu-layers` option when GPU offload is possible.
3. Add an entry for the `-ngl`, `--n-gpu-layers` option to the `main` and `server` examples documentation
4. Update `main` and `server` examples documentation to use the new style dash separator argument format
5. Update the `server` example to use dash separators for its arguments and adds `-ngl` to `--help` (only shown when compiled with appropriate support). It will still support `--memory_f32` and `--ctx_size` for compatibility.
6. Add a warning discouraging use of `--memory-f32` for the `main` and `server` examples `--help` text as well as documentation. Rationale: https://github.com/ggerganov/llama.cpp/discussions/1593#discussioncomment-6004356
master-1b78ed2
2023-05-28 11:48:57 -06:00
337aea1139 examples : add --alias option to gpt_params to set use friendly model name (#1614) master-337aea1 2023-05-28 20:14:24 +03:00
bb051d9723 opencl : no need to allocate cl_mem on heap (#1612) master-bb051d9 2023-05-28 20:13:36 +03:00
ca74884f66 opencl : use strstr to check if fp16 supported (#1611)
* Use strstr to check if fp16 supported

* Ensure ext_buffer is null terminated
master-ca74884
2023-05-28 20:09:56 +03:00
a6704643b6 ggml : add support for the RISCV architecture (#1616) master-a670464 2023-05-27 23:03:25 +03:00
0df7d63e5b Include server in releases + other build system cleanups (#1610)
Set `LLAMA_BUILD_SERVER` in workflow so the `server` example gets build. This currently only applies to Windows builds because it seems like only Windows binary artifacts are included in releases.

Add `server` example target to `Makefile` (still uses `LLAMA_BUILD_SERVER` define and does not build by default)

Fix issue where `vdot` binary wasn't removed when running `make clean`.

Fix compile warnings in `server` example.

Add `.hpp` files to trigger workflow (the server example has one).
master-0df7d63
2023-05-27 11:04:14 -06:00
97c9b77c4f Add documentation about CLBlast (#1604)
Installing, compiling and using.
2023-05-27 18:47:55 +03:00
0ecb1bbbeb [CI] Fix openblas (#1613)
* Fix OpenBLAS build

* Fix `LLAMA_BLAS_VENDOR` CMake variable that should be a string and not a boolean.
master-0ecb1bb
2023-05-27 17:24:06 +03:00
93618031c7 ggml : add ggml_tensor_overhead() master-9361803 2023-05-27 16:19:56 +03:00
83c54e6da5 [CI] CLBlast: Fix directory name (#1606) 2023-05-27 14:18:25 +02:00
bdbda1b17a ggml : sync ggml core (minor additions, e.g. ggml_get_tensor_by_name()) 2023-05-27 12:23:16 +03:00
66874d4fbc Some improvements to loading the session with --prompt-cache (#1550)
Improvements to loading the session with `--prompt-cache` in the `main` example.

1. Fix an issue where the `--seed` parameter was ignored when loading a cached prompt.
2. When loading a cached prompt, you previously had to specify the saved prompt (or a prefix of it) again. This pull changes that behavior to default to the prompt that was cached if a prompt wasn't specified by the user.
master-66874d4
2023-05-25 20:18:01 -06:00
1fcdcc28b1 cuda : performance optimizations (#1530)
* xor hack

* block y dim

* loop unrolling

* Fixed cmake LLAMA_CUDA_BY option

* Removed hipblas compatibility code

* Define GGML_CUDA_DMMV_BLOCK_Y if not defined

* Fewer iters, more ops per iter

* Renamed DMMV X/Y compilation options
master-1fcdcc2
2023-05-26 00:07:29 +03:00
ac7876ac20 Update CLBlast to 1.6.0 (#1580)
* Update CLBlast to 1.6.0
master-ac7876a
2023-05-24 10:30:09 +03:00
c31bbe934b readme : add docs for chat-persistent.sh (#1568)
* readme : add docs for chat-persistent.sh

* Update README.md
2023-05-24 09:24:01 +03:00
1359b6aba5 chat-persistent.sh : use bracket expressions in grep (#1564) 2023-05-24 09:16:22 +03:00
7d873811f3 Fix handling of "invalid property" when creating OpenCL command queue (#1565)
The `clCreateCommandQueue()` function will return the code
`CL_INVALID_QUEUE_PROPERTIES` when passed unsupported properties,
not `CL_INVALID_PROPERTY` as the original code was checking for.
master-7d87381
2023-05-23 19:01:15 +03:00
2e6cd4b025 OpenCL Token Generation Acceleration (#1459)
* Move back to C++ for OpenCL

* Refactor OpenCL code to work more like the CUDA code, add missing functions

* Deduplicate dequant kernels

* Add OpenCL compile options

* Use compile args for preprocessing constants

* Restore default platform + device selection by id behavior

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Henri Vasserman <henv@hot.ee>
master-2e6cd4b
2023-05-23 00:33:24 +03:00
7e4ea5beff examples : add server example with REST API (#1443)
* Added httplib support

* Added readme for server example

* fixed some bugs

* Fix the build error on Macbook

* changed json11 to nlohmann-json

* removed some whitespaces

* remove trailing whitespace

* added support custom prompts and more functions

* some corrections and added as cmake option
master-7e4ea5b
2023-05-21 20:51:18 +03:00
7780e4f479 make : .PHONY clean (#1553) master-7780e4f 2023-05-21 17:03:44 +03:00
265db9834e ggml : output 3d sizes in ggml_graph_dump_dot() master-265db98 2023-05-21 11:56:23 +03:00
fab49c685e ggml : update WASM SIMD master-fab49c6 2023-05-20 20:00:41 +03:00
b8ee340abe feature : support blis and other blas implementation (#1536)
* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix: blas changes on ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-b8ee340
2023-05-20 17:58:31 +03:00
9ecb30f959 OpenCL: Fixes for older devices. (#1435)
* Remove `constant`

* Rewrite platform and device selection

* Fix Q8_0
master-9ecb30f
2023-05-20 17:57:39 +03:00
29cf5596fe llama : define magic numbers as integer constants (#1518) (#1520)
The underlying representation of multibyte character literals is
implementation-defined. This could, at least in principle, cause
cross-build data export/import issues independent of endianness.

Define magic numbers as integer literals to be on the safe side.

Signed-off-by: Juuso Alasuutari <juuso.alasuutari@gmail.com>
master-29cf559
2023-05-20 15:58:15 +03:00