Commit Graph

598 Commits

Author SHA1 Message Date
2510c1831f Add ggml-model-*.bin checksums for 7B, 13B, 30B, 65B (#1088)
* Add ggml-model-*.bin checksums for 7B, 13B, 30B
* Add ggml-model-*.bin checksums for 65B

---------

Co-authored-by: Pavol Rusnak <pavol@rusnak.io>
2023-04-20 23:56:44 +02:00
12b5900dbc ggml : sync ggml (add GPT-NeoX RoPE implementation) master-12b5900 2023-04-20 23:32:59 +03:00
9ff334f3c9 ggml : fix bug in ggml_compute_forward_dup_f32() master-9ff334f 2023-04-20 21:58:38 +03:00
2005469ea1 Add Q4_3 support to cuBLAS (#1086) master-2005469 2023-04-20 20:49:53 +02:00
8a1756abdf ggml : do not break cuBLAS build (Q4_3 is not yet implemented) master-8a1756a 2023-04-20 21:43:50 +03:00
66aab46079 ggml : fix Q4_3 quantization
Broke it during conflict resolution in last PR
master-66aab46
2023-04-20 20:44:05 +03:00
38de86a711 llama : multi-threaded quantization (#1075)
* Multi-threading quantization.

Not much gain for simple quantizations, bit it will be important
for quantizations that require more CPU cycles.

* Multi-threading for quantize-stats

It now does the job in ~14 seconds on my Mac for
Q4_0, Q4_1 and Q4_2. Single-threaded it was taking
more than 2 minutes after adding the more elaborate
version of Q4_2.

* Reviewer comments

* Avoiding compiler confusion

After changing chunk_size to const int as suggested by
@ggerganov, clang and GCC starting to warn me that I don't
need to capture it in the lambda. So, I removed it from the
capture list. But that makes the MSVC build fail. So,
making it a constexpr to make every compiler happy.

* Still fighting with lambda captures in MSVC

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-38de86a
2023-04-20 20:42:27 +03:00
e0305ead3a ggml : add Q4_3 quantization (#1082) master-e0305ea 2023-04-20 20:35:53 +03:00
6a9661ea5a ci : remove the LLAMA_ACCELERATE matrix dimension from Ubuntu builds in the CI (#1074)
[Accelerate](https://developer.apple.com/documentation/accelerate) is an Apple framework which can only be used on macOS, and the CMake build [ignores](https://github.com/ggerganov/llama.cpp/blob/master/CMakeLists.txt#L102) the `LLAMA_ACCELERATE` variable when run on non-Apple platforms. This implies setting `LLAMA_ACCELERATE` is a no-op on Ubuntu and can be removed.

This will reduce visual noise in CI check results (in addition to reducing the number of checks we have to run for every PR). Right now every sanitized build is duplicated twice for no good reason (e.g., we have `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, ON)` and `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, OFF)`).
master-6a9661e
2023-04-20 18:15:18 +03:00
5addcb120c fix: LLAMA_CUBLAS=1 undefined reference 'shm_open' (#1080) master-5addcb1 2023-04-20 15:28:43 +02:00
c8c2c52482 AVX2 optimization for vec_dot_q4_2_q8_0 (#1068) master-c8c2c52 2023-04-20 08:45:41 +02:00
02d6988121 Improve cuBLAS performance by dequantizing on the GPU (#1065) master-02d6988 2023-04-20 03:14:14 +02:00
834695fe3a Minor: Readme fixed grammar, spelling, and misc updates (#1071) 2023-04-19 19:52:14 +00:00
f7d05095b4 Q4_2 quantization with rmse-optimized scale and quants (#1062)
* Q4_2 quantization with rmse-optimized scale and quants

For quantize-stats we get
q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012

For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks.

Quantization is slow (~90 seconds on my Mac for 7B) as not
multi-threaded as in PR #896.

* ggml : satisfy the sanitizer builds

Not sure why this makes them fail

* Better follow ggml conventions for function names

* Fixed type as per reviewer comment

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-f7d0509
2023-04-19 20:20:14 +02:00
884e7d7a2b ggml : use 8-bit precision for Q4_1 intermediate results (#1047)
* ggml : use 8-bit precision for Q4_1 intermediate results (ARM)

* ggml : optimize ggml_vec_dot_q4_1_q8_0() via vmalq_n_f32

56 ms/token with Q4_1 !

* ggml : AVX2 implementation of ggml_vec_dot_q4_1_q8_0 (#1051)

* gitignore : ignore ppl-*.txt files

---------

Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
master-884e7d7
2023-04-19 20:10:08 +03:00
7cd5c4a3e9 readme : add warning about Q4_2 and Q4_3 2023-04-19 19:07:54 +03:00
f3d4edf504 ggml : Q4 cleanup - remove 4-bit dot product code (#1061)
* Q4 cleanup

* Remove unused AVX512 Q4_0 code
master-f3d4edf
2023-04-19 19:06:37 +03:00
8944a13296 Add NVIDIA cuBLAS support (#1044) master-8944a13 2023-04-19 11:22:45 +02:00
6667401238 Multi-threaded ggml_cpy (#1035)
* Multi-threaded ggml_cpy

* Update ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Also fix wdata offset in ggml_compute_forward_add_q_f32

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-6667401
2023-04-19 00:53:24 +02:00
77a73403ca ggml : add new Q4_2 quantization (ARM only) (#1046)
* ggml : Q4_2 ARM

* ggml : add ggml_is_quantized()

* llama : update llama_type_name() with Q4_2 entry

* ggml : speed-up q4_2

- 4 threads: ~100ms -> ~90ms
- 8 threads:  ~55ms -> ~50ms

* ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32
master-77a7340
2023-04-18 23:54:57 +03:00
50a8a2af97 ggml : scratch that - vmlaq_n_f32 is always better
Had a background process that was messing with the timings
master-50a8a2a
2023-04-18 23:11:23 +03:00
4caebf6d40 gitignore : vdot 2023-04-18 23:00:08 +03:00
dcdd65e296 ggml : optimize ggml_vec_dot_q4_0_q8_0() using vectorized accumulators master-dcdd65e 2023-04-18 22:59:17 +03:00
5ecff35151 Adding a simple program to measure speed of dot products (#1041)
On my Mac, the direct Q4_1 product is marginally slower
(~69 vs ~55 us for Q4_0). The SIMD-ified ggml version
is now almost 2X slower (~121 us).

On a Ryzen 7950X CPU, the direct product for Q4_1 quantization
is faster than the AVX2 implementation (~60 vs ~62 us).

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
master-5ecff35
2023-04-18 19:00:14 +00:00
7faa7460f0 readme : update hot topics about new LoRA functionality 2023-04-18 20:10:26 +03:00
5af8e32238 ci : do not run on drafts master-5af8e32 2023-04-18 19:57:06 +03:00
42747220b4 Do not close file after mmap (Windows version) (#1034) master-4274722 2023-04-18 03:15:50 +02:00
e9298af389 readme : add Ruby bindings (#1029) 2023-04-17 22:34:35 +03:00
4ad73137a1 add 4_0 to default outfile namestr dict (#1031)
this came up when trying to convert the gpt4all-lora-unfiltered-quantized.bin file
2023-04-17 20:26:23 +02:00
315a95a4d3 Add LoRA support (#820) master-315a95a 2023-04-17 17:28:55 +02:00
efd05648c8 llama : well-defined static initialization of complex objects (#927)
* Replaced static initialization of complex objects with a initialization on first use. This prevents an undefined behavior on program run, for example, crash in Release build, works in Debug build

* replaced use of auto with exact type to avoid using -std=c++14

* Made the assessors functions for static maps be static const
master-efd0564
2023-04-17 17:41:53 +03:00
eb17a026fd quantize-stats : fix bug in --type argument master-eb17a02 2023-04-17 17:31:06 +03:00
69b740289f ggml : avoid using ggml_fp16_to_fp32() and ggml_fp32_to_fp16() in ggml.c master-69b7402 2023-04-17 16:16:23 +03:00
f266259ad9 Speedup the AVX-512 implementation of ggml_vec_dot_q4_0() (#933) master-f266259 2023-04-17 15:10:57 +02:00
47f61aaa5f Fix: do not close file on mmap (#1017) master-47f61aa 2023-04-16 21:27:38 +02:00
3173a62eb9 stdout : vertical align outputs for better readibility master-3173a62 2023-04-16 13:59:27 +03:00
489537e6cf examples: add missing <ctime> include for time() (#1011) master-489537e 2023-04-16 10:13:00 +00:00
2d3481c721 Fix msys2 build error and warnings (#1009) master-2d3481c 2023-04-16 11:13:42 +02:00
74f5899df4 convert.py: Fix loading safetensors and ggml format on Windows (#991)
Calling `mmap.mmap` on Windows apparently resets the file offset of the
raw file object (and makes the BufferedReader return a *negative* file
offset).  For safetensors, avoid using the file offset after calling
mmap.  For GGML format, explicitly save and restore the offset.

Fixes #966.
2023-04-15 23:53:21 +02:00
2f7c8e014e Fix potential int8 overflow in non-SIMD vec_dot (#986) master-2f7c8e0 2023-04-15 18:28:56 +00:00
0ad964631f Refactor ggml.c for future tensor types (#1001) master-0ad9646 2023-04-15 16:25:38 +00:00
e95b6554b4 ggml : add Q8_0 quantization for intermediate results (#951)
* ggml : add Q8_0 quantization for intermediate results

* quantize-stats : fix test + add it to Makefile default

* Q8: use int8_t, AVX/AVX2 optimizations

* ggml : fix quantize_row_q8_0() ARM_NEON rounding

* minor : updates after rebase to latest master

* quantize-stats : delete obsolete strings

* ggml : fix q4_1 dot func

---------

Co-authored-by: Stephan Walter <stephan@walter.name>
master-e95b655
2023-04-15 17:53:22 +03:00
aa485cee33 ggml : use posix_memalign on non-Windows env master-aa485ce 2023-04-15 14:25:45 +03:00
c12b14b77f benchmark : fix result validation in benchmark-q4_0-matmult (#987) master-c12b14b 2023-04-15 08:51:54 +03:00
106faaf297 cmake : add finding the OpenBLAS header file (#992) master-106faaf 2023-04-15 08:51:11 +03:00
c85e03d12e Revert "main : alternative instruct mode (Vicuna support, etc.) (#863)" (#982)
This reverts commit f4d277ae17.
master-c85e03d
2023-04-14 22:58:43 +03:00
489093548c py : bump sentencepiece to 0.1.98 to support Python 3.11 (#976) 2023-04-14 19:46:49 +00:00
93265e988a make : fix dependencies, use auto variables (#983) master-93265e9 2023-04-14 22:39:48 +03:00
c56b715269 Expose type name from ggml (#970)
Avoid duplication of type names in utils

Co-authored-by: Håkon H. Hitland <haakon@likedan.net>
master-c56b715
2023-04-14 20:05:37 +02:00
f4d277ae17 main : alternative instruct mode (Vicuna support, etc.) (#863)
* Add support for configs, add configurable prefixes / suffixes, deprecate instruct mode, add stop prompt

* Add multiline mode, update text input.

* bugfix

* update implementation

* typos

* Change --multiline implementation to be toggled by EOF.

* bugfix

* default multiline mode

* add more configs

* update formating

* update formatting

* apply suggestions
master-f4d277a
2023-04-14 18:19:17 +03:00