Commit Graph

5076 Commits

Author SHA1 Message Date
dcdd65e296 ggml : optimize ggml_vec_dot_q4_0_q8_0() using vectorized accumulators master-dcdd65e 2023-04-18 22:59:17 +03:00
5ecff35151 Adding a simple program to measure speed of dot products (#1041)
On my Mac, the direct Q4_1 product is marginally slower
(~69 vs ~55 us for Q4_0). The SIMD-ified ggml version
is now almost 2X slower (~121 us).

On a Ryzen 7950X CPU, the direct product for Q4_1 quantization
is faster than the AVX2 implementation (~60 vs ~62 us).

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
master-5ecff35
2023-04-18 19:00:14 +00:00
7faa7460f0 readme : update hot topics about new LoRA functionality 2023-04-18 20:10:26 +03:00
5af8e32238 ci : do not run on drafts master-5af8e32 2023-04-18 19:57:06 +03:00
42747220b4 Do not close file after mmap (Windows version) (#1034) master-4274722 2023-04-18 03:15:50 +02:00
e9298af389 readme : add Ruby bindings (#1029) 2023-04-17 22:34:35 +03:00
4ad73137a1 add 4_0 to default outfile namestr dict (#1031)
this came up when trying to convert the gpt4all-lora-unfiltered-quantized.bin file
2023-04-17 20:26:23 +02:00
315a95a4d3 Add LoRA support (#820) master-315a95a 2023-04-17 17:28:55 +02:00
efd05648c8 llama : well-defined static initialization of complex objects (#927)
* Replaced static initialization of complex objects with a initialization on first use. This prevents an undefined behavior on program run, for example, crash in Release build, works in Debug build

* replaced use of auto with exact type to avoid using -std=c++14

* Made the assessors functions for static maps be static const
master-efd0564
2023-04-17 17:41:53 +03:00
eb17a026fd quantize-stats : fix bug in --type argument master-eb17a02 2023-04-17 17:31:06 +03:00
69b740289f ggml : avoid using ggml_fp16_to_fp32() and ggml_fp32_to_fp16() in ggml.c master-69b7402 2023-04-17 16:16:23 +03:00
f266259ad9 Speedup the AVX-512 implementation of ggml_vec_dot_q4_0() (#933) master-f266259 2023-04-17 15:10:57 +02:00
47f61aaa5f Fix: do not close file on mmap (#1017) master-47f61aa 2023-04-16 21:27:38 +02:00
3173a62eb9 stdout : vertical align outputs for better readibility master-3173a62 2023-04-16 13:59:27 +03:00
489537e6cf examples: add missing <ctime> include for time() (#1011) master-489537e 2023-04-16 10:13:00 +00:00
2d3481c721 Fix msys2 build error and warnings (#1009) master-2d3481c 2023-04-16 11:13:42 +02:00
74f5899df4 convert.py: Fix loading safetensors and ggml format on Windows (#991)
Calling `mmap.mmap` on Windows apparently resets the file offset of the
raw file object (and makes the BufferedReader return a *negative* file
offset).  For safetensors, avoid using the file offset after calling
mmap.  For GGML format, explicitly save and restore the offset.

Fixes #966.
2023-04-15 23:53:21 +02:00
2f7c8e014e Fix potential int8 overflow in non-SIMD vec_dot (#986) master-2f7c8e0 2023-04-15 18:28:56 +00:00
0ad964631f Refactor ggml.c for future tensor types (#1001) master-0ad9646 2023-04-15 16:25:38 +00:00
e95b6554b4 ggml : add Q8_0 quantization for intermediate results (#951)
* ggml : add Q8_0 quantization for intermediate results

* quantize-stats : fix test + add it to Makefile default

* Q8: use int8_t, AVX/AVX2 optimizations

* ggml : fix quantize_row_q8_0() ARM_NEON rounding

* minor : updates after rebase to latest master

* quantize-stats : delete obsolete strings

* ggml : fix q4_1 dot func

---------

Co-authored-by: Stephan Walter <stephan@walter.name>
master-e95b655
2023-04-15 17:53:22 +03:00
aa485cee33 ggml : use posix_memalign on non-Windows env master-aa485ce 2023-04-15 14:25:45 +03:00
c12b14b77f benchmark : fix result validation in benchmark-q4_0-matmult (#987) master-c12b14b 2023-04-15 08:51:54 +03:00
106faaf297 cmake : add finding the OpenBLAS header file (#992) master-106faaf 2023-04-15 08:51:11 +03:00
c85e03d12e Revert "main : alternative instruct mode (Vicuna support, etc.) (#863)" (#982)
This reverts commit f4d277ae17.
master-c85e03d
2023-04-14 22:58:43 +03:00
489093548c py : bump sentencepiece to 0.1.98 to support Python 3.11 (#976) 2023-04-14 19:46:49 +00:00
93265e988a make : fix dependencies, use auto variables (#983) master-93265e9 2023-04-14 22:39:48 +03:00
c56b715269 Expose type name from ggml (#970)
Avoid duplication of type names in utils

Co-authored-by: Håkon H. Hitland <haakon@likedan.net>
master-c56b715
2023-04-14 20:05:37 +02:00
f4d277ae17 main : alternative instruct mode (Vicuna support, etc.) (#863)
* Add support for configs, add configurable prefixes / suffixes, deprecate instruct mode, add stop prompt

* Add multiline mode, update text input.

* bugfix

* update implementation

* typos

* Change --multiline implementation to be toggled by EOF.

* bugfix

* default multiline mode

* add more configs

* update formating

* update formatting

* apply suggestions
master-f4d277a
2023-04-14 18:19:17 +03:00
c9a59b70a5 ggml : add unary and binary map operations (#874)
* GGML map ops proof of concept.

* Various cleanups.

Add handling for task setting.

Add handling for ggml_compute_backward.

Rename functions to ggml_map_unary_f32 and ggml_map_binary_f32

Fix compiler warnings related to casting function pointers and `void *`

Reorder functions and definitions based on the GGML op number.

Use typedefs for map op function pointer types.

* Fix position of map ops cases in ggml_compute_forward
master-c9a59b7
2023-04-14 17:43:55 +03:00
a32f7acc9f py : cleanup dependencies (#962)
after #545 we do not need torch, tqdm and requests in the dependencies
2023-04-14 15:37:11 +02:00
43ffdefb74 py : fix flake8 and isort nitpicks (#960) 2023-04-14 14:23:21 +02:00
1623a6e9b4 ggml : minor master-1623a6e 2023-04-14 13:31:29 +03:00
c14e0d2f23 ggml : always allocate buffers with size multiple of GGML_MEM_ALIGN 2023-04-14 13:31:15 +03:00
723dac55fa py : new conversion script (#545)
Current status: Working, except for the latest GPTQ-for-LLaMa format
  that includes `g_idx`.  This turns out to require changes to GGML, so
  for now it only works if you use the `--outtype` option to dequantize it
  back to f16 (which is pointless except for debugging).

  I also included some cleanup for the C++ code.

  This script is meant to replace all the existing conversion scripts
  (including the ones that convert from older GGML formats), while also
  adding support for some new formats.  Specifically, I've tested with:

  - [x] `LLaMA` (original)
  - [x] `llama-65b-4bit`
  - [x] `alpaca-native`
  - [x] `alpaca-native-4bit`
  - [x] LLaMA converted to 'transformers' format using
        `convert_llama_weights_to_hf.py`
  - [x] `alpaca-native` quantized with `--true-sequential --act-order
        --groupsize 128` (dequantized only)
  - [x] same as above plus `--save_safetensors`
  - [x] GPT4All
  - [x] stock unversioned ggml
  - [x] ggmh

  There's enough overlap in the logic needed to handle these different
  cases that it seemed best to move to a single script.

  I haven't tried this with Alpaca-LoRA because I don't know where to find
  it.

  Useful features:

  - Uses multiple threads for a speedup in some cases (though the Python
    GIL limits the gain, and sometimes it's disk-bound anyway).

  - Combines split models into a single file (both the intra-tensor split
    of the original and the inter-tensor split of 'transformers' format
    files).  Single files are more convenient to work with and more
    friendly to future changes to use memory mapping on the C++ side.  To
    accomplish this without increasing memory requirements, it has some
    custom loading code which avoids loading whole input files into memory
    at once.

  - Because of the custom loading code, it no longer depends in PyTorch,
    which might make installing dependencies slightly easier or faster...
    although it still depends on NumPy and sentencepiece, so I don't know
    if there's any meaningful difference.  In any case, I also added a
    requirements.txt file to lock the dependency versions in case of any
    future breaking changes.

  - Type annotations checked with mypy.

  - Some attempts to be extra user-friendly:

      - The script tries to be forgiving with arguments, e.g. you can
        specify either the model file itself or the directory containing
        it.

      - The script doesn't depend on config.json / params.json, just in
        case the user downloaded files individually and doesn't have those
        handy.  But you still need tokenizer.model and, for Alpaca,
        added_tokens.json.

      - The script tries to give a helpful error message if
        added_tokens.json is missing.
2023-04-14 10:03:03 +03:00
0f07cacb05 ggml : fix q4_1 dot product types master-0f07cac 2023-04-14 09:45:42 +03:00
c5d70f5c9e ggml : optimize rope function to avoid call powf in the tight loop (#807) master-c5d70f5 2023-04-14 09:24:52 +03:00
be87b6ed20 perplexity : add support for batch size to --perplexity (#407)
* Add support to batch size for perplexity

* Revert "Fix memory allocation issues and seg faults"

This reverts commit 4870e455b3.

* update from merge

* Remove perplexity from main

* updates

* Update batch size for efficiency
master-be87b6e
2023-04-14 00:50:42 +03:00
0e07e6a839 common : remove unnecessary includes (#947) master-0e07e6a 2023-04-13 18:39:25 +03:00
a3a2a0eda8 ggml : add GGML_DEFAULT_N_THREADS master-a3a2a0e 2023-04-13 18:36:48 +03:00
d990e3fffc ggml : speed-up ggml_vec_dot_q4_1() ARM_NEON + 32-bit ARM support (#900)
* ggml : speed-up q4_1 ARM_NEON by ~5%

* ggml : implement vaddvq when missing

* ggml : implement vminvq and vmaxvq when missing

* ggml : implement vzip when missing

* ggml : fix comment

* ggml : try to use correct ifdef
master-d990e3f
2023-04-13 18:32:36 +03:00
9190e8eac8 llama : merge llama_internal.h into llama.h
Hide it behind an #ifdef
master-9190e8e
2023-04-13 18:04:45 +03:00
c85980acd0 gitignore : benchmark 2023-04-13 18:01:33 +03:00
6232f2d7fd ggml : optimize non-SIMD Q4_0 vector dot product (#703) master-6232f2d 2023-04-13 17:59:50 +03:00
6c248707f5 ggml : introduce GGML_ALIGNED_MALLOC/GGML_ALIGNED_FREE macros (#884)
which allows us to use aligned_alloc or _aligned_malloc functions
master-6c24870
2023-04-13 17:08:32 +03:00
8cda5c981d fix whitespace (#944) master-8cda5c9 2023-04-13 16:03:57 +02:00
ec29272175 readme : remove python 3.10 warning (#929) 2023-04-13 16:59:53 +03:00
7e941b95eb readme : llama node binding (#911)
* chore: add nodejs binding

* chore: add nodejs binding
2023-04-13 16:54:27 +03:00
c729ff730a flake.nix: add all binaries from bin (#848) 2023-04-13 15:49:05 +02:00
4579af95e8 zig : update build.zig (#872)
* update

* update readme

* minimize the changes.

---------

Co-authored-by: zjli2019 <zhengji.li@ingchips.com>
2023-04-13 16:43:22 +03:00
8c3ffc2f04 ggml : update cblas_sgemm columns var to be more reasonable (#838) master-8c3ffc2 2023-04-13 16:24:30 +03:00