SOTA 2-bit quants (#4773)

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-08-27 02:28:19 -04:00

* iq2_xxs: basics

* iq2_xxs: scalar and AVX2 dot products

Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.

* iq2_xxs: ARM_NEON dot product

Somehow strangely slow (112 ms/token).

* iq2_xxs: WIP Metal

Dequantize works, something is still wrong with the
dot product.

* iq2_xxs: Metal dot product now works

We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s

Not the greatest performance, but not complete garbage either.

* iq2_xxs: slighty faster dot product

TG-128 is now 48.4 t/s

* iq2_xxs: slighty faster dot product

TG-128 is now 50.9 t/s

* iq2_xxs: even faster Metal dot product

TG-128 is now 54.1 t/s.

Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.

* iq2_xxs: dequantize CUDA kernel - fix conflict with master

* iq2_xxs: quantized CUDA dot product (MMVQ)

We get TG-128 = 153.1 t/s

* iq2_xxs: slightly faster CUDA dot product

TG-128 is now at 155.1 t/s.

* iq2_xxs: add to llama ftype enum

* iq2_xxs: fix MoE on Metal

* Fix missing MMQ ops when on hipBLAS

I had put the ggml_supports_mmq call at the wrong place.

* Fix bug in qequantize_row_iq2_xxs

The 0.25f factor was missing.
Great detective work by @ggerganov!

* Fixing tests

* PR suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

This commit is contained in:

Kawrakow

2024-01-08 16:02:32 +01:00

committed by

GitHub

parent 668b31fc7d

commit dd5ae06405

10 changed files with 902 additions and 1 deletions

									
										1

llama.h
									
												View File
												
				@@ -103,6 +103,7 @@ extern "C" {

				        LLAMA_FTYPE_MOSTLY_Q5_K_S        = 16, // except 1d tensors

				        LLAMA_FTYPE_MOSTLY_Q5_K_M        = 17, // except 1d tensors

				        LLAMA_FTYPE_MOSTLY_Q6_K          = 18, // except 1d tensors

				        LLAMA_FTYPE_MOSTLY_IQ2_XXS       = 19, // except 1d tensors

				        LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file

				    };

SOTA 2-bit quants (#4773)

1 llama.h Unescape Escape View File

1

llama.h

View File