Commit Graph

3278 Commits

Author SHA1 Message Date
4483396751 llama : apply classifier-free guidance to logits directly (#4951) b1878 2024-01-15 15:06:52 +02:00
d9aa4ffa6e awq-py : fix typo in awq-py/README.md (#4947) 2024-01-15 14:41:46 +02:00
ddb008d845 cuda : fix dequantize kernel names (#4938) b1876 2024-01-15 13:27:00 +02:00
2faaef3979 llama : check for 256 divisibility for IQ2_XS, IQ2_XXS (#4950)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
b1875
2024-01-15 10:09:38 +02:00
4a3156de2f CUDA: faster dequantize kernels for Q4_0 and Q4_1 (#4938)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
b1874
2024-01-15 07:48:06 +02:00
a836c8f534 llama : fix missing quotes (#4937) b1873 2024-01-14 17:46:00 +02:00
467a882fd2 Add ability to use importance matrix for all k-quants (#4930)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
b1872
2024-01-14 16:21:12 +02:00
bb0c139247 llama : check LLAMA_TRACE env for extra logging (#4929)
* llama : minor fix indent

* llama : check LLAMA_TRACE env for extra logging

ggml-ci
b1871
2024-01-14 13:26:53 +02:00
9408cfdad6 scripts : sync-ggml-am.sh option to skip commits 2024-01-14 11:08:41 +02:00
03c5267490 llama : use LLAMA_LOG_ macros for logging b1869 2024-01-14 11:03:19 +02:00
a128c38de8 Fix ffn_down quantization mix for MoE models (#4927)
* Fix ffn_down quantization mix for MoE models

In #4872 I did not consider the part where every third
tensor is quantized with more bits. Fir MoE this leads to tensors
of the same layer being quantized with different number of bits,
which is not considered as a possibility in the inference implementation
(it is assumed all experts use the same quantization).

* Fix the fix

* Review suggestion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
b1868
2024-01-14 10:53:39 +02:00
5f5fe1bd60 metal : correctly set SIMD support flags on iOS (#4923)
* Correctly set support_simdgroup_reduction and support_simdgroup_mm on iPhone/iPad

* log a little bit more info on iOS
b1867
2024-01-14 10:44:39 +02:00
ac32902a87 llama : support WinXP build with MinGW 8.1.0 (#3419) b1866 2024-01-14 10:41:44 +02:00
147b17ac94 2-bit quantizations (#4897)
* imatrix: load

* imatrix: WIP

* imatrix: Add Q2_K quantization

* imatrix: also guard against Q2_K_S quantization without importance matrix

* imatrix: guard even more against low-bit quantization misuse

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
b1865
2024-01-14 09:45:56 +02:00
807179ec58 Make Q3_K_S be the same as olf Q3_K_L for Mixtral-8x7B (#4906)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
b1864
2024-01-14 09:44:30 +02:00
76484fbfd3 sync : ggml b1863 2024-01-14 00:14:46 +02:00
c71d608ce7 ggml: cache sin/cos for RoPE (#4908) b1862 2024-01-13 21:41:37 +01:00
4be5ef556d metal : remove old API (#4919)
ggml-ci
b1861
2024-01-13 20:45:45 +02:00
0ea069b87b server : fix prompt caching with system prompt (#4914) b1860 2024-01-13 19:31:26 +02:00
f172de03f1 llama : fix detokenization of non-special added-tokens (#4916)
Co-authored-by: goerch <jhr.walter@t-online.de>
b1859
2024-01-13 18:47:38 +02:00
2d57de5255 metal : disable log for loaded kernels (#4794) b1858 2024-01-13 18:46:37 +02:00
df845cc982 llama : minimize size used for state save/load (#4820)
* examples : save-load-state: save only required state

* llama : only reserve n_vocab * n_batch at most for logits

llama_decode asserts that only n_batch tokens are passed each call, and
n_ctx is expected to be bigger than n_batch.

* llama : always reserve n_vocab * n_batch for logits

llama_context de-serialization breaks if the contexts have differing
capacity for logits and llama_decode will at maximum resize to
n_vocab * n_batch.

* llama : only save and restore used logits

for batch sizes of 512 this reduces save state in the best case by
around 62 MB, which can be a lot if planning to save on each message
to allow regenerating messages.

* llama : use ostringstream and istringstream for save and load

* llama : serialize rng into minimum amount of space required

* llama : break session version due to serialization changes
b1857
2024-01-13 18:29:43 +02:00
6b48ed0893 workflows: unbreak nix-build-aarch64, and split it out (#4915)
The fix should be just the `sudo apt-get update`
b1856
2024-01-13 16:29:16 +00:00
722d33f34e main : add parameter --no-display-prompt (#4541)
* add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens

* remove empty line

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b1855
2024-01-13 18:09:08 +02:00
c30b1ef39a gguf : fix potential infinite for-loop (#4600)
Co-authored-by: Bernhard Gstrein <gstrein@informatik.uni-freiburg.de>
b1854
2024-01-13 18:06:20 +02:00
b38b5e93ae metal : refactor kernel loading code (#4794)
* metal : detect more GPU families

* metal : refactor kernel loading

* metal : set kernel family requirements

* metal : fix kernel init + fix compile options

* metal : take into account simdgroup reduction support

* metal : print only skipped kernels

* metal : fix check for simdgroup reduction support

* metal : check for Metal 3

* metal : free allocations

* metal : normalize encoder:setComputePipelineStatus calls

ggml-ci

* metal : fix Metal3 family check

ggml-ci

* metal : check for simdgroup matrix mul. feature

ggml-ci
b1853
2024-01-13 18:03:45 +02:00
7dc78764e2 compare-llama-bench: tweak output format (#4910) 2024-01-13 15:52:53 +01:00
356327feb3 server : fix deadlock that occurs in multi-prompt scenarios (#4905)
* * fix deadlock

* * dont ruint all whitespace
b1851
2024-01-13 16:20:46 +02:00
ee8243adaa server : fix crash with multimodal models without BOS token (#4904) b1850 2024-01-13 16:16:11 +02:00
15ebe59210 convert : update phi-2 to latest HF repo (#4903)
* convert : update phi-2 to latest HF repo

ggml-ci

* py : try to fix flake stuff
b1849
2024-01-13 13:44:37 +02:00
de473f5f8e sync : ggml b1848 2024-01-12 22:02:43 +02:00
f238461236 ggml : fix 32-bit ARM compat for IQ2_XS (whisper/1758)
* ggml : fix 32-bit ARM compat

* ggml : fix fix

* ggml : fix fix fix
2024-01-12 22:02:11 +02:00
fa5c1fb44a backend_sched : fix assignments
ggml-ci
2024-01-12 22:02:11 +02:00
52ee4540c0 examples : add pydantic models to GBNF grammar generator (#4883)
* Create pydantic-models-to-grammar.py

* Added some comments for usage

* Refactored Grammar Generator

Added example and usage instruction.

* Update pydantic_models_to_grammar.py

* Update pydantic-models-to-grammar-examples.py

* Renamed module and imported it.

* Update pydantic-models-to-grammar.py

* Renamed file and fixed grammar generator issue.
2024-01-12 21:46:45 +02:00
3fe81781e3 CUDA: faster q8_0 -> f16 dequantization (#4895) b1844 2024-01-12 20:38:54 +01:00
e7e4df031b llama : ggml-backend integration (#4766)
* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (#4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (#4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b1843
2024-01-12 20:07:38 +01:00
584d674be6 llama : remove redundant assert for StableLM (#4901) b1842 2024-01-12 20:54:12 +02:00
930f907d3e export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894)
This commit replaces the magic number used in export-lora.cpp with
the one defined in llama.h, which is indirectly included via common.h.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
b1841
2024-01-12 19:54:53 +02:00
Zay
e790eef21c llama.swiftui : update models layout (#4826)
* Updated Models Layout

- Added a models drawer
- Added downloading directly from Hugging Face
- Load custom models from local folder
- Delete models by swiping left

* trimmed trailing white space

* Updated Models Layout
b1840
2024-01-12 14:48:00 +02:00
5537d9d36b gitignore : imatrix 2024-01-12 14:33:21 +02:00
1b280c9fff CUDA: fix softmax compile for old CUDA versions (#4862) b1838 2024-01-12 12:30:41 +01:00
3cabe80630 llama : fix typo "imp_embd" -> "inp_embd" b1837 2024-01-12 13:11:15 +02:00
4315a94366 common : streamline the formatting of help (#4890)
* common : streamline the formatting of help

- Separate alternative parameters by a comma

- Do not indent `--version` differently

* Update common/common.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b1836
2024-01-12 13:05:32 +02:00
2d00741e12 py : fix lint (#4889) 2024-01-12 13:03:38 +02:00
f445c0e68c llama : fix llm_build_k_shift to use correct n_rot (#4889)
* llama : fix llm_build_k_shift to use correct n_rot

ggml-ci

* llama : always use hparams.n_rot for ggml_rope_custom

ggml-ci

* convert : fix persimmon conversion to write correct n_rot
b1834
2024-01-12 13:01:56 +02:00
326b418b59 Importance Matrix calculation (#4861)
* imatrix: 1st version

* imatrix: WIP

* Cleanup

* Update examples/imatrix/imatrix.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b1833
2024-01-12 06:59:57 +01:00
1d118386fe server : fix infill when prompt is empty (#4833) b1832 2024-01-11 23:23:49 +02:00
7edefbd79c main : better name for variable n_print (#4874) b1831 2024-01-11 22:46:26 +02:00
3ca63b4538 main : disable token count by default (#4874) b1830 2024-01-11 22:43:05 +02:00
b037787548 swift : track ggml release branch (#4867) b1829 2024-01-11 21:58:28 +02:00