Commit Graph

3324 Commits

Author SHA1 Message Date
213701b51a Detokenizer fixes (#8039)
* Add llama_detokenize():
  - Update header files location
  - UNKNOWN and CONTROL are 'special pieces'
  - Remove space after UNKNOWN and CONTROL
  - Refactor llama_token_to_piece()
  - Add flag: clean_up_tokenization_spaces
  - Symmetric params for llama_tokenize() and llama_detokenize()

* Update and fix tokenizer tests:
  - Using llama_detokenize()
  - Unexpected vocab type as test fail instead of error
    - Useful when automating tests:
    - If you don't know in advance the vocab type
    - Differenciate other loading errors
  - Skip unicode surrogaes and undefined
  - Gracefully exit threads
    - Using exit() is throwing random exceptions
  - Clean old known problematic codepoints
  - Minor: confusing hexadecimal codepoint

* Update bruteforce random tests
  - Add detokenizer checks
  - New generator: ascii_lr_strip
  - New generator: apostrophe
  - Add more vocabs files
  - Detokenize special tokens.
  - Replace errors with '\uFFFD' when detokenizing to 'utf-8'
  - More edge cases
  - Better detokenization results check

* Fix add_space_prefix, set false by default
* Better leading space removal
* Do not remove space when decoding special tokens
* Bugfix: custom regexs splits undefined unicode codepoints
* 'viking' detokenizer clean spaces
b3324
2024-07-05 19:01:35 +02:00
be20e7f49d Reorganize documentation pages (#8325)
* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
2024-07-05 18:08:32 +02:00
7ed03b8974 llama : fix compile warning (#8304) b3322 2024-07-05 17:32:09 +03:00
1d894a790e cmake : add GGML_BUILD and GGML_SHARED macro definitions (#8281) 2024-07-05 17:29:35 +03:00
1f3e1b66e2 Enabled more data types for oneMKL gemm_batch (#8236) 2024-07-05 13:23:25 +01:00
148ec970b6 convert : remove AWQ remnants (#8320) 2024-07-05 10:15:36 +03:00
2cccbaa008 llama : minor indentation during tensor loading (#8304)
* llama : minor indentation during tensor loading

ggml-ci

* llama : use int for layer iterators [no ci]
2024-07-05 10:15:24 +03:00
8e558309dc CUDA: MMQ support for iq4_nl, iq4_xs (#8278) b3317 2024-07-05 09:06:31 +02:00
0a423800ff CUDA: revert part of the RDNA1 optimizations (#8309)
The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s
b3316
2024-07-05 09:06:09 +02:00
d12f781074 llama : streamline embeddings from "non-embedding" models (#8087) b3315 2024-07-05 10:05:56 +03:00
bcefa03bc0 CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (#8311) b3314 2024-07-05 09:05:34 +02:00
5a7447c569 readme : fix minor typos [no ci] (#8314) 2024-07-05 09:58:41 +03:00
61ecafa390 passkey : add short intro to README.md [no-ci] (#8317)
* passkey : add short intro to README.md [no-ci]

This commit adds a short introduction to the README.md file in the
examples/passkey directory.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* Update examples/passkey/README.md

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-05 09:14:24 +03:00
aa5898dc53 llama : prefer n_ over num_ prefix (#8308) b3311 2024-07-05 09:10:03 +03:00
6c05752c50 contributing : update guidelines (#8316) 2024-07-05 09:09:47 +03:00
a9554e20b6 [SYCL] Fix WARP_SIZE=16 bug of Intel GPU (#8266)
* fix group_norm ut

* split softmax

* fix softmax

* add concat support condition

* revert debug code

* move QK_WARP_SIZE to presets.hpp
b3309
2024-07-05 13:06:13 +08:00
e235b267a2 py : switch to snake_case (#8305)
* py : switch to snake_case

ggml-ci

* cont

ggml-ci

* cont

ggml-ci

* cont : fix link

* gguf-py : use snake_case in scripts entrypoint export

* py : rename requirements for convert_legacy_llama.py

Needed for scripts/check-requirements.sh

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-07-05 07:53:33 +03:00
f09b7cb609 rm get_work_group_size() by local cache for performance (#8286)
Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
b3307
2024-07-05 10:32:29 +08:00
a38b884c6c cli: add EOT when user hit Ctrl+C (#8296)
* main: add need_insert_eot

* do not format system prompt if it is empty
b3306
2024-07-04 20:55:03 +02:00
d7fd29fff1 llama : add OpenELM support (#7359)
* Initial OpenELM support (270M only so far)

* Fill out missing entries in llama_model_type_name

* fixup! Initial OpenELM support (270M only so far)

Fix formatting

* llama : support all OpenELM models

* llama : add variable GQA and variable FFN sizes

Some metadata keys can now also be arrays to support setting
their value per-layer for models like OpenELM.

* llama : minor spacing changes

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : use std::array for per-layer hparams

* llama : fix save/load state

* llama : do not print hparams for vocab-only models

* llama : handle n_head == 0

* llama : use const ref for print_f and fix division by zero

* llama : fix t5 uses of n_head and n_ff

* llama : minor comment

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b3305
2024-07-04 20:14:21 +03:00
6f63d646c1 tokenize : add --show-count (token) option (#8299)
This commit adds a new option to the tokenize example, --show-count.
When this is set the total number of tokens are printed to stdout.

This was added as an option as I was concerned that there might be
scripts that use the output from this program and it might be better to
not print this information by default.

The motivation for this is that can be useful to find out how many
tokens a file contains, for example when trying to determine prompt
input file sizes for testing.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
b3304
2024-07-04 19:38:58 +03:00
51d2ebadbb build: Export hf-to-gguf as snakecase b3303 2024-07-04 15:39:13 +00:00
1e920018d3 doc: Add context for why we add an explicit pytorch source 2024-07-04 15:39:13 +00:00
01a5f06550 chore: Remove rebase artifacts 2024-07-04 15:39:13 +00:00
07786a61a2 chore: Fixup requirements and build 2024-07-04 15:39:13 +00:00
de14e2ea2b chore: ignore all __pychache__ 2024-07-04 15:39:13 +00:00
821922916f fix: Update script paths in CI scripts 2024-07-04 15:39:13 +00:00
b1c3f26e5e fix: Actually include scripts in build
Not namespaced though :(
2024-07-04 15:39:13 +00:00
b0a46993df build(python): Package scripts with pip-0517 compliance 2024-07-04 15:39:13 +00:00
807b0c49ff Inference support for T5 and FLAN-T5 model families (#5763)
* llama : add inference support and model types for T5 and FLAN-T5 model families

* llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token()

* common, llama-cli, llama-batched : add support for encoder-decoder models

* convert-hf : handle shared token embeddings tensors in T5Model

* convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models)

* convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model

* convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b3295
2024-07-04 15:46:11 +02:00
f8c4c0738d tests : add _CRT_SECURE_NO_WARNINGS for WIN32 (#8231)
This commit adds the compile definition `_CRT_SECURE_NO_WARNINGS`
to the root cmake subproject.

The motivation for this is that currently the following warnings are
displayed when compiling the tests and common cmake subprojects:
```console
test-llama-grammar.cpp
C:\llama.cpp\src\.\llama.cpp(1406,77): warning C4996: 'strerror':
This function or variable may be unsafe. Consider using strerror_s
instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See
online help for details.
[C:\llama.cpp\build\tests\test-llama-grammar.vcxproj]
...
```

This compile definition is currently set for the `src` subproject
and this change moves into the root cmake project so that it is applied
to all cmake subprojects.
b3294
2024-07-04 13:53:42 +03:00
402d6feffa llama : suppress unref var in Windows MSVC (#8150)
* llama : suppress unref var in Windows MSVC

This commit suppresses two warnings that are currently generated for
src/llama.cpp when building on Windows MSVC

```console
C:\llama.cpp\src\llama.cpp(14349,45): warning C4101: 'ex':
unreferenced local variable [C:\llama.cpp\build\src\llama.vcxproj]
C:\llama.cpp\src\llama.cpp(19285,44): warning C4101: 'e':
unreferenced local variable [C:\llama.cpp\build\src\llama.vcxproj]
```

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b3293
2024-07-04 13:50:57 +03:00
20fc3804bf convert : fix gemma v1 tokenizer convert (#8248)
ggml-ci
b3292
2024-07-04 10:41:03 +03:00
f619024764 [SYCL] Remove unneeded semicolons (#8280) b3291 2024-07-04 09:07:19 +08:00
d23287f122 Define and optimize RDNA1 (#8085) b3290 2024-07-04 01:02:58 +02:00
5f2d4e60e2 ppl : fix n_seq_max for perplexity (#8277)
* ppl : fix n_seq_max for perplexity

* use 1 seq for kl_divergence
b3289
2024-07-03 20:33:31 +03:00
916248af1f fix phi 3 conversion (#8262) 2024-07-03 16:01:54 +02:00
f8d6a23804 fix typo (#8267)
Co-authored-by: Judd <foldl@boxvest.com>
b3287
2024-07-03 14:40:16 +02:00
fadde67135 Dequant improvements rebase (#8255)
* Single load for half2

* Store scales in local mem

* Vec load quantized values
b3286
2024-07-03 09:55:34 +08:00
a27152b602 fix: add missing short command line argument -mli for multiline-input (#8261) b3285 2024-07-02 22:56:46 +02:00
3e2618bc7b Adding step to clean target to remove legacy binary names to reduce upgrade / migration confusion arising from #7809. (#8257) b3284 2024-07-02 13:19:56 -04:00
07a3fc0608 Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. (#8258) b3283 2024-07-02 12:18:10 -04:00
968967376d Add JAIS model(s) (#8118)
* Add `JAIS` model(s)

* cleanup

* address review comments

* remove hack

* un-hardcode max-alibi-bias

* minor tweaks

---------

Co-authored-by: fmz <quic_fzaghlou@quic.com>
b3282
2024-07-02 16:36:00 +02:00
023b8807e1 convert-hf : print output file name when completed (#8181)
* convert-hf : print output file name when completed

This commit adds the output file name to the log message when the
conversion is completed.

The motivation for this change is that when `--outfile` option is not
specified it migth not be obvious where the output file is written.

With this change the output of running the script will be something like
the following:
```console
INFO:hf-to-gguf:Model successfully exported to models/gemma-2-9b-it.gguf.
```

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! convert-hf : print output file name when completed

Updates the output of to support printing the directory if the output is
split into multiple files. Also the output file name is now retrieved
from the model_instance object.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! convert-hf : print output file name when completed

Use parent attribute of Path object and string interpolation.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! convert-hf : print output file name when completed

Use os.sep instead of hardcoding the path separator.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-02 09:40:49 +03:00
0e0590adab cuda : update supports_op for matrix multiplication (#8245) b3280 2024-07-02 09:39:38 +03:00
a9f3b10215 [SYCL] Fix win build conflict of math library (#8230)
* fix win build conflict of math library

* fix the condition: !(win32 & SYCL)

* revert warp_size=16
b3279
2024-07-02 12:50:07 +08:00
d08c20edde [SYCL] Fix the sub group size of Intel (#8106)
* use warp_size macro for all sycl kernels

* fix mask of permute_sub_group_by_xor

* fix rms_norm with correct warp number

* fix rms_norm_f32/group_norm_f32

* move norm to norm.cpp file

* fix quantize bug

* fix mmvq's batch size
b3278
2024-07-02 10:16:00 +08:00
5fac350b9c Fix gemma2 tokenizer convert (#8244)
* fix gemma2 tokenizer convert

* remove scores

* improve code, fix new line issue
2024-07-02 01:07:23 +02:00
cb5fad4c6c CUDA: refactor and optimize IQ MMVQ (#8215)
* CUDA: refactor and optimize IQ MMVQ

* uint -> uint32_t

* __dp4a -> ggml_cuda_dp4a

* remove MIN_CC_DP4A checks

* change default

* try CI fix
b3276
2024-07-01 20:39:06 +02:00
dae57a1ebc readme: add Paddler to the list of projects (#8239) 2024-07-01 20:13:22 +03:00