Commit Graph

2957 Commits

Author SHA1 Message Date
c3f8d58356 tests : test-tokenizer-0.sh print more info (#7402) 2024-05-21 19:53:48 +03:00
11474e756d examples: cache hf model when --model not provided (#7353)
* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided

* examples: cache hf model when --model not provided
b2956
2024-05-21 17:13:12 +03:00
d8ee902227 CUDA: deduplicate mmq code (#7397) b2955 2024-05-21 16:02:12 +02:00
d7e852c1bc Tokenizer SPM fixes for phi-3 and llama-spm (bugfix) (#7425)
* Update brute force test: add_special
* Update brute force test: default values for add_bos_token and add_eos_token
* Enable rtrim when pre-inserting BOS

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Revert "server : fix test regexes"
2024-05-21 14:39:48 +02:00
917dc8cfa6 Tokenizer SPM fixes for phi-3 and llama-spm (#7375)
* Update brute force test: special tokens
* Fix added tokens
  - Try to read 'added_tokens.json'.
  - Try to read 'tokenizer_config.json'.
  - Try to read 'tokenizer.json'.
* Fix special tokens rtrim

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : fix test regexes
b2953
2024-05-20 20:15:57 +02:00
fabf30b4c4 llama : remove Persimmon (#7408)
* llama : remove Persimmon

* requirements : remove
b2952
2024-05-21 02:35:28 +10:00
20385cebcc perplexity: update README FP16 results [no ci] (#7413) 2024-05-20 18:15:38 +02:00
db10f01310 rpc : track allocated buffers (#7411)
* rpc : track allocated buffers

ref: #7407

* rpc : pack rpc_tensor tightly
b2950
2024-05-20 16:36:55 +03:00
3bc10cb485 server : fix temperature + disable some tests (#7409)
* server : fix temperature

* server : disable tests relying on parallel determinism

* ci : change server Debug -> RelWithDebInfo
b2949
2024-05-20 22:10:03 +10:00
6bf9b66fa3 [SYCL] Update SYCL upscale operation (#7321)
* Update SYCL upscale operation

* Formatting

* Remove messages
b2948
2024-05-20 16:38:23 +05:30
26cd4237bc Update README.md (#7410) 2024-05-20 11:55:34 +02:00
213e90ed73 ggml-opencl, llama: using reserve() if count already known (#7272) b2946 2024-05-20 10:33:21 +03:00
65c58207ec ggml : add loongarch lsx and lasx support (#6454)
* add loongarch lsx and lasx optimize code

* Add loongarch compilation support to makefile

* revert stb_image.h

* opt bytes_from_nibbles_32 and sum_i16_pairs_float

* fix undeclared

* format code

* update

* update 2

---------

Co-authored-by: Jinyang He <hejinyang@loongson.cn>
b2945
2024-05-20 10:19:21 +03:00
1cc0155d04 server : tuning tests (#7388)
* server : don't pass temperature as string

* server : increase timeout

* tests : fix the fix 0.8f -> 0.8

ggml-ci

* tests : set explicit temperature
2024-05-20 10:16:41 +03:00
e932094d58 server : return error on too large embedding input (#7389) b2943 2024-05-20 08:56:05 +03:00
2789baf480 tests : fix --keep_split -> --keep-split (#7374) 2024-05-20 08:55:09 +03:00
33c8d50acc Add provisions for windows support for BF16 code including CMake provision for enabling AVX512_BF16 (#7258) b2941 2024-05-20 12:18:39 +10:00
d359f30921 llama : remove MPI backend (#7395) b2940 2024-05-20 01:17:03 +02:00
1ea2a0036e quantize : fix --keep-split check (#7374) b2939 2024-05-19 19:37:04 +03:00
f030ec1f7a Vulkan Embedding Fix (#7360)
* Fix empty Vulkan host buffers

Add fp32 fp16 matmul shader

Fix matmul shader alignment

* Remove deprecated tensor->backend uses

* Fix Vulkan validation errors on embedding models with no offloaded layers

* Fix Vulkan llava segfault when not offloading layers
b2938
2024-05-19 17:19:53 +02:00
e4e6f67be6 ggml : fix another case of quants nans (#7387) b2937 2024-05-19 17:08:46 +02:00
5ca49cbecd ggml: implement quantized KV cache for FA (#7372) b2936 2024-05-19 16:46:13 +02:00
1b01f06db0 server: add test for token probs (#7347) 2024-05-19 16:26:02 +02:00
41858392e1 server: fix seed being reported back (#7382) b2934 2024-05-19 17:06:33 +03:00
6aade19ee7 Add StableLM2 pre-tokenizer (#7349)
* Add StableLM pre-tokenizer

* Fix space

* Fix trailing whitespace
b2933
2024-05-19 22:46:46 +10:00
ab33f7a338 cuda : clear error after buffer allocation failure (#7376) b2932 2024-05-19 14:19:37 +02:00
e23b974f4c labeler.yml: Use settings from ggerganov/llama.cpp [no ci] (#7363)
https://github.com/actions/labeler#using-configuration-path-input-together-with-the-actionscheckout-action
Recommends the use of checkout action to use the correct repo context
when applying settings for PR labels

e.g.

    steps:
    - uses: actions/checkout@v4 # Uploads repository content to the runner
      with:
        repository: "owner/repositoryName" # The one of the available inputs, visit https://github.com/actions/checkout#readme to find more
    - uses: actions/labeler@v5
      with:
        configuration-path: 'path/to/the/uploaded/configuration/file'
2024-05-19 20:51:03 +10:00
854d365aba cmake : update android comments (#7341) b2930 2024-05-19 11:01:01 +03:00
f5bf761747 Capture CUDA logging output (#7298)
* logging: output capture in cuda module

* fix compile error

* fix: vsnprintf terminates with 0, string use not correct

* post review

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
b2929
2024-05-19 00:44:42 +02:00
059031b8c4 ci : re-enable sanitizer runs (#7358)
* Revert "ci : temporary disable sanitizer builds (#6128)"

This reverts commit 4f6d1337ca.

* ci : trigger
b2928
2024-05-18 18:55:54 +03:00
511182eabb android : use "ci-android" branch for CI (#7341)
* android : use "ci-android" branch for CI

* ggml : disable SIMD exp and silu for 32-bit ARM

ggml-ci

* android : do not fetch, use add_subdirectory instead

* cmake : provide binary dir
b2927
2024-05-18 20:40:39 +10:00
133d99c599 CUDA: deduplicate FlashAttention code (#7352) b2926 2024-05-18 12:36:25 +02:00
cb42c29427 server: correct --threads documentation [no ci] (#7362) 2024-05-18 11:10:47 +02:00
d233b507cd cuda : add half2 __shfl_xor() for ROCm 5.5 (#7263) 2024-05-18 10:05:17 +02:00
0f98acfac6 llama : add support for larger Granite Code Models (20B, 34B) (#7324)
Tie the weights for ARCH_STARCODER to support the larger Granite code models.
Partially addresses ggerganov/issues/7116

There still remains to be a few things to fix.
Currently requires `--override-kv tokenizer.ggml.add_bos_token=bool:false`
b2923
2024-05-18 11:04:55 +03:00
ca57e0f35e perplexity : ndot progress and show stats with < 100 tasks (#7348)
Fix floating point error with ndot printing, allow end stats on lower task numbers if multiple-choice tasks.
b2922
2024-05-18 10:57:08 +03:00
c1b295eea5 Update and fix Vulkan soft_max and argsort implementations (#7237)
* Update and fix Vulkan softmax implementation

* Update and fix Vulkan argsort implementation
b2921
2024-05-18 08:10:58 +02:00
de73196344 github-actions-labeler: initial commit (#7330)
* github-actions-labeler: initial commit [no ci]

* github actions: remove priority auto labeling [no ci]
2024-05-18 16:04:23 +10:00
b49a13dd2f convert : fix set_vocab_sentencepiece (#6866)
* convert : fix set_vocab_sentencepiece

* Update convert-hf-to-gguf.py
2024-05-18 08:46:20 +03:00
05834841dc ggml : fix quants nans when all the group weights are very close to zero (#7313) b2918 2024-05-18 02:39:54 +02:00
ef277de2ad cmake : fix typo in AMDGPU_TARGETS (#7356) b2917 2024-05-18 02:39:25 +02:00
b43272afa2 Unicode codepoint flags for custom regexs (#7245)
* Replace CODEPOINT_TYPE_* with codepoint_flags
* Update and bugfix brute force random test
* Deterministic brute force random test
* Unicode normalization NFD
* Get rid of BOM
b2916
2024-05-18 01:09:13 +02:00
0fc1e820a9 CUDA: faster large batch FA without tensor cores (#7314) b2915 2024-05-17 18:54:52 +02:00
82ca83db3c ROCm: use native CMake HIP support (#5966)
Supercedes #4024 and #4813.

CMake's native HIP support has become the
recommended way to add HIP code into a project (see
[here](https://rocm.docs.amd.com/en/docs-6.0.0/conceptual/cmake-packages.html#using-hip-in-cmake)).
This PR makes the following changes:

1. The environment variable `HIPCXX` or CMake option
`CMAKE_HIP_COMPILER` should be used to specify the HIP
compiler. Notably this shouldn't be `hipcc`, but ROCm's clang,
which usually resides in `$ROCM_PATH/llvm/bin/clang`. Previously
this was control by `CMAKE_C_COMPILER` and `CMAKE_CXX_COMPILER`.
Note that since native CMake HIP support is not yet available on
Windows, on Windows we fall back to the old behavior.

2. CMake option `CMAKE_HIP_ARCHITECTURES` is used to control the
GPU architectures to build for. Previously this was controled by
`GPU_TARGETS`.

3. Updated the Nix recipe to account for these new changes.

4. The GPU targets to build against in the Nix recipe is now
consistent with the supported GPU targets in nixpkgs.

5. Added CI checks for HIP on both Linux and Windows. On Linux, we test
both the new and old behavior.

The most important part about this PR is the separation of the
HIP compiler and the C/C++ compiler. This allows users to choose
a different C/C++ compiler if desired, compared to the current
situation where when building for ROCm support, everything must be
compiled with ROCm's clang.

~~Makefile is unchanged. Please let me know if we want to be
consistent on variables' naming because Makefile still uses
`GPU_TARGETS` to control architectures to build for, but I feel
like setting `CMAKE_HIP_ARCHITECTURES` is a bit awkward when you're
calling `make`.~~ Makefile used `GPU_TARGETS` but the README says
to use `AMDGPU_TARGETS`. For consistency with CMake, all usage of
`GPU_TARGETS` in Makefile has been updated to `AMDGPU_TARGETS`.

Thanks to the suggestion of @jin-eld, to maintain backwards
compatibility (and not break too many downstream users' builds), if
`CMAKE_CXX_COMPILER` ends with `hipcc`, then we still compile using
the original behavior and emit a warning that recommends switching
to the new HIP support. Similarly, if `AMDGPU_TARGETS` is set but
`CMAKE_HIP_ARCHITECTURES` is not, then we forward `AMDGPU_TARGETS`
to `CMAKE_HIP_ARCHITECTURES` to ease the transition to the new
HIP support.

Signed-off-by: Gavin Zhao <git@gzgz.dev>
b2914
2024-05-17 17:03:03 +02:00
f4bd8b3d26 rpc : set SO_REUSEADDR for the server socket (#7320)
ref: #7293
b2913
2024-05-17 17:25:44 +03:00
51e9d02599 Added a single test function script and fix debug-test.sh to be more robust (#7279)
* run-single-test.sh: added a single test function script and fix debug-test.sh to be more robust

* debug-test.sh: combined execute and gdb test mode via -g flag

* debug-test.sh: refactor

* debug-test: refactor for clarity

* debug-test.sh: comment style changes

* debug-test.sh: fix gdb
2024-05-17 22:40:14 +10:00
d273c1402b py : convert-hf-to-gguf-update improvements (#7340)
* convert-hf-to-gguf-update: automate updating

* convert-hf-to-gguf-update: improve download

* share requests session for performance
* create directories only when needed, don't skip downloads when empty directory encountered
* be more graceful about errors
2024-05-17 15:11:45 +03:00
27b040691c llama : use n_embd_head_v when reshaping kqv (#7327)
* llama : use n_embd_head_v instead of n_embd_head_k when reshaping kqv

* llama : use n_embd_v_gqa and n_embd_head_v instead of n_embd_k_gqa and n_embd_head_k when making a view of cached value vectors.

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
b2910
2024-05-17 14:24:38 +03:00
29c60d8cdd tokenization: add warning for double BOS (#7332) b2909 2024-05-17 09:59:57 +02:00
359cbe3f46 ggml-quants, llama : removed excess checks (#7274) b2908 2024-05-17 10:08:49 +03:00