Commit Graph

4143 Commits

Author SHA1 Message Date
cda0e4b648 llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch (#9745)
* refactor llama_batch_get_one

* adapt all examples

* fix simple.cpp

* fix llama_bench

* fix

* fix context shifting

* free batch before return

* use common_batch_add, reuse llama_batch in loop

* null terminated seq_id list

* fix save-load-state example

* fix perplexity

* correct token pos in llama_batch_allocr
b3943
2024-10-18 23:18:01 +02:00
afd9909a64 rpc : backend refactoring (#9912)
* rpc : refactor backend

Use structs for RPC request/response messages

* rpc : refactor server
b3942
2024-10-18 14:33:58 +03:00
87421a23e8 [SYCL] Add SYCL Backend registry, device and Event Interfaces (#9705)
* implemented missing SYCL event APIs

* sycl : Added device and backend reg interfaces

* Restructured ggml-sycl.cpp
b3941
2024-10-18 06:46:16 +01:00
60ce97c9d8 add amx kernel for gemm (#8998)
add intel amx isa detection

add vnni kernel for gemv cases

add vnni and amx kernel support for block_q8_0

code cleanup

fix packing B issue

enable openmp

fine tune amx kernel

switch to aten parallel pattern

add error message for nested parallelism

code cleanup

add f16 support in ggml-amx

add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS

update CMakeList

update README

fix some compilation warning

fix compiler warning when amx is not enabled

minor change

ggml-ci

move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp

ggml-ci

update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16

ggml-ci

add amx as an ggml-backend

update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h

minor change

update CMakeLists.txt

minor change

apply weight prepacking in set_tensor method in ggml-backend

fix compile error

ggml-ci

minor change

ggml-ci

update CMakeLists.txt

ggml-ci

add march dependency

minor change

ggml-ci

change ggml_backend_buffer_is_host to return false for amx backend

ggml-ci

fix supports_op

use device reg for AMX backend

ggml-ci

minor change

ggml-ci

minor change

fix rebase

set .buffer_from_host_ptr to be false for AMX backend
b3940
2024-10-18 13:34:36 +08:00
8901755ba3 server : add n_indent parameter for line indentation requirement (#9929)
ggml-ci
b3939
2024-10-18 07:32:19 +03:00
6f55bccbb8 llama : rename batch_all to batch (#8881)
This commit addresses the TODO in the code to rename the `batch_all`
parameter to `batch` in `llama_decode_internal`.
b3938
2024-10-18 01:41:51 +02:00
17bb928080 readme : remove --memory-f32 references (#9925) b3937 2024-10-17 23:43:05 +03:00
9f45fc1e99 llama : change warning to debug log b3936 2024-10-17 23:27:42 +03:00
99bd4ac28c llama : infill sampling handle very long tokens (#9924)
* llama : infill sampling handle very long tokens

ggml-ci

* cont : better indices

ggml-ci
b3935
2024-10-17 22:32:47 +03:00
3752217ed5 readme : update bindings list (#9918)
Co-authored-by: Tim Wang <tim.wang@ing.com>
2024-10-17 09:57:14 +03:00
f010b77a37 vulkan : add backend registry / device interfaces (#9721)
* vulkan : add backend registry / device interfaces

* llama : print devices used on model load
b3933
2024-10-17 02:46:58 +02:00
2194200278 fix: allocating CPU buffer with size 0 (#9917) b3932 2024-10-17 01:34:22 +02:00
73afe681aa fix: use vm_allocate to allocate CPU backend buffer on macOS (#9875)
* fix: use `vm_allocate` to allocate CPU backend buffer on macOS

* fix: switch to `posix_memalign` to keep existing `free()` usages work

* feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS

* style: formatting

* fix: move const outside of `#ifndef`

* style: formatting

* fix: unused var

* fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h`

* fix: unused var

* fix: page align to `GGUF_DEFAULT_ALIGNMENT`

* fix: page align to `TENSOR_ALIGNMENT`

* fix: convert `TENSOR_ALIGNMENT` to a macro

* fix: increase page size to `32` on iOS

* fix: iOS page size

* fix: `hbw_posix_memalign` alignment
b3931
2024-10-17 00:36:51 +02:00
9e04102448 llama : suppress conversion from 'size_t' to 'int' (#9046)
* llama : suppress conversion from 'size_t' to 'int'

This commit updates llm_tokenizer_spm.tokenize to suppress/remove the
following warnings that are generated on Windows when using MSVC:

```console
src\llama-vocab.cpp(211,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
src\llama-vocab.cpp(517,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
```

This is done by adding a cast for the size_t returned from
symbols.size(). I believe this is safe as it seems unlikely that
symbols, which stores an entry for each UTF8 character, would become
larger than INT_MAX.

The motivation for this change is to reduce the number of warnings that
are currently generated when building on Windows.

* squash! llama : suppress conversion from 'size_t' to 'int'

Move cast into for loop.
b3930
2024-10-16 20:34:28 +03:00
dbf18e4de9 llava : fix typo in error message [no ci] (#9884) 2024-10-16 20:24:05 +03:00
66c2c93082 grammar : fix JSON Schema for string regex with top-level alt. (#9903)
Prior to this commit, using a JSON Schema containing a string
with `pattern` regular expression that uses top-level alternation
(e.g. `"pattern": "^A|B|C|D$"`) would result in invalid JSON
output from the constrained sampling grammar, because it
ended up creating a grammar rule like this for the string:

```
thing ::= "\"" "A" | "B" | "C" | "D" "\"" space
```

Note that this rule will only match a starting quote for the "A" case,
and will only match an ending quote for the "D" case,
so this rule will always produce invalid JSON when used for sampling
(that is, the JSON will always be lacking the starting quote,
the ending quote, or both).

This was fixed in a simple way by adding parentheses to the
generated rule (for all string pattern rules, to keep it simple),
such that the new generated rule looks like this (correct):

```
thing ::= "\"" ("A" | "B" | "C" | "D") "\"" space
```
b3928
2024-10-16 19:03:24 +03:00
10433e8b45 llama : add tensor name for "result_norm" (#9907)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
b3927
2024-10-16 13:10:21 +03:00
1f66b699c4 server : fix the disappearance of the end of the text (#9867)
* server: fix the disappearance of the end of the text when streaming with stop strings

* simplify "send text" checks
b3926
2024-10-16 11:35:53 +03:00
0e41b300ed sync : ggml b3925 2024-10-16 11:28:14 +03:00
cd60b88bf7 ggml-alloc : remove buffer_id from leaf_alloc (ggml/987)
This commit removes the buffer_id field from the leaf_alloc struct.

The motivation for is that this field is only written to and never
read/used as far as I can tell. Each tensor_alloc has a buffer_id field
and this is what caused me to look into this more closely, to
understand what the buffer_id in leaf_alloc was used for.
2024-10-16 11:28:01 +03:00
becfd387f6 [CANN] Fix cann compilation error (#9891)
Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.
b3923
2024-10-16 08:51:46 +08:00
755a9b2bf0 llama : add infill sampler (#9896)
ggml-ci
b3922
2024-10-15 16:35:33 +03:00
223c25a72f server : improve infill context reuse (#9894)
ggml-ci
b3921
2024-10-15 16:28:55 +03:00
fbc98b748e sampling : add XTC sampler (#9742)
* Initial XTC commit

Adds XTC sampler, not activated by default, but recommended settings by default.

* Cleanup

* Simplified chances calculation

To be more inline with the original implementation, chance is calculated once at the beginning.

* First fixes by comments

Still need to look into sorting

* Fixed trailing backspaces

* Fixed RNG to be reproduceable 

Thanks to @slaren for directions

* Fixed forgotten header

* Moved `min_keep` 

Moved from conditions to a simple check at the end.

* Fixed broken randomization

Thanks to @slaren for explanation

* Swapped sorting for a custom algorithm

Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.

* Algorithm rework

1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.

* Added XTC to `test-sampling`

* Simplified algorithm and more tests

* Updated info in common and args

* Merged back lost commits in common and arg

* Update dump info in common

* Fixed incorrect min_keep check

* Added XTC to README

* Renamed parameters, fixed info and defaults

* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off

* Initial server support

* Added XTC to server UIs

* Fixed labels in old server UI

* Made algorithm safer and more readable

* Removed xtc_threshold_max

* Fixed arg after update

* Quick fixes by comments

* Simplified algorithm since threshold_max is removed

* Renamed random distribution

* Fixed tests and outdated README

* Small fixes
b3920
2024-10-15 12:54:55 +02:00
dcdd535302 server : update preact (#9895) 2024-10-15 12:48:44 +03:00
4c42f93b22 readme : update bindings list (#9889) 2024-10-15 11:20:34 +03:00
a89f75e1b7 server : handle "logprobs" field with false value (#9871)
Co-authored-by: Gimling <huangjl@ruyi.ai>
b3917
2024-10-14 10:04:36 +03:00
13dca2a54a Vectorize load instructions in dmmv f16 CUDA kernel (#9816)
* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b3916
2024-10-14 02:49:08 +02:00
d4c19c0f5c server : accept extra_context for the infill endpoint (#9874)
* server : accept extra_context for the infill endpoint

ggml-ci

* server : update readme [no ci]

* server : use repo-level FIM pattern if possible

ggml-ci
2024-10-13 21:31:35 +03:00
c7181bd294 server : reuse cached context chunks (#9866)
ggml-ci
b3914
2024-10-13 18:52:48 +03:00
92be9f1216 flake.lock: Update (#9870)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/bc947f541ae55e999ffdb4013441347d83b00feb?narHash=sha256-NOiTvBbRLIOe5F6RbHaAh6%2B%2BBNjsb149fGZd1T4%2BKBg%3D' (2024-10-04)
  → 'github:NixOS/nixpkgs/5633bcff0c6162b9e4b5f1264264611e950c8ec7?narHash=sha256-9UTxR8eukdg%2BXZeHgxW5hQA9fIKHsKCdOIUycTryeVw%3D' (2024-10-09)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-10-12 20:11:26 -07:00
edc265661c server : add option to time limit the generation phase (#9865)
ggml-ci
b3912
2024-10-12 16:14:27 +03:00
1bde94dd02 server : remove self-extend features (#9860)
* server : remove self-extend

ggml-ci

* server : fix context limit check to use slot.n_past

ggml-ci
b3911
2024-10-12 16:06:31 +03:00
95c76e8e92 server : remove legacy system_prompt feature (#9857)
* server : remove legacy system_prompt feature

ggml-ci

* readme : update [no ci]

* server : fix non-transformer logic + remove response from /props
2024-10-12 14:51:54 +03:00
11ac9800af llama : improve infill support and special token detection (#9798)
* llama : improve infill support

ggml-ci

* llama : add more FIM token strings

ggml-ci

* server : update prompt on slot restore (#9800)

* gguf : deprecate old FIM token KVs
b3909
2024-10-12 08:21:51 +03:00
943d20b411 musa : update doc (#9856)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-10-12 08:09:53 +03:00
96776405a1 ggml : move more prints to the ggml log system (#9839)
* ggml : move more prints to the ggml log system

* show BLAS OpenMP warnings in all builds using debug print
b3907
2024-10-11 15:34:45 +02:00
7eee341bee common : use common_ prefix for common library functions (#9805)
* common : use common_ prefix for common library functions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b3906
2024-10-10 22:57:42 +02:00
0e9f760eb1 rpc : add backend registry / device interfaces (#9812)
* rpc : add backend registry / device interfaces

* llama : add llama_supports_rpc API

* ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server
b3905
2024-10-10 20:14:55 +02:00
cf8e0a3bb9 musa: add docker image support (#9685)
* mtgpu: add docker image support

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: enable docker workflow

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b3904
2024-10-10 20:10:37 +02:00
c7499c557c examples : do not use common library in simple example (#9803)
* examples : do not use common library in simple example

* add command line parser, simplify code
b3903
2024-10-10 19:50:49 +02:00
c81f3bbb05 cmake : do not build common library by default when standalone (#9804) b3902 2024-10-09 18:49:52 +02:00
e7022064ab perplexity : fix integer overflow (#9783)
* perplexity : fix integer overflow

ggml-ci

* perplexity : keep n_vocab as int and make appropriate casts

ggml-ci
b3901
2024-10-09 17:00:18 +03:00
3dc48fe75a examples : remove llama.vim
An updated version will be added in #9787
2024-10-09 10:55:42 +03:00
dca1d4b58a ggml : fix BLAS with unsupported types (#9775)
* ggml : do not use BLAS with types without to_float

* ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies

* ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits

it's not really internal if everybody uses it
b3899
2024-10-08 14:21:43 +02:00
458367a906 server : better security control for public deployments (#9776)
* server : more explicit endpoint access settings

* protect /props endpoint

* fix tests

* update server docs

* fix typo

* fix tests
b3898
2024-10-08 13:27:04 +02:00
fa42aa6d89 scripts : fix spelling typo in messages and comments (#9782)
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
2024-10-08 09:19:53 +03:00
6374743747 ggml : add backend registry / device interfaces to BLAS backend (#9752)
* ggml : add backend registry / device interfaces to BLAS backend

* fix mmap usage when using host buffers
b3896
2024-10-07 21:55:08 +02:00
f1af42fa8c Update building for Android (#9672)
* docs : clarify building Android on Termux

* docs : update building Android on Termux

* docs : add cross-compiling for Android

* cmake : link dl explicitly for Android
b3895
2024-10-07 09:37:31 -07:00
6279dac039 flake.lock: Update (#9753)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/bcef6817a8b2aa20a5a6dbb19b43e63c5bf8619a?narHash=sha256-HO4zgY0ekfwO5bX0QH/3kJ/h4KvUDFZg8YpkNwIbg1U%3D' (2024-09-12)
  → 'github:hercules-ci/flake-parts/3d04084d54bedc3d6b8b736c70ef449225c361b1?narHash=sha256-K5ZLCyfO/Zj9mPFldf3iwS6oZStJcU4tSpiXTMYaaL0%3D' (2024-10-01)
• Updated input 'flake-parts/nixpkgs-lib':
    '356624c120.tar.gz?narHash=sha256-Ss8QWLXdr2JCBPcYChJhz4xJm%2Bh/xjl4G0c0XlP6a74%3D' (2024-09-01)
  → 'fb192fec7c.tar.gz?narHash=sha256-0xHYkMkeLVQAMa7gvkddbPqpxph%2BhDzdu1XdGPJR%2BOs%3D' (2024-10-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/1925c603f17fc89f4c8f6bf6f631a802ad85d784?narHash=sha256-J%2BPeFKSDV%2BpHL7ukkfpVzCOO7mBSrrpJ3svwBFABbhI%3D' (2024-09-26)
  → 'github:NixOS/nixpkgs/bc947f541ae55e999ffdb4013441347d83b00feb?narHash=sha256-NOiTvBbRLIOe5F6RbHaAh6%2B%2BBNjsb149fGZd1T4%2BKBg%3D' (2024-10-04)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-10-07 09:35:42 -07:00