Commit Graph

4256 Commits

Author SHA1 Message Date
8a43e940ab ggml: new optimization interface (ggml/988) 2024-11-17 08:30:29 +02:00
5c9a8b22b1 scripts : update sync 2024-11-17 08:30:29 +02:00
0fff7fd798 docs : vulkan build instructions to use git bash mingw64 (#10303) 2024-11-17 00:29:18 +01:00
4e54be0ec6 llama/ex: remove --logdir argument (#10339) b4103 2024-11-16 23:00:41 +01:00
db4cfd5dbc llamafile : fix include path (#0)
ggml-ci
b4102
2024-11-16 20:36:26 +02:00
8ee0d09ae6 make : auto-determine dependencies (#0) 2024-11-16 20:36:26 +02:00
bcdb7a2386 server: (web UI) Add samplers sequence customization (#10255)
* Samplers sequence: simplified and input field.

* Removed unused function

* Modify and use `settings-modal-short-input`

* rename "name" --> "label"

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b4100
2024-11-16 14:26:54 +01:00
f245cc28d4 scripts : fix missing key in compare-llama-bench.py (#10332) 2024-11-16 10:32:50 +02:00
772703c8ff vulkan: Optimize some mat-vec mul quant shaders (#10296)
Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses
the B loads across the rows and also reuses some addressing calculations.
This required manually partially unrolling the loop, since the compiler
is less willing to unroll outer loops.

Add bounds-checking on the last iteration of the loop. I think this was at
least partly broken before.

Optimize the Q4_K shader to vectorize most loads and reduce the number of
bit twiddling instructions.
b4098
2024-11-16 07:26:57 +01:00
dd3a6ce9f8 vulkan : add cmake preset debug/release (#10306) 2024-11-16 02:59:33 +01:00
1e58ee1318 ggml : optimize Q4_0 into Q4_0_X_Y repack (#10324) b4096 2024-11-16 01:53:37 +01:00
89e4caaaf0 llama : save number of parameters and the size in llama_model (#10286)
fixes #10285
b4095
2024-11-16 01:42:13 +01:00
74d73dc85c Make updates to fix issues with clang-cl builds while using AVX512 flags (#10314) b4094 2024-11-15 22:27:00 +01:00
4047be74da scripts: update compare-llama-bench.py (#10319) b4093 2024-11-15 21:19:03 +01:00
883d206fbd ggml : fix some build issues b4092 2024-11-15 21:45:32 +02:00
09ecbcb596 cmake : fix ppc64 check (whisper/0)
ggml-ci
b4091
2024-11-15 15:44:06 +02:00
3225008973 ggml : vulkan logs (whisper/2547) 2024-11-15 15:44:06 +02:00
cbf5541a82 sync : ggml 2024-11-15 15:44:06 +02:00
Eve
18429220bd AVX BF16 and single scale quant optimizations (#10212)
* use 128 bit loads (i've tried 256->128 to death and its slower)

* double accumulator

* avx bf16 vec dot

* +3% q4_0 inference

* +7% tg +5% pp compared to master

* slower f16c version, kep for reference

* 256b version, also slow. i tried :)

* revert f16

* faster with madd

* split to functions

* Q8_0 and IQ4_NL, 5-7% faster

* fix potential overflow (performance reduced)

* 16 bit add for q4_0 only

* merge
b4088
2024-11-15 12:47:58 +01:00
f0204a0ec7 ci: build test musa with cmake (#10298)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b4087
2024-11-15 12:47:25 +01:00
57f8355b29 sycl: Update Intel docker images to use DPC++ 2025.0 (#10305) 2024-11-15 13:10:45 +02:00
9901068ac7 server : (web UI) add copy button for code block, fix api key (#10242)
* server : (web ui) add copy btn for code blocks

* fix problem with api key

* use settings-modal-short-input component

* always show copy btn for code snippet
b4085
2024-11-15 10:48:49 +01:00
231f9360d9 cann: dockerfile and doc adjustment (#10302)
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-11-15 15:09:35 +08:00
4802ad350b scripts : fix regex in sync [no ci] 2024-11-15 08:38:43 +02:00
5a54af4d4f sycl: Use syclcompat::dp4a (#10267)
* sycl: Use syclcompat::dp4a

* Using the syclcompat version allow the compiler to optimize the
  operation with native function

* Update news section

* Update CI Windows oneAPI version to 2025.0

* Reword doc

* Call syclcompat::dp4a inside dpct::dp4a

This reverts commit 90cb61d692.
b4082
2024-11-15 11:09:12 +08:00
1607a5e5b0 backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921)
* backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
b4081
2024-11-15 01:28:50 +01:00
ae8de6d50a ggml : build backends as libraries (#10256)
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
b4080
2024-11-14 18:04:35 +01:00
4a8ccb37ad CUDA: no -sm row for very small matrices (#10185) b4079 2024-11-14 13:00:15 +01:00
2a82891a85 speculative : fix out-of-bounds access (#10289) b4078 2024-11-14 11:44:15 +02:00
af148c9386 vulkan: Optimize binary ops (#10270)
Reuse the index calculations across all of src0/src1/dst. Add a shader
variant for when src0/src1 are the same dimensions and additional modulus
for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that
have a fast path when the calculation isn't needed or can be done more
cheaply.
b4077
2024-11-14 06:22:55 +01:00
66798e42fb vulkan: Use macros to make the mat mul pipeline creation more concise (#10259)
Also add vk_matmul_pipeline2 to hold f16/f32 accumulator versions of a
pipeline. This isn't really used yet.
b4076
2024-11-13 21:59:47 +01:00
fb4a0ec083 llama : propagate the results of graph_compute (#9525)
* llama: propagating the results of `graph_compute` to the user interface

* llama: reverting kv_cache in case of failed compute

* llama: `llama_kv_cache_state` was removed, only the result of `llama_graph_compute` is returned

* llama: restore a kv_cache in case of failed computation

* llama: correct reverting of the entire batch.
also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models

* llama: updated comments

* llama : add comments about KV cache state after error

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4075
2024-11-13 20:00:35 +02:00
5ea926dad7 sync : ggml 2024-11-13 18:11:54 +02:00
1ee9eea094 docs : update bindings list (#10261)
Signed-off-by: tianzixuan <tianzixuan335@hellobike.com>
b4073
2024-11-13 13:17:10 +02:00
ff7fb670d0 server : add missing docs (#10269) 2024-11-13 13:16:30 +02:00
0e712a5acb server : fix incorrect res in validate_model_chat_template (#10272)
* server : fix validate_model_chat_template

* server : fix chat res
b4071
2024-11-13 13:15:23 +02:00
a0ec17b32e metadata: Detailed Dataset Authorship Metadata (#8875)
Converter script can now read these two fields as a detailed base model and dataset source.
This was done so that it will be easier for Hugging Face to integrate detailed metadata as needed.

 -  base_model_sources (List[dict], optional)
 -  dataset_sources (List[dict], optional)

Dataset now represented as:

 - general.dataset.count
 - general.dataset.{id}.name
 - general.dataset.{id}.author
 - general.dataset.{id}.version
 - general.dataset.{id}.organization
 - general.dataset.{id}.description
 - general.dataset.{id}.url
 - general.dataset.{id}.doi
 - general.dataset.{id}.uuid
 - general.dataset.{id}.repo_url

This also adds to base model these metadata:

 - general.base_model.{id}.description
2024-11-13 21:10:38 +11:00
2e82ffa4af sycl : Fixes to broken builds and test-backend-ops (#10257)
* Fixes broken build for the SYCL CUDA backend caused by non-explicit gemm call in outprod (merged in with RWKV6 in
Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration #10133)

* Marks permuted MUL_MAT as unsupported to be able to run test-backend-ops

* Fixes asserts in norm to fix debug builds.
b4069
2024-11-13 09:40:57 +00:00
80dd7ff22f vulkan: Optimize contiguous copies (#10254)
* tests: Fix memory bandwidth calculation for perf tests

Add a flops calculation for flash attention.

Add one GGML_OP_CPY perf test.

* vulkan: Optimize contiguous copies

Add a variant of the copy shader for when the tensors are contiguous. Avoid
the complex addressing calculations, and do four elements per invocation
to hide some other overhead.

Apply similar changes to the scale shader, since scale is always contiguous.

Add a "progress bar" for shader compiles.
b4068
2024-11-13 07:58:57 +01:00
54ef9cfc72 vulkan: Throttle the number of shader compiles during the build step. (#10222)
Fixes #9582

Spawning too many concurrent copies of glslc leads to "Failed to create pipes"
errors on Linux. This change applies the same throttling we use for
multithreaded pipeline creation.
b4067
2024-11-11 18:13:51 +01:00
b0cefea58a metal : more precise Q*K in FA vec kernel (#10247) b4066 2024-11-11 08:39:13 +02:00
b141e5f6ef server : enable KV cache defrag by default (#10233)
ggml-ci
b4065
2024-11-11 08:38:43 +02:00
4b3a9212b6 flake.lock: Update (#10243)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/807e9154dcb16384b1b765ebe9cd2bba2ac287fd?narHash=sha256-l253w0XMT8nWHGXuXqyiIC/bMvh1VRszGXgdpQlfhvU%3D' (2024-10-29)
  → 'github:NixOS/nixpkgs/4aa36568d413aca0ea84a1684d2d46f55dbabad7?narHash=sha256-Zwl8YgTVJTEum%2BL%2B0zVAWvXAGbWAuXHax3KzuejaDyo%3D' (2024-11-05)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-11-10 11:45:25 -08:00
505f33274d server : (web UI) Add back sampler settings (#10239)
* Add back samplers to server

* Added tooltips with basic information

* Fixed stretching of input fields.

* use component for settings input, move help msg to tooltips

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-11-10 15:42:25 -04:00
160687b3ed vulkan: Fix newly added tests for permuted mul_mat and 1D im2col (#10226) b4062 2024-11-10 12:37:56 +01:00
6423c65aa8 metal : reorder write loop in mul mat kernel + style (#10231)
* metal : reorder write loop

* metal : int -> short, style

ggml-ci
b4061
2024-11-09 11:53:13 +02:00
39a334a9aa metal : fix build and some more comments (#10229) b4060 2024-11-09 11:53:02 +02:00
bb38cdd8ba metal : fix F32 accumulation in FA vec kernel (#10232) b4059 2024-11-09 11:52:45 +02:00
f018acba22 llama : fix Qwen model type strings b4058 2024-11-09 11:26:34 +02:00
46323fa9ef metal : hide debug messages from normal log b4057 2024-11-09 11:21:49 +02:00