Commit Graph

4299 Commits

Author SHA1 Message Date
f245cc28d4 scripts : fix missing key in compare-llama-bench.py (#10332) 2024-11-16 10:32:50 +02:00
772703c8ff vulkan: Optimize some mat-vec mul quant shaders (#10296)
Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses
the B loads across the rows and also reuses some addressing calculations.
This required manually partially unrolling the loop, since the compiler
is less willing to unroll outer loops.

Add bounds-checking on the last iteration of the loop. I think this was at
least partly broken before.

Optimize the Q4_K shader to vectorize most loads and reduce the number of
bit twiddling instructions.
b4098
2024-11-16 07:26:57 +01:00
dd3a6ce9f8 vulkan : add cmake preset debug/release (#10306) 2024-11-16 02:59:33 +01:00
1e58ee1318 ggml : optimize Q4_0 into Q4_0_X_Y repack (#10324) b4096 2024-11-16 01:53:37 +01:00
89e4caaaf0 llama : save number of parameters and the size in llama_model (#10286)
fixes #10285
b4095
2024-11-16 01:42:13 +01:00
74d73dc85c Make updates to fix issues with clang-cl builds while using AVX512 flags (#10314) b4094 2024-11-15 22:27:00 +01:00
4047be74da scripts: update compare-llama-bench.py (#10319) b4093 2024-11-15 21:19:03 +01:00
883d206fbd ggml : fix some build issues b4092 2024-11-15 21:45:32 +02:00
09ecbcb596 cmake : fix ppc64 check (whisper/0)
ggml-ci
b4091
2024-11-15 15:44:06 +02:00
3225008973 ggml : vulkan logs (whisper/2547) 2024-11-15 15:44:06 +02:00
cbf5541a82 sync : ggml 2024-11-15 15:44:06 +02:00
Eve
18429220bd AVX BF16 and single scale quant optimizations (#10212)
* use 128 bit loads (i've tried 256->128 to death and its slower)

* double accumulator

* avx bf16 vec dot

* +3% q4_0 inference

* +7% tg +5% pp compared to master

* slower f16c version, kep for reference

* 256b version, also slow. i tried :)

* revert f16

* faster with madd

* split to functions

* Q8_0 and IQ4_NL, 5-7% faster

* fix potential overflow (performance reduced)

* 16 bit add for q4_0 only

* merge
b4088
2024-11-15 12:47:58 +01:00
f0204a0ec7 ci: build test musa with cmake (#10298)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b4087
2024-11-15 12:47:25 +01:00
57f8355b29 sycl: Update Intel docker images to use DPC++ 2025.0 (#10305) 2024-11-15 13:10:45 +02:00
9901068ac7 server : (web UI) add copy button for code block, fix api key (#10242)
* server : (web ui) add copy btn for code blocks

* fix problem with api key

* use settings-modal-short-input component

* always show copy btn for code snippet
b4085
2024-11-15 10:48:49 +01:00
231f9360d9 cann: dockerfile and doc adjustment (#10302)
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-11-15 15:09:35 +08:00
4802ad350b scripts : fix regex in sync [no ci] 2024-11-15 08:38:43 +02:00
5a54af4d4f sycl: Use syclcompat::dp4a (#10267)
* sycl: Use syclcompat::dp4a

* Using the syclcompat version allow the compiler to optimize the
  operation with native function

* Update news section

* Update CI Windows oneAPI version to 2025.0

* Reword doc

* Call syclcompat::dp4a inside dpct::dp4a

This reverts commit 90cb61d692.
b4082
2024-11-15 11:09:12 +08:00
1607a5e5b0 backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921)
* backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
b4081
2024-11-15 01:28:50 +01:00
ae8de6d50a ggml : build backends as libraries (#10256)
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
b4080
2024-11-14 18:04:35 +01:00
4a8ccb37ad CUDA: no -sm row for very small matrices (#10185) b4079 2024-11-14 13:00:15 +01:00
2a82891a85 speculative : fix out-of-bounds access (#10289) b4078 2024-11-14 11:44:15 +02:00
af148c9386 vulkan: Optimize binary ops (#10270)
Reuse the index calculations across all of src0/src1/dst. Add a shader
variant for when src0/src1 are the same dimensions and additional modulus
for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that
have a fast path when the calculation isn't needed or can be done more
cheaply.
b4077
2024-11-14 06:22:55 +01:00
66798e42fb vulkan: Use macros to make the mat mul pipeline creation more concise (#10259)
Also add vk_matmul_pipeline2 to hold f16/f32 accumulator versions of a
pipeline. This isn't really used yet.
b4076
2024-11-13 21:59:47 +01:00
fb4a0ec083 llama : propagate the results of graph_compute (#9525)
* llama: propagating the results of `graph_compute` to the user interface

* llama: reverting kv_cache in case of failed compute

* llama: `llama_kv_cache_state` was removed, only the result of `llama_graph_compute` is returned

* llama: restore a kv_cache in case of failed computation

* llama: correct reverting of the entire batch.
also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models

* llama: updated comments

* llama : add comments about KV cache state after error

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4075
2024-11-13 20:00:35 +02:00
5ea926dad7 sync : ggml 2024-11-13 18:11:54 +02:00
1ee9eea094 docs : update bindings list (#10261)
Signed-off-by: tianzixuan <tianzixuan335@hellobike.com>
b4073
2024-11-13 13:17:10 +02:00
ff7fb670d0 server : add missing docs (#10269) 2024-11-13 13:16:30 +02:00
0e712a5acb server : fix incorrect res in validate_model_chat_template (#10272)
* server : fix validate_model_chat_template

* server : fix chat res
b4071
2024-11-13 13:15:23 +02:00
a0ec17b32e metadata: Detailed Dataset Authorship Metadata (#8875)
Converter script can now read these two fields as a detailed base model and dataset source.
This was done so that it will be easier for Hugging Face to integrate detailed metadata as needed.

 -  base_model_sources (List[dict], optional)
 -  dataset_sources (List[dict], optional)

Dataset now represented as:

 - general.dataset.count
 - general.dataset.{id}.name
 - general.dataset.{id}.author
 - general.dataset.{id}.version
 - general.dataset.{id}.organization
 - general.dataset.{id}.description
 - general.dataset.{id}.url
 - general.dataset.{id}.doi
 - general.dataset.{id}.uuid
 - general.dataset.{id}.repo_url

This also adds to base model these metadata:

 - general.base_model.{id}.description
2024-11-13 21:10:38 +11:00
2e82ffa4af sycl : Fixes to broken builds and test-backend-ops (#10257)
* Fixes broken build for the SYCL CUDA backend caused by non-explicit gemm call in outprod (merged in with RWKV6 in
Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration #10133)

* Marks permuted MUL_MAT as unsupported to be able to run test-backend-ops

* Fixes asserts in norm to fix debug builds.
b4069
2024-11-13 09:40:57 +00:00
80dd7ff22f vulkan: Optimize contiguous copies (#10254)
* tests: Fix memory bandwidth calculation for perf tests

Add a flops calculation for flash attention.

Add one GGML_OP_CPY perf test.

* vulkan: Optimize contiguous copies

Add a variant of the copy shader for when the tensors are contiguous. Avoid
the complex addressing calculations, and do four elements per invocation
to hide some other overhead.

Apply similar changes to the scale shader, since scale is always contiguous.

Add a "progress bar" for shader compiles.
b4068
2024-11-13 07:58:57 +01:00
54ef9cfc72 vulkan: Throttle the number of shader compiles during the build step. (#10222)
Fixes #9582

Spawning too many concurrent copies of glslc leads to "Failed to create pipes"
errors on Linux. This change applies the same throttling we use for
multithreaded pipeline creation.
b4067
2024-11-11 18:13:51 +01:00
b0cefea58a metal : more precise Q*K in FA vec kernel (#10247) b4066 2024-11-11 08:39:13 +02:00
b141e5f6ef server : enable KV cache defrag by default (#10233)
ggml-ci
b4065
2024-11-11 08:38:43 +02:00
4b3a9212b6 flake.lock: Update (#10243)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/807e9154dcb16384b1b765ebe9cd2bba2ac287fd?narHash=sha256-l253w0XMT8nWHGXuXqyiIC/bMvh1VRszGXgdpQlfhvU%3D' (2024-10-29)
  → 'github:NixOS/nixpkgs/4aa36568d413aca0ea84a1684d2d46f55dbabad7?narHash=sha256-Zwl8YgTVJTEum%2BL%2B0zVAWvXAGbWAuXHax3KzuejaDyo%3D' (2024-11-05)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-11-10 11:45:25 -08:00
505f33274d server : (web UI) Add back sampler settings (#10239)
* Add back samplers to server

* Added tooltips with basic information

* Fixed stretching of input fields.

* use component for settings input, move help msg to tooltips

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-11-10 15:42:25 -04:00
160687b3ed vulkan: Fix newly added tests for permuted mul_mat and 1D im2col (#10226) b4062 2024-11-10 12:37:56 +01:00
6423c65aa8 metal : reorder write loop in mul mat kernel + style (#10231)
* metal : reorder write loop

* metal : int -> short, style

ggml-ci
b4061
2024-11-09 11:53:13 +02:00
39a334a9aa metal : fix build and some more comments (#10229) b4060 2024-11-09 11:53:02 +02:00
bb38cdd8ba metal : fix F32 accumulation in FA vec kernel (#10232) b4059 2024-11-09 11:52:45 +02:00
f018acba22 llama : fix Qwen model type strings b4058 2024-11-09 11:26:34 +02:00
46323fa9ef metal : hide debug messages from normal log b4057 2024-11-09 11:21:49 +02:00
SXX
5b359bb1e3 ggml: fix zero division in ‘dne’ calculation in CUDA COUNT_EQUAL operator when ‘ne’ is small (#10213) b4056 2024-11-09 08:35:46 +01:00
e89213492d ggml : optimize llamafile cpu matrix multiplication for ppc64le (#10156)
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for FP32 datatype.

This change results in a consistent 90%
improvement in input processing time, and 20%
to 80% improvement in output processing time,
across various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
b4055
2024-11-09 09:17:50 +02:00
8fc393f246 scripts : fix pattern and get n_tokens in one go (#10221) 2024-11-09 09:06:54 +02:00
ec450d3bbf metal : opt-in compile flag for BF16 (#10218)
* metal : opt-in compile flag for BF16

ggml-ci

* ci : use BF16

ggml-ci

* swift : switch back to v12

* metal : has_float -> use_float

ggml-ci

* metal : fix BF16 check in MSL

ggml-ci
b4053
2024-11-08 21:59:46 +02:00
695ad752b2 metal : improve clarity (minor) (#10171) b4052 2024-11-08 18:37:41 +02:00
841f27abdb metal : optimize FA kernels (#10171)
* ggml : add ggml_flash_attn_ext_get_prec

* metal : use F16 precision in FA kernels

ggml-ci

* metal : minor clean-up

* metal : compile-guard bf16 FA kernels

ggml-ci

* build : remove obsolete compile flag [no ci]

* metal : prevent int overflows [no ci]

* cuda : disable BF16 FA

ggml-ci

* metal : fix BF16 requirement for FA kernels

ggml-ci

* make : clean-up [no ci]
2024-11-08 13:47:22 +02:00
d05b3127bd swift : exclude ggml-metal-embed.metal (#10211)
* llama.swift : exclude ggml-metal-embed.metal

* swift : exclude build/
b4050
2024-11-08 11:34:06 +02:00