Commit Graph

5076 Commits

Author SHA1 Message Date
d3481e6316 cuda : only use native when supported by cmake (#10389) b4126 2024-11-18 18:43:40 +01:00
531cb1c233 Skip searching root path for cross-compile builds (#10383) 2024-11-18 16:23:58 +01:00
f139d2ea61 vulkan: remove use of null initializer (#10372)
Seems like this isn't working for vulkan-over-metal when the array is sized
by a spec constant. Maybe a spirv-cross limitation?
2024-11-18 08:28:42 -06:00
2eb76b2a5e flake.lock: Update (#10346)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/4aa36568d413aca0ea84a1684d2d46f55dbabad7?narHash=sha256-Zwl8YgTVJTEum%2BL%2B0zVAWvXAGbWAuXHax3KzuejaDyo%3D' (2024-11-05)
  → 'github:NixOS/nixpkgs/5e4fbfb6b3de1aa2872b76d49fafc942626e2add?narHash=sha256-OZiZ3m8SCMfh3B6bfGC/Bm4x3qc1m2SVEAlkV6iY7Yg%3D' (2024-11-15)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-11-18 06:08:20 -08:00
9b75f03cd2 Vulkan: Fix device info output format specifiers (#10366)
* Vulkan: Fix device info output format specifiers

* Vulkan: Use zu printf specifier for size_t instead of ld
b4122
2024-11-18 11:02:43 +01:00
75207b3a88 docker: use GGML_NATIVE=OFF (#10368) 2024-11-18 00:21:53 +01:00
76e9e58b78 CUDA: fix MMV kernel being used for FP16 src1 (#10357) b4120 2024-11-17 23:20:42 +01:00
ce2e59ba10 CMake: fix typo in comment [no ci] (#10360) 2024-11-17 12:59:38 +01:00
be5caccef9 llama : only use default buffer types for the KV cache (#10358) b4118 2024-11-17 12:25:45 +01:00
20a780c7b6 gitignore : ignore local run scripts [no ci] 2024-11-17 13:12:22 +02:00
cf32a9b93a metal : refactor kernel args into structs (#10238)
* metal : add kernel arg structs (wip)

* metal : fattn args

ggml-ci

* metal : cont + avoid potential int overflow [no ci]

* metal : mul mat struct (wip)

* cont : mul mat vec

* cont : pass by reference

* cont : args is first argument

* cont : use char ptr

* cont : shmem style

* cont : thread counters style

* cont : mul mm id

ggml-ci

* cont : int safety + register optimizations

ggml-ci

* metal : GGML_OP_CONCAT

ggml-ci

* metal : GGML_OP_ADD, GGML_OP_SUB, GGML_OP_MUL, GGML_OP_DIV

* metal : GGML_OP_REPEAT

* metal : GGML_OP_CPY

* metal : GGML_OP_RMS_NORM

* metal : GGML_OP_NORM

* metal : add TODOs for rest of ops

* ggml : add ggml-metal-impl.h

ggml-ci
2024-11-17 11:23:01 +02:00
a43178299c ggml : fix undefined reference to 'getcpu' (#10354)
https://github.com/ggerganov/llama.cpp/issues/10352
b4115
2024-11-17 10:39:22 +02:00
c3ea58aca4 CUDA: remove DMMV, consolidate F16 mult mat vec (#10318) b4114 2024-11-17 09:09:55 +01:00
467576b6cc CMake: default to -arch=native for CUDA build (#10320) b4113 2024-11-17 09:06:34 +01:00
eda7e1d4f5 ggml : fix possible buffer use after free in sched reserve (#9930) b4112 2024-11-17 08:31:17 +02:00
24203e9dd7 ggml : inttypes.h -> cinttypes (#0)
ggml-ci
b4111
2024-11-17 08:30:29 +02:00
5d9e59979c ggml : adapt AMX to tensor->grad removal (#0)
ggml-ci
2024-11-17 08:30:29 +02:00
a4200cafad make : add ggml-opt (#0)
ggml-ci
2024-11-17 08:30:29 +02:00
84274a10c3 tests : remove test-grad0 2024-11-17 08:30:29 +02:00
68fcb4759c ggml : fix compile warnings (#0)
ggml-ci
2024-11-17 08:30:29 +02:00
8a43e940ab ggml: new optimization interface (ggml/988) 2024-11-17 08:30:29 +02:00
5c9a8b22b1 scripts : update sync 2024-11-17 08:30:29 +02:00
0fff7fd798 docs : vulkan build instructions to use git bash mingw64 (#10303) 2024-11-17 00:29:18 +01:00
4e54be0ec6 llama/ex: remove --logdir argument (#10339) b4103 2024-11-16 23:00:41 +01:00
db4cfd5dbc llamafile : fix include path (#0)
ggml-ci
b4102
2024-11-16 20:36:26 +02:00
8ee0d09ae6 make : auto-determine dependencies (#0) 2024-11-16 20:36:26 +02:00
bcdb7a2386 server: (web UI) Add samplers sequence customization (#10255)
* Samplers sequence: simplified and input field.

* Removed unused function

* Modify and use `settings-modal-short-input`

* rename "name" --> "label"

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b4100
2024-11-16 14:26:54 +01:00
f245cc28d4 scripts : fix missing key in compare-llama-bench.py (#10332) 2024-11-16 10:32:50 +02:00
772703c8ff vulkan: Optimize some mat-vec mul quant shaders (#10296)
Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses
the B loads across the rows and also reuses some addressing calculations.
This required manually partially unrolling the loop, since the compiler
is less willing to unroll outer loops.

Add bounds-checking on the last iteration of the loop. I think this was at
least partly broken before.

Optimize the Q4_K shader to vectorize most loads and reduce the number of
bit twiddling instructions.
b4098
2024-11-16 07:26:57 +01:00
dd3a6ce9f8 vulkan : add cmake preset debug/release (#10306) 2024-11-16 02:59:33 +01:00
1e58ee1318 ggml : optimize Q4_0 into Q4_0_X_Y repack (#10324) b4096 2024-11-16 01:53:37 +01:00
89e4caaaf0 llama : save number of parameters and the size in llama_model (#10286)
fixes #10285
b4095
2024-11-16 01:42:13 +01:00
74d73dc85c Make updates to fix issues with clang-cl builds while using AVX512 flags (#10314) b4094 2024-11-15 22:27:00 +01:00
4047be74da scripts: update compare-llama-bench.py (#10319) b4093 2024-11-15 21:19:03 +01:00
883d206fbd ggml : fix some build issues b4092 2024-11-15 21:45:32 +02:00
09ecbcb596 cmake : fix ppc64 check (whisper/0)
ggml-ci
b4091
2024-11-15 15:44:06 +02:00
3225008973 ggml : vulkan logs (whisper/2547) 2024-11-15 15:44:06 +02:00
cbf5541a82 sync : ggml 2024-11-15 15:44:06 +02:00
Eve
18429220bd AVX BF16 and single scale quant optimizations (#10212)
* use 128 bit loads (i've tried 256->128 to death and its slower)

* double accumulator

* avx bf16 vec dot

* +3% q4_0 inference

* +7% tg +5% pp compared to master

* slower f16c version, kep for reference

* 256b version, also slow. i tried :)

* revert f16

* faster with madd

* split to functions

* Q8_0 and IQ4_NL, 5-7% faster

* fix potential overflow (performance reduced)

* 16 bit add for q4_0 only

* merge
b4088
2024-11-15 12:47:58 +01:00
f0204a0ec7 ci: build test musa with cmake (#10298)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b4087
2024-11-15 12:47:25 +01:00
57f8355b29 sycl: Update Intel docker images to use DPC++ 2025.0 (#10305) 2024-11-15 13:10:45 +02:00
9901068ac7 server : (web UI) add copy button for code block, fix api key (#10242)
* server : (web ui) add copy btn for code blocks

* fix problem with api key

* use settings-modal-short-input component

* always show copy btn for code snippet
b4085
2024-11-15 10:48:49 +01:00
231f9360d9 cann: dockerfile and doc adjustment (#10302)
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-11-15 15:09:35 +08:00
4802ad350b scripts : fix regex in sync [no ci] 2024-11-15 08:38:43 +02:00
5a54af4d4f sycl: Use syclcompat::dp4a (#10267)
* sycl: Use syclcompat::dp4a

* Using the syclcompat version allow the compiler to optimize the
  operation with native function

* Update news section

* Update CI Windows oneAPI version to 2025.0

* Reword doc

* Call syclcompat::dp4a inside dpct::dp4a

This reverts commit 90cb61d692.
b4082
2024-11-15 11:09:12 +08:00
1607a5e5b0 backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921)
* backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
b4081
2024-11-15 01:28:50 +01:00
ae8de6d50a ggml : build backends as libraries (#10256)
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
b4080
2024-11-14 18:04:35 +01:00
4a8ccb37ad CUDA: no -sm row for very small matrices (#10185) b4079 2024-11-14 13:00:15 +01:00
2a82891a85 speculative : fix out-of-bounds access (#10289) b4078 2024-11-14 11:44:15 +02:00
af148c9386 vulkan: Optimize binary ops (#10270)
Reuse the index calculations across all of src0/src1/dst. Add a shader
variant for when src0/src1 are the same dimensions and additional modulus
for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that
have a fast path when the calculation isn't needed or can be done more
cheaply.
b4077
2024-11-14 06:22:55 +01:00