Commit Graph

4117 Commits

Author SHA1 Message Date
54ef9cfc72 vulkan: Throttle the number of shader compiles during the build step. (#10222)
Fixes #9582

Spawning too many concurrent copies of glslc leads to "Failed to create pipes"
errors on Linux. This change applies the same throttling we use for
multithreaded pipeline creation.
b4067
2024-11-11 18:13:51 +01:00
b0cefea58a metal : more precise Q*K in FA vec kernel (#10247) b4066 2024-11-11 08:39:13 +02:00
b141e5f6ef server : enable KV cache defrag by default (#10233)
ggml-ci
b4065
2024-11-11 08:38:43 +02:00
4b3a9212b6 flake.lock: Update (#10243)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/807e9154dcb16384b1b765ebe9cd2bba2ac287fd?narHash=sha256-l253w0XMT8nWHGXuXqyiIC/bMvh1VRszGXgdpQlfhvU%3D' (2024-10-29)
  → 'github:NixOS/nixpkgs/4aa36568d413aca0ea84a1684d2d46f55dbabad7?narHash=sha256-Zwl8YgTVJTEum%2BL%2B0zVAWvXAGbWAuXHax3KzuejaDyo%3D' (2024-11-05)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-11-10 11:45:25 -08:00
505f33274d server : (web UI) Add back sampler settings (#10239)
* Add back samplers to server

* Added tooltips with basic information

* Fixed stretching of input fields.

* use component for settings input, move help msg to tooltips

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-11-10 15:42:25 -04:00
160687b3ed vulkan: Fix newly added tests for permuted mul_mat and 1D im2col (#10226) b4062 2024-11-10 12:37:56 +01:00
6423c65aa8 metal : reorder write loop in mul mat kernel + style (#10231)
* metal : reorder write loop

* metal : int -> short, style

ggml-ci
b4061
2024-11-09 11:53:13 +02:00
39a334a9aa metal : fix build and some more comments (#10229) b4060 2024-11-09 11:53:02 +02:00
bb38cdd8ba metal : fix F32 accumulation in FA vec kernel (#10232) b4059 2024-11-09 11:52:45 +02:00
f018acba22 llama : fix Qwen model type strings b4058 2024-11-09 11:26:34 +02:00
46323fa9ef metal : hide debug messages from normal log b4057 2024-11-09 11:21:49 +02:00
SXX
5b359bb1e3 ggml: fix zero division in ‘dne’ calculation in CUDA COUNT_EQUAL operator when ‘ne’ is small (#10213) b4056 2024-11-09 08:35:46 +01:00
e89213492d ggml : optimize llamafile cpu matrix multiplication for ppc64le (#10156)
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for FP32 datatype.

This change results in a consistent 90%
improvement in input processing time, and 20%
to 80% improvement in output processing time,
across various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
b4055
2024-11-09 09:17:50 +02:00
8fc393f246 scripts : fix pattern and get n_tokens in one go (#10221) 2024-11-09 09:06:54 +02:00
ec450d3bbf metal : opt-in compile flag for BF16 (#10218)
* metal : opt-in compile flag for BF16

ggml-ci

* ci : use BF16

ggml-ci

* swift : switch back to v12

* metal : has_float -> use_float

ggml-ci

* metal : fix BF16 check in MSL

ggml-ci
b4053
2024-11-08 21:59:46 +02:00
695ad752b2 metal : improve clarity (minor) (#10171) b4052 2024-11-08 18:37:41 +02:00
841f27abdb metal : optimize FA kernels (#10171)
* ggml : add ggml_flash_attn_ext_get_prec

* metal : use F16 precision in FA kernels

ggml-ci

* metal : minor clean-up

* metal : compile-guard bf16 FA kernels

ggml-ci

* build : remove obsolete compile flag [no ci]

* metal : prevent int overflows [no ci]

* cuda : disable BF16 FA

ggml-ci

* metal : fix BF16 requirement for FA kernels

ggml-ci

* make : clean-up [no ci]
2024-11-08 13:47:22 +02:00
d05b3127bd swift : exclude ggml-metal-embed.metal (#10211)
* llama.swift : exclude ggml-metal-embed.metal

* swift : exclude build/
b4050
2024-11-08 11:34:06 +02:00
76c6e7f105 server : minor UI fix (#10207) 2024-11-07 18:44:38 -04:00
a71d81cf8c server : revamp chat UI with vuejs and daisyui (#10175)
* server : simple chat UI with vuejs and daisyui

* move old files to legacy folder

* embed deps into binary

* basic markdown support

* add conversation history, save to localStorage

* fix bg-base classes

* save theme preferences

* fix tests

* regenerate, edit, copy buttons

* small fixes

* docs: how to use legacy ui

* better error handling

* make CORS preflight more explicit

* add GET method for CORS

* fix tests

* clean up a bit

* better auto scroll

* small fixes

* use collapse-arrow

* fix closeAndSaveConfigDialog

* small fix

* remove console.log

* fix style for <pre> element

* lighter bubble color (less distract when reading)
b4048
2024-11-07 17:31:10 -04:00
eec4d71737 scripts : add amx to sync-ggml.sh [no ci] 2024-11-07 23:11:36 +02:00
3b08828674 sync : ggml 2024-11-07 23:08:24 +02:00
a2c6fd747c scripts : sync update 2024-11-07 23:07:55 +02:00
97404c4a03 ggml : add ggml-cpu.h to the public headers (#10204) b4044 2024-11-07 18:16:08 +01:00
60e17ce23c Remove identical wte/etw logic for jais (#10203) 2024-11-07 08:46:12 -08:00
5107e8cea3 DRY: Fixes clone functionality (#10192) b4042 2024-11-07 16:20:25 +01:00
2319126a70 fix q4_0_8_8 format for corrupted tokens issue (#10198)
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-62-167.us-west-2.compute.internal>
b4041
2024-11-07 09:02:08 +01:00
3bcd40b3c5 Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration (#10133)
* rwkv6: rename to wkv6

* rwkv6: support avx2 avx512 armv8 armv9

* rwkv6: update cuda file name

* rwkv6: rename params

* wkv on sycl

* sycl: add some ops

* sycl: Enhance OP support judgment

* wkv6: drop armv9 and tranfer to GGML style

ggml-ci

* sync : ggml

* update the function to use appropriate types

* fix define error

* Update ggml/src/ggml-cpu.c

* add appropriate asserts

* move element-wise functions outside

* put the declaration outside the loop

* rewrite to be more inline with the common pattern for distributing threads

* use recommended way GGML_TENSOR_LOCALS

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Plamen Minev <pacominev@gmail.com>
Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com>
Co-authored-by: Meng, Hengyu <airdldl@163.com>
b4040
2024-11-07 15:19:10 +08:00
5c333e0140 metal : add BF16 support (#8439)
* ggml : add initial BF16 support

ggml-ci

* metal : add mul_mat_id BF16 support

ggml-ci

* metal : check for bfloat support on the Metal device

ggml-ci

* metal : better var names [no ci]

* metal : do not build bfloat kernels when not supported

ggml-ci

* metal : try to fix BF16 support check

ggml-ci

* metal : this should correctly check bfloat support
2024-11-06 19:53:51 +02:00
b11f9ba9b8 server : remove hack for extra parallel slot (#10187)
ggml-ci
b4038
2024-11-06 13:29:01 +02:00
94d8cb8be1 metal : fix from ptr buffer name (#10189) b4037 2024-11-06 12:10:07 +01:00
1dc04b2dee ggml : adjust is_first_call init value (#10193)
ggml-ci
b4036
2024-11-06 11:20:10 +02:00
a1eaf6a960 metal : add quantized FA support (#10149)
* metal : add quantized FA (vec) support

ggml-ci

* metal : add quantized FA (non-vec) support

* metal : fix support check

ggml-ci

* metal : clean-up

* metal : clean-up (cont)

* metal : fix shared memory calc + reduce smem + comments

* metal : float-correctness

* metal : minor [no ci]
2024-11-06 10:24:23 +02:00
b8deef0ec0 llama : add <|tool_call|> formatting to Granite template (#10177)
Branch: GraniteToolCallTemplate

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
b4034
2024-11-05 14:23:04 +02:00
a9e8a9a030 ggml : fix arch check in bf16_to_fp32 (#10164) b4033 2024-11-04 23:17:01 +01:00
Eve
3407364776 Q6_K AVX improvements (#10118)
* q6_k instruction reordering attempt

* better subtract method

* should be theoretically faster

small improvement with shuffle lut, likely because all loads are already done at that stage

* optimize bit fiddling

* handle -32 offset separately. bsums exists for a reason!

* use shift

* Update ggml-quants.c

* have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86
b4032
2024-11-04 23:06:31 +01:00
d5a409e57f ggml : fix gelu tables initialization (#10172) 2024-11-04 20:06:58 +01:00
401558b7ba ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (#10167) 2024-11-04 17:34:08 +01:00
9e0ecfb697 server : clarify /slots endpoint, add is_processing (#10162)
* server : clarify /slots endpoint, add is_processing

* fix tests
2024-11-04 16:33:29 +01:00
6a066b9978 fix build break on arm64 linux (#10166)
This fixes the build break from the recent changes
to move the CPU backend to separate files
https://github.com/ggerganov/llama.cpp/pull/10144
2024-11-04 16:08:33 +01:00
ea02c753eb cuda : clear error after changing peer access (#10153) b4027 2024-11-04 13:10:23 +01:00
05697f670b metal : simplify f16 and f32 dequant kernels (#0) b4026 2024-11-04 13:49:34 +02:00
f8e58135cf metal : move dequantize templates to beginning of MSL source (#0) b4025 2024-11-04 13:44:06 +02:00
329ed914c9 CANN: adjust backend registry refactor. (#10158)
remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR.
b4024
2024-11-04 19:08:22 +08:00
ce027adfb3 sync : ggml b4023 2024-11-04 10:33:37 +02:00
284e5b0275 cmake : make it possible linking ggml as external lib (ggml/1003) 2024-11-04 10:33:11 +02:00
e2292aaa17 metal : fix minor string leaks (ggml/1004) 2024-11-04 10:33:10 +02:00
9f40989351 ggml : move CPU backend to a separate file (#10144) b4020 2024-11-03 19:34:08 +01:00
08828a6d7d metal : minor fixup in FA kernel (#10143)
* metal : minor fixup in FA kernel

ggml-ci

* metal : use the unrolled loop variable

* metal : remove unused var
b4019
2024-11-03 15:18:40 +02:00
1839f69130 flake.lock: Update (#10146) 2024-11-03 05:14:15 -08:00