Commit Graph

4118 Commits

Author SHA1 Message Date
be5caccef9 llama : only use default buffer types for the KV cache (#10358) b4118 2024-11-17 12:25:45 +01:00
20a780c7b6 gitignore : ignore local run scripts [no ci] 2024-11-17 13:12:22 +02:00
cf32a9b93a metal : refactor kernel args into structs (#10238)
* metal : add kernel arg structs (wip)

* metal : fattn args

ggml-ci

* metal : cont + avoid potential int overflow [no ci]

* metal : mul mat struct (wip)

* cont : mul mat vec

* cont : pass by reference

* cont : args is first argument

* cont : use char ptr

* cont : shmem style

* cont : thread counters style

* cont : mul mm id

ggml-ci

* cont : int safety + register optimizations

ggml-ci

* metal : GGML_OP_CONCAT

ggml-ci

* metal : GGML_OP_ADD, GGML_OP_SUB, GGML_OP_MUL, GGML_OP_DIV

* metal : GGML_OP_REPEAT

* metal : GGML_OP_CPY

* metal : GGML_OP_RMS_NORM

* metal : GGML_OP_NORM

* metal : add TODOs for rest of ops

* ggml : add ggml-metal-impl.h

ggml-ci
2024-11-17 11:23:01 +02:00
a43178299c ggml : fix undefined reference to 'getcpu' (#10354)
https://github.com/ggerganov/llama.cpp/issues/10352
b4115
2024-11-17 10:39:22 +02:00
c3ea58aca4 CUDA: remove DMMV, consolidate F16 mult mat vec (#10318) b4114 2024-11-17 09:09:55 +01:00
467576b6cc CMake: default to -arch=native for CUDA build (#10320) b4113 2024-11-17 09:06:34 +01:00
eda7e1d4f5 ggml : fix possible buffer use after free in sched reserve (#9930) b4112 2024-11-17 08:31:17 +02:00
24203e9dd7 ggml : inttypes.h -> cinttypes (#0)
ggml-ci
b4111
2024-11-17 08:30:29 +02:00
5d9e59979c ggml : adapt AMX to tensor->grad removal (#0)
ggml-ci
2024-11-17 08:30:29 +02:00
a4200cafad make : add ggml-opt (#0)
ggml-ci
2024-11-17 08:30:29 +02:00
84274a10c3 tests : remove test-grad0 2024-11-17 08:30:29 +02:00
68fcb4759c ggml : fix compile warnings (#0)
ggml-ci
2024-11-17 08:30:29 +02:00
8a43e940ab ggml: new optimization interface (ggml/988) 2024-11-17 08:30:29 +02:00
5c9a8b22b1 scripts : update sync 2024-11-17 08:30:29 +02:00
0fff7fd798 docs : vulkan build instructions to use git bash mingw64 (#10303) 2024-11-17 00:29:18 +01:00
4e54be0ec6 llama/ex: remove --logdir argument (#10339) b4103 2024-11-16 23:00:41 +01:00
db4cfd5dbc llamafile : fix include path (#0)
ggml-ci
b4102
2024-11-16 20:36:26 +02:00
8ee0d09ae6 make : auto-determine dependencies (#0) 2024-11-16 20:36:26 +02:00
bcdb7a2386 server: (web UI) Add samplers sequence customization (#10255)
* Samplers sequence: simplified and input field.

* Removed unused function

* Modify and use `settings-modal-short-input`

* rename "name" --> "label"

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b4100
2024-11-16 14:26:54 +01:00
f245cc28d4 scripts : fix missing key in compare-llama-bench.py (#10332) 2024-11-16 10:32:50 +02:00
772703c8ff vulkan: Optimize some mat-vec mul quant shaders (#10296)
Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses
the B loads across the rows and also reuses some addressing calculations.
This required manually partially unrolling the loop, since the compiler
is less willing to unroll outer loops.

Add bounds-checking on the last iteration of the loop. I think this was at
least partly broken before.

Optimize the Q4_K shader to vectorize most loads and reduce the number of
bit twiddling instructions.
b4098
2024-11-16 07:26:57 +01:00
dd3a6ce9f8 vulkan : add cmake preset debug/release (#10306) 2024-11-16 02:59:33 +01:00
1e58ee1318 ggml : optimize Q4_0 into Q4_0_X_Y repack (#10324) b4096 2024-11-16 01:53:37 +01:00
89e4caaaf0 llama : save number of parameters and the size in llama_model (#10286)
fixes #10285
b4095
2024-11-16 01:42:13 +01:00
74d73dc85c Make updates to fix issues with clang-cl builds while using AVX512 flags (#10314) b4094 2024-11-15 22:27:00 +01:00
4047be74da scripts: update compare-llama-bench.py (#10319) b4093 2024-11-15 21:19:03 +01:00
883d206fbd ggml : fix some build issues b4092 2024-11-15 21:45:32 +02:00
09ecbcb596 cmake : fix ppc64 check (whisper/0)
ggml-ci
b4091
2024-11-15 15:44:06 +02:00
3225008973 ggml : vulkan logs (whisper/2547) 2024-11-15 15:44:06 +02:00
cbf5541a82 sync : ggml 2024-11-15 15:44:06 +02:00
Eve
18429220bd AVX BF16 and single scale quant optimizations (#10212)
* use 128 bit loads (i've tried 256->128 to death and its slower)

* double accumulator

* avx bf16 vec dot

* +3% q4_0 inference

* +7% tg +5% pp compared to master

* slower f16c version, kep for reference

* 256b version, also slow. i tried :)

* revert f16

* faster with madd

* split to functions

* Q8_0 and IQ4_NL, 5-7% faster

* fix potential overflow (performance reduced)

* 16 bit add for q4_0 only

* merge
b4088
2024-11-15 12:47:58 +01:00
f0204a0ec7 ci: build test musa with cmake (#10298)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b4087
2024-11-15 12:47:25 +01:00
57f8355b29 sycl: Update Intel docker images to use DPC++ 2025.0 (#10305) 2024-11-15 13:10:45 +02:00
9901068ac7 server : (web UI) add copy button for code block, fix api key (#10242)
* server : (web ui) add copy btn for code blocks

* fix problem with api key

* use settings-modal-short-input component

* always show copy btn for code snippet
b4085
2024-11-15 10:48:49 +01:00
231f9360d9 cann: dockerfile and doc adjustment (#10302)
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-11-15 15:09:35 +08:00
4802ad350b scripts : fix regex in sync [no ci] 2024-11-15 08:38:43 +02:00
5a54af4d4f sycl: Use syclcompat::dp4a (#10267)
* sycl: Use syclcompat::dp4a

* Using the syclcompat version allow the compiler to optimize the
  operation with native function

* Update news section

* Update CI Windows oneAPI version to 2025.0

* Reword doc

* Call syclcompat::dp4a inside dpct::dp4a

This reverts commit 90cb61d692.
b4082
2024-11-15 11:09:12 +08:00
1607a5e5b0 backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921)
* backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
b4081
2024-11-15 01:28:50 +01:00
ae8de6d50a ggml : build backends as libraries (#10256)
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
b4080
2024-11-14 18:04:35 +01:00
4a8ccb37ad CUDA: no -sm row for very small matrices (#10185) b4079 2024-11-14 13:00:15 +01:00
2a82891a85 speculative : fix out-of-bounds access (#10289) b4078 2024-11-14 11:44:15 +02:00
af148c9386 vulkan: Optimize binary ops (#10270)
Reuse the index calculations across all of src0/src1/dst. Add a shader
variant for when src0/src1 are the same dimensions and additional modulus
for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that
have a fast path when the calculation isn't needed or can be done more
cheaply.
b4077
2024-11-14 06:22:55 +01:00
66798e42fb vulkan: Use macros to make the mat mul pipeline creation more concise (#10259)
Also add vk_matmul_pipeline2 to hold f16/f32 accumulator versions of a
pipeline. This isn't really used yet.
b4076
2024-11-13 21:59:47 +01:00
fb4a0ec083 llama : propagate the results of graph_compute (#9525)
* llama: propagating the results of `graph_compute` to the user interface

* llama: reverting kv_cache in case of failed compute

* llama: `llama_kv_cache_state` was removed, only the result of `llama_graph_compute` is returned

* llama: restore a kv_cache in case of failed computation

* llama: correct reverting of the entire batch.
also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models

* llama: updated comments

* llama : add comments about KV cache state after error

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4075
2024-11-13 20:00:35 +02:00
5ea926dad7 sync : ggml 2024-11-13 18:11:54 +02:00
1ee9eea094 docs : update bindings list (#10261)
Signed-off-by: tianzixuan <tianzixuan335@hellobike.com>
b4073
2024-11-13 13:17:10 +02:00
ff7fb670d0 server : add missing docs (#10269) 2024-11-13 13:16:30 +02:00
0e712a5acb server : fix incorrect res in validate_model_chat_template (#10272)
* server : fix validate_model_chat_template

* server : fix chat res
b4071
2024-11-13 13:15:23 +02:00
a0ec17b32e metadata: Detailed Dataset Authorship Metadata (#8875)
Converter script can now read these two fields as a detailed base model and dataset source.
This was done so that it will be easier for Hugging Face to integrate detailed metadata as needed.

 -  base_model_sources (List[dict], optional)
 -  dataset_sources (List[dict], optional)

Dataset now represented as:

 - general.dataset.count
 - general.dataset.{id}.name
 - general.dataset.{id}.author
 - general.dataset.{id}.version
 - general.dataset.{id}.organization
 - general.dataset.{id}.description
 - general.dataset.{id}.url
 - general.dataset.{id}.doi
 - general.dataset.{id}.uuid
 - general.dataset.{id}.repo_url

This also adds to base model these metadata:

 - general.base_model.{id}.description
2024-11-13 21:10:38 +11:00
2e82ffa4af sycl : Fixes to broken builds and test-backend-ops (#10257)
* Fixes broken build for the SYCL CUDA backend caused by non-explicit gemm call in outprod (merged in with RWKV6 in
Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration #10133)

* Marks permuted MUL_MAT as unsupported to be able to run test-backend-ops

* Fixes asserts in norm to fix debug builds.
b4069
2024-11-13 09:40:57 +00:00