Commit Graph

4260 Commits

Author SHA1 Message Date
40c6d79fb5 SYCL : Move to compile time oneMKL interface backend selection for NVIDIA backend (#10584)
* [SYCL] Move to Compile Time backend selection on oneMKL Interface for NVIDIA backend

Move to compile time selection to backend to avoid latency at run time.
Add it to all mkl gemm calls and only for NVIDIA backend.

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>

* Formatting

* Address PR comments to increase readibility

---------

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
b4260
2024-12-04 09:29:20 +08:00
98036d5670 fix typo of README.md (#10605) 2024-12-04 02:22:50 +01:00
cd2f37b304 Avoid using __fp16 on ARM with old nvcc (#10616) b4258 2024-12-04 01:41:37 +01:00
da6aac91f1 Add docs for creating a static build (#10268) (#10630)
* Add notes for a static build

* Update docs/build.md

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-04 01:40:36 +01:00
01e6d9bb71 clip : add sycl support (#10574)
Co-authored-by: piDack <pcdack@hotmail.co>
b4256
2024-12-04 01:26:37 +01:00
cc98896db8 vulkan: optimize and reenable split_k (#10637)
Use vector loads when possible in mul_mat_split_k_reduce. Use split_k
when there aren't enough workgroups to fill the shaders.
b4255
2024-12-03 20:29:54 +01:00
91c36c269b server : (web ui) Various improvements, now use vite as bundler (#10599)
* hide buttons in dropdown menu

* use npm as deps manager and vite as bundler

* fix build

* fix build (2)

* fix responsive on mobile

* fix more problems on mobile

* sync build

* (test) add CI step for verifying build

* fix ci

* force rebuild .hpp files

* cmake: clean up generated files pre build
b4254
2024-12-03 19:38:44 +01:00
1cd3df46bd scripts : remove amx sync
ggml-ci
b4253
2024-12-03 20:04:49 +02:00
c505471857 sync : ggml 2024-12-03 20:04:49 +02:00
e9e661bd59 CUDA: remove unnecessary warp reduce in FA (ggml/1032)
* kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit

* same problem in vec32

---------

Co-authored-by: ZhaoXiaoYu <zhao.xiaoyu@zte.com.cn>
2024-12-03 20:04:49 +02:00
PAB
efb6ae9630 feat: add GGML_UNARY_OP_ARGMAX Metal kernel (ggml/1019)
* implemented argmax kernel

* tpig -> tgpig

* change to strides

* contiguous assertions

* kernel working and tested

* argmax simd parallel implementation

* added 2 new tests for argmax in test-backend-ops

* cosmit

* added 3 tests cases for perf eval

* add test_argmax in make_test_cases_perf

* Update test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-03 20:04:49 +02:00
PAB
667d70d170 metal : add GGML_OP_CONV_TRANSPOSE_1D kernels (ggml/1026)
* wip

* wip implementation f32

* kernel conv transpose 1d f32 working

* initial commit
2024-12-03 20:04:49 +02:00
3b4f2e33e2 llama : add missing LLAMA_API for llama_chat_builtin_templates (#10636) b4248 2024-12-03 12:54:30 +01:00
82bca2257b readme : add option, update default value, fix formatting (#10271)
* readme : document --no-display-prompt

* readme : update default prompt context size

* readme : remove unnecessary indentation

Indenting a line with four spaces makes Markdown treat that section as
plain text.

* readme : indent commands under bullets

* readme : indent commands in lettered list
2024-12-03 12:50:08 +02:00
0115df2f65 metal : small-batch mat-mul kernels (#10581)
* metal : small-batch mat-mul kernels

ggml-ci

* metal : add rest of types

ggml-ci

* metal : final adjustments

ggml-ci

* metal : add comments

ggml-ci
b4246
2024-12-03 11:52:33 +02:00
515d4e5372 github : minify link [no ci] (revert)
this doesn't work as expected
2024-12-03 11:21:43 +02:00
844e2e1fee github : minify link [no ci] 2024-12-03 11:20:35 +02:00
70b98fadbc server : fix default draft model parameters (#10586)
* server : force F16 KV cache for the draft model

ggml-ci

* server : fix draft params

ggml-ci

* server : various params fixes

ggml-ci
b4243
2024-12-03 11:20:00 +02:00
642330ac7c llama : add enum for built-in chat templates (#10623)
* llama : add enum for supported chat templates

* use "built-in" instead of "supported"

* arg: print list of built-in templates

* fix test

* update server README
b4242
2024-12-02 22:10:19 +01:00
8648c52101 make : deprecate (#10514)
* make : deprecate

ggml-ci

* ci : disable Makefile builds

ggml-ci

* docs : remove make references [no ci]

* ci : disable swift build

ggml-ci

* docs : remove obsolete make references, scripts, examples

ggml-ci

* basic fix for compare-commits.sh

* update build.md

* more build.md updates

* more build.md updates

* more build.md updates

* Update Makefile

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-12-02 21:22:53 +02:00
64ed2091b2 server: Add "tokens per second" information in the backend (#10548)
* add cmake rvv support

* add timings

* remove space

* update readme

* fix

* fix code

* remove empty line

* add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b4240
2024-12-02 14:45:54 +01:00
991f8aabee SYCL: Fix and switch to GGML_LOG system instead of fprintf (#10579)
* Switched to GGML_LOG

* Fix missing semicolon
b4239
2024-12-02 15:04:11 +08:00
4cb003dd8d contrib : refresh (#10593)
* contrib : refresh

* contrib : expand [no ci]

* contrib : expand test-backend-ops instructions

* contrib : add CODEOWNERS

* prs : update template to not have checkbox [no ci]
2024-12-02 08:53:27 +02:00
917786f43d Add mistral-v1, mistral-v3, mistral-v3-tekken and mistral-v7 chat template types (#10572)
* Templates: `mistral-v1`, `mistral-v2`, `mistral-v3`, `mistral-v3-tekken`

* Changed system message logic and added tests for all 4

* Invalid `system_message` instead of `content` fixed

* Removed tab-indented lines

* Added template code and test for `mistral-v7`

* Added all tests. Fixed bug with `tmpl == "llama2"` test.

* Replaced tabs with spaces.

* Removed `'mistral-v2'` option as no (open) models ever used it

* Removed all references to 'v2' template from comments

* Update llama.cpp

Fixed `trim_assistant_message` bug
2024-12-01 23:09:49 +01:00
5e1ed95583 grammars : add English-only grammar (#10612) 2024-12-01 21:37:54 +02:00
5c7a5aa0c3 ci: add error handling for Python venv creation in run.sh (#10608) 2024-12-01 20:11:42 +02:00
3420909dff ggml : automatic selection of best CPU backend (#10606)
* ggml : automatic selection of best CPU backend

* amx : minor opt

* add GGML_AVX_VNNI to enable avx-vnni, fix checks
b4234
2024-12-01 16:12:41 +01:00
86dc11c5bc server : bind to any port when specified (#10590) b4233 2024-12-01 13:33:12 +02:00
6acce39710 readme : update the usage section with examples (#10596)
* readme : update the usage section with examples

* readme : more examples
2024-12-01 11:25:17 +02:00
43957ef203 build: update Makefile comments for C++ version change (#10598) b4231 2024-12-01 04:19:44 +01:00
0c39f44d70 ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() (#10567)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b4230
2024-11-30 09:13:18 -08:00
3e0ba0e604 readme : remove old badge 2024-11-30 10:09:21 +02:00
abadba05be readme : refresh (#10587)
* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
2024-11-30 09:47:07 +02:00
Eve
0533e7fb38 vulkan: Dynamic subgroup size support for Q6_K mat_vec (#10536)
* subgroup 64 version with subgroup add. 15% faster

scalable version

tested for subgroup sizes 16-128

* check for subgroup multiple of 16 and greater than 16

* subgroup sizes are always a power of 2 (https://github.com/KhronosGroup/GLSL/issues/45)

* force 16 sequential threads per block

* make 16 subgroup size a constant
b4227
2024-11-30 08:00:02 +01:00
7cc2d2c889 ggml : move AMX to the CPU backend (#10570)
* ggml : move AMX to the CPU backend

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4226
2024-11-29 21:54:58 +01:00
b782e5c7d4 server : add more test cases (#10569)
* server : add split model test

* add test speculative

* add invalid cases
2024-11-29 21:48:56 +01:00
3a8e9af402 imatrix : support combine-only (#10492)
* imatrix-combine-only idea

* ensured that behavior consistent with log
b4224
2024-11-29 19:21:37 +02:00
a3a3048e7a cleanup UI link list (#10577)
* cleanup UI link list

* sort list alphabetically

* add missing licenses
2024-11-29 17:45:08 +01:00
f0678c5ff4 ggml : fix I8MM Q4_1 scaling factor conversion (#10562)
ggml-ci
b4222
2024-11-29 16:25:39 +02:00
4b3242bbea ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 (#10580) b4221 2024-11-29 14:49:02 +01:00
0f77aae560 sycl : offload of get_rows set to 0 (#10432) b4220 2024-11-29 20:38:45 +08:00
266b8519ee sycl : Reroute permuted mul_mats through oneMKL (#10408)
This PR fixes the failing MUL_MAT tests for the sycl backend.
b4219
2024-11-29 09:49:43 +00:00
938f608742 CANN: RoPE operator optimization (#10563)
* [cann] RoPE operator optimization

* [CANN]Code Formatting

---------

Co-authored-by: noemotiovon <noemotiovon@gmail.com>
b4218
2024-11-29 14:46:55 +08:00
f095a649ec vulkan: get the first command buffer submitted sooner (#10499)
This is an incremental improvement over #9118 to get work to the GPU a bit
sooner. The first part is to start with a smaller number of nodes before
the first submit, and ramp it up to the current 100 nodes/submit. The
second part is to reduce the dryrun overhead for all the nodes that just
need to request descriptor space.

With these changes I get around 1-2% speedup on RTX 4070 combined with my
old Haswell-era CPU.
b4217
2024-11-29 07:18:02 +01:00
678d7994f4 llava: return false instead of exit (#10546) b4216 2024-11-29 01:09:46 +01:00
dc22344088 ggml : remove redundant copyright notice + update authors b4215 2024-11-28 20:46:40 +02:00
4c0a95b107 llama : add missing model types b4214 2024-11-28 20:45:07 +02:00
6c59567689 server : (tests) don't use thread for capturing stdout/stderr, bump openai client library (#10568)
* server : (tests) don't use thread for capturing stdout/stderr

* test: bump openai to 1.55.2

* bump openai to 1.55.3
2024-11-28 19:17:49 +01:00
890719311b common: fix warning message when no GPU found (#10564) b4212 2024-11-28 18:15:25 +01:00
7281cf13ad docs: fix outdated usage of llama-simple (#10565) b4211 2024-11-28 16:03:11 +01:00