Commit Graph

4299 Commits

Author SHA1 Message Date
ae4b922614 imatrix : Add imatrix to --no-context-shift (#10766)
This allows for setting the --no-context-shift value in llama-imatrix which is required for models like DeepSeek
b4299
2024-12-10 18:23:50 +01:00
750cb3e246 CUDA: rename macros to avoid conflicts with WinAPI (#10736)
* Renames NVIDIA GPU-architecture flags to avoid name clashes with WinAPI. (e.g. CC_PASCAL, GPU architecture or WinAPI pascal compiler flag?)

* Reverts erroneous rename in SYCL-code.

* Renames GGML_CUDA_MIN_CC_DP4A to GGML_CUDA_CC_DP4A.

* Renames the rest of the compute capability macros for consistency.
b4298
2024-12-10 18:23:24 +01:00
a86ad841f1 server : add flag to disable the web-ui (#10762) (#10751)
Co-authored-by: eugenio.segala <esegala@deloitte.co.uk>
b4297
2024-12-10 18:22:34 +01:00
a05e2afcc2 vulkan: disable spirv-opt for coopmat shaders (#10763)
There are some bugs in the 1.3.296 SDK, so disable this. It isn't strictly
necessary anyway.

Add missing dependency on vulkan-shaders-gen, so shaders get recompiled when it
changes.

Fix coopmat support reporting when glslc doesn't support NV_coopmat2.
b4296
2024-12-10 18:22:20 +01:00
26a8406ba9 CUDA: fix shared memory access condition for mmv (#10740) b4295 2024-12-09 20:07:12 +01:00
c37fb4cf62 Changes to CMakePresets.json to add ninja clang target on windows (#10668)
* Update cmakepreset.json to use clang with ninja by default

* Update cmakepreset.json to add clang and ninja based configs

* Updates to build.md file

* Make updates to rename preset targets

* Update with .cmake file

* Remove additional whitespaces

* Add .cmake file for x64-windows-llvm

* Update docs/build.md

* Update docs/build.md

---------

Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
2024-12-09 09:40:19 -08:00
3d98b4cb22 vulkan: fix compile warnings (#10731) b4293 2024-12-09 08:24:01 +01:00
1a05004743 cmake : simplify msvc charsets (#10672) b4292 2024-12-09 09:15:13 +02:00
ce8784bdb1 server : fix format_infill (#10724)
* server : fix format_infill

* fix

* rename

* update test

* use another model

* update test

* update test

* test_invalid_input_extra_req
b4291
2024-12-08 23:04:29 +01:00
e52522b869 server : bring back info of final chunk in stream mode (#10722)
* server : bring back into to final chunk in stream mode

* clarify a bit

* traling space
b4290
2024-12-08 20:38:51 +01:00
06d70147e6 Vulkan: fix NaN in tanh.comp with AMD proprietary driver on Windows (#10723)
* Vulkan: fix NaN in tanh.comp

* Faster NaN-free tanh
2024-12-08 19:19:19 +01:00
43ed389a3f llama : use cmake for swift build (#10525)
* llama : use cmake for swift build

* swift : <> -> ""

* ci : remove make

* ci : disable ios build

* Revert "swift : <> -> """

This reverts commit d39ffd9556.

* ci : try fix ios build

* ci : cont

* ci : cont

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4288
2024-12-08 13:14:54 +02:00
ecc93d0558 vulkan: compile a test shader in cmake to check for coopmat2 support (#10713) b4287 2024-12-08 09:05:55 +01:00
62e84d9848 llama : add 128k yarn context for Qwen (#10698)
* add 128k yarn context for Qwen

* added property for model tensors

* removing useless line
2024-12-07 23:12:27 +02:00
3573fa8e7b server : (refactor) no more json in server_task input (#10691)
* server : (refactor) no more json in server_task input

* add test for slots endpoint

* add tests for /props and /slots

* remove task inf_type

* fix CI by adding safe_json_to_str

* add "model_path" to /props

* update readme
b4285
2024-12-07 20:21:09 +01:00
d9c3ba2b77 ggml : disable iq4_nl interleave size 8 (#10709)
ggml-ci
b4284
2024-12-07 18:38:15 +02:00
ce4a7b8493 server : various fixes (#10704)
* server : various fixes

ggml-ci

* server : show curent seed in slot_params

ggml-ci

* fix /slots endpoint

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server : reflect endpoint response changes in the readme

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
b4283
2024-12-07 18:02:05 +02:00
19d8762ab6 ggml : refactor online repacking (#10446)
* rename ggml-cpu-aarch64.c to .cpp

* reformat extra cpu backend.

- clean Q4_0_N_M and IQ4_0_N_M
  - remove from "file" tensor type
  - allow only with dynamic repack

- extract cpu extra bufts and convert to C++
  - hbm
  - "aarch64"

- more generic use of extra buffer
  - generalise extra_supports_op
  - new API for "cpu-accel":
     - amx
     - aarch64

* clang-format

* Clean Q4_0_N_M ref

Enable restrict on C++

* add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack

* added/corrected control on tensor size for Q4 repacking.

* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add debug logs on repacks.

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4282
2024-12-07 14:37:50 +02:00
c2a16c0bdb server : fix free of spec context and batch (#10651)
ggml-ci
b4281
2024-12-07 11:52:44 +02:00
3df784b305 Vulkan: VK_KHR_cooperative_matrix support to speed up prompt processing (#10597)
* Vulkan: Implement VK_KHR_cooperative_matrix support in the matrix matrix multiplication shader

* Improve performance with better q4_k and q5_k dequant and store unrolling

* Add Vulkan MUL_MAT and MUL_MAT_ID accumulator precision selection

* Rework mulmat shader selection and compilation logic, avoid compiling shaders that won't get used by device

* Vulkan: Implement accumulator switch for specific mul mat mat shaders

* Vulkan: Unroll more loops for more mul mat mat performance

* Vulkan: Add VK_AMD_shader_core_properties2 support to read Compute Unit count for split_k logic

* Disable coopmat support on AMD proprietary driver

* Remove redundant checks

* Add environment variable GGML_VK_DISABLE_COOPMAT to disable VK_KHR_cooperative_matrix support

* Fix rebase typo

* Fix coopmat2 MUL_MAT_ID pipeline selection
b4280
2024-12-07 10:24:15 +01:00
86a1934978 metal : Extend how Llama.cpp locates metal resources (#10676)
* metal : Extend how Llama.cpp locates metal resources (#10675)

  * It searches the resource file in the directory where the current
    binary is located as well.
  * Resolves symbolic links.

Rationale:

When we plug this dependency into a Bazel build and run it in the
context of Bazel (e.g. testing):

  * the execution directory is often very different from where the files
    are located and no direct control over this (Bazel sandboxing),
  * the Bazel sandbox often use symbolic links to make files available.

With this patch, we can have the resource file added to the target,
can build and run tests in the context of Bazel.

* Update ggml/src/ggml-metal/ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-metal/ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4279
2024-12-07 09:55:01 +02:00
784a14aa49 convert : add support for Roberta embeddings (#10695) 2024-12-07 09:02:14 +02:00
c5ede3849f convert : add custom attention mapping 2024-12-06 21:33:49 +02:00
f162d45a21 common : bring back --no-warmup to server (#10686) b4276 2024-12-06 13:29:05 +01:00
6c5bc0625f server : (refactoring) do not rely on JSON internally (#10643)
* server : (refactoring) reduce usage of json internally

* move all response types to struct

* wip [no ci]

* many fixes

* add virtual function

* fix index

* minor style fix

* add std::move

* refactor handle_completions_generic

* add virtual functions

* remove server.hpp

* clarify server_sent_event RFC specs

* apply review comments

* fix model_alias and completion_probabilities

* small clean up

* remove virtual for to_json_oai_compat()

* naming oai_compat --> oaicompat

* fix unwanted recursive call

* update docs
2024-12-06 11:14:32 +01:00
7736837d62 fix(server) : not show alert when DONE is received (#10674) 2024-12-05 22:36:41 +01:00
c9c6e01dae vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and flash attention (#10206) b4273 2024-12-05 20:15:05 +01:00
6fe6247831 llama : add Minerva 7B model support (#10673)
* Support for Minerva 7B

* Update convert_hf_to_gguf_update.py
b4272
2024-12-05 20:30:59 +02:00
0cd182ebcc sync : ggml b4271 2024-12-05 13:27:42 +02:00
PAB
a8cbab201d ggml: add GGML_SET Metal kernel + i32 CPU kernel (ggml/1037)
* implemented cpu kernel

* add i32 test cases in test-backend-ops

* typedef `ggml_metal_kargs_set`

* implemented `kernel_set`

* memcpy
2024-12-05 13:27:33 +02:00
PAB
c2082d93a8 ggml : add GGML_PAD_REFLECT_1D operation (ggml/1034)
* ggml_pad_reflect_1d defined in header

* implemented on CPU

* called the forward pass

* impl Metal kernel

* added Metal kernel

* added OP_PAD_REFLECT_1D in test-backend-ops.cpp

* add test-pad-reflect-1d test case

* test case support multiple backend
2024-12-05 13:27:31 +02:00
d405804be8 py : update outdated copy-paste instructions [no ci] (#10667)
This commit updates the copy-paste instruction in
convert_hf_to_gguf_update.py to reflect that convert_hf_to_gguf.py
will have already been updated with the new get_vocab_base_pre()
function when this script completes.
2024-12-05 09:47:55 +02:00
f112d198cd Update deprecation-warning.cpp (#10619)
Fixed Path Separator Handling for Cross-Platform Support (Windows File Systems)
b4267
2024-12-04 23:19:20 +01:00
1da7b76569 server : fix speculative decoding with context shift (#10641)
* server : fix speculative decoding with context shift

ggml-ci

* server : take into account speculative limits

ggml-ci

* server : add tests
b4266
2024-12-04 22:38:20 +02:00
59f4db1088 ggml : add predefined list of CPU backend variants to build (#10626)
* ggml : add predefined list of CPU backend variants to build

* update CPU dockerfiles
b4265
2024-12-04 14:45:40 +01:00
2803540814 ggml-cpu : fix HWCAP2_I8MM value (#10646) 2024-12-04 14:40:44 +01:00
253b7fde91 Fix HF repo commit to clone lora test models (#10649) 2024-12-04 10:45:48 +01:00
8d0cfd554a llama: Support MiniCPM-1B (with & w/o longrope) (#10559) b4262 2024-12-04 11:42:50 +02:00
2759916d86 vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (#10642) b4261 2024-12-04 08:28:59 +01:00
40c6d79fb5 SYCL : Move to compile time oneMKL interface backend selection for NVIDIA backend (#10584)
* [SYCL] Move to Compile Time backend selection on oneMKL Interface for NVIDIA backend

Move to compile time selection to backend to avoid latency at run time.
Add it to all mkl gemm calls and only for NVIDIA backend.

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>

* Formatting

* Address PR comments to increase readibility

---------

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
b4260
2024-12-04 09:29:20 +08:00
98036d5670 fix typo of README.md (#10605) 2024-12-04 02:22:50 +01:00
cd2f37b304 Avoid using __fp16 on ARM with old nvcc (#10616) b4258 2024-12-04 01:41:37 +01:00
da6aac91f1 Add docs for creating a static build (#10268) (#10630)
* Add notes for a static build

* Update docs/build.md

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-04 01:40:36 +01:00
01e6d9bb71 clip : add sycl support (#10574)
Co-authored-by: piDack <pcdack@hotmail.co>
b4256
2024-12-04 01:26:37 +01:00
cc98896db8 vulkan: optimize and reenable split_k (#10637)
Use vector loads when possible in mul_mat_split_k_reduce. Use split_k
when there aren't enough workgroups to fill the shaders.
b4255
2024-12-03 20:29:54 +01:00
91c36c269b server : (web ui) Various improvements, now use vite as bundler (#10599)
* hide buttons in dropdown menu

* use npm as deps manager and vite as bundler

* fix build

* fix build (2)

* fix responsive on mobile

* fix more problems on mobile

* sync build

* (test) add CI step for verifying build

* fix ci

* force rebuild .hpp files

* cmake: clean up generated files pre build
b4254
2024-12-03 19:38:44 +01:00
1cd3df46bd scripts : remove amx sync
ggml-ci
b4253
2024-12-03 20:04:49 +02:00
c505471857 sync : ggml 2024-12-03 20:04:49 +02:00
e9e661bd59 CUDA: remove unnecessary warp reduce in FA (ggml/1032)
* kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit

* same problem in vec32

---------

Co-authored-by: ZhaoXiaoYu <zhao.xiaoyu@zte.com.cn>
2024-12-03 20:04:49 +02:00
PAB
efb6ae9630 feat: add GGML_UNARY_OP_ARGMAX Metal kernel (ggml/1019)
* implemented argmax kernel

* tpig -> tgpig

* change to strides

* contiguous assertions

* kernel working and tested

* argmax simd parallel implementation

* added 2 new tests for argmax in test-backend-ops

* cosmit

* added 3 tests cases for perf eval

* add test_argmax in make_test_cases_perf

* Update test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-03 20:04:49 +02:00