Commit Graph

5179 Commits

Author SHA1 Message Date
2969019837 media : add SVG logo [no ci] (#12616) 2025-03-27 23:09:05 +02:00
5dec47dcd4 opencl: add multi and vision rope, gelu_quick and im2col (#12600)
* opencl: add `im2col`

* opencl: add `gelu_quick`

* opencl: add mrope

* opencl: add vision rope
b4978
2025-03-27 08:08:08 -07:00
f125b8dccf llama : add PLM GGUF Conversion & Inference Support (#12457)
* add edgellm model arch[conversation feature doesn't work]

* remove output.weight layer for edgellm arch

* [Model] update the name of the model

* update the name of model arch in convert gguf

* [Model] Refarctor the model arch into llama-model

* [Bug] Fix the bug in create attn kv

* [Code] Fix editorconfig erros

* [Code] Remove Trailing whitespace

* [Code] Remove Trailing whitespace

* [Code] Change the order of model arch in list

* [Code] Fix flake8 Lint errors

* Remove trailing white space

* [Code] Remove  call in model arch
b4977
2025-03-27 12:49:15 +02:00
953c2a62cf model : restore support for T5Encoder (#12590) b4976 2025-03-27 11:43:33 +01:00
d5c6309d91 convert : Support Qwen2_5_VLForConditionalGeneration (#12595) 2025-03-27 11:11:23 +01:00
029c693fdc sync : ggml
ggml-ci
b4974
2025-03-27 10:09:29 +02:00
771d84371c scripts : update sync + fix cmake merge
ggml-ci
2025-03-27 10:09:29 +02:00
df0665a483 sync : ggml
ggml-ci
b4972
2025-03-27 09:04:38 +02:00
0306aad1ca cmake : sync/merge PowerPC build commands (#0) 2025-03-27 09:04:38 +02:00
c7b43ab608 llamafile : ppc64le MMA implementation for Q4_0. (#12489)
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le ISA using MMA
builtins. This patch handles matrix multiplication
between quantised datatypes, block_q4_0 and
block_q8_0.

This change results in 5% - 50% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
b4970
2025-03-27 08:51:47 +02:00
24feaec057 ggml : riscv: add 128-bit RVV support (#12530)
* ggml : add 128-bit RVV support

* ggml : revert to old RVV 256+ q2_K, q3_K, q4_K, q6_K impl

* remove trailing whitespaces

* restructure vector length selection code
b4969
2025-03-27 08:38:34 +02:00
f28bc4c286 llama : make loras compatible with repacking (#12593)
* llama : make loras compatible with repacking

ggml-ci

* cont : simplify

ggml-ci

* cont : add TODO [no ci]
2025-03-27 08:24:10 +02:00
f17a3bb4e8 SYCL: implement memset ggml backend buffer interface (#12580)
* SYCL: implement memset ggml backend buffer interface

* use GGML_ABORT macro

* Do not wait for all queues to finish for memset operation
b4967
2025-03-27 09:46:00 +08:00
bd40678df7 HIP: Add support for RDNA4 targets (#12372) b4966 2025-03-26 23:46:30 +01:00
b3298fa47a metal : refactor mat-vec code (#12569)
* metal : refactor mat-vec code

ggml-ci

* metal : rename all_sum -> sum_all

ggml-ci

* metal : fix comments [no ci]

* metal : fix nr constant [no ci]

* metal : mv q6_K support nr0 > 1

ggml-ci

* metal : reduce register pressure

ggml-ci

* metal : fix typo [no ci]

* metal : reduce register pressure

ggml-ci
2025-03-26 21:38:38 +02:00
2447ad8a98 upgrade to llguidance 0.7.10 (#12576) b4964 2025-03-26 11:06:09 -07:00
02082f1519 clip: Fix llama-llava-clip-quantize-cli quantization error under CUDA backend (#12566)
* [Fix] Compiling clip-quantize-cli and running it in a CUDA environment will cause ggml_fp16_to_fp32 to report an error when trying to access video memory. You need to switch to the CPU backend to run quantize.
After the fix, it will automatically run in the CPU backend and will no longer be bound to CUDA.

* [Fix]Roll back the signature and implementation of clip_model_load, and change the call in clip_model_quantize to clip_init.
b4963
2025-03-26 15:06:04 +01:00
df4d20cd53 convert : fix squeeze for ssm_conv tensors (#12573)
* convert : fix squeeze for ssm_conv tensors

* convert : match ssm_conv tensors by type

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2025-03-26 08:21:05 -04:00
5ed38b6852 ggml : fix MUL_MAT_ID repack with Q8_K (#12544)
* ggml : fix MUL_MAT_ID repack with Q8_K

ggml-ci

* ggml : improve repack templates

ggml-ci
b4961
2025-03-26 13:02:00 +02:00
fd7855f8f5 doc: [MUSA] minor changes (#12583)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-26 09:09:48 +02:00
53af4dba42 convert: fix Mistral3/Gemma3 model hparams init (#12571)
* Fix Mistral3/Gemma3 model hparams init

* set positional args correctly

* use existing hparams if passed
2025-03-25 23:03:10 +01:00
ef19c71769 run: de-duplicate fmt and format functions and optimize (#11596) b4958 2025-03-25 18:46:11 +01:00
053b3f9aae ggml-cpu : update KleidiAI to v1.5.0 (#12568)
ggml-cpu : bug fix related to KleidiAI LHS packing

Signed-off-by: Dan Johansson <dan.johansson@arm.com>
b4957
2025-03-25 13:10:18 +02:00
e2f560175a SYCL: disable Q4_0 reorder optimization (#12560)
ggml-ci
b4956
2025-03-25 18:40:18 +08:00
36ee06dd2d docs : add build instructions for KleidiAI (#12563)
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
2025-03-25 11:35:20 +02:00
3cd3a39532 ci: [MUSA] add CI and update doc (#12562)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-25 09:45:08 +02:00
2d77d88e70 context : fix worst-case reserve outputs (#12545)
ggml-ci
b4953
2025-03-25 09:19:23 +02:00
c95fa362b3 ci: [SYCL] ggml-ci Use main GPU and enable sysman (#12547) 2025-03-24 19:35:38 +02:00
2b65ae3029 opencl: simplify kernel embedding logic in cmakefile (#12503)
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
b4951
2025-03-24 09:20:47 -07:00
48d7021c61 CI: fix SYCL build (#12546) 2025-03-24 14:58:32 +02:00
3361e2deba docs: update: improve the Fedoa CUDA guide (#12536)
* docs: update fedora-cuda guide

- Rename and place into Backend Folder.
- Update Host-Supplied Packages.
- Expand Recommended Users Section.

* docs: improve the flow of CUDA-FEDORA.md
2025-03-24 11:02:26 +00:00
00d53800e0 llama-vocab : add SuperBPE pre-tokenizer (#12532) b4948 2025-03-24 11:47:24 +01:00
7ea75035b6 CUDA: Fix clang warnings (#12540)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b4947
2025-03-24 11:28:34 +01:00
c54f6b7988 mmap : skip resource limit checks on AIX (#12541) b4946 2025-03-24 12:17:10 +02:00
9b169a4d4e vulkan: fix mul_mat_vec failure in backend tests (#12529)
The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.
b4945
2025-03-24 07:56:17 +01:00
77f9c6bbe5 server : Add verbose output to OAI compatible chat endpoint. (#12246)
Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.
b4944
2025-03-23 19:30:26 +01:00
18b663d8e4 install : add macports (#12518)
MacPorts section added
2025-03-23 10:21:48 +02:00
fbdfefe74e llama : gemma3 : use output tensor if it exists in model weight (#12506)
* llama : gemma3 : use output tensor if it exists in model weight

* also add to the llm_tensor_names
b4942
2025-03-22 23:28:19 +01:00
ba932dfb50 ggml : fix quantized cpy op (#12310)
* ggml : fix quantized cpy op

ggml-ci

* tests : add cpy tests for all types

ggml-ci

* tests : add BF16 copy tests

ggml-ci

* tests : fix loop for same-type copy

ggml-ci

* tests : add option to permute the dst tensor

ggml-ci
2025-03-22 16:23:26 +02:00
fac63a3d78 musa: refine compute capability (#12493)
* musa: refine compute capability

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b4940
2025-03-22 10:11:37 +01:00
eddfb43850 vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505)
* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders

* vulkan: Optimize mul_mat_vec p021 and nc shaders.

These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).

Using subgroupAdd in the p021 shader also helps, use that conditionally.
b4939
2025-03-22 09:40:11 +01:00
4375415b4a Vulkan: RTE rounding for cpy to quant (#12480)
* Vulkan: RTE rounding for cpy to quant

Co-Authored-By: Jeff Bolz <jbolz@nvidia.com>

* remove trailing whitespace

* avoid duplicating pipeline_cpy_f32_quant

* fix copypasting issue

* remove duplicated code

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
b4938
2025-03-21 20:34:50 +01:00
Eve
30c42ef5cb vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (#12472) b4937 2025-03-21 20:27:47 +01:00
af04481e6b model : do not repack if a GPU device is present (#12498)
ggml-ci
b4936
2025-03-21 16:14:29 +02:00
960e726077 chore : cleanup llama_model_loader::TENSOR_ usage (#12492) b4935 2025-03-21 10:21:36 +01:00
ea1518e839 llama-tts : avoid crashes related to bad model file paths (#12482) b4934 2025-03-21 11:12:45 +02:00
1aa87ee53d [SYCL] Fix build on Windows when ccache enabled (#9954) (#9976)
* [SYCL] Fix build on Windows when ccache enabled (#9954)

* take effect only on windows and force it to icl

---------

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
b4933
2025-03-21 14:58:47 +08:00
9ffcc9e374 sycl: cleanup oneDNN related code (#12097) b4932 2025-03-21 10:15:56 +08:00
e04643063b webui : Prevent rerendering on textarea input (#12299)
* webui: Make textarea uncontrolled to eliminate devastating lag

* Update index.html.gz

* use signal-style implementation

* rm console log

* no duplicated savedInitValue set

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-03-20 15:57:43 +01:00
dbb3a4739e llama : make Qwen2MoE QKV bias optional (#12477) b4930 2025-03-20 12:49:59 +01:00