Commit Graph

5753 Commits

Author SHA1 Message Date
cb79c2e7fa ggml: don't include arm_neon.h when using CUDA 12 with ARM Neon (ggml/1187)
fix #1186
2025-04-11 00:17:47 +03:00
fe92821ea9 ggml : add bilinear upscale support (ggml/1185) 2025-04-11 00:17:47 +03:00
459895c326 ggml : add more generic custom op, remove deprecated custom ops (ggml/1183)
* ggml : add more generic ggml_custom op

* ggml : remove deprecated custom ops
2025-04-11 00:17:47 +03:00
e4bf72d631 scripts : fix sync-ggml-am.sh 2025-04-11 00:17:47 +03:00
8b9cc7cdd8 llava : introduce libmtmd (#12849)
* wip llava2

* migrated gemma3 to llava2

* add timings

* correct pre/postfix

* fix missing include

* fix compilation unused var warn

* update llava2_tokenize

* change name llava2 --> mtmd

* improve api

* refine helpers

* Update examples/llava/mtmd.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5099
2025-04-10 22:57:16 +02:00
64eda5deb9 convert : ability to lazy-load safetensors remotely without downloading to disk (#12820)
* gguf util : add SafetensorRemote

* fix style

* convert: add --remote option

* convert : allow using lazy remote tensors

It's a bit slow for now since everything is blocking and single-threaded.

* correct metadata.name

* small style fix

* support HF_TOKEN

* convert : use writeable buffer for remote lazy tensors

* convert : fix flake8 lint regarding lamdba assigment

* multithreaded download

* multithread: print debug

* fix style

* Revert "multithreaded download"

This reverts commit 42fc895ace.

* bring back _get_request_headers

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2025-04-10 17:24:44 +02:00
fe5b78c896 CANN: Support more ops (#12841)
* [CANN]Support Opt LOG && MEAN && PAD_REFLECT_1D

* [CANN]Support COUNT_EQUAL && STEP && SGN

* [CANN]codestyle adjustment

* [CANN]codestyle adjustment

---------

Signed-off-by: noemotiovon <noemotiovon@gmail.com>
b5097
2025-04-10 08:51:52 +08:00
11d07e1e69 Fixes #12823 (#12830)
* Including limits file on AIX

* Fixes #12823
b5096
2025-04-10 01:18:01 +02:00
b0091ecc1e docker : added all CPU to GPU images (#12749) 2025-04-10 01:17:12 +02:00
31f7803bc4 ggml-cpu-impl.h: do not redefine bool on POWER9 (#12856)
error: unknown type name '_Bool'
b5094
2025-04-10 01:00:34 +02:00
2391506ace ggml-impl.h: fix build on POWER9 (#12855)
error: ISO C++17 does not allow 'register' storage class specifier
b5093
2025-04-10 01:00:25 +02:00
d3bd7193ba llama : Support Qwen3 and Qwen3MoE (#12828)
* add qwen3 & qwen3moe support.

* fix

---------

Co-authored-by: bozheng-hit <dsoul0621@gmail.com>
b5092
2025-04-09 11:47:36 +02:00
d9a63b2f2e musa: enable freediskspace for docker image build (#12839)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-04-09 11:22:30 +02:00
8ed71242f4 sycl: update documentation to use -no-cnv (#12845) 2025-04-09 11:22:04 +02:00
381603a775 ci: detach common from the library (#12827)
* fix: detach common from the library

* fix: building chat test template
b5089
2025-04-09 10:11:11 +02:00
65a69e6e1b clip : do not print ftype (#12832) 2025-04-09 10:09:53 +02:00
47277d6d1d readme : add rpc backend (#12842) 2025-04-09 10:54:42 +03:00
6e1c4cebdb CANN: Support Opt CONV_TRANSPOSE_1D and ELU (#12786)
* [CANN] Support ELU and CONV_TRANSPOSE_1D

* [CANN]Modification review comments

* [CANN]Modification review comments

* [CANN]name adjustment

* [CANN]remove lambda used in template

* [CANN]Use std::func instead of template

* [CANN]Modify the code according to the review comments

---------

Signed-off-by: noemotiovon <noemotiovon@gmail.com>
b5086
2025-04-09 14:04:14 +08:00
0090950f67 vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (#12833)
q4_k and q5_k had a lot of redundant global loads where the same 16B of
scale information is repeatedly loaded and decoded during each loop iteration.
This change restructures the loops to more explicitly iterate over whole
blocks in the outer loop (with unrolled inner loop) and to copy/decode the
scale data into shared memory once at the start of each outer loop. The copy
is pipelined so the scale load from global memory is relatively cheap.

This improves q4_k/q5_k model prompt processing performance by around 5-7%.
I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k
and hurt for q4_0.

The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped
variants isn't used as often as it originally was (e.g. due to the padded_N
change), so I trimmed it down to offset some of the new complexity of the
semi-manual loop unrolling.
b5085
2025-04-09 07:25:08 +02:00
7ecd780b1a vulkan: Use fp16 for the flash attention P*V multiplication (#12783)
This is consistent with the ggml-cuda behavior and the mul_mat fallback.
b5084
2025-04-09 07:12:57 +02:00
7538246e7c cuda : add f32 to bf16 copy op (#12806)
This allows BF16 KV-cache on CUDA.
b5083
2025-04-08 23:21:31 +02:00
b32efad2bc llava: improve clip_ctx destructor to not memleak load_image_size (#12834) b5082 2025-04-08 22:01:58 +02:00
a19b5cef16 llama : fix FA when KV cache is not used (i.e. embeddings) (#12825)
* ggml : FA supports F32 V

* graph : cast KV to F16 when the KV cache is not used

ggml-ci

* server : add test that exercises embeddings with FA enabled

ggml-ci
b5081
2025-04-08 19:54:51 +03:00
78a1ba0a4f server : fix thread.join() on exit (#12831) b5080 2025-04-08 18:37:06 +02:00
dm4
2dabf759e7 llava: add more helper functions to check projector types in clip context (#12824)
Signed-off-by: dm4 <sunrisedm4@gmail.com>
b5079
2025-04-08 15:49:13 +02:00
1d343b4069 arg : Including limits file on AIX (#12822) b5078 2025-04-08 14:30:59 +02:00
8ca6e1c3a4 server : webui : Improve Chat Input with Auto-Sizing Textarea (#12785)
* Update ChatScreen.tsx

* useAutosizeTextarea.ts

useAutosizeTextarea to encapsulate the logic.

* Implement responsive auto-sizing chat textarea

Replaces the manual textarea resizing with an automatic height adjustment based on content.

- `useChatTextarea` hook to manage textarea state and auto-sizing logic via refs, preserving the optimization
- Textarea now grows vertically up to a maximum height (`lg:max-h-48`) on large screens (lg breakpoint and up).
- Disables auto-sizing and enables manual vertical resizing (`resize-vertical`) on smaller screens for better mobile usability.
- Aligns the "Send" button to the bottom of the textarea (`items-end`) for consistent positioning during resize.

* -update compressed index.html.gz after npm run build
-refactor: replace OptimizedTextareaValue with AutosizeTextareaApi in VSCode context hook

* chore: normalize line endings to LF
refactor: AutosizeTextareaApi -> chatTextareaApi

* refactor: Rename interface to PascalCase

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-04-08 11:14:59 +02:00
656babd6c2 Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (#12812)
* Revert "sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_s…"

This reverts commit 518a01480e.

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

* rm tail space
b5076
2025-04-08 15:03:21 +08:00
a226bc7a9a gguf-py : support lazy tensor splitting (#12809)
* gguf-py : support lazy tensor splitting

Splitting usually involves returning tuples of tensors,
which need to be handled properly to avoid early eager evaluation.

* gguf-py : fix flake8 lint
2025-04-08 09:03:07 +02:00
1466621e73 llama : Support llama 4 text-only (#12791)
* llama4 conversion

* initial support, no chat template

* clean up a bit

* fix tokenizer conversion

* correct hparams

* try this

* fix shexp

* ffn_inp_normed

* chat template

* clean up model conversion

* add_bos

* add scale_before_ffn

* fix order

* weight_before_ffn

* llm_graph_input_attn_temp

* add chunk attn mask

* build_inp_attn_scale()

* add comment about ggml_repeat

* clarify comments

* fix build
b5074
2025-04-07 23:06:44 +02:00
82974011f3 opencl: better identify Adreno GPU (#12760) b5073 2025-04-07 13:22:54 -07:00
4ccea213bc hellaswag: display estimated score confidence interval (#12797) b5072 2025-04-07 18:47:08 +03:00
1a1ab7e7a4 cuda : fix HIP and MUSA BF16 (#0)
ggml-ci
b5071
2025-04-07 18:44:17 +03:00
a4e46e28f9 sync : ggml
ggml-ci
2025-04-07 18:44:17 +03:00
ff067dbcb9 ggml : simplify Arm fp16 CPU logic (ggml/1177)
* ggml : simlpify Arm fp16 CPU logic

ggml-ci

* cont : bring back CUDA/MUSA checks

ggml-ci
2025-04-07 18:44:17 +03:00
36ca8b3628 CUDA: don't convert BF16 weights to FP32 (ggml/1174)
* add bf16 support

* use convert_from_bf16_cuda instead of convert_unary_cuda for f32

* revert 7ec5085

* move functionality into convert_unary with constexpr
2025-04-07 18:44:17 +03:00
995083e4ed cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167)
* cpu: refactor SIMD mappings and vectorized op functions into separate files

* Fix warning for ggml_float to float

* Fix warnings

* cpu: move all the operations (except mul_mat) to a separate c++ file

* fix whitespace

* Update ggml/src/ggml-cpu/vec.h

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Fix PR comments - use GGML_UNUSED, use cassert in ops.cpp

* Reverse the order of import for ops.h and vec.h, to match what was present in ggml-cpu.c previously

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-04-07 18:44:17 +03:00
518a01480e sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (#12734) b5066 2025-04-07 17:22:57 +02:00
e391d3ee8d ci : no curl on ggml-ci (#12796) 2025-04-07 15:37:28 +03:00
bd3f59f812 cmake : enable curl by default (#12761)
* cmake : enable curl by default

* no curl if no examples

* fix build

* fix build-linux-cross

* add windows-setup-curl

* fix

* shell

* fix path

* fix windows-latest-cmake*

* run: include_directories

* LLAMA_RUN_EXTRA_LIBS

* sycl: no llama_curl

* no test-arg-parser on windows

* clarification

* try riscv64 / arm64

* windows: include libcurl inside release binary

* add msg

* fix mac / ios / android build

* will this fix xcode?

* try clearing the cache

* add bunch of licenses

* revert clear cache

* fix xcode

* fix xcode (2)

* fix typo
b5064
2025-04-07 13:35:19 +02:00
52b3d71f12 CANN: fix typo in ggml-cann (#12733) 2025-04-07 19:34:14 +08:00
d0d5b2232b CANN: Refactor to reduce duplicate code (#12731)
* CANN: Refactor to reduce duplicate code

* CANN: fix review comment
b5062
2025-04-07 17:10:36 +08:00
916c83bfe7 musa: fix compilation warnings in mp_22/31 (#12780)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
b5061
2025-04-06 15:23:54 +02:00
0c74b04376 vulkan: fix NaN issue in flash attention shader (#12776)
Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.
b5060
2025-04-06 11:03:47 +02:00
80b717d493 vulkan: Use unclamped loads for flash attention mask (#12720)
nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple
of the number of rows in the matrix. The KV dim is a multiple of the number of
columns for the aligned shader.
b5059
2025-04-06 10:47:13 +02:00
6bf28f0111 Vulkan: Tune Vulkan mmq int dot shader for performance (#12767) b5058 2025-04-05 18:04:03 +02:00
f1e3eb4249 common : fix includes in arg.cpp and gemma3-cli.cpp (#12766)
* arg.cpp: add a missing include

* gemma3-cli.cpp: fix cinttypes include
b5057
2025-04-05 17:46:00 +02:00
0364178ca2 clip : refactor clip_init, add tests (#12757)
* refactor clip_init

* fix loading file

* fix style

* test ok

* better test with report

* add missing headers

* clarify

* add KEY_MM_PATCH_MERGE_TYPE

* remove bool has_* pattern

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/llava/clip.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* use ggml_soft_max_ext

* refactor logging system

* add minicpm-v-o 2.6 for testing

* use nullptr everywhere

* fix Yi-VL model

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5056
2025-04-05 17:17:40 +02:00
c6ff5d2a8d common: custom hf endpoint support (#12769)
* common: custom hf endpoint support

Add support for custom huggingface endpoints via HF_ENDPOINT environment variable

You can now specify a custom huggingface endpoint using the HF_ENDPOINT environment variable when using the --hf-repo flag, which works similarly to huggingface-cli's endpoint configuration.

Example usage:
HF_ENDPOINT=https://hf-mirror.com/ ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is"

The trailing slash in the URL is optional:
HF_ENDPOINT=https://hf-mirror.com ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is"

* Update common/arg.cpp

readability Improvement

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Apply suggestions from code review

---------

Co-authored-by: ベアトリーチェ <148695646+MakiSonomura@users.noreply.github.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
b5055
2025-04-05 15:31:42 +02:00
7a84777f42 sync: minja (#12739)
* sync: minja

https://github.com/google/minja/pull/57

* fix json include
b5054
2025-04-04 21:16:39 +01:00