Commit Graph

4849 Commits

Author SHA1 Message Date
2c6c8df56d vulkan: optimize coopmat2 iq2/iq3 callbacks (#11521)
* vulkan: optimize coopmat2 iq2/iq3 callbacks

* build: trigger CI on GLSL compute shader changes
b4649
2025-02-06 07:15:30 +01:00
8a7e3bf17a vulkan: initial support for IQ4_XS quantization (#11501) b4648 2025-02-06 07:09:59 +01:00
1b598b3058 vulkan: use smaller combined allocations to avoid fragmentation (#11551) b4647 2025-02-06 07:02:18 +01:00
902368a06b metal : avoid breaking build when metal API predates TARGET_OS_VISION (#11690)
Avoids breakage in nix flake build introduced by b0569130c5
b4646
2025-02-06 09:52:31 +08:00
c3db0480bb readme : add link to Autopen under UIs (#11684)
Autopen (https://github.com/blackhole89/autopen) is a graphical text editor that uses llama.cpp to tokenize the buffer on the fly, score the buffer, visualise token logits and allow you to switch back and forth between different possible completions at any point. It hopefully meets the criteria for inclusion, as the dependency on llama.cpp is stated prominently.
2025-02-06 01:55:25 +01:00
d774ab3acc metal : adjust support conditions for norm operators (#11671)
cont #11659

ggml-ci
b4644
2025-02-05 10:57:42 +02:00
fa62da9b2d CUDA: support for mat. mul. with ne03 != ne13 (#11656) b4643 2025-02-05 08:58:31 +01:00
1ec208083c llava: add quantization for the visual projector LLAVA, Qwen2VL (#11644)
* Added quantization for visual projector
* Added README
* Fixed the clip quantize implementation in the file

* Fixed the gcc warning regarding minor linting

* Removed trailing whitespace
b4642
2025-02-05 10:45:40 +03:00
9f4cc8f8d3 sync: minja (#11641)
* `sync`: minja

182de30cda

https://github.com/google/minja/pull/46

https://github.com/google/minja/pull/45
b4641
2025-02-05 01:00:12 +00:00
fd08255d0d CUDA: non-contiguous (RMS) norm support (#11659)
* CUDA: non-contiguous (RMS) norm support

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4640
2025-02-04 22:21:42 +01:00
3ec9fd4b77 HIP: force max threads per block to be 1024 (#11621)
Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm.

Signed-off-by: fxzjshm <fxzjshm@163.com>
b4639
2025-02-04 19:18:38 +01:00
3962fc1a79 server : add try..catch to places not covered by set_exception_handler (#11620)
* server : add try..catch to places not covered by set_exception_handler

* log_server_request: rm try catch, add reminder
2025-02-04 18:25:42 +01:00
1bef571f6a arg : list RPC devices first when using --list-devices (#11655)
List devices in the same order as they appear when evaluating the model
and splitting tensors across devices, i.e. RPC devices come first in the
list.

ref #11435
b4637
2025-02-04 18:16:20 +02:00
db288b60cb tool-call: command r7b fix for normal responses (#11608)
* fix command r7b normal response regex + add to server test

* test multiline non-tool-call responses in test-chat
b4636
2025-02-04 15:48:53 +00:00
106045e7bb readme : add llm_client Rust crate to readme bindings (#11628)
[This crate](https://github.com/ShelbyJenkins/llm_client) has been in a usable state for quite awhile, so I figured now is fair to add it.

It installs from crates.io, and automatically downloads the llama.cpp repo and builds it for the target platform - with the goal being the easiest user experience possible.

It also integrates model presets and choosing the largest quant given the target's available VRAM. So a user just has to specify one of the presets (I manually add the most popular models), and it will download from hugging face.

So, it's like a Rust Ollama, but it's not really for chatting. It makes heavy use of llama.cpp's grammar system to do structured output for decision making and control flow tasks.
2025-02-04 13:20:55 +02:00
f117d84b48 swift : fix llama-vocab api usage (#11645)
* swiftui : fix vocab api usage

* batched.swift : fix vocab api usage
b4634
2025-02-04 13:15:24 +02:00
534c46b53c metal : use residency set for other platforms (#11648) b4633 2025-02-04 13:07:18 +02:00
387a1598ca authors : update 2025-02-04 13:04:10 +02:00
7c9e0ca520 sync : ggml b4631 2025-02-04 12:59:21 +02:00
8f8290ada9 cmake: Add ability to pass in GGML_BUILD_NUMBER (ggml/1096)
This makes git as a dependency optional, and is useful in the case where
ggml is built not from git, but from a tarball, or a distribution source
package.

This conditional also affects GGML_BUILD_COMMIT. Nothing seems to be
using it, though, so there doesn't seem much value factor it out, or
even require it.
2025-02-04 12:59:15 +02:00
b34aedd558 ci : do not stale-close roadmap issues 2025-02-04 09:31:01 +02:00
cde3833239 tool-call: allow --chat-template chatml w/ --jinja, default to chatml upon parsing issue, avoid double bos (#11616)
* tool-call: allow `--jinja --chat-template chatml`

* fix double bos issue (drop bos/eos tokens from jinja template)

* add missing try catch around jinja parsing to default to chatml

* Simplify default chatml logic
b4628
2025-02-03 23:49:27 +00:00
b3451785ac server : (webui) revert hacky solution from #11626 (#11634) 2025-02-04 00:10:52 +01:00
1d1e6a90bc server : (webui) allow typing and submitting during llm response (#11626) 2025-02-03 23:16:27 +01:00
5598f475be server : remove CPPHTTPLIB_NO_EXCEPTIONS define (#11622)
This commit removes the CPPHTTPLIB_NO_EXCEPTIONS define from the server
code.

The motivation for this is that when using a debug build the server
would crash when an exception was throws and terminate the server
process, as it was unhandled. When CPPHTTPLIB_NO_EXCEPTIONS is set
cpp_httplib will not call the exception handler, which would normally
return a 500 error to the client. This caused tests to fail when using
a debug build.

Fixes: https://github.com/ggerganov/llama.cpp/issues/11613
2025-02-03 16:45:38 +01:00
8ec05832fa sync : ggml 2025-02-03 14:57:08 +02:00
21c84b5d2d CUDA: fix Volta FlashAttention logic (#11615) b4623 2025-02-03 14:25:56 +02:00
d92cb67e37 server : (webui) Fix Shift+Enter handling (#11609)
* Fix Shift+Enter handling

`exact` on the Enter handler means the message is not sent when Shift+Enter is pressed anyway

* build index.html.gz

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-02-03 10:42:55 +01:00
6eecde3cc8 HIP: fix flash_attn_stream_k_fixup warning (#11604) b4621 2025-02-02 23:48:29 +01:00
396856b400 CUDA/HIP: add support for selectable warp size to mmv (#11519)
CUDA/HIP: add support for selectable warp size to mmv
b4620
2025-02-02 22:40:09 +01:00
4d0598e144 HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other (#11601)
This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly
b4619
2025-02-02 22:08:05 +01:00
90f9b88afb nit: more informative crash when grammar sampler fails (#11593) b4618 2025-02-02 19:58:34 +00:00
864a0b67a6 CUDA: use mma PTX instructions for FlashAttention (#11583)
* CUDA: use mma PTX instructions for FlashAttention

* __shfl_sync workaround for movmatrix

* add __shfl_sync to HIP

Co-authored-by: Diego Devesa <slarengh@gmail.com>
b4617
2025-02-02 19:31:09 +01:00
84ec8a58f7 Name colors (#11573)
It's more descriptive, use #define's so we can use compile-time
concatenations.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
b4616
2025-02-02 15:14:48 +00:00
bfcce4d693 tool-call: support Command R7B (+ return tool_plan "thoughts" in API) (#11585)
* `tool-call`: support Command R7B (w/ tool_plan return)

* `tool-call`: cleaner preservation of tokens + warn when likely bad chat template override

* `tool-call`: test cleanup / handle lazy grammar triggers
b4615
2025-02-02 09:25:38 +00:00
69804487e0 Fix exotic ci env that lacks ostringstream::str (#11581) b4614 2025-02-02 09:10:15 +00:00
ff227703d6 sampling : support for llguidance grammars (#10224)
* initial porting of previous LLG patch

* update for new APIs

* build: integrate llguidance as an external project

* use '%llguidance' as marker to enable llg lark syntax

* add some docs

* clarify docs

* code style fixes

* remove llguidance.h from .gitignore

* fix tests when llg is enabled

* pass vocab not model to llama_sampler_init_llg()

* copy test-grammar-integration.cpp to test-llguidance.cpp

* clang fmt

* fix ref-count bug

* build and run test

* gbnf -> lark syntax

* conditionally include llguidance test based on LLAMA_LLGUIDANCE flag

* rename llguidance test file to test-grammar-llguidance.cpp

* add gh action for llg test

* align tests with LLG grammar syntax and JSON Schema spec

* llama_tokenizer() in fact requires valid utf8

* update llg

* format file

* add $LLGUIDANCE_LOG_LEVEL support

* fix whitespace

* fix warning

* include <cmath> for INFINITY

* add final newline

* fail llama_sampler_init_llg() at runtime

* Link gbnf_to_lark.py script; fix links; refer to llg docs for lexemes

* simplify #includes

* improve doc string for LLAMA_LLGUIDANCE

* typo in merge

* bump llguidance to 0.6.12
b4613
2025-02-02 09:55:32 +02:00
0cec062a63 llama : add support for GLM-Edge and GLM-Edge-V series models (#10573)
* add glm edge chat model

* use config partial_rotary_factor as rope ratio

* support for glm edge model

* vision model support

* remove debug info

* fix format

* llava.cpp trailing whitespace

* remove unused AutoTokenizer

* Update src/llama.cpp for not contain <|end|> or </s>

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* add edge template

* fix chat template

* fix confict

* fix confict

* fix ci err

* fix format err

* fix template err

* 9b hf chat support

* format

* format clip.cpp

* fix format

* Apply suggestions from code review

* Apply suggestions from code review

* Update examples/llava/clip.cpp

* fix format

* minor : style

---------

Co-authored-by: liyuhang <yuhang.li@zhipuai.cn>
Co-authored-by: piDack <pcdack@hotmail.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: liyuhang <yuhang.li@aminer.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-02 09:48:46 +02:00
53debe6f3c ci: use sccache on windows HIP jobs (#11553) b4611 2025-02-01 18:22:38 +00:00
cfd74c86db sync: minja (418a2364b5) (#11574) b4610 2025-02-01 12:24:51 +00:00
ecef206ccb Implement s3:// protocol (#11511)
For those that want to pull from s3

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
b4609
2025-02-01 10:30:54 +00:00
5bbc7362cb ci: simplify cmake build commands (#11548) b4608 2025-02-01 00:01:20 +00:00
aa6fb13213 ci: use sccache on windows instead of ccache (#11545)
* Use sccache on ci for windows

* Detect sccache in cmake
b4607
2025-01-31 17:12:40 +00:00
a83f528688 tool-call: fix llama 3.x and functionary 3.2, play nice w/ pydantic_ai package, update readme (#11539)
* An empty tool_call_id is better than none!

* sync: minja (tool call name optional https://github.com/google/minja/pull/36)

* Force-disable parallel_tool_calls if template doesn't support it

* More debug logs

* Llama 3.x tools: accept / trigger on more varied spaced outputs

* Fix empty content for functionary v3.2 tool call

* Add proper tool call docs to server README

* readme: function calling *is* supported now

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4606
2025-01-31 14:15:25 +00:00
b1bcd309fc fix stop regression (#11543) b4605 2025-01-31 13:48:31 +00:00
5783575c9d Fix chatml fallback for unsupported builtin templates (when --jinja not enabled) (#11533) b4604 2025-01-31 08:24:29 +00:00
4a2b196d03 server : fix --jinja when there's no tools or schema (typo was forcing JSON) (#11531) b4603 2025-01-31 10:12:40 +02:00
1bd3047a93 common: Add missing va_end (#11529)
The va_copy man page states that va_end must be called to revert
whatever the copy did. For some implementaions, not calling va_end
has no consequences. For others it could leak memory.
2025-01-31 07:58:55 +02:00
a2df2787b3 server : update help metrics processing/deferred (#11512)
This commit updates the help text for the metrics `requests_processing`
and `requests_deferred` to be more grammatically correct.

Currently the returned metrics look like this:
```console
\# HELP llamacpp:requests_processing Number of request processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of request deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```

With this commit, the metrics will look like this:
```console
\# HELP llamacpp:requests_processing Number of requests processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of requests deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```
This is also consistent with the description of the metrics in the
server examples [README.md](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#get-metrics-prometheus-compatible-metrics-exporter).
b4601
2025-01-31 06:04:53 +01:00
553f1e46e9 ci: ccache for all github worfklows (#11516) b4600 2025-01-30 22:01:06 +00:00