Commit Graph

4841 Commits

Author SHA1 Message Date
952feedfca context : disable encoder embd tensor for now
ggml-ci
2025-02-27 15:07:10 +02:00
4efe989886 context : pass embeddings tensor from encoder to decoder
ggml-ci
2025-02-25 16:11:17 +02:00
e2b3294f2c context : fix enc-dec state save/load
ggml-ci
2025-02-25 12:14:34 +02:00
e5bc5f8e02 context : enc-dec is now working
ggml-ci
2025-02-25 12:10:34 +02:00
be58e30017 enc-dec : compose wip
ggml-ci
2025-02-24 18:12:24 +02:00
9cd78f11a1 context : explicit llama_context_i abstract interface
ggml-ci
2025-02-24 13:38:11 +02:00
4a1054b552 context : reuse built_attn_mha
ggml-ci
2025-02-24 11:29:52 +02:00
a5a85a3bc0 context : fix recurrent reserve
ggml-ci
2025-02-24 08:59:12 +02:00
0699a44c83 context : remove redundant virtual, protected -> private
ggml-ci
2025-02-23 20:02:11 +02:00
6378112cb5 graph : remove the build_kv_... API from llama_graph_i
ggml-ci
2025-02-23 19:39:22 +02:00
372fa3a894 cont : enc should work now, next is dec
ggml-ci
2025-02-23 12:20:23 +02:00
f5e80208c5 wip enc-dec 2025-02-21 19:17:47 +02:00
c4c0a4d13c Merge branch 'master' into gg/llama-kv-cache 2025-02-21 19:14:07 +02:00
51f311e057 llama : skip loading unused tensors (#12004)
* llama : assign unknown/unused tensors to host buffer type

ggml-ci

* llama : skip unused tensors

ggml-ci
b4753
2025-02-21 18:33:18 +02:00
3753b30d65 context : fix n_outputs init
ggml-ci
2025-02-21 15:53:26 +02:00
f588a70da3 context : wrap input tensors in struct
ggml-ci
2025-02-21 15:09:28 +02:00
ebf1bdf97b context : add logs
ggml-ci
2025-02-21 14:35:23 +02:00
586d5fe6eb doc: update contributing guidelines [no ci] (#11969) 2025-02-21 12:51:25 +01:00
548c230dff graph : remove worst_case from the API
ggml-ci
2025-02-21 13:29:25 +02:00
ecc8e3aeff CUDA: correct the lowest Maxwell supported by CUDA 12 (#11984)
* CUDA: correct the lowest Maxwell supported by CUDA 12

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b4751
2025-02-21 12:21:05 +01:00
2645a7d9a9 context : add save/load for recurrent context
ggml-ci
2025-02-21 10:28:42 +02:00
0b3863ff95 MUSA: support ARM64 and enable dp4a .etc (#11843)
* MUSA:  support ARM64 and enable __dp4a .etc

* fix cross entropy loss op for musa

* update

* add cc info log for musa

* add comment for the MUSA .cc calculation block

---------

Co-authored-by: Bodhi Hu <huaishun.hu@mthreads.com>
2025-02-21 09:46:23 +02:00
ee02ad02c5 clip : fix visual encoders with no CLS (#11982)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
b4749
2025-02-21 08:11:03 +02:00
08011c2ca1 context : add llama_kv_cache_recurrent prototype
ggml-ci
2025-02-20 20:55:13 +02:00
c392e5094d server (webui): Fix Premature Submission During IME Conversion (#11971)
* fix skip ime composing

* fix npm rebuild

* fix warn

---------

Co-authored-by: momonga <115213907+mmnga@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-02-20 19:43:22 +01:00
ad870c49f4 context : fix causal input for cache-less case
ggml-ci
2025-02-20 20:01:02 +02:00
b1554be1d7 context : add cache-less llama_context
ggml-ci
2025-02-20 18:30:04 +02:00
c5d91a7400 ggml-cpu: Add CPU backend support for KleidiAI library (#11390)
* ggml-cpu: Add CPU backend support for KleidiAI library

* Add environmental variable GGML_KLEIDIAI_SME

* Add support for multithread LHS conversion

* Switch kernel selection order to dotprod and i8mm

* updates for review comments

* More updates for review comments

* Reorganize and rename KleidiAI files

* Move ggml-cpu-traits.h to source file

* Update cmake for SME build and add alignment for SME

* Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list
b4747
2025-02-20 15:06:51 +02:00
072280ea6b Merge branch 'master' into gg/llama-kv-cache
ggml-ci
2025-02-20 14:26:43 +02:00
4806498bf1 ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot (#11917)
* Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file

* Improved Formating of code in  ggml-cpu-quants.c file

* style : minor fixes

* style : less whitespaces

* style : ptr spaceing

---------

Co-authored-by: vithulep <p.m.vithule1517@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4746
2025-02-20 12:08:32 +02:00
0d559580a0 run : add --chat-template-file (#11961)
Relates to: https://github.com/ggml-org/llama.cpp/issues/11178

Added --chat-template-file CLI option to llama-run. If specified, the file
will be read and the content passed for overwriting the chat template of
the model to common_chat_templates_from_model.

Signed-off-by: Michael Engel <mengel@redhat.com>
b4745
2025-02-20 10:35:11 +02:00
d04e7163c8 doc: add links to ggml examples [no ci] (#11958) 2025-02-19 20:45:17 +01:00
f95b04a21c model : fix order kvq -> qkv
ggml-ci
2025-02-19 18:52:20 +02:00
2eacb4c1bf graph : simplify attention api
ggml-ci
2025-02-19 18:43:49 +02:00
e17e4b72d1 context : add llama_context_recurrent
ggml-ci
2025-02-19 16:07:27 +02:00
5f11a5502a kv-cache : remove llama_kv_cache_i 2025-02-19 14:36:27 +02:00
d07c621393 common : add llama.vim preset for Qwen2.5 Coder (#11945)
This commit adds a preset for llama.vim to use the default Qwen 2.5
Coder models.

The motivation for this change is to make it easier to start a server
suitable to be used with the llama.vim plugin. For example, the server
can be started with a command like the following:
```console
$ llama.vim --fim-qwen-1.5b-default
```

Refs: https://github.com/ggml-org/llama.cpp/issues/10932
b4743
2025-02-19 12:29:52 +01:00
abd4d0bc4f speculative : update default params (#11954)
* speculative : update default params

* speculative : do not discard the last drafted token
b4742
2025-02-19 13:29:42 +02:00
9626d9351a llama : fix indentation in llama-grammar [no ci] (#11943)
This commit adjusts the indentation for the functions `parse_sequence`
and `parse_rule` in src/llama-grammar.cpp.

The motivation is consistency and improve readability.
2025-02-19 06:16:23 +01:00
b58934c183 server : (webui) Enable communication with parent html (if webui is in iframe) (#11940)
* Webui: Enable communication with parent html (if webui is in iframe):
- Listens for "setText" command from parent with "text" and "context" fields. "text" is set in inputMsg, "context" is used as hidden context on the following requests to the llama.cpp server
- On pressing na Escape button sends command "escapePressed" to the parent

Example handling from the parent html side:
- Send command "setText" from parent html to webui in iframe:
const iframe = document.getElementById('askAiIframe');
if (iframe) {
	iframe.contentWindow.postMessage({ command: 'setText', text: text, context: context }, '*');
}

- Listen for Escape key from webui on parent html:
// Listen for escape key event in the iframe
window.addEventListener('keydown', (event) => {
	if (event.key === 'Escape') {
		// Process case when Escape is pressed inside webui
	}
});

* Move the extraContext from storage to app.context.

* Fix formatting.

* add Message.extra

* format + build

* MessageExtraContext

* build

* fix display

* rm console.log

---------

Co-authored-by: igardev <ivailo.gardev@akros.ch>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-02-18 23:01:44 +01:00
f5cedbcaaa kv-cache : prepare for abstraction
ggml-ci
2025-02-18 21:28:58 +02:00
63e489c025 tool-call: refactor common chat / tool-call api (+ tests / fixes) (#11900)
* tool-call refactoring: moved common_chat_* to chat.h, common_chat_templates_init return a unique_ptr to opaque type

* addressed clang-tidy lints in [test-]chat.*

* rm minja deps from util & common & move it to common/minja/

* add name & tool_call_id to common_chat_msg

* add common_chat_tool

* added json <-> tools, msgs conversions to chat.h

* fix double bos/eos jinja avoidance hack (was preventing inner bos/eos tokens)

* fix deepseek r1 slow test (no longer <think> opening w/ new template)

* allow empty tools w/ auto + grammar

* fix & test server grammar & json_schema params w/ & w/o --jinja
b4739
2025-02-18 18:03:23 +00:00
63ac128563 server : add TEI API format for /rerank endpoint (#11942)
* server : add TEI API format for /rerank endpoint

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix

* also gitignore examples/server/*.gz.hpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4738
2025-02-18 14:21:41 +01:00
2bffc2d514 model : pass llama_graph_i as ptr
ggml-ci
2025-02-18 14:57:26 +02:00
9e50456e19 context : minor simplify
ggml-ci
2025-02-18 14:53:02 +02:00
befe14f06f llama : reorder encode/decode in sources 2025-02-18 14:47:53 +02:00
bc6f187e9c cont : use returend tensors from the graph build
ggml-ci
2025-02-18 14:24:17 +02:00
172f61690c cont : return important tensors
ggml-ci
2025-02-18 13:48:43 +02:00
c23590319a graph : add llama_graph_result
ggml-ci
2025-02-18 13:48:21 +02:00
5137da7b8c scripts: corrected encoding when getting chat template (#11866) (#11907)
Signed-off-by: MoonRide303 <moonride303@gmail.com>
2025-02-18 10:30:16 +01:00