Commit Graph

5076 Commits

Author SHA1 Message Date
7ae33a616f llama : add Falcon3 support (#10883)
* Add Falcon3 model support

* Add fix for adding bos to added special tokens

* Add comment explaining the logic behind the if statement

* Add a log message to better track the when the following line of code is triggered

* Update log to only print when input and output characters are different

* Fix handling pre-normalized tokens

* Refactoring
b4376
2024-12-23 00:09:58 +02:00
ebdee9478c vulkan: build fixes for 32b (#10927)
* vulkan: build fixes for 32b

Should fix #10923

* vulkan: initialize some buffer/offset variables
b4375
2024-12-22 10:44:01 +01:00
5cd85b5e00 convert : add BertForMaskedLM (#10919) 2024-12-21 10:10:18 +02:00
a91a41364b vulkan: optimize coopmat2 dequant functions (#10855)
Change the code to do 16b loads when possible and extract the appropriate
component late, so the code is effectively decoding a pair of elements and
then selecting one. This can allow more commoning to happen in the compiler
when neighboring elements are loaded.
2024-12-21 08:04:45 +01:00
e34c5af43f ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0() (#10874)
* ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0()

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ggml-cpu: format code

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b4372
2024-12-21 00:33:37 +01:00
eb5c3dc64b SYCL: Migrate away from deprecated ggml_tensor->backend (#10840)
* Migrate to tensor->buffer for checking backend buffer type: 1

* SYCL: common.cpp try to migrate away from tensor->backend

* SYCL: fix assertions and add proper comments

* SYCL: remove extra space

* SYCL: Add back static to ggml_backend_buffer_is_sycl_split function

* SYCL: Add pragma directive to suppress warning spam

* SYCL: Integrate debug logs with GGML_LOG and other fixes

* Revert "SYCL: Integrate debug logs with GGML_LOG and other fixes"

This reverts commit 2607b7de0f.
Let's keep the current SYCL specific logging mechanism for now

* SYCL: Use GGML_SYCL_DEBUG after reverting

* SYCL: reg_get_proc_address func, update to the current func signature

* SYCL: Refactor SYCL buffer checks in ggml_sycl_cpy_tensor_2d
b4371
2024-12-20 23:31:28 +08:00
0ca416c91a server : (UI) fix copy to clipboard function (#10916) 2024-12-20 14:12:06 +01:00
21ae3b9be8 ggml : add test for SVE and disable when it fails (#10906) b4369 2024-12-20 13:31:28 +01:00
0a11f8b7b5 convert : fix RWKV v6 model conversion (#10913)
* Enable --no-context-shift for llama-perplexity example

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV 6: Fix error in ggml_cuda_op_bin_bcast

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
b4368
2024-12-20 11:44:58 +02:00
d408bb9268 clip : disable GPU support (#10896)
ggml-ci
b4367
2024-12-19 18:47:15 +02:00
5cab3e4aaa llama : minor grammar refactor (#10897)
ggml-ci
b4366
2024-12-19 17:42:13 +02:00
36319dec5d tts : small QoL for easy model fetch (#10903) b4365 2024-12-19 17:35:15 +02:00
57bb2c40cd server : fix logprobs, make it OAI-compatible (#10783)
* server : fix logprobs, make it openai-compatible

* update docs

* add std::log

* return pre-sampling p

* sort before apply softmax

* add comment

* fix test

* set p for sampled token

* update docs

* add --multi-token-probs

* update docs

* add `post_sampling_probs` option

* update docs [no ci]

* remove --multi-token-probs

* "top_probs" with "post_sampling_probs"

* resolve review comments

* rename struct token_prob to prob_info

* correct comment placement

* fix setting prob for sampled token
2024-12-19 15:40:08 +01:00
a3c33b1dce ggml: fix arm build with gcc (#10895)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b4363
2024-12-19 14:20:41 +01:00
2fffc52b50 llama : fix Roberta embeddings (#10856)
* fix: Use gpt2 tokenizer for roberta and add eos/bos tokens

Branch: RobertaTokenizer

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fixes to position embeddings

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* map roberta-bpe to gpt-2

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* fix linting

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
b4362
2024-12-19 15:04:51 +02:00
7585edbdeb convert : Add support for Microsoft Phi-4 model (#10817)
* convert : use GPT2 vocab for Phi-4 model

* convert : use null value of sliding_window to distinguish Phi-4 from other PHI3-based models

* llama : do not use sliding window attention mask for Phi-4 model

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
b4361
2024-12-19 10:37:12 +01:00
cd920d0ac3 tests: disable GGUF test for bad value size (#10886) b4360 2024-12-19 08:53:58 +01:00
7909e8588d llama-run : improve progress bar (#10821)
Set default width to whatever the terminal is. Also fixed a small bug around
default n_gpu_layers value.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
b4359
2024-12-19 03:58:00 +01:00
9177484f58 ggml : fix arm build (#10890)
* ggml: GGML_NATIVE uses -mcpu=native on ARM

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ggml: Show detected features with GGML_NATIVE

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* remove msvc support, add GGML_CPU_ARM_ARCH option

* disable llamafile in android example

* march -> mcpu, skip adding feature macros

ggml-ci

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Adrien Gallouët <angt@huggingface.co>
b4358
2024-12-18 23:21:42 +01:00
0bf2d10c55 tts : add OuteTTS support (#10784)
* server : add "tokens" output

ggml-ci

* server : output embeddings for all tokens when pooling = none

ggml-ci

* server : be explicit about the pooling type in the tests

ggml-ci

* server : do not normalize embeddings when there is no pooling

ggml-ci

* llama : add OuteTTS support (wip)

* wip

* extract features

* first conv

* group norm

* resnet conv

* resnet

* attn

* pos net

* layer norm

* convnext

* head

* hann window

* fix n_embd + remove llama.cpp hacks

* compute hann window

* fft

* spectrum processing

* clean-up

* tts : receive input text and generate codes

* clip : fix new conv name

* tts : minor fix

* tts : add header + minor fixes

ggml-ci

* tts : add matchematical constant

ggml-ci

* tts : fix sampling + cut initial noise

* tts : fixes

* tts : update default samplers

ggml-ci

* tts : text pre-processing

* tts : outetts-voc -> wavtokenizer-dec

* tts : remove hardcoded constants

ggml-ci

* tts : fix tensor shapes

* llama : refactor wavtokenizer tensors

ggml-ci

* cont

ggml-ci

* cont [no ci]

* llama : update WavTokenizer to non-causal attn

* llama : handle no-vocab detokenization

* tts : add Python example for OuteTTS (wip)

* tts : extend python example to generate spectrogram

ggml-ci

* server : fix rebase artifacts

* tts : enable "return_tokens" in Python example

ggml-ci

* tts : minor fixes

* common : support HF download for vocoder
b4357
2024-12-18 19:27:21 +02:00
7bbb5acf12 server: avoid overwriting Authorization header (#10878)
* server: avoid overwriting Authorization header

If no API key is set, leave the Authorization header as is. It may be
used by another part of the Web stack, such as an authenticating proxy.

Fixes https://github.com/ggerganov/llama.cpp/issues/10854

* rebuild

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-18 15:00:07 +01:00
152610eda9 server : output embeddings for all tokens when pooling = none (#10861)
* server : add "tokens" output

ggml-ci

* server : output embeddings for all tokens when pooling = none

ggml-ci

* server : update readme [no ci]

* server : fix spacing [no ci]

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* server : be explicit about the pooling type in the tests

ggml-ci

* server : update /embeddings and /v1/embeddings endpoints

ggml-ci

* server : do not normalize embeddings when there is no pooling

ggml-ci

* server : update readme

ggml-ci

* server : fixes

* tests : update server tests

ggml-ci

* server : update readme [no ci]

* server : remove rebase artifact

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-18 13:01:41 +02:00
0e70ba686e server : add "tokens" output (#10853)
* server : add "tokens" output

ggml-ci

* server : update readme

ggml-ci

* server : return tokens ids only if requested

ggml-ci

* tests : improve "tokens" type check

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* server : remove "tokens" from the OAI endpoint

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
b4354
2024-12-18 11:05:29 +02:00
46828872c3 server : (embeddings) using same format for "input" and "content" (#10872)
* server : (embeddings) using same format for "input" and "content"

* fix test case

* handle empty input case

* fix test
b4353
2024-12-18 10:55:09 +02:00
6b064c92b4 docs: Fix HIP (née hipBLAS) in README (#10880)
Related to #10524 / be0e350c references to hipBLAS have been removed
across the repository.  This fixes the link from the repositories
`README.md`.

Signed-off-by: Brian 'redbeard' Harrington <redbeard@dead-city.org>
2024-12-18 10:35:00 +02:00
4da69d1abd Revert "llama : add Falcon3 support (#10864)" (#10876)
This reverts commit 382bc7f2e8.
b4351
2024-12-18 01:36:46 +01:00
d62b532c52 Use model->gguf_kv for loading the template instead of using the C API. (#10868)
* Bump model_template to 16384 bytes to support larger chat templates.

* Use `model->gguf_kv` for efficiency.
b4350
2024-12-17 23:24:22 +01:00
081b29bd2a tests: add tests for GGUF (#10830) b4349 2024-12-17 19:09:35 +01:00
5437d4aaf5 sync : ggml b4348 2024-12-17 18:36:02 +02:00
78f766768d cmake : fix "amd64" processor string (whisper/2638) 2024-12-17 18:35:49 +02:00
8dd19a4812 vulkan : fix soft_max.comp division by zero (whisper/2633)
This change prevents a division by zero error when p.KY is 0.
2024-12-17 18:35:49 +02:00
130d0c90bd ggml : remove return from ggml_gallocr_allocate_node (ggml/1048)
This commit removes the return statement from ggml_gallocr_allocate_node
function.

The motivation behind this change is to make the code more readable and
consistent.
2024-12-17 18:35:49 +02:00
3919da8e33 ggml : add check for grad_accs (ggml/1046)
* ggml : add check for grad_accs

This commit adds a check for grad_accs in ggml_graph_get_grad and
ggml_graph_get_grad_acc functions. This is necessary to avoid segfaults
when grad_accs is not initialized.

The motivation for this change is that I find it nice to be able to
print out a computation graph using ggml_graph_print but this function
segfaults when grad_accs is not initialized:
```console
(gdb) p g1
$2 = (ggml_cgraph *) 0x7ffff66004b0
(gdb) p *g1
$3 = {size = 2048, n_nodes = 1, n_leafs = 2, nodes = 0x7ffff6600500,
grads = 0x0, grad_accs = 0x0, leafs = 0x7ffff6604500,
visited_hash_set = {size = 4099, used = 0x7ffff6610518,
keys = 0x7ffff6608500}, order = GGML_CGRAPH_EVAL_ORDER_LEFT_TO_RIGHT}
(gdb) p ggml_graph_print(g1)
=== GRAPH ===
n_nodes = 1

Program received signal SIGSEGV, Segmentation fault.
0x0000555555579775 in ggml_graph_get_grad
(cgraph=0x7ffff66004b0,node=0x7ffff6600340)
    at /ggml/ggml/src/ggml.c:5990
5990  return igrad != GGML_HASHSET_FULL &&
          ggml_bitset_get(cgraph->visited_hash_set.used, igrad) ?
          cgraph->grads[igrad] : NULL;
```

* squash! ggml : add check for grad_accs

Fix the check in ggml_graph_get_grad. The check was incorrectly using
cgraph->grad_accs instead of cgraph->grads.
2024-12-17 18:35:48 +02:00
0006f5a74a ggml : update ggml_backend_cpu_device_supports_op (#10867)
* ggml : fix cpy op for IQ-quants to use reference impl

ggml-ci

* ggml : disable tests involving i-matrix quantization

* ggml : update ggml_backend_cpu_device_supports_op

ggml-ci
b4343
2024-12-17 18:35:42 +02:00
05c3a444b8 server : fill usage info in embeddings and rerank responses (#10852)
* server : fill usage info in embeddings response

* server : fill usage info in reranking response
b4342
2024-12-17 18:00:24 +02:00
382bc7f2e8 llama : add Falcon3 support (#10864) b4341 2024-12-17 17:24:56 +02:00
4f51968aca readme : update typos (#10863) 2024-12-17 11:47:20 +02:00
227d7c5a7f server : (UI) fix missing async generator on safari (#10857)
* server : (UI) fix missing async generator on safari

* fix
2024-12-17 09:52:09 +01:00
Eve
7b1ec53f56 vulkan: bugfixes for small subgroup size systems + llvmpipe test (#10809)
* ensure mul mat shaders work on systems with subgroup size less than 32

more fixes

add test

* only s_warptile_mmq needs to be run with 32 threads or more
b4338
2024-12-17 06:52:55 +01:00
160bc039c8 rwkv6: add wkv6 support for Vulkan backend (#10829)
* rwkv_wkv6 vulkan shader

* RWKV_WKV6 Vulkan op tests passed

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Apply code format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* add [[unroll]] and remove unnecessary conditions

* add uma support

* fix erros in EditorConfig Checker

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Molly Sophia <mollysophia379@gmail.com>
b4337
2024-12-16 22:00:46 +01:00
08ea539df2 unicode : improve naming style (#10838)
* unicode : improve naming style

ggml-ci

* cont [no ci]
2024-12-16 12:31:45 +02:00
644fd71b44 sampling : refactor + optimize penalties sampler (#10803)
* sampling : refactor + optimize penalties sampler

ggml-ci

* common : apply ignore_eos as logit bias

ggml-ci

* batched : remove penalties sampler

* params : allow penalty_last_n == -1 to be equal to context size

ggml-ci

* common : by default, move the penalties at the end of the sampling chain

ggml-ci

* common : ignore all EOG tokens

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* common : move back the penalties at the front of the sampling chain

ggml-ci

* readme : restore hint about --ignore-eos flag [no ci]

* llama : minor

ggml-ci

* webui : update

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-16 12:31:14 +02:00
4ddd199f6f llava : Allow locally downloaded models for QwenVL (#10833)
* Allow locally downloaded models for QwenVL

* Define model_path

* rm trailing space

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-15 21:43:25 +01:00
a0974156f3 llama : add Deepseek MoE v1 & GigaChat models (#10827)
* Add deepseek v1 arch & gigachat template

* improve template code

* add readme

* delete comments

* remove comment

* fix format

* lint llama.cpp

* fix order of deepseek and deepseek2, move gigachat temlate to the end of func

* fix order of deepseek and deepseek2 in constants; mark shared exp as deepseek arch need

* remove comments

* move deepseek above deepseek2

* change placement of gigachat chat template
b4333
2024-12-15 19:02:46 +02:00
87cf323cef scripts : change build path to "build-bench" for compare-commits.sh (#10836) 2024-12-15 18:44:47 +02:00
5478bbcd17 server: (UI) add syntax highlighting and latex math rendering (#10808)
* add code highlighting and math formatting

* code cleanup

* build public/index.html

* rebuild public/index.html

* fixed coding style

* fixed coding style

* style fixes

* highlight: smaller bundle size, fix light & dark theme

* remove katex

* add bundle size check

* add more languages

* add php

* reuse some langs

* use gzip

* Revert "remove katex"

This reverts commit c0e5046acc.

* use better maintained @vscode/markdown-it-katex

* fix gzip non deterministic

* ability to add a demo conversation for dev

* fix latex rendering

* add comment

* latex codeblock as code

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b4331
2024-12-15 12:55:54 +01:00
b5ae1ddff9 gguf-py : bump to v0.13.0 gguf-v0.13.0 2024-12-15 13:16:42 +02:00
89d604f2c8 server: Fix has_next_line in JSON response (#10818)
* Update server JSON response.

* Add unit test to check `has_new_line` JSON response

* Remove `has_new_line` unit test changes.

* Address code review comment: type check for `has_new_line` in unit test
gguf-v0.12.0 b4329
2024-12-14 23:29:45 +01:00
e52aba537a nix: allow to override rocm gpu targets (#10794)
This allows to reduce compile time when you are building for a single GPU.
2024-12-14 10:17:36 -08:00
ba1cb19cdd llama : add Qwen2VL support + multimodal RoPE (#10361)
* Barebone Qwen2VL LLM convertor

* Add Qwen2VL cli entrypoint

* [WIP] add qwen2vl arch

* Verify m-rope output

* Add vl-rope/2d-rope support for qwen2vl ViT

* update qwen2vl cli tool

* update 5D tensor op workaround

* [WIP] qwen2vl vision model

* make batch and clip utils compatible with qwen2vl

* [WIP] create inference workflow, gguf convert script but fix

* correcting vision-rope behavior, add the missing last layer back to ViT

* add arg parser to qwen2vl_surgery

* replace variable size array with vector

* cuda-gdb cmake preset

* add fp32 mrope, vision rope kernel

* add fp16 support for qwen2vl and m-rope

* add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION`

* fix rope op mode switching, out dated func args

* update `llama_hparams`

* update to keep up stream changes

* resolve linter, test errors

* add makefile entry, update speical image padding token

* add mrope unit test, fix few compiler warnings

* rename `mrope` related function, params

* minor updates on debug util, bug fixs

* add `m-rope` testcase to `test-backend-ops`

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix traililng whitespce

* store `llama_hparams.rope_sections` with fixed size array

* update position id tensor size check in GGML_OP_ROPE

* minor updates

* update `ggml_backend_*_supports_op` of unsupported backends

* remote old `rope_section` compare operator

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4327
2024-12-14 14:43:46 +02:00