Commit Graph

1656 Commits

Author SHA1 Message Date
2994f0c5a2 decode : fix logits_valid for legacy API (#4516) b1656 2023-12-17 19:39:02 -05:00
b1306c4394 readme : update hot topics 2023-12-17 20:16:23 +02:00
800a489e4a llama.swiftui : add bench functionality (#4483)
* llama.swiftui : add bench button

* llama.swiftui : initial bench functionality

* force to use n_gpu_layers on simulator

* add download buttons & expose llamaState.loadModel

* update project.pbxproj

* comment #Preview & fix editorconfig check

* gitignore : xcode stuff

* llama.swiftui : UX improvements

* llama.swiftui : avoid data copy via "downloadTask"

* llama.swiftui : remove model from project

* llama : remove "mostly" from model infos

* llama.swiftui : improve bench

---------

Co-authored-by: jhen <developer@jhen.me>
b1654
2023-12-17 19:38:41 +02:00
f7f468a97d gguf-py : fail fast on nonsensical special token IDs (#4489) 2023-12-17 10:45:46 -05:00
919c40660f build : Check the ROCm installation location (#4485)
* build : Check the ROCm installation location

* more generic approach

* fixup! It was returning the path instead of the command output

* fixup! Trailing whitespace
b1652
2023-12-17 17:23:33 +02:00
45668633fd finetune : keep allocs alive until all allocations are done (#4486) b1651 2023-12-17 16:05:56 +01:00
0ffc92d2d2 server : disable llm logs if SERVER_VERBOSE is off (#3792) b1650 2023-12-17 17:02:16 +02:00
8edd2b40fd server : fix grammar being ignored (#4494)
Fix bug in identifying the grammar.
b1649
2023-12-17 16:57:56 +02:00
eb16dae7e7 server : fix possible ambiguity in content type charset (#4501) b1648 2023-12-17 16:56:09 +02:00
62bd52b7bf server : allow requests larger than 8K (#4500) b1647 2023-12-17 16:54:37 +02:00
5daa5f54fd Link to cublas dynamically on Windows even with LLAMA_STATIC (#4506) b1646 2023-12-17 11:57:33 +01:00
c6c4fc081c lora : add support for non-llama models (#3333)
* lora : add support for non-llama models

ggml-ci

* avoid leaking ggml_context on failure
cleanup

ggml-ci

* lora : allow 1d tensors

* lora : include embd and output layers in size calculation

* fix style
b1645
2023-12-16 18:58:46 +01:00
8a5be3bd58 llama : sanity checks for access to logits (#4274)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b1644
2023-12-15 22:16:15 -05:00
88ae8952b6 server : add optional API Key Authentication example (#4441)
* Add API key authentication for enhanced server-client security

* server : to snake_case

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b1643
2023-12-15 13:49:01 +02:00
ee4725a686 ggml : group mul_mat_id rows by matrix (cpu only) (#4480)
* ggml : group mul_mat_id rows by matrix (cpu only)

* remove mmid parameters from mm forward

* store row groups in wdata and calculate only once in GGML_TASK_INIT

ggml-ci
b1642
2023-12-15 12:45:50 +01:00
6744dbe924 ggml : use ggml_row_size where possible (#4472)
* ggml : use ggml_row_size where possible

ggml-ci

* ggml : move ggml_nbytes_split to ggml-cuda.cu
b1641
2023-12-14 20:05:21 +01:00
cafcd4f895 ggml : remove n_dims from ggml_tensor (#4469)
ggml-ci
b1640
2023-12-14 16:52:08 +01:00
c50e400163 py : add protobuf dependency (#4466) 2023-12-14 14:44:49 +02:00
20a68a7030 ggml : add ggml_row_size() (fixes llama out of space) (#4461)
* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values

* do not cast to size_t, instead just use doubles

* ggml : add ggml_row_size(), deprecate ggml_type_sizef()

* ggml : fix row size compute to avoid overflows

* tests : fix sizey -> sizez

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b1638
2023-12-14 14:13:33 +02:00
55e87c3749 ggml : fix OpenCL broadcast requirement for ggml_mul (close #4453) b1637 2023-12-14 10:35:29 +02:00
873637afc7 convert : support loading vocab from fast tokenizer config (#3633)
* Add HFVocab into convert.py

* Update convert.py

* Update convert.py

* add bytes_to_unicode function

* change add_meta_vocab fucntion

* remove debug code

* remove byte_encoder

* Add newline between classes

* Check tokenizer.json when tokenizer.model is not exist.

* Move transformers dependency to local code

* Add error context with 'raise from'

* Add fast tokenizer option to BpeVocab

* Update convert.py

* Add VocabLoader and remove *Vocab class

* Add transformers dependency

* remove added tokens and check newline token to decide spm or bpe

* Update convert.py

* Add special token type

* Update convert.py

* Update convert.py

* Update convert.py

* Fix typo in convert.py

* Fix when params.n_vocab < tokenizer vocab size

* update vocab class

* change funtion name

* Remove unused variable/functions, add types to class variable and methods, delete blank liens

* fix flake8 warnings

* code style cleanup

* make mypy happy

* change exception

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2023-12-14 10:09:34 +02:00
0353a18401 readme : update supported model list (#4457) 2023-12-14 09:38:49 +02:00
948ff137ec server : fix handling of characters that span multiple tokens when streaming (#4446) b1634 2023-12-13 21:57:15 +02:00
4d98d9a656 sync : ggml (SD ops, tests, kernels) (#4444)
* sync : ggml (SD ops, tests, kernels)

ggml-ci

* cuda : restore im2col

ggml-ci

* metal : fix accuracy of dequantization kernels

ggml-ci

* cuda : restore correct im2col

ggml-ci

* metal : try to fix moe test by reducing expert size

ggml-ci

* cuda : fix bin bcast when src1 and dst have different types

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
b1633
2023-12-13 21:54:54 +02:00
70f806b821 build : detect host compiler and cuda compiler separately (#4414) b1632 2023-12-13 12:10:10 -05:00
9fb13f9584 common : add --version option to show build info in CLI (#4433) b1631 2023-12-13 14:50:14 +02:00
113f9942fc readme : update hot topics 2023-12-13 14:05:38 +02:00
799a1cb13b llama : add Mixtral support (#4406)
* convert : support Mixtral as LLAMA arch

* convert : fix n_ff typo

* llama : model loading

* ggml : sync latest ggml_mul_mat_id

* llama : update graph to support MoE

* llama : fix cur -> cur_expert

* llama : first working version

* llama : fix expert weighting in the FFN

* ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)

* ggml : add n_as argument to ggml_mul_mat_id

* ggml : fix ggml_get_rows to take into account ne02 / ne11

* metal : add more general support for ggml_get_rows + tests

* llama : add basic support for offloading moe with CUDA

* metal : add/mul/div use general kernel when src1 not cont

* metal : reduce the kernel launches for ggml_mul_mat_id

* ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D

* ggml : update get_rows f16 and q

* cuda : support non-contiguous src1 in get_rows

* llama : offload missing ffn_moe_silu

* metal : fix ggml_get_rows to work with non-cont src1

* metal : add indirect mat-vec kernels for all quantization types

* llama : do not quantize expert gating tensors

* llama : add n_expert and n_expert_used to hparams + change quants

* test-backend-ops : add moe test

* cuda : fix get_rows when ncols is odd

* convert : determine n_ctx correctly

* metal : fix ggml_mul_mat_id for F32

* test-backend-ops : make experts more evenly probable (test_moe)

* test-backend-ops : cleanup, add moe test for batches

* test-backend-ops : add cpy from f32 -> all types test

* test-backend-ops : fix dequantize block offset

* llama : fix hard-coded number of experts

* test-backend-ops : simplify and disable slow tests to avoid CI timeout

* test-backend-ops : disable MOE test with thread sanitizer

* cuda : fix mul_mat_id with multi gpu

* convert : use 1e6 rope_freq_base for mixtral

* convert : fix style

* convert : support safetensors format

* gguf-py : bump version

* metal : add cpy f16 -> f32 kernel

* metal : fix binary ops for ne10 % 4 != 0

* test-backend-ops : add one more sum_rows test

* ggml : do not use BLAS with ggml_mul_mat_id

* convert-hf : support for mixtral-instruct (#4428)

* convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct

* convert : use sentencepiece tokenizer for Mixtral-instruct

* convert : make flake8 happy

* metal : fix soft_max kernels

ref: 1914017863

* metal : limit kernels to not use more than the allowed threads

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Radek Pilar <github@mrkva.eu>
b1629
2023-12-13 14:04:25 +02:00
fecac45658 server : tweak default sampling parameters (#4367)
* Set a more typical Top P setting as the default

* Update temp max
2023-12-12 12:12:35 +02:00
9494d7c477 english : use typos to fix comments and logs (#4354) b1627 2023-12-12 11:53:36 +02:00
6138963fb2 build : target Windows 8 for standard mingw-w64 (#4405)
* build : target Windows 8 for standard mingw-w64

* make : fix missing console.o deps

This was causing a link error with `make all` on Windows.
b1626
2023-12-12 11:27:26 +02:00
6391817cd1 llama : document logits_all deprecation (#4418)
llama_context_params.logits_all is a parameter for controlling
llama_eval. This documents that logits_all should not be used with
llama_decode and llama_batch.
b1625
2023-12-12 11:25:57 +02:00
d9d4cfef64 server : fix local model name in server (#4420) b1624 2023-12-12 11:25:29 +02:00
41a11aaf99 ggml : increased GGML_MAX_PARAMS to allow finetuning of 70b models (#4424) b1623 2023-12-12 11:24:32 +02:00
8a7b2fa528 Update README.md (#4388)
Fix small typo.
2023-12-10 23:27:38 +01:00
e18f7345a3 grammar : revert the replacement of llama_token_to_piece with id_to_token (#4396) b1621 2023-12-09 23:29:27 +02:00
fe680e3d10 sync : ggml (new ops, tests, backend, etc.) (#4359)
* sync : ggml (part 1)

* sync : ggml (part 2, CUDA)

* sync : ggml (part 3, Metal)

* ggml : build fixes

ggml-ci

* cuda : restore lost changes

* cuda : restore lost changes (StableLM rope)

* cmake : enable separable compilation for CUDA

ggml-ci

* ggml-cuda : remove device side dequantize

* Revert "cmake : enable separable compilation for CUDA"

This reverts commit 09e35d04b1.

* cuda : remove assert for rope

* tests : add test-backend-ops

* ggml : fix bug in ggml_concat

* ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()`

* ci : try to fix macOS

* ggml-backend : remove backend self-registration

* ci : disable Metal for macOS cmake build

ggml-ci

* metal : fix "supports family" call

* metal : fix assert

* metal : print resource path

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
b1620
2023-12-07 22:26:54 +02:00
bcc0eb4591 llama : per-layer KV cache + quantum K cache (#4309)
* per-layer KV

* remove unnecessary copies

* less code duplication, offload k and v separately

* llama : offload KV cache per-layer

* llama : offload K shift tensors

* llama : offload for rest of the model arches

* llama : enable offload debug temporarily

* llama : keep the KV related layers on the device

* llama : remove mirrors, perform Device -> Host when partial offload

* common : add command-line arg to disable KV cache offloading

* llama : update session save/load

* llama : support quantum K cache (#4312)

* llama : support quantum K cache (wip)

* metal : add F32 -> Q8_0 copy kernel

* cuda : add F32 -> Q8_0 copy kernel

ggml-ci

* cuda : use mmv kernel for quantum cache ops

* llama : pass KV cache type through API

* llama : fix build

ggml-ci

* metal : add F32 -> Q4_0 copy kernel

* metal : add F32 -> Q4_1 copy kernel

* cuda : wip

* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels

* llama-bench : support type_k/type_v

* metal : use mm kernel only for quantum KV cache

* cuda : add comment

* llama : remove memory_f16 and kv_f16 flags

---------

Co-authored-by: slaren <slarengh@gmail.com>

* readme : add API change notice

---------

Co-authored-by: slaren <slarengh@gmail.com>
b1619
2023-12-07 13:03:17 +02:00
81bc9214a3 train : fix #4227 (double free in examples/train-text-from-scratch/train-text-from-scratch.cpp) (#4351)
On commit b1108 (44c117f4) xaedes added

    ggml_allocr * alloc = NULL;

    ... (many lines in between)

    if (alloc) {
        ggml_allocr_free(alloc);
    }

Which is correct, but it's easy to lose context after many lines in between.

On commit b1287 (0e76a899) xaedes made a big change. From here on, alloc is freed eagerly.

    alloc = ggml_allocr_new(...)
    ... (short lines of code)
    ggml_allocr_free(alloc)

This happens a few times, but alloc is never set to NULL, and many lines below,
we still have

    if (alloc) {
        ggml_allocr_free(alloc);
    }

which causes a double-free.
b1618
2023-12-07 12:25:22 +02:00
05cd6e5036 server : recognize cache_prompt parameter in OAI API (#4347) b1617 2023-12-06 20:21:59 +02:00
caa9249217 common : fix compile warning b1616 2023-12-06 10:41:03 +02:00
da5eaef1f3 speculative : support --color (#4343)
* speculative: add some colors

* minor : add braces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b1615
2023-12-06 10:08:17 +02:00
5f6e0c0dff grammar : pre-computed pieces + reserve mem + less string copies (#4330)
* reserve space for codepoints

* improvement for the appended 0

* used precomputed token text for grammar sample

* reserve canidates_decoded

* reserve canidates_grammar

* remove candidates_decoded

* Revert "remove candidates_decoded"

This reverts commit 3773328080.

* changed decode_utf8 to take src by ref
b1614
2023-12-05 22:55:12 +02:00
5aa365d88f llama : allow overriding GGUF metadata when loading model (#4092)
* feat: Allow overriding GGUF metadata when loading model

* Fix the one time GCC is stricter than clang about something

* Step1

* Refactor... basically everything!

* Nuke obsolete GetArrayLen struct

* simplify std::string specialization

* Various cleanups

Add informational output when overrides are applied

Warn user when an override with the wrong type is specified

* Fix broken logic for parsing bool KV overrides
Fix issue where overrides didn't apply when key missing in GGUF metadata
Resolve merge changes

* llama : rearrange model params

* Update new GET_KEY call

Add note that metadata KV overrides aren't reflected in initial metadata KV info dump

---------

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b1613
2023-12-05 19:19:18 +02:00
52c8bc3cf3 sampling : custom samplers order (#4285)
* Samplers sequence order w parameter

* Cleaned commented code

* Fixed formatting

* Rewrote with unordered_map

* Revert and rewrite, too many problems and safeguards would be needed

* Fixed code style

* Code style fixes according to review

* More readable samplers input string, fixed help

* Style fix in sampler_queue

* Formatting fixes

* Fixing whitespaces
b1612
2023-12-05 12:05:51 +02:00
e4b76bbe31 swift : revert compiler checks for swift package (#4332) b1611 2023-12-05 09:29:46 +02:00
23b5e12eb5 simple : update error message for KV cache check (#4324)
This commit updates the error message that is printed when the
KV cache is not big enough to hold all the prompt and generated
tokens. Specifically it removes the reference to n_parallel and
replaces it with n_len.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
b1610
2023-12-04 18:04:21 +02:00
d208995c6d swift : fix concatenation method to avoid invalid UTF8 stringfication (#4325) b1609 2023-12-04 18:03:49 +02:00
5c9f90cba1 swift : fix prompt tokenization logic (#4321) b1608 2023-12-04 15:43:45 +02:00
4fa44e84ad grammar-parser : fix typo (#4318)
preceeding -> preceding
b1607
2023-12-04 09:57:35 +02:00