Commit Graph

1767 Commits

Author SHA1 Message Date
012cf349ae server : send token probs for "stream == false" (#4714) b1767 2024-01-04 19:56:33 +02:00
a91928014f Print backend name on test-backend-ops failure (#4751) b1766 2024-01-04 09:43:23 +01:00
3c0b585561 llama.swiftui : support loading custom model from file picker (#4767)
* swiftui: support load model from file picker

* swiftui: remove trailing whitespace
b1765
2024-01-04 10:22:38 +02:00
e5804313a1 server : fix options in README.md (#4765)
* fix examples/server/README.md

* minor : fix whitespace

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-04 10:17:09 +02:00
dc891b7f7a ggml : include stdlib.h before intrin.h (#4736) b1763 2024-01-04 10:12:26 +02:00
46cea79e1f llama.swiftui : fix build of ggml.metallib (#4754)
* metal: fix metal backend init failure in swiftui

* metal: build ggml.metallib instead of copy src

* llama.swift : remove debug flags from metallib build

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-04 09:58:16 +02:00
cb1e2818e0 train : fix typo in overlapping-samples help msg (#4758)
This commit fixes a typo in the help message for the
--overlapping-samples option.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
b1761
2024-01-03 19:53:40 +02:00
ece9a45e8f swift : update Package.swift to use ggml as dependency (#4691)
* updates the package.swift to use ggml as dependency

* changes the ggml package url src to ggerganov
b1760
2024-01-03 19:30:02 +02:00
7bed7eba35 cuda : simplify expression
Co-authored-by: slaren <slarengh@gmail.com>
b1759
2024-01-03 14:38:38 +02:00
d55356d3ba cuda : mark I16 and I32 ops as unsupported
ggml-ci
2024-01-03 14:38:38 +02:00
75e3fd8581 sync : ggml
ggml-ci
2024-01-03 14:38:38 +02:00
289313716f metal : add kernel_get_rows_i32
ggml-ci
2024-01-03 14:38:38 +02:00
ab62fc3e55 scripts : fix sync order + metal sed 2024-01-03 14:38:38 +02:00
5f66ebca9c ggml : extend ggml_get_rows, ggml_repeat, ggml_concat (ggml/639)
* add more int ops

* ggml_compute_forward_dup_bytes

* add tests

* PR comments

* tests : minor indentations

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-03 14:38:38 +02:00
f2eb19bd8b server : throw an error when slot unavailable (#4741) 2024-01-03 10:43:19 +02:00
f3f62f0d83 metal : optimize ggml_mul_mat_id (faster Mixtral PP) (#4725)
* ggml : disable fast-math for Metal (cmake build only)

ggml-ci

* metal : fix Metal API debug warnings

* cmake : add -fno-inline for Metal build (#4545)

* metal : fix API debug warnings

* metal : fix compile warnings

* metal : use uint64_t for strides

* cmake : rename option to LLAMA_METAL_SHADER_DEBUG

* metal : fix mat-vec Q8_0 kernel for BS > 1

* metal : normalize mat-vec kernel signatures

* cmake : respect LLAMA_QKK_64 option

* metal : fix mat-vec Q4_K kernel for QK_K == 64

* metal : optimizing ggml_mul_mat_id (wip)

* metal : minor fix

* metal : opt mul_mm_id
b1752
2024-01-02 21:07:47 +02:00
0ef3ca2ac6 server : add token counts to html footer (#4738)
* server: add token counts to stats

* server: generate hpp

---------

Co-authored-by: phiharri <ph@got-root.co.uk>
b1751
2024-01-02 17:48:49 +02:00
540938f890 llama : llama_model_desc print number of experts b1750 2024-01-02 16:26:45 +02:00
0040d42eeb llama : replace all API facing int's with int32_t (#4577)
* replaced all API facing `int`'s with `int32_t`

* formatting and missed `int` in `llama_token_to_piece`
b1749
2024-01-02 16:15:16 +02:00
83e633c27e llama : differentiate the KV dims in the attention (#4657)
* Add n_key_dim and n_value_dim

Some models use values that are not derived from `n_embd`.
Also remove `n_embd_head` and `n_embd_gqa` because it is not clear
which "head" is referred to (key or value).

Fix issue #4648.

* Fix `llm_build_kqv` to use `n_value_gqa`

* Rebase

* Rename variables

* Fix llm_build_kqv to be more generic wrt n_embd_head_k

* Update default values for n_embd_head_k and n_embd_head_v

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix llm_load_tensors: the asserts were not backcompat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b1748
2024-01-02 13:51:28 +02:00
32866c5edd editorconfig : fix whitespace and indentation #4710 b1747 2024-01-02 13:28:15 +02:00
5d7002d437 server : add --override-kv parameter (#4710)
* Changes to server to allow metadata override

* documentation

* flake.nix: expose full scope in legacyPackages

* flake.nix: rocm not yet supported on aarch64, so hide the output

* flake.nix: expose checks

* workflows: nix-ci: init; build flake outputs

* workflows: nix-ci: add a job for eval

* workflows: weekly `nix flake update`

* workflows: nix-flakestry: drop tag filters

...and add a job for flakehub.com

* workflows: nix-ci: add a qemu job for jetsons

* flake.nix: suggest the binary caches

* flake.lock: update

to a commit recently cached by nixpkgs-cuda-ci

---------

Co-authored-by: John <john@jLap.lan>
Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>
b1746
2024-01-02 12:38:15 +02:00
26f3071d71 py : re-enable mmap in convert hf (#4732)
* update: awq support llama-7b model

* update: change order

* update: benchmark results for llama2-7b

* update: mistral 7b v1 benchmark

* update: support 4 models

* fix: Readme

* update: ready for PR

* update: readme

* fix: readme

* update: change order import

* black

* format code

* update: work for bot mpt and awqmpt

* update: readme

* Rename to llm_build_ffn_mpt_awq

* Formatted other files

* Fixed params count

* fix: remove code

* update: more detail for mpt

* fix: readme

* fix: readme

* update: change folder architecture

* fix: common.cpp

* fix: readme

* fix: remove ggml_repeat

* update: cicd

* update: cicd

* uppdate: remove use_awq arg

* update: readme

* llama : adapt plamo to new ffn

ggml-ci

* fix: update torch version

---------

Co-authored-by: Trần Đức Nam <v.namtd12@vinai.io>
Co-authored-by: Le Hoang Anh <v.anhlh33@vinai.io>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-02 11:23:38 +02:00
775ac8712a finetune: fix typo in README.md (#4733)
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-02 10:16:55 +01:00
58ba655af0 metal : enable shader debugging (cmake option) (#4705)
* ggml : disable fast-math for Metal (cmake build only)

ggml-ci

* metal : fix Metal API debug warnings

* cmake : add -fno-inline for Metal build (#4545)

* metal : fix API debug warnings

* metal : fix compile warnings

* metal : use uint64_t for strides

* cmake : rename option to LLAMA_METAL_SHADER_DEBUG

* metal : fix mat-vec Q8_0 kernel for BS > 1

* metal : normalize mat-vec kernel signatures

* cmake : respect LLAMA_QKK_64 option

* metal : fix mat-vec Q4_K kernel for QK_K == 64

ggml-ci
b1743
2024-01-02 10:57:44 +02:00
edd1ab7bc3 flake.lock: update
to a commit recently cached by nixpkgs-cuda-ci
b1742
2023-12-31 13:14:58 -08:00
198ed7ebfc flake.nix: suggest the binary caches 2023-12-31 13:14:58 -08:00
d836174731 workflows: nix-ci: add a qemu job for jetsons 2023-12-31 13:14:58 -08:00
06f2a5d190 workflows: nix-flakestry: drop tag filters
...and add a job for flakehub.com
2023-12-31 13:14:58 -08:00
c5239944ba workflows: weekly nix flake update 2023-12-31 13:14:58 -08:00
1e9ae54cf2 workflows: nix-ci: add a job for eval 2023-12-31 13:14:58 -08:00
7adedecbe3 workflows: nix-ci: init; build flake outputs 2023-12-31 13:14:58 -08:00
356ea17e0f flake.nix: expose checks 2023-12-31 13:14:58 -08:00
a5c088d8c6 flake.nix: rocm not yet supported on aarch64, so hide the output 2023-12-31 13:14:58 -08:00
1e3900ebac flake.nix: expose full scope in legacyPackages 2023-12-31 13:14:58 -08:00
e39106c055 ggml : add ggml_vdotq_s32 alias (#4715)
ggml-ci
b1732
2023-12-31 11:43:31 +02:00
9fbda719de clip : refactor + bug fixes (#4696)
* clip : refactor + bug fixes

ggml-ci

* server : add log message
b1731
2023-12-30 23:24:42 +02:00
39d8bc71ed CUDA: fixed tensor cores not being used on RDNA3 (#4697) b1730 2023-12-30 13:52:01 +01:00
24a447e20a ggml : add ggml_cpu_has_avx_vnni() (#4589)
* feat: add avx_vnni based on intel documents

* ggml: add avx vnni based on intel document

* llama: add avx vnni information display

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* docs: add more details about using oneMKL and oneAPI for intel processors

* Update ggml.c

Fix indentation upgate

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b1729
2023-12-30 10:07:48 +02:00
a20f3c7465 CUDA: fix tensor core logic for Pascal and HIP (#4682) b1728 2023-12-29 23:12:53 +01:00
0235b9b571 clip : use ggml_backend_buffer_is_host (#4205) b1727 2023-12-29 18:53:34 +02:00
ce18d727a4 clip : enable gpu backend (#4205)
* clip: enable CUDA backend

* add missing kernels

* add enough padding for alignment

* remove ggml_repeat of clip.cpp

* add metal backend

* llava : fixes

- avoid ggml_repeat
- use GGML_USE_ instead of CLIP_USE_ macros
- remove unused vars

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b1726
2023-12-29 18:52:15 +02:00
91bb39cec7 cuda: fix vmm oom issue on NVIDIA AGX Orin (#4687)
Signed-off-by: hydai <hydai@secondstate.io>
b1725
2023-12-29 17:31:19 +01:00
04ac0607e9 python : add check-requirements.sh and GitHub workflow (#4585)
* python: add check-requirements.sh and GitHub workflow

This script and workflow forces package versions to remain compatible
across all convert*.py scripts, while allowing secondary convert scripts
to import dependencies not wanted in convert.py.

* Move requirements into ./requirements

* Fail on "==" being used for package requirements (but can be suppressed)

* Enforce "compatible release" syntax instead of ==

* Update workflow

* Add upper version bound for transformers and protobuf

* improve check-requirements.sh

* small syntax change

* don't remove venvs if nocleanup is passed

* See if this fixes docker workflow

* Move check-requirements.sh into ./scripts/

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>
b1724
2023-12-29 16:50:29 +02:00
68eccbdc5b flake.nix : rewrite (#4605)
* flake.lock: update to hotfix CUDA::cuda_driver

Required to support https://github.com/ggerganov/llama.cpp/pull/4606

* flake.nix: rewrite

1. Split into separate files per output.

2. Added overlays, so that this flake can be integrated into others.
   The names in the overlay are `llama-cpp`, `llama-cpp-opencl`,
   `llama-cpp-cuda`, and `llama-cpp-rocm` so that they fit into the
   broader set of Nix packages from [nixpkgs](https://github.com/nixos/nixpkgs).

3. Use [callPackage](https://summer.nixos.org/blog/callpackage-a-tool-for-the-lazy/)
   rather than `with pkgs;` so that there's dependency injection rather
   than dependency lookup.

4. Add a description and meta information for each package.
   The description includes a bit about what's trying to accelerate each one.

5. Use specific CUDA packages instead of cudatoolkit on the advice of SomeoneSerge.

6. Format with `serokell/nixfmt` for a consistent style.

7. Update `flake.lock` with the latest goods.

* flake.nix: use finalPackage instead of passing it manually

* nix: unclutter darwin support

* nix: pass most darwin frameworks unconditionally

...for simplicity

* *.nix: nixfmt

nix shell github:piegamesde/nixfmt/rfc101-style --command \
    nixfmt flake.nix .devops/nix/*.nix

* flake.nix: add maintainers

* nix: move meta down to follow Nixpkgs style more closely

* nix: add missing meta attributes

nix: clarify the interpretation of meta.maintainers

nix: clarify the meaning of "broken" and "badPlatforms"

nix: passthru: expose the use* flags for inspection

E.g.:

```
❯ nix eval .#cuda.useCuda
true
```

* flake.nix: avoid re-evaluating nixpkgs too many times

* flake.nix: use flake-parts

* nix: migrate to pname+version

* flake.nix: overlay: expose both the namespace and the default attribute

* ci: add the (Nix) flakestry workflow

* nix: cmakeFlags: explicit OFF bools

* nix: cuda: reduce runtime closure

* nix: fewer rebuilds

* nix: respect config.cudaCapabilities

* nix: add the impure driver's location to the DT_RUNPATHs

* nix: clean sources more thoroughly

...this way outPaths change less frequently,
and so there are fewer rebuilds

* nix: explicit mpi support

* nix: explicit jetson support

* flake.nix: darwin: only expose the default

---------

Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>
b1723
2023-12-29 16:42:26 +02:00
97bbca6e85 cmake : fix ld warning duplicate libraries libllama.a (#4671)
* fix "ld: warning: ignoring duplicate libraries: '../libllama.a'"

* fix warning in example.
b1722
2023-12-29 16:39:15 +02:00
4af4801566 llava-cli : refactor to use sampling library (#4669)
This change makes it possible to use flags like `--grammar` when using
the `llava-cli` program. The rest is just code cleanup deleting a long
standing TODO comment.

This change also ensures that logging information is emitted to stderr
which helps the `llava-cli` command be more friendly to shell scripts.

See Mozilla-Ocho/llamafile@1cd334f
b1721
2023-12-29 16:38:38 +02:00
db49ff8ed7 server : replace sleep with condition variables (#4673)
The server currently schedules tasks using a sleep(5ms) busy loop. This
adds unnecessary latency since most sleep implementations do a round up
to the system scheduling quantum (usually 10ms). Other libc sleep impls
spin for smaller time intervals which results in the server's busy loop
consuming all available cpu. Having the explicit notify() / wait() code
also helps aid in the readability of the server code.

See mozilla-Ocho/llamafile@711344b
b1720
2023-12-29 16:24:12 +02:00
60f55e888c server : fix OpenAI server sampling w.r.t. penalty. (#4675) b1719 2023-12-29 16:22:44 +02:00
b93edd22f5 server : allow to generate multimodal embeddings (#4681) b1718 2023-12-29 16:22:10 +02:00