Commit Graph

5517 Commits

Author SHA1 Message Date
3ec7e596b2 docker : add '--server' option (#2174) 2023-07-11 19:12:35 +03:00
917831c63a readme : fix zig build instructions (#2171) 2023-07-11 19:03:06 +03:00
2347463201 Support using mmap when applying LoRA (#2095)
* Support using mmap when applying LoRA

* Fix Linux

* Update comment to reflect the support lora with mmap
master-2347463
2023-07-11 22:37:01 +08:00
bbef28218f Possible solution to allow K-quants on models with n_vocab!=32000 (#2148)
* This allows LLAMA models that were previously incompatible with K quants to function mostly as normal. This happens when a model has a vocab != 32000, e.g 32001 which means it's not divisible by 256 or 64. Since the problematic dimensions only apply for `tok_embeddings.weight` and `output.weight` (dimentions 4096 x n_vocab), we can simply quantize these layers to Q8_0 whereas the majority of the hidden layers are still K-quanted since they have compatible dimensions.

* Fix indentation

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* As an alternative, to avoid failing on Metal due to lack of Q8_0 support, instead quantize tok_embeddings.weight to Q4_0 and retain output.weight as F16. This results in a net gain of about 55mb for a 7B model compared to previous approach, but should minimize adverse impact to model quality.

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-bbef282
2023-07-11 22:01:08 +08:00
5656d10599 mpi : add support for distributed inference via MPI (#2099)
* MPI support, first cut

* fix warnings, update README

* fixes

* wrap includes

* PR comments

* Update CMakeLists.txt

* Add GH workflow, fix test

* Add info to README

* mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099)

* mpi : add names for layer inputs + prep ggml_mpi_graph_compute()

* mpi : move all MPI logic into ggml-mpi

Not tested yet

* mpi : various fixes - communication now works but results are wrong

* mpi : fix output tensor after MPI compute (still not working)

* mpi : fix inference

* mpi : minor

* Add OpenMPI to GH action

* [mpi] continue-on-error: true

* mpi : fix after master merge

* [mpi] Link MPI C++ libraries to fix OpenMPI

* tests : fix new llama_backend API

* [mpi] use MPI_INT32_T

* mpi : factor out recv / send in functions and reuse

* mpi : extend API to allow usage with outer backends (e.g. Metal)

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-5656d10
2023-07-10 18:49:56 +03:00
1d16309969 llama : remove "first token must be BOS" restriction (#2153) master-1d16309 2023-07-09 11:59:53 +03:00
db4047ad5c main : escape prompt prefix/suffix (#2151) master-db4047a 2023-07-09 11:56:18 +03:00
18780e0a5e readme : update Termux instructions (#2147)
The file pathing is significant when running models inside of Termux on Android devices. llama.cpp performance is improved with loading a .bin from the $HOME directory.
2023-07-09 11:20:43 +03:00
3bbc1a11f0 ggml : fix buidling with Intel MKL but ask for "cblas.h" issue (#2104) (#2115)
* Fix buidling with Intel MKL but ask for "cblas.h" issue

* Use angle brackets to indicate the system library
master-3bbc1a1
2023-07-09 11:12:20 +03:00
2492a53fd0 readme : add more docs indexes (#2127)
* Update README.md to add more docs indexes

* Update README.md to add more docs indexes
2023-07-09 10:38:42 +03:00
64639555ff Fixed OpenLLaMA 3b CUDA mul_mat_vec_q (#2144) master-6463955 2023-07-08 20:01:44 +02:00
061f5f8d21 CUDA: add __restrict__ to mul mat vec kernels (#2140) master-061f5f8 2023-07-08 00:25:15 +02:00
84525e7962 docker : add support for CUDA in docker (#1461)
Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-84525e7
2023-07-07 21:25:25 +03:00
a7e20edf22 ci : switch threads to 1 (#2138) master-a7e20ed 2023-07-07 21:23:57 +03:00
1d656d6360 ggml : change ggml_graph_compute() API to not require context (#1999)
* ggml_graph_compute: deprecate using ggml_context, try resolve issue #287

* rewrite: no longer consider backward compitability; plan and make_plan

* minor: rename ctx as plan; const

* remove ggml_graph_compute from tests/test-grad0.c, but current change breaks backward

* add static ggml_graph_compute_sugar()

* minor: update comments

* reusable buffers

* ggml : more consistent naming + metal fixes

* ggml : fix docs

* tests : disable grad / opt + minor naming changes

* ggml : add ggml_graph_compute_with_ctx()

- backwards compatible API
- deduplicates a lot of copy-paste

* ci : enable test-grad0

* examples : factor out plan allocation into a helper function

* llama : factor out plan stuff into a helper function

* ci : fix env

* llama : fix duplicate symbols + refactor example benchmark

* ggml : remove obsolete assert + refactor n_tasks section

* ggml : fix indentation in switch

* llama : avoid unnecessary bool

* ggml : remove comments from source file and match order in header

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-07 19:24:01 +03:00
7242140283 ggml : remove sched_yield() call in ggml_graph_compute_thread() (#2134) master-7242140 2023-07-07 18:37:10 +03:00
3e08ae99ce convert.py: add mapping for safetensors bf16 (#1598)
Fixes #1473
2023-07-07 09:12:49 -04:00
481f793acc Fix opencl by wrap #if-else-endif with \n (#2086) master-481f793 2023-07-07 05:34:18 +02:00
dfd9fce6d6 ggml : fix restrict usage master-dfd9fce 2023-07-06 19:41:31 +03:00
36680f6e40 convert : update for baichuan (#2081)
1. guess n_layers;
2. relax warnings on context size;
3. add a note that its derivations are also supported.

Co-authored-by: Judd <foldl@boxvest.com>
master-36680f6
2023-07-06 19:23:49 +03:00
a17a2683d8 alpaca.sh : update model file name (#2074)
The original file name, `ggml-alpaca-7b-q4.bin`, implied the first-generation GGML. After the breaking changes (mentioned in https://github.com/ggerganov/llama.cpp/issues/382), `llama.cpp` requires GGML V3 now. Those model files are named `*ggmlv3*.bin`. We should change the example to an actually working model file, so that this thing is more likely to run out-of-the-box for more people, and less people would waste time downloading the old Alpaca model.
2023-07-06 19:17:50 +03:00
31cfbb1013 Expose generation timings from server & update completions.js (#2116)
* use javascript generators as much cleaner API

Also add ways to access completion as promise and EventSource

* export llama_timings as struct and expose them in server

* update readme, update baked includes

* llama : uniform variable names + struct init

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-31cfbb1
2023-07-05 16:51:13 -04:00
983b555e9d Update Server Instructions (#2113)
* Update server instructions for web front end
* Update server README
* Remove duplicate OAI instructions
* Fix duplicate text

---------

Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05 21:03:19 +03:00
ec326d350c ggml : fix bug introduced in #1237 master-ec326d3 2023-07-05 20:44:11 +03:00
1b6efeab82 tests : fix test-grad0 master-1b6efea 2023-07-05 20:20:25 +03:00
1b107b8550 ggml : generalize quantize_fns for simpler FP16 handling (#1237)
* Generalize quantize_fns for simpler FP16 handling

* Remove call to ggml_cuda_mul_mat_get_wsize

* ci : disable FMA for mac os actions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-1b107b8
2023-07-05 19:13:06 +03:00
8567c76b53 Update server instructions for web front end (#2103)
Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05 18:13:35 +03:00
924dd22fd3 Quantized dot products for CUDA mul mat vec (#2067) master-924dd22 2023-07-05 14:19:42 +02:00
051c70dcd5 llama: Don't double count the sampling time (#2107) master-051c70d 2023-07-05 18:31:23 +08:00
9e4475f5cf Fixed OpenCL offloading prints (#2082) master-9e4475f 2023-07-05 08:58:05 +02:00
7f0e9a775e embd-input: Fix input embedding example unsigned int seed (#2105) master-7f0e9a7 2023-07-05 07:33:33 +08:00
b472f3fca5 readme : add link web chat PR 2023-07-04 22:25:22 +03:00
ed9a54e512 ggml : sync latest (new ops, macros, refactoring) (#2106)
- add ggml_argmax()
- add ggml_tanh()
- add ggml_elu()
- refactor ggml_conv_1d() and variants
- refactor ggml_conv_2d() and variants
- add helper macros to reduce code duplication in ggml.c
master-ed9a54e
2023-07-04 21:54:11 +03:00
f257fd2550 Add an API example using server.cpp similar to OAI. (#2009)
* add api_like_OAI.py
* add evaluated token count to server
* add /v1/ endpoints binding
master-f257fd2
2023-07-04 21:06:12 +03:00
7ee76e45af Simple webchat for server (#1998)
* expose simple web interface on root domain

* embed index and add --path for choosing static dir

* allow server to multithread

because web browsers send a lot of garbage requests we want the server
to multithread when serving 404s for favicon's etc. To avoid blowing up
llama we just take a mutex when it's invoked.


* let's try this with the xxd tool instead and see if msvc is happier with that

* enable server in Makefiles

* add /completion.js file to make it easy to use the server from js

* slightly nicer css

* rework state management into session, expose historyTemplate to settings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-7ee76e4
2023-07-04 16:05:27 +02:00
acc111caf9 Allow old Make to build server. (#2098)
Also make server build by default.

Tested with Make 3.82
master-acc111c
2023-07-04 15:38:04 +03:00
23c7c6fc91 Update Makefile: clean simple (#2097) master-23c7c6f 2023-07-04 14:15:16 +02:00
698efad5fb CI: make the brew update temporarily optional. (#2092)
until they decide to fix the brew installation in the macos runners.
see the open issues. eg https://github.com/actions/runner-images/pull/7710
master-698efad
2023-07-04 01:50:12 +02:00
14a2cc71f6 [ggml] fix index for ne03 value in ggml_cl_mul_f32 (#2088) 2023-07-04 07:50:00 +08:00
1cf14ccef1 fix server crashes (#2076) 2023-07-04 00:05:23 +03:00
cc45a7feb8 Fix crash of test-tokenizer-0 under Debug build (#2064)
* Fix crash of test-tokenizer-0 under Debug build

* Change per comment
2023-07-03 20:43:55 +02:00
55dbb915cc [llama] No need to check file version when loading vocab score (#2079) 2023-07-03 19:58:58 +08:00
d7d2e6a0f0 server: add option to output probabilities for completion (#1962)
* server: add option to output probabilities for completion
* server: fix issue when handling probability output for incomplete tokens for multibyte character generation
* server: fix llama_sample_top_k order
* examples/common.h: put all bool variables in gpt_params together
master-d7d2e6a
2023-07-03 00:38:44 +03:00
46088f7231 ggml : fix build with OpenBLAS (close #2066) master-46088f7 2023-07-02 09:46:46 +03:00
0bc2cdfc87 Better CUDA synchronization logic (#2057) master-0bc2cdf 2023-07-01 21:49:44 +02:00
befb3a3562 Test-based VRAM scratch size + context adjustment (#2056) 2023-07-01 21:47:26 +02:00
b213227067 cmake : don't force -mcpu=native on aarch64 (#2063)
It's currently not possible to cross-compile llama.cpp for aarch64
because CMakeLists.txt forces -mcpu=native for that target.

-mcpu=native doesn't make sense if your build host is not the
target architecture, and clang rejects it for that reason, aborting the
build. This can be easily reproduced using the current Android NDK to build
for aarch64 on an x86_64 host.

If there is not a specific CPU-tuning target for aarch64 then -mcpu
should be omitted completely. I think that makes sense, there is not
enough variance in the aarch64 instruction set to warrant a fixed -mcpu
optimization at this point. And if someone is building natively and wishes
to enable any possible optimizations for the host device, then there is
already the LLAMA_NATIVE option available.

Fixes #495.
2023-07-01 21:31:44 +03:00
2f8cd979ec metal : release buffers when freeing metal context (#2062) master-2f8cd97 2023-07-01 21:14:59 +03:00
471aab6e4c convert : add support of baichuan-7b (#2055)
Co-authored-by: Judd <foldl@boxvest.com>
2023-07-01 20:00:25 +03:00
463f2f4c4f llama : fix return value of llama_load_session_file_internal (#2022) 2023-07-01 19:05:09 +03:00