Commit Graph

1195 Commits

Author SHA1 Message Date
983b555e9d Update Server Instructions (#2113)
* Update server instructions for web front end
* Update server README
* Remove duplicate OAI instructions
* Fix duplicate text

---------

Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05 21:03:19 +03:00
ec326d350c ggml : fix bug introduced in #1237 master-ec326d3 2023-07-05 20:44:11 +03:00
1b6efeab82 tests : fix test-grad0 master-1b6efea 2023-07-05 20:20:25 +03:00
1b107b8550 ggml : generalize quantize_fns for simpler FP16 handling (#1237)
* Generalize quantize_fns for simpler FP16 handling

* Remove call to ggml_cuda_mul_mat_get_wsize

* ci : disable FMA for mac os actions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-1b107b8
2023-07-05 19:13:06 +03:00
8567c76b53 Update server instructions for web front end (#2103)
Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05 18:13:35 +03:00
924dd22fd3 Quantized dot products for CUDA mul mat vec (#2067) master-924dd22 2023-07-05 14:19:42 +02:00
051c70dcd5 llama: Don't double count the sampling time (#2107) master-051c70d 2023-07-05 18:31:23 +08:00
9e4475f5cf Fixed OpenCL offloading prints (#2082) master-9e4475f 2023-07-05 08:58:05 +02:00
7f0e9a775e embd-input: Fix input embedding example unsigned int seed (#2105) master-7f0e9a7 2023-07-05 07:33:33 +08:00
b472f3fca5 readme : add link web chat PR 2023-07-04 22:25:22 +03:00
ed9a54e512 ggml : sync latest (new ops, macros, refactoring) (#2106)
- add ggml_argmax()
- add ggml_tanh()
- add ggml_elu()
- refactor ggml_conv_1d() and variants
- refactor ggml_conv_2d() and variants
- add helper macros to reduce code duplication in ggml.c
master-ed9a54e
2023-07-04 21:54:11 +03:00
f257fd2550 Add an API example using server.cpp similar to OAI. (#2009)
* add api_like_OAI.py
* add evaluated token count to server
* add /v1/ endpoints binding
master-f257fd2
2023-07-04 21:06:12 +03:00
7ee76e45af Simple webchat for server (#1998)
* expose simple web interface on root domain

* embed index and add --path for choosing static dir

* allow server to multithread

because web browsers send a lot of garbage requests we want the server
to multithread when serving 404s for favicon's etc. To avoid blowing up
llama we just take a mutex when it's invoked.


* let's try this with the xxd tool instead and see if msvc is happier with that

* enable server in Makefiles

* add /completion.js file to make it easy to use the server from js

* slightly nicer css

* rework state management into session, expose historyTemplate to settings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-7ee76e4
2023-07-04 16:05:27 +02:00
acc111caf9 Allow old Make to build server. (#2098)
Also make server build by default.

Tested with Make 3.82
master-acc111c
2023-07-04 15:38:04 +03:00
23c7c6fc91 Update Makefile: clean simple (#2097) master-23c7c6f 2023-07-04 14:15:16 +02:00
698efad5fb CI: make the brew update temporarily optional. (#2092)
until they decide to fix the brew installation in the macos runners.
see the open issues. eg https://github.com/actions/runner-images/pull/7710
master-698efad
2023-07-04 01:50:12 +02:00
14a2cc71f6 [ggml] fix index for ne03 value in ggml_cl_mul_f32 (#2088) 2023-07-04 07:50:00 +08:00
1cf14ccef1 fix server crashes (#2076) 2023-07-04 00:05:23 +03:00
cc45a7feb8 Fix crash of test-tokenizer-0 under Debug build (#2064)
* Fix crash of test-tokenizer-0 under Debug build

* Change per comment
2023-07-03 20:43:55 +02:00
55dbb915cc [llama] No need to check file version when loading vocab score (#2079) 2023-07-03 19:58:58 +08:00
d7d2e6a0f0 server: add option to output probabilities for completion (#1962)
* server: add option to output probabilities for completion
* server: fix issue when handling probability output for incomplete tokens for multibyte character generation
* server: fix llama_sample_top_k order
* examples/common.h: put all bool variables in gpt_params together
master-d7d2e6a
2023-07-03 00:38:44 +03:00
46088f7231 ggml : fix build with OpenBLAS (close #2066) master-46088f7 2023-07-02 09:46:46 +03:00
0bc2cdfc87 Better CUDA synchronization logic (#2057) master-0bc2cdf 2023-07-01 21:49:44 +02:00
befb3a3562 Test-based VRAM scratch size + context adjustment (#2056) 2023-07-01 21:47:26 +02:00
b213227067 cmake : don't force -mcpu=native on aarch64 (#2063)
It's currently not possible to cross-compile llama.cpp for aarch64
because CMakeLists.txt forces -mcpu=native for that target.

-mcpu=native doesn't make sense if your build host is not the
target architecture, and clang rejects it for that reason, aborting the
build. This can be easily reproduced using the current Android NDK to build
for aarch64 on an x86_64 host.

If there is not a specific CPU-tuning target for aarch64 then -mcpu
should be omitted completely. I think that makes sense, there is not
enough variance in the aarch64 instruction set to warrant a fixed -mcpu
optimization at this point. And if someone is building natively and wishes
to enable any possible optimizations for the host device, then there is
already the LLAMA_NATIVE option available.

Fixes #495.
2023-07-01 21:31:44 +03:00
2f8cd979ec metal : release buffers when freeing metal context (#2062) master-2f8cd97 2023-07-01 21:14:59 +03:00
471aab6e4c convert : add support of baichuan-7b (#2055)
Co-authored-by: Judd <foldl@boxvest.com>
2023-07-01 20:00:25 +03:00
463f2f4c4f llama : fix return value of llama_load_session_file_internal (#2022) 2023-07-01 19:05:09 +03:00
cb44dbc7de llama : catch llama_load_session_file_internal exceptions (#2022)
* convert checks in llama_load_session_file to throw and handle them

* make llama_load_session_file_internal static

* address feedbacks to avoid using exceptions
2023-07-01 19:02:58 +03:00
79f634a19d embd-input : fix returning ptr to temporary master-79f634a 2023-07-01 18:46:00 +03:00
04606a1599 train : fix compile warning 2023-07-01 18:45:44 +03:00
b1ca8f36a9 ggml : disable GGML_TASK_INIT and GGML_TASK_FINALIZE by default (#1995)
Will not be scheduled unless explicitly enabled.
2023-07-01 18:42:43 +03:00
b8c8dda75f Use unsigned for random seed (#2006)
* Use unsigned for random seed. Keep -1 as the value to use a time based seed.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-b8c8dda
2023-06-29 06:15:15 -07:00
96a712ca1b Porting the improved K-Quant CUDA kernels to OpenCL (#1966)
* Added broken new q4k quant

* xx + ib0

* Fix q2_k fast kernel

* Use preprocessor for QK_K

* Add q6_k fast matmul kernel

* ported q3k speedup successfully

* ported q2k and q5k speedups

* remove old dot kernels and template

* fixed global const struct types

* fixing address spaces

* fixed string too long CI issue

---------

Co-authored-by: 0cc4m <picard12@live.de>
2023-06-29 05:56:43 +02:00
d3494bb86b llama : replacing auto &kv with const auto &kv (#2041)
* Replacing auto &kv with const auto &kv

* Create codacy.yml

* Delete codacy.yml
master-d3494bb
2023-06-28 21:39:08 +03:00
5b351e94d0 cuda : remove nchannels_x argument from mul_mat_vec_nc_f16_f32 (#2028)
- Not used
master-5b351e9
2023-06-28 20:27:31 +03:00
6432aabb6d cuda : fix missing const qualifier in casts (#2027) master-6432aab 2023-06-28 20:26:26 +03:00
b922bc351b llama : remove shards weight file support (#2000)
* Remove multiple shards

* Remove multiple file loaders

* Remove llama_load_tensor_shard class

* Simplify load logic

* Remove dead code guess_n_parts function

* Remove vocab_only from constructor of llama_model_loader

* Remove alignment_prevents_mmap which is not more needed.

* Remove useless check
master-b922bc3
2023-06-28 20:13:02 +03:00
7f9753fa12 CUDA GPU acceleration for LoRAs + f16 models (#1970) master-7f9753f 2023-06-28 18:35:54 +02:00
cfa0750bc9 llama : support input embeddings directly (#1910)
* add interface for float input

* fixed inpL shape and type

* add examples of input floats

* add test example for embd input

* fixed sampling

* add free for context

* fixed add end condition for generating

* add examples for llava.py

* add READMD for llava.py

* add READMD for llava.py

* add example of PandaGPT

* refactor the interface and fixed the styles

* add cmake build for embd-input

* add cmake build for embd-input

* Add MiniGPT-4 example

* change the order of the args of llama_eval_internal

* fix ci error
2023-06-28 18:53:37 +03:00
9d23589d63 fix pthreads setaffinity usage on android (#2020) master-9d23589 2023-06-27 19:06:33 +02:00
0be54f75a6 baby-llama : fix build after ggml_rope change (#2016) master-0be54f7 2023-06-27 08:07:13 +03:00
181e8d9755 llama : fix rope usage after ChatGLM change 2023-06-27 00:37:33 +03:00
d9779021bd ggml : add support for ChatGLM RoPE 2023-06-27 00:06:51 +03:00
d38e451578 readme : add Scala 3 bindings repo (#2010) 2023-06-26 22:47:59 +03:00
eaa6ca5a61 ggml : increase max tensor name + clean up compiler warnings in train-text (#1988)
* Clean up compiler warnings in train-text

Some brackets to disambiguate order of operations

* Increase GGML_MAX_NAME

Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues
master-eaa6ca5
2023-06-26 22:45:32 +03:00
aa777abbb7 readme : LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux (#2007)
* docs - Alternative way to build at Android, with CLBlast.

* doc - LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux.

* doc- fix typo
2023-06-26 22:34:45 +03:00
c824d2e368 ggml : avoid conv 2d kernel round up master-c824d2e 2023-06-26 21:03:59 +03:00
zrm
b853d45601 ggml : add NUMA support (#1556)
* detect NUMA systems and pin work threads to nodes (linux)

* disable mmap prefetch/readahead for NUMA systems

* avoid sending finalize op to thread pool if it does nothing

* silence robot

* fix args

* make --numa a param

* recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement

* lower synchronization overhead

* statically allocate

* move numa state to g_state

* add description for --numa

* ggml : minor style changes

* ggml : minor style + try fix sanitizer build

* llama : allow to initialize backend with NUMA support

* llama : avoid ggml include in llama-util.h

* ggml : style / formatting

* ggml : fix handling of ops with n_threads > n_tasks > 1

* server : utilize numa parameter

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
master-b853d45
2023-06-26 20:57:59 +03:00
9225baef71 k-quants : fix indentation master-9225bae 2023-06-26 20:10:52 +03:00