Commit Graph

253 Commits

Author SHA1 Message Date
49006c67b4 llama : move random seed generation to the samplers (#9398)
* llama_sampler_penalties : clamp penalty_last_n to zero
2024-09-10 18:04:25 +02:00
bfe76d4a17 common : move arg parser code to arg.cpp (#9388)
* common : move arg parser to arg.cpp

* better categorize args

* add cmake

* missing climits

* missing cstdarg

* common : more explicit includes

* fix build

* refactor gpt_params_parse

* update server readme

* fix test

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-09 23:36:09 +02:00
5fb5e24811 llama : minor sampling refactor (2) (#9386) 2024-09-09 17:10:46 +02:00
1b9ae5189c common : refactor arg parser (#9308)
* (wip) argparser v3

* migrated

* add test

* handle env

* fix linux build

* add export-docs example

* fix build (2)

* skip build test-arg-parser on windows

* update server docs

* bring back missing --alias

* bring back --n-predict

* clarify test-arg-parser

* small correction

* add comments

* fix args with 2 values

* refine example-specific args

* no more lamba capture

Co-authored-by: slaren@users.noreply.github.com

* params.sparams

* optimize more

* export-docs --> gen-docs
2024-09-07 20:43:51 +02:00
df270ef745 llama : refactor sampling v2 (#9294)
- Add `struct llama_sampler` and `struct llama_sampler_i`
- Add `llama_sampler_` API
- Add `llama_sampler_chain_` API for chaining multiple samplers
- Remove `LLAMA_API_INTERNAL`
- Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`
2024-09-07 15:16:19 +03:00
9b2c24c099 server : simplify state machine for slot (#9283)
* server : simplify state machine for slot

* add SLOT_STATE_DONE_PROMPT

* pop_deferred_task

* add missing notify_one

* fix passkey test

* metrics : add n_busy_slots_per_decode

* fix test step

* add test

* maybe fix AddressSanitizer?

* fix deque ?

* missing lock

* pop_deferred_task: also notify

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-06 23:21:29 +02:00
4a1411b4f1 server : fix missing lock (#9334) 2024-09-06 14:06:04 +02:00
6e7d133a5f server : refactor multitask handling (#9274)
* server : remove multitask from server_task

* refactor completions handler

* fix embeddings

* use res_ok everywhere

* small change for handle_slots_action

* use unordered_set everywhere

* (try) fix test

* no more "mutable" lambda

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* use deque

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-02 17:11:51 +02:00
42c76d1358 Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool

- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems

* Minor fixes

* fixed use after release bug

* fixed a harmless race condition

* Fix Android bulid issue

* fix more race conditions

* fix deadlock for cases where cgraph.n_nodes == 1

and fix --poll case

* threadpool: use cpu_get_num_math to set the default number of threadpool threads

This way we avoid using E-Cores and Hyperthreaded siblings.

* bench: create fresh threadpool for each test

For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).

* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier

This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.

* threadpool: make polling the default to match openmp behavior

All command line args now allow for setting poll to 0 (false).

* threadpool: do not wakeup threads in already paused threadpool

* fix potential race condition in check_for_work

* threadpool: do not create two threadpools if their params are identical

* threadpool: reduce pause/resume/wakeup overhead in common cases

We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.

* threadpool: add support for hybrid polling

poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...

The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.

* threadpool: reduce the number of barrier required

New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.

* threadpool: remove special-casing for disposable threadpools

With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.

Include n_threads in debug print for disposable threadpool.

Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.

* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)

This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.

* threadpool: use relaxed order for chunk sync

Full memory barrier is an overkill for this since each thread works on different chunk

* threadpool: remove abort_callback from threadpool state

* threadpool: better naming for thread/cpumask releated functions

* threadpool: consistent use of int type for n_threads params

* threadpool: add support for ggml_threadpool_params_default/init

Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.

* threadpool: move typedef into ggml.h

* threadpool: fix apply_priority() function name

* threadpool: fix swift wrapper errors due to n_threads int type cleanup

* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled

* threadpool: replace checks for compute_thread ret code with proper status check

* threadpool: simplify threadpool init logic and fix main thread affinity application

Most of the init code is now exactly the same between threadpool and openmp.

* threadpool: update threadpool resume/pause function names

* threadpool: enable openmp by default for now

* threadpool: don't forget to free workers state when omp is enabled

* threadpool: avoid updating process priority on the platforms that do not require it

On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.

* threadpool: update calling thread prio and affinity only at start/resume

This avoids extra syscalls for each graph_compute()

* llama-bench: turn threadpool params into vectors, add output headers, etc

* llama-bench: add support for cool off between tests --delay

This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.

* threadpool: move process priority setting into the apps (bench and cli)

This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.

* threadpool: move all pause/resume logic into ggml

* threadpool: futher api cleanup and prep for future refactoring

All threadpool related functions and structs use ggml_threadpool prefix.

* threadpool: minor indent fixes

* threadpool: improve setprioty error message

* Update examples/llama-bench/llama-bench.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* threadpool: fix indent in set_threadpool call

* use int32_t for n_thread type in public llama.cpp API

* threadpool: use _new and _free instead of _create and _release

* fix two more public APIs to use int32_t for n_threads

* build: set _GNU_SOURCE for Adroid

---------

Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
9f7d4bcf5c server : fix crash when error handler dumps invalid utf-8 json (#9195) 2024-08-30 07:15:26 +08:00
fc54ef0d1c server : support reading arguments from environment variables (#9105)
* server : support reading arguments from environment variables

* add -fa and -dt

* readme : specify non-arg env var
2024-08-21 11:04:34 +02:00
8b3befc0e2 server : refactor middleware and /health endpoint (#9056)
* server : refactor middleware and /health endpoint

* move "fail_on_no_slot" to /slots

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix server tests

* fix CI

* update server docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-16 17:19:05 +02:00
37501d9c79 server : fix duplicated n_predict key in the generation_settings (#8994) 2024-08-15 10:28:05 +03:00
4af8420afb common : remove duplicate function llama_should_add_bos_token (#8778) 2024-08-15 10:23:23 +03:00
234b30676a server : init stop and error fields of the result struct (#9026)
Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
2024-08-15 09:21:57 +03:00
98a532d474 server : fix segfault on long system prompt (#8987)
* server : fix segfault on long system prompt

* server : fix parallel generation with very small batch sizes

* server : fix typo in comment
2024-08-14 09:51:02 +03:00
5ef07e25ac server : handle models with missing EOS token (#8997)
ggml-ci
2024-08-12 10:21:50 +03:00
daef3ab233 server : add one level list nesting for embeddings (#8936) 2024-08-09 09:32:02 +03:00
1e6f6554aa server : add lora hotswap endpoint (WIP) (#8857)
* server : add lora hotswap endpoint

* handle lora_no_apply

* fix build

* updae docs

* clean up struct def

* fix build

* add LoRA test

* fix style
2024-08-06 17:33:39 +02:00
0a4ce78681 common : Changed tuple to struct (TODO fix) (#8823)
* common : Changed tuple to struct (TODO fix)

Use struct `llama_init_result` to replace the previous
std::tuple<struct llama_model *, struct llama_context *>

* delete llama_init_default_params()

* delete the extra whitespace
2024-08-05 18:14:10 +02:00
978ba3d83d Server: Don't ignore llama.cpp params (#8754)
* Don't ignore llama.cpp params

* Add fallback for max_tokens
2024-08-04 20:16:23 +02:00
3807c3de04 server : respect --special cli arg (#8553) 2024-07-18 11:06:22 +03:00
c3ebcfa148 server : ensure batches are either all embed or all completion (#8420)
* make sure batches are all embed or all non-embed

* non-embedding batch for sampled tokens; fix unused params warning
2024-07-12 11:14:12 +03:00
278d0e1846 Initialize default slot sampling parameters from the global context. (#8418) 2024-07-10 20:08:17 -04:00
a59f8fdc85 Server: Enable setting default sampling parameters via command-line (#8402)
* Load server sampling parameters from the server context by default.

* Wordsmithing comment
2024-07-09 18:26:40 -04:00
cb4d86c4d7 server: Retrieve prompt template in /props (#8337)
* server: Retrieve prompt template in /props

This PR adds the following:
- Expose the model's Jinja2 prompt template from the model in the /props endpoint.
- Change log-level from Error to Warning for warning about template mismatch.

The front-end stands a better chance of actually executing the Jinja template format correctly. Server is currently just guessing it.

Ideally this should have been inside a JSON block that expose the same key/value pairs as listed during startup in "llm_load_print_meta" function.

* Make string buffer dynamic

* Add doc and better string handling

* Using chat_template naming convention

* Use intermediate vector for string assignment
2024-07-07 11:10:38 +02:00
38373cfbab Add SPM infill support (#8016)
* add --spm-infill option

* support --spm-infill

* support --spm-infill
2024-06-28 12:53:43 +02:00
48e6b92cc3 Add chat template support for llama-cli (#8068)
* add chat template support for llama-cli

* add help message

* server: simplify format_chat

* more consistent naming

* improve

* add llama_chat_format_example

* fix server

* code style

* code style

* Update examples/main/main.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-25 21:56:49 +10:00
ba58993152 server : fix smart slot selection (#8020) 2024-06-20 09:57:10 +10:00
91c188d6c2 Only use FIM middle token if it exists (#7648)
* Only use FIM middle if it exists

* Only use FIM middle if it exists
2024-06-18 22:19:45 +10:00
704a35b183 server : restore numeric prompts (#7883) 2024-06-12 14:42:29 +03:00
d9da0e4986 server : improve "prompt" handling (#7847) 2024-06-10 14:59:55 +03:00
7a16ce7db2 server : smart slot selection using Longest Common Prefix (#7728)
* server : Smart selection of available slot using Longest Common Substring

* add usage

* remove trailing whitespaces

* Use Longest Common Prefix (LCP) instead of LCS

* Rename argument
2024-06-08 10:50:31 +03:00
a5cabd7649 server : do not get prompt in infill mode (#7286)
* avoid to get prompt in infill mode and embedding mode

* remove embedding mode

* refactor format

---------

Co-authored-by: wudexiang <wudexiang@bytedance.com>
2024-06-07 10:09:45 +03:00
f83351f9a6 imatrix : migrate to gpt_params (#7771)
* imatrix : migrate to gpt_params

ggml-ci

* imatrix : add --save-frequency cli arg

* common : fix --no-ppl
2024-06-06 16:30:58 +03:00
1442677f92 common : refactor cli arg parsing (#7675)
* common : gpt_params_parse do not print usage

* common : rework usage print (wip)

* common : valign

* common : rework print_usage

* infill : remove cfg support

* common : reorder args

* server : deduplicate parameters

ggml-ci

* common : add missing header

ggml-ci

* common : remote --random-prompt usages

ggml-ci

* examples : migrate to gpt_params

ggml-ci

* batched-bench : migrate to gpt_params

* retrieval : migrate to gpt_params

* common : change defaults for escape and n_ctx

* common : remove chatml and instruct params

ggml-ci

* common : passkey use gpt_params
2024-06-04 21:23:39 +03:00
2e666832e6 server : new UI (#7633)
* ic

* migrate my eary work

* add the belonging stuff: css,favicon etc

* de prompts

* chore: Update HTML meta tags in index.html file

* add api-key css classes

* some necessary fixes

* Add API key CSS classes and update styling in style.css

* clean the code

* move API to the top, rearrange param sliders. update css

* add tooltips to the parameters with comprehensible explanations

* fix FloatField and BoolField tooltips

* fix grammar field width

* use template literales for promptFormats.js

* update const ModelGenerationInfo

* remove ms per token, since not relevant for most webui users and use cases

* add phi-3 prompt template

* add phi3 to dropdown

* add css class

* update forgotten css theme

* add user message suffix

* fix chatml & add llama3 format

* fix llama3 prompt template

* more prompt format fixes

* add more comon stop tokens

* add missing char

* do not separate with new line or comma

* move prompt style

* add hacky llama2 prompt solution, reduce redundancy in promptFormats.js

* fix toggle state localstorage

* add cmd-r prompt et reduce redundancy

* set default prompt to empty

* move files, clean code

* fix css path

* add a button to the new ui

* move new ui to "/public" due to otherwise problematic CORS behaviour

* include new ui in cpp

* fix wrong link to old ui

* renaming to ensure consistency

* fix typos "prompt-format" -> "prompt-formats"

* use correct indent

* add new ui files to makefile

* fix typo
2024-06-01 22:31:48 +03:00
6ff13987ad common : normalize naming style (#7462)
* common : normalize naming style

ggml-ci

* common : match declaration / definition order

* zig : try to fix build
2024-05-22 20:04:20 +03:00
e932094d58 server : return error on too large embedding input (#7389) 2024-05-20 08:56:05 +03:00
41858392e1 server: fix seed being reported back (#7382) 2024-05-19 17:06:33 +03:00
ee94172d33 server : add support for the RPC backend (#7305)
ref: #7292
2024-05-17 10:00:17 +03:00
4f0263633b server: free sampling contexts on exit (#7264)
* server: free sampling contexts on exit

This cleans up last leak found by the address sanitizer.

* fix whitespace

* fix whitespace
2024-05-14 16:11:24 +02:00
72c177c1f6 fix system prompt handling (#7153) 2024-05-11 17:28:10 +02:00
988631335a server : free llama_batch on exit (#7212)
* [server] Cleanup a memory leak on exit

There are a couple memory leaks on exit of the server. This hides others.
After cleaning this up, you can see leaks on slots. But that is another
patch to be sent after this.

* make tab into spaces
2024-05-11 11:13:02 +03:00
5ae3426b0b server: fix reported top tokens for temperature 0 (#7203) 2024-05-11 10:11:28 +02:00
c12452c7ae JSON: [key] -> .at(key), assert() -> GGML_ASSERT (#7143) 2024-05-08 21:53:08 +02:00
911b3900dd server : add_special option for tokenize endpoint (#7059) 2024-05-08 15:27:58 +03:00
af0a5b6163 server: fix incorrectly reported token probabilities (#7125)
* server: normalize token probabilities

* fix temperature == 0.0f
2024-05-07 23:07:58 +02:00
03fb8a002d If first token generated from the server is the stop word the server will crash (#7038)
This will reproduce the issue in llama13b
{
'prompt': 'Q: hello world \nA: ',
 'stop': ['\n'],
 'temperature': 0.0,
 'n_predict': 10,
 'cache_prompt': True,
 'n_probs': 10
}
2024-05-04 11:06:40 +02:00
9c67c2773d ggml : add Flash Attention (#5021)
* ggml : add ggml_flash_attn_ext API

* ggml : fix GQA support in ggml_flash_attn_ext

* ggml : online attention (CPU)

* metal : initial implementation

* metal : f16 precision

* metal : reduce branches

* metal : specialize for head size

* wip : 8 rows per simd group

* wip : 4 rows per simd group

* wip : template for rows per warp

* metal : parallelize across KV size

* metal : parallel reduce across heads

* metal : efficient flash_attn_f16 implementation

* metal : avoid redundant loads of the attention

* metal : scale and mask in matrix form

* metal : fix comment

* llama : avoid ggml_cast, use F32 query

* metal : add parallel reduce version (disabled)

* metal : move output into local memory + optimize

- the result from each simdgroup now stays in the registers
- significantly reduced SRAM usage
- more efficient skipping of -INF blocks
- avoid simdgroup barrier in hot loop
- add comments

* metal : add tests, fix scaling, support C > 32

* metal : improve precision

* ggml : fix f16 mad

* metal : minor

* metal : support Q > 8

* tests : add ATTN tests

* metal : disable buffer allocation logs

* tests : more

* metal : faster inner loop for C == 32

* metal : fix array initialization

* tests : ifdef

* ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext

* ggml : fix ggml_soft_max mask requirement

* cuda : fix soft_max to use correct mask size

* cuda : add flash_attn kernel (wip)

* metal : optimize softmax for C > 32

* metal : optimize softmax

* tests : minor fix

* cuda : avoid zeroing fragments

* tests : update dims

* cuda : fix __hisinf() result check

* cuda : avoid warp_reduce for smax

* cuda : use int instead of int64_t

Noticeably improves performance (thanks to Johannes)

* cuda : make loops use the same loop values

Thanks Johannes again for the tip

* cuda : unroll some of the loops

* cuda : avoid __hisinf branches

* cuda : use half2 in softmax

* cuda : switch to 1 warp for bs > 16

* cuda : speed-up reduce part of the kernel

* cuda : unroll Q*K^T loop

* cuda : fix -INF block check

* cuda : simplify softmax

* cuda : fix matrix names

* cuda : minor

* llama : adapt to F16 KQ_pos

* llama : adapt new models to F16 KQ_mask

* ggml : fix F16 store (ARM NEON)

* llama : fix type of KQ_mask and KQ_pos

* ggml : fix CPU soft_max

* tests : add hs=256

* cuda : fix build

* metal : improve perf via smaller int registers

* cuda : adapt soft_max to F16 mask and pos

* CUDA: faster FlashAttention, kernel for bs == 1

* 16 cols for Phi-2

* no vec for hs, no hs==256 ncols==32 for Volta

* adjust kernel selection logic

* 4 warps, 256 stride for all D

* no ncols == 64

* Multiple parallel blocks for batch size 1

* fix compile warnings

* fix excessive KQ_b loads

* fix cmake build

* fix KV cache padding, NaN from INFINITY (#6438)

* llama : flash_attn cparam + fix defrag

* server: support flash_attn param

* server: bench: enable flash_attn param

* CUDA: refactor host code, dyn. par. blocks

* fix flash_attn_vec_f16 race condition

* flush softmax exp below threshold to 0

* store temp KQ in registers

* Calculate KQ as FP32 if KQV has GGML_PREC_F32

* Add __hgt2_mask implementation for CUDA 11

* fix KQ FP32 precision fpr parallel_blocks > 1

* llama-bench : add -fa,--flash-attn arg

* metal : add BS=1 kernel for flash attention (#6508)

* metal : add BS=1 kernel for flash attention (wip)

* metal : support more than 1 warps

* metal : opts

* metal : opt

* metal : switch to parallel reduce

* metal : reduce registers

* metal : simplify

* metal : initial FA vec kernel

* metal : use F32 attention accumulators

* batched-bench : add fattn arg

* llama : simplify llama_build_kv_store

ggml-ci

* llama : adapt build_olmo to changes

* ggml : fix arm fp16 store on windows

* metal : clean-up

* metal : clean-up kernel code

* metal : minor

* tests : remove benchmarks

ggml-ci

* ggml : fix avx512 const correctness

ggml-ci

* ggml : fix soft_max with bias on CPU

ggml-ci

* common : print --flash-attn in help

* ggml : fix num dimensions in ggml_flash_attn_ext

* llama : force disable flash attention for incompatible models

* ggml : ggml_soft_max support F16/F32 mask/pos

ggml-ci

* cuda : uint -> uint32_t

* cuda : "constexpr dim3" -> "const dim3"

ggml-ci

* cuda : try to fix __hgt2_mask

ggml-ci

* ggml : add TODO's for F16/F32 mask/pos support in other backends

* llama : replace bool need_kq_pos with use_alibi

* llama : prep ALiBi support for BERT models

ggml-ci

* llama : fix n_batch requirements

ggml-ci

* cont

* server : add help for --flash-attn arg

* llama : disable FA for AMD

* tests : remove TMP_ATTN_BENCH

ggml-ci

* llama : support save/load state with FA enabled

ggml-ci

* ci : add CUDA save-load-state tests

ggml-ci

* llama : llama_kv_cache_clear zeroes data + fix save-load seq

ggml-ci

* llama : fix copy-paste errors, add TODO

* llama : disallow incompatible states

* llama : update llama_state_get_size after v_trans field

* metal : remove tmp log

* llama : add static reminder for llama_state_get_size

* metal : fix max nsg

ggml-ci

* ci : fix arg order

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
2024-04-30 12:16:08 +03:00