mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2025-06-27 03:55:20 +00:00
* Copy minja from58f0ca6dd7
* Add --jinja and --chat-template-file flags * Add missing <optional> include * Avoid print in get_hf_chat_template.py * No designated initializers yet * Try and work around msvc++ non-macro max resolution quirk * Update test_chat_completion.py * Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template * Refactor test-chat-template * Test templates w/ minja * Fix deprecation * Add --jinja to llama-run * Update common_chat_format_example to use minja template wrapper * Test chat_template in e2e test * Update utils.py * Update test_chat_completion.py * Update run.cpp * Update arg.cpp * Refactor common_chat_* functions to accept minja template + use_jinja option * Attempt to fix linkage of LLAMA_CHATML_TEMPLATE * Revert LLAMA_CHATML_TEMPLATE refactor * Normalize newlines in test-chat-templates for windows tests * Forward decl minja::chat_template to avoid eager json dep * Flush stdout in chat template before potential crash * Fix copy elision warning * Rm unused optional include * Add missing optional include to server.cpp * Disable jinja test that has a cryptic windows failure * minja: fix vigogne (https://github.com/google/minja/pull/22) * Apply suggestions from code review Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Finish suggested renamings * Move chat_templates inside server_context + remove mutex * Update --chat-template-file w/ recent change to --chat-template * Refactor chat template validation * Guard against missing eos/bos tokens (null token otherwise throws in llama_vocab::impl::token_get_attr) * Warn against missing eos / bos tokens when jinja template references them * rename: common_chat_template[s] * reinstate assert on chat_templates.template_default * Update minja tob8437df626
* Update minja to https://github.com/google/minja/pull/25 * Update minja from https://github.com/google/minja/pull/27 * rm unused optional header --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Server tests
Python based server tests scenario using pytest.
Tests target GitHub workflows job runners with 4 vCPU.
Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail.
To mitigate it, you can increase values in n_predict
, kv_size
.
Install dependencies
pip install -r requirements.txt
Run tests
- Build the server
cd ../../..
cmake -B build -DLLAMA_CURL=ON
cmake --build build --target llama-server
- Start the test:
./tests.sh
It's possible to override some scenario steps values with environment variables:
variable | description |
---|---|
PORT |
context.server_port to set the listening port of the server during scenario, default: 8080 |
LLAMA_SERVER_BIN_PATH |
to change the server binary path, default: ../../../build/bin/llama-server |
DEBUG |
to enable steps and server verbose mode --verbose |
N_GPU_LAYERS |
number of model layers to offload to VRAM -ngl --n-gpu-layers |
To run slow tests:
SLOW_TESTS=1 ./tests.sh
To run with stdout/stderr display in real time (verbose output, but useful for debugging):
DEBUG=1 ./tests.sh -s -v -x
To run single test unit:
./tests.sh unit/test_{name of test case here}.py -v -x
Hint: You can compile and run test in single command, useful for local developement:
cmake --build build -j --target llama-server && ./examples/server/tests/tests.sh
To see all available arguments, please refer to pytest documentation