b2034c2b55
contrib: support modelscope community ( #12664 )
...
* support download from modelscope
* support login
* remove comments
* add arguments
* fix code
* fix win32
* test passed
* fix readme
* revert readme
* change to MODEL_ENDPOINT
* revert tail line
* fix readme
* refactor model endpoint
* remove blank line
* fix header
* fix as comments
* update comment
* update readme
---------
Co-authored-by: tastelikefeet <yuze.zyz@alibaba-inc/com>
2025-04-11 14:01:56 +02:00
bd3f59f812
cmake : enable curl by default ( #12761 )
...
* cmake : enable curl by default
* no curl if no examples
* fix build
* fix build-linux-cross
* add windows-setup-curl
* fix
* shell
* fix path
* fix windows-latest-cmake*
* run: include_directories
* LLAMA_RUN_EXTRA_LIBS
* sycl: no llama_curl
* no test-arg-parser on windows
* clarification
* try riscv64 / arm64
* windows: include libcurl inside release binary
* add msg
* fix mac / ios / android build
* will this fix xcode?
* try clearing the cache
* add bunch of licenses
* revert clear cache
* fix xcode
* fix xcode (2)
* fix typo
2025-04-07 13:35:19 +02:00
ef19c71769
run: de-duplicate fmt and format functions and optimize ( #11596 )
2025-03-25 18:46:11 +01:00
9f2250ba72
Add CLI arg to llama-run to adjust the number of threads used ( #12370 )
...
We default to 4, sometimes we want to manually adjust this
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-03-14 16:41:20 +00:00
e0dbec0bc6
llama : refactor llama_context, llama_kv_cache, llm_build_context ( #12181 )
...
* llama : refactor llama_context, llama_kv_cache, llm_build_context
ggml-ci
* graph : don't mutate the KV cache during defrag
ggml-ci
* context : reduce virtuals + remove test function
ggml-ci
* context : move interface implementation to source file + factory
ggml-ci
* graph : move KV cache build functions to llama_context impl
ggml-ci
* graph : remove model reference from build_pooling
ggml-ci
* graph : remove llama_model reference
ggml-ci
* kv_cache : provide rope factors
ggml-ci
* graph : rework inputs to use only unique_ptr, remove attn input abstraction
ggml-ci
* context : remove llama_context_i abstraction
ggml-ci
* context : clean-up
ggml-ci
* graph : clean-up
ggml-ci
* llama : remove redundant keywords (struct, enum)
ggml-ci
* model : adapt gemma3
ggml-ci
* graph : restore same attention ops as on master
ggml-ci
* llama : remove TODO + fix indent
ggml-ci
2025-03-13 12:35:44 +02:00
c950a1f692
Adding UTF-8 support to llama.cpp ( #12111 )
...
For emojis, non-alpha characters, etc.
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-03-03 12:44:56 +00:00
7ad0779f5d
run: allow to customize prompt by env var LLAMA_PROMPT_PREFIX ( #12041 )
...
Signed-off-by: Florent Benoit <fbenoit@redhat.com >
2025-02-23 17:15:51 +00:00
f777a73e18
Some llama-run cleanups ( #11973 )
...
Use consolidated open function call from File class. Change
read_all to to_string(). Remove exclusive locking, the intent for
that lock is to avoid multiple processes writing to the same file,
it's not an issue for readers, although we may want to consider
adding a shared lock. Remove passing nullptr as reference,
references are never supposed to be null. clang-format the code
for consistent styling.
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-02-23 13:14:32 +00:00
0d559580a0
run : add --chat-template-file ( #11961 )
...
Relates to: https://github.com/ggml-org/llama.cpp/issues/11178
Added --chat-template-file CLI option to llama-run. If specified, the file
will be read and the content passed for overwriting the chat template of
the model to common_chat_templates_from_model.
Signed-off-by: Michael Engel <mengel@redhat.com >
2025-02-20 10:35:11 +02:00
63e489c025
tool-call: refactor common chat / tool-call api (+ tests / fixes) ( #11900 )
...
* tool-call refactoring: moved common_chat_* to chat.h, common_chat_templates_init return a unique_ptr to opaque type
* addressed clang-tidy lints in [test-]chat.*
* rm minja deps from util & common & move it to common/minja/
* add name & tool_call_id to common_chat_msg
* add common_chat_tool
* added json <-> tools, msgs conversions to chat.h
* fix double bos/eos jinja avoidance hack (was preventing inner bos/eos tokens)
* fix deepseek r1 slow test (no longer <think> opening w/ new template)
* allow empty tools w/ auto + grammar
* fix & test server grammar & json_schema params w/ & w/o --jinja
2025-02-18 18:03:23 +00:00
19d3c8293b
There's a better way of clearing lines ( #11756 )
...
Use the ANSI escape code for clearing a line.
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-02-09 10:34:49 +00:00
d2fe216fb2
Make logging more verbose ( #11714 )
...
Debugged an issue with a user who was on a read-only filesystem.
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-02-07 14:42:46 +00:00
9f4cc8f8d3
sync
: minja (#11641 )
...
* `sync`: minja
182de30cda
https://github.com/google/minja/pull/46
https://github.com/google/minja/pull/45
2025-02-05 01:00:12 +00:00
84ec8a58f7
Name colors ( #11573 )
...
It's more descriptive, use #define's so we can use compile-time
concatenations.
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-02-02 15:14:48 +00:00
ecef206ccb
Implement s3:// protocol ( #11511 )
...
For those that want to pull from s3
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-02-01 10:30:54 +00:00
f0d4b29edf
Parse https://ollama.com/library/ syntax ( #11480 )
...
People search for ollama models using the web ui, this change
allows one to copy the url from the browser and for it to be
compatible with llama-run.
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-01-29 11:23:10 +00:00
7fee2889e6
Add github protocol pulling and http:// ( #11465 )
...
As pulling protocols to llama-run
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-01-28 14:45:41 +00:00
2b8525d5c8
Handle missing model in CLI parameters for llama-run ( #11399 )
...
The HTTP client in llama-run only prints an error in case the download of
a resource failed. If the model name in the CLI parameter list is missing,
this causes the application to crash.
In order to prevent this, a check for the required model parameter has been
added and errors for resource downloads get propagated to the caller.
Signed-off-by: Michael Engel <mengel@redhat.com >
2025-01-28 08:32:40 +00:00
a4417ddda9
Add new hf protocol for ollama ( #11449 )
...
https://huggingface.co/docs/hub/en/ollama
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-01-27 19:36:10 +01:00
01f37edf1a
Update llama-run README.md ( #11386 )
...
For consistency
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-01-24 09:39:24 +00:00
05f63cc9ee
Update documentation ( #11373 )
...
To show -n, -ngl, --ngl is acceptable.
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-01-23 20:04:31 +00:00
f7fb43cd0b
Add -ngl ( #11372 )
...
Most other llama.cpp cli tools accept -ngl with a single dash.
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-01-23 16:16:18 +00:00
f211d1dc10
Treat hf.co/ prefix the same as hf:// ( #11350 )
...
ollama uses hf.co/ to specify huggingface prefix, like RamaLama
uses hf://
Treat them similarly.
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-01-23 10:38:20 +00:00
6171c9d258
Add Jinja template support ( #11016 )
...
* Copy minja from 58f0ca6dd7
* Add --jinja and --chat-template-file flags
* Add missing <optional> include
* Avoid print in get_hf_chat_template.py
* No designated initializers yet
* Try and work around msvc++ non-macro max resolution quirk
* Update test_chat_completion.py
* Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template
* Refactor test-chat-template
* Test templates w/ minja
* Fix deprecation
* Add --jinja to llama-run
* Update common_chat_format_example to use minja template wrapper
* Test chat_template in e2e test
* Update utils.py
* Update test_chat_completion.py
* Update run.cpp
* Update arg.cpp
* Refactor common_chat_* functions to accept minja template + use_jinja option
* Attempt to fix linkage of LLAMA_CHATML_TEMPLATE
* Revert LLAMA_CHATML_TEMPLATE refactor
* Normalize newlines in test-chat-templates for windows tests
* Forward decl minja::chat_template to avoid eager json dep
* Flush stdout in chat template before potential crash
* Fix copy elision warning
* Rm unused optional include
* Add missing optional include to server.cpp
* Disable jinja test that has a cryptic windows failure
* minja: fix vigogne (https://github.com/google/minja/pull/22 )
* Apply suggestions from code review
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Finish suggested renamings
* Move chat_templates inside server_context + remove mutex
* Update --chat-template-file w/ recent change to --chat-template
* Refactor chat template validation
* Guard against missing eos/bos tokens (null token otherwise throws in llama_vocab::impl::token_get_attr)
* Warn against missing eos / bos tokens when jinja template references them
* rename: common_chat_template[s]
* reinstate assert on chat_templates.template_default
* Update minja to b8437df626
* Update minja to https://github.com/google/minja/pull/25
* Update minja from https://github.com/google/minja/pull/27
* rm unused optional header
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2025-01-21 13:18:51 +00:00
2e2f8f093c
linenoise.cpp refactoring ( #11301 )
...
More RAII mainly
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-01-21 09:32:35 +00:00
9f7add1cde
examples : fix add_special conditions ( #11311 )
2025-01-20 16:36:08 +02:00
a1649cc13f
Adding linenoise.cpp to llama-run ( #11252 )
...
This is a fork of linenoise that is C++17 compatible. I intend on
adding it to llama-run so we can do things like traverse prompt
history via the up and down arrows:
https://github.com/ericcurtin/linenoise.cpp
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-01-18 14:42:31 +00:00
924518e2e5
Reset color before we exit ( #11205 )
...
We don't want colors to leak post termination of llama-run.
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-01-12 18:23:10 +00:00
afa8a9ec9b
llama : add llama_vocab
, functions -> methods, naming ( #11110 )
...
* llama : functions -> methods (#11110 )
* llama : add struct llama_vocab to the API (#11156 )
ggml-ci
* hparams : move vocab params to llama_vocab (#11159 )
ggml-ci
* vocab : more pimpl (#11165 )
ggml-ci
* vocab : minor tokenization optimizations (#11160 )
ggml-ci
Co-authored-by: Diego Devesa <slarengh@gmail.com >
* lora : update API names (#11167 )
ggml-ci
* llama : update API names to use correct prefix (#11174 )
* llama : update API names to use correct prefix
ggml-ci
* cont
ggml-ci
* cont
ggml-ci
* minor [no ci]
* vocab : llama_vocab_add_[be]os -> llama_vocab_get_add_[be]os (#11174 )
ggml-ci
* vocab : llama_vocab_n_vocab -> llama_vocab_n_tokens (#11174 )
ggml-ci
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2025-01-12 11:32:42 +02:00
1bf839b1e8
Enhance user input handling for llama-run ( #11138 )
...
The main motivation for this change is it was not handing
ctrl-c/ctrl-d correctly. Modify `read_user_input` to handle EOF,
"/bye" command, and empty input cases. Introduce `get_user_input`
function to manage user input loop and handle different return
cases.
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-01-08 18:47:05 +00:00
dc7cef9f37
llama-run : fix context size ( #11094 )
...
Set `n_ctx` equal to `n_batch` in `Opt` class. Now context size is
a more reasonable 2048.
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-01-06 23:45:28 +01:00
47182dd03f
llama : update llama_model API names ( #11063 )
...
* llama : deprecate llama_free_model, add llama_model_free
ggml-ci
* llama : change `llama_load_model_from_file` -> `llama_model_load_from_file`
ggml-ci
2025-01-06 10:55:18 +02:00
6e1531aca5
common, examples, ggml : fix MSYS2 GCC compiler errors and warnings when building with LLAMA_CURL=ON and GGML_OPENCL=ON ( #11013 )
...
In common/common.cpp:
* Convert usage of stat() function call to check if file exists to standard library function std::filesystem::exists (error unable to match to correct function signature)
* Additional conditions to check if PATH_MAX is already defined in WIN32 environment (warning it is already defined in MSYS2)
In examples/run/run.cpp:
* Add io.h header inclusion (error cannot find function _get_osfhandle)
* Change initialisers for OVERLAPPED to empty struct (warning about uninitialised members)
* Add initialiser for hFile (warning it may be uninitialised)
* Add cast for curl_off_t percentage value to long int in generate_progress_prefix function (warning that curl_off_t is long long int)
In ggml/src/ggml-opencl/ggml-opencl.cpp:
* Initialise certain declared cl_mem variables to nullptr for greater safety (warning about B_d variable possibly used unassigned)
2024-12-31 01:46:06 +01:00
dab76c92cc
llama-run : include temperature option ( #10899 )
...
This commit updates the `examples/run/README.md` file to include a new
option for setting the temperature and updates the `run.cpp` file to
parse this option.
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2024-12-23 01:21:40 +01:00
7909e8588d
llama-run : improve progress bar ( #10821 )
...
Set default width to whatever the terminal is. Also fixed a small bug around
default n_gpu_layers value.
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2024-12-19 03:58:00 +01:00
c27ac678dd
Opt class for positional argument handling ( #10508 )
...
Added support for positional arguments `model` and `prompt`. Added
functionality to download via strings like:
llama-run llama3
llama-run ollama://granite-code
llama-run ollama://granite-code:8b
llama-run hf://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf
llama-run huggingface://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf
llama-run https://example.com/some-file1.gguf
llama-run some-file2.gguf
llama-run file://some-file3.gguf
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2024-12-13 19:34:25 +01:00
7cc2d2c889
ggml : move AMX to the CPU backend ( #10570 )
...
* ggml : move AMX to the CPU backend
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2024-11-29 21:54:58 +01:00
0cc63754b8
Introduce llama-run ( #10291 )
...
It's like simple-chat but it uses smart pointers to avoid manual
memory cleanups. Less memory leaks in the code now. Avoid printing
multiple dots. Split code into smaller functions. Uses no exception
handling.
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2024-11-25 22:56:24 +01:00