* Add Arcee AFM support
* Add draft update code
* Fix linter and update URL, may still not be final
* Update src/llama-model.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Remote accidental blank line
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Adds:
* Dots1Model to convert_hf_to_gguf.py
* Computation graph code to llama-model.cpp
* Chat template to llama-chat.cpp to detect this model's template.
---
The model is called "dots.llm1" (I decided to shorten it to dots1 or
DOTS1 in the code generally) architecture.
The only models that exist as of writing of this commit that follow this
architecture are "dots.llm1.inst" and "dots.llm1.base" from here:
* https://huggingface.co/rednote-hilab/dots.llm1.inst
* https://huggingface.co/rednote-hilab/dots.llm1.base
The model architecture is a combination of Qwen and Deepseek parts, as
seen here:
ffe12627b4/src/transformers/models/dots1/modular_dots1.py
* add distilbert
* small fixes
* add note for LLM_ARCH_DISTIL_BERT
* Use MODEL_ARCH.BERT for DistilBert
---------
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
* mtmd : allow multiple modalities at the same time
* refactor mtmd tokenizer
* fix compile
* ok, missing SinusoidsPositionEmbedding
* first working version
* fix style
* more strict validate of n_embd
* refactor if..else to switch
* fix regression
* add test for 3B
* update docs
* fix tokenizing with add_special
* add more tests
* fix test case "huge"
* rm redundant code
* set_position_mrope_1d rm n_tokens
* convert ok, load ok
* warmup ok
* test
* still does not work?
* fix padding
* temporary give up
* fix merge conflict
* build_ultravox()
* rm test
* fix merge conflict
* add necessary mtmd APIs
* first working version (only 4s of audio)
* will this monster compile?
* fix compile
* please compile
* fPIC
* fix windows
* various fixes
* clean up audio_helpers
* fix conversion
* add some debug stuff
* long audio input ok
* adapt the api
* add --audio arg
* final touch UX
* add miniaudio to readme
* fix typo
* refactor kv metadata
* mtmd_default_marker()
The bug caused a crash upon load with venvs created with
--system-site-packages to use
python3-pyside6.qtwidgets=python3-pyside6.qtwidgets=6.6.2-4
from Kubuntu 24.10.
* feat: Add GGUF conversion for granitemoeshared
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: hparam and arch plumbing for granitemoeshared
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Split MoE fused tensors for shared experts in conversion
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: First WIP cut at model arch in cpp
The hparam and architecture plumbing should be correct, but the
implementation of the shared experts seems to still be broken.
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Cleaner (maybe more correct?) splitting for gate/up
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix the input to the shared experts
I had misread that the shared experts take the inputs _before_ the standard
MoE layer and was feeding the output of the MoE to the shared experts.
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Avoid architecture-specific checks for Granite MoE Shared
This is a cleaner way that will allow more flexibility in architecture
strings going forward.
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Split granite architectures out of llm_build_llama
This helps de-clutter the llama-family graph construction and allows
granite to diverge further (in preparation for Granite 4).
NOTE: I removed the granite scale factors from llm_build_deci because they
appear to only be there as copy-paste from llm_build_llama. The HF config
does not seem to set those values:
https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix compiler warning about uninitialized inp_pos
This should not have been reachable, but it warns on some compliers
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Consoladate GraniteMoEShared into GraniteMoE for conversion
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Consolidate GraniteMoEShared into GraniteMoE on the c++ side
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* convert : internvl support
* InternVL3-1B working
* fix regression
* rm mobilevlm from test
* fix conversion
* add test for internvl
* add to list of pre-quant
* restore boi/eoi check
* add clarify comment for norm eps
- gguf-py : remove gguf-py/gguf/scripts/__init__.py because it's not needed
Implicit namespaces are supported since Python 3.3 (https://peps.python.org/pep-0420/),
and the entrypoints in pyproject.toml can directly refer to the main functions.
* Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture
- Adds MoE-based embedding model supporting multilingual embeddings.
- Selects architecture variant based on hyperparameter detection (MoE layers).
- Removes unnecessary subclass initialization checks for clarity.
https://www.nomic.ai/blog/posts/nomic-embed-text-v2
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
* fix tokenizer
* don't rename this tensor
---------
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
* add pixtral text model (vision is wip)
* cgraph ok, just missing 2D RoPE
* fix bad rebase
* first working version
* fix problem with img_break token
* support dynamic image size
* update docs
* update test script
* Merged using squash to remove all noise commit messages
* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large
* Removed 3 conts (2x RoPE and 1x RMS-norm)
* Changed to use `<cmath>` instead of `<math.h>`
* Reverted removal of the 3 conts
* Used `reshape` in `llm_graph_context::build_attn_mha()`
* Use `k_pe = ggml_reshape`
* Removed the 3 conts again
* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF
* Removed MQA optimisation from `build_attn_mha()` as no gains now
* Simplified `is_mla` branch in `llm_build_deepseek2()`
* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls
* Fixed call to `build_attn` in `llm_build_t5_enc`
* gguf-py : support lazy tensor splitting
Splitting usually involves returning tuples of tensors,
which need to be handled properly to avoid early eager evaluation.
* gguf-py : fix flake8 lint
* add edgellm model arch[conversation feature doesn't work]
* remove output.weight layer for edgellm arch
* [Model] update the name of the model
* update the name of model arch in convert gguf
* [Model] Refarctor the model arch into llama-model
* [Bug] Fix the bug in create attn kv
* [Code] Fix editorconfig erros
* [Code] Remove Trailing whitespace
* [Code] Remove Trailing whitespace
* [Code] Change the order of model arch in list
* [Code] Fix flake8 Lint errors
* Remove trailing white space
* [Code] Remove call in model arch