* Add Arcee AFM support
* Add draft update code
* Fix linter and update URL, may still not be final
* Update src/llama-model.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Remote accidental blank line
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Adds:
* Dots1Model to convert_hf_to_gguf.py
* Computation graph code to llama-model.cpp
* Chat template to llama-chat.cpp to detect this model's template.
---
The model is called "dots.llm1" (I decided to shorten it to dots1 or
DOTS1 in the code generally) architecture.
The only models that exist as of writing of this commit that follow this
architecture are "dots.llm1.inst" and "dots.llm1.base" from here:
* https://huggingface.co/rednote-hilab/dots.llm1.inst
* https://huggingface.co/rednote-hilab/dots.llm1.base
The model architecture is a combination of Qwen and Deepseek parts, as
seen here:
ffe12627b4/src/transformers/models/dots1/modular_dots1.py
* add distilbert
* small fixes
* add note for LLM_ARCH_DISTIL_BERT
* Use MODEL_ARCH.BERT for DistilBert
---------
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
* convert: add support for BertForSequenceClassification
* add support for reranking using BertForSequenceClassification
* merge checks of eos and sep
* fix lint
---------
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
* mtmd : allow multiple modalities at the same time
* refactor mtmd tokenizer
* fix compile
* ok, missing SinusoidsPositionEmbedding
* first working version
* fix style
* more strict validate of n_embd
* refactor if..else to switch
* fix regression
* add test for 3B
* update docs
* fix tokenizing with add_special
* add more tests
* fix test case "huge"
* rm redundant code
* set_position_mrope_1d rm n_tokens
* convert ok, load ok
* warmup ok
* test
* still does not work?
* fix padding
* temporary give up
* fix merge conflict
* build_ultravox()
* rm test
* fix merge conflict
* add necessary mtmd APIs
* first working version (only 4s of audio)
* will this monster compile?
* fix compile
* please compile
* fPIC
* fix windows
* various fixes
* clean up audio_helpers
* fix conversion
* add some debug stuff
* long audio input ok
* adapt the api
* add --audio arg
* final touch UX
* add miniaudio to readme
* fix typo
* refactor kv metadata
* mtmd_default_marker()
* feat: Add GGUF conversion for granitemoeshared
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: hparam and arch plumbing for granitemoeshared
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Split MoE fused tensors for shared experts in conversion
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: First WIP cut at model arch in cpp
The hparam and architecture plumbing should be correct, but the
implementation of the shared experts seems to still be broken.
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Cleaner (maybe more correct?) splitting for gate/up
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix the input to the shared experts
I had misread that the shared experts take the inputs _before_ the standard
MoE layer and was feeding the output of the MoE to the shared experts.
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Avoid architecture-specific checks for Granite MoE Shared
This is a cleaner way that will allow more flexibility in architecture
strings going forward.
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Split granite architectures out of llm_build_llama
This helps de-clutter the llama-family graph construction and allows
granite to diverge further (in preparation for Granite 4).
NOTE: I removed the granite scale factors from llm_build_deci because they
appear to only be there as copy-paste from llm_build_llama. The HF config
does not seem to set those values:
https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix compiler warning about uninitialized inp_pos
This should not have been reachable, but it warns on some compliers
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Consoladate GraniteMoEShared into GraniteMoE for conversion
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Consolidate GraniteMoEShared into GraniteMoE on the c++ side
Branch: GraniteMoEShared
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* convert : internvl support
* InternVL3-1B working
* fix regression
* rm mobilevlm from test
* fix conversion
* add test for internvl
* add to list of pre-quant
* restore boi/eoi check
* add clarify comment for norm eps
* mtmd : refactor graph builder
* fix qwen2vl
* clean up siglip cgraph
* pixtral migrated
* move minicpmv to a dedicated build function
* move max_feature_layer to build_llava
* use build_attn for minicpm resampler
* fix windows build
* add comment for batch_size
* also support tinygemma3 test model
* qwen2vl does not use RMS norm
* fix qwen2vl norm (2)
* Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture
- Adds MoE-based embedding model supporting multilingual embeddings.
- Selects architecture variant based on hyperparameter detection (MoE layers).
- Removes unnecessary subclass initialization checks for clarity.
https://www.nomic.ai/blog/posts/nomic-embed-text-v2
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
* fix tokenizer
* don't rename this tensor
---------
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
* fix wrong template in GLM4-0414
* fix spaces
* no bos token since it is already in the template
* moved the chatgml4 check to higher priority
* restored template for old GLM models
* moved the GLM4 template check in the correct place with correct check
* implment vision model architecture, gguf convertor
* handle window attention inputs
* add debug utils
* fix few incorrect tensor memory layout
* move position id remap out of ggml to avoid int32 cuda operations
* cleaning up
* ignore transformers Qwen2_5_xxx type check
* remove not so often use `qwen2vl-cli` debug functions
* remove commented-out code blocks
* fix attn weight scaling after rebase
* add `PROJECTOR_TYPE_QWEN2_5_VL`
* remove `KEY_USE_GLU_MLP`, `KEY_USE_RMS_NORM`
* replace `KEY_FULLATTN_BLK_IDX` with `KEY_WIN_ATTN_PATTERN`
* remove `attn_window_size` from gguf
* fix model conversion
* clean up
* fix merging problem
* add test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* add pixtral text model (vision is wip)
* cgraph ok, just missing 2D RoPE
* fix bad rebase
* first working version
* fix problem with img_break token
* support dynamic image size
* update docs
* update test script
* Merged using squash to remove all noise commit messages
* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large
* Removed 3 conts (2x RoPE and 1x RMS-norm)
* Changed to use `<cmath>` instead of `<math.h>`
* Reverted removal of the 3 conts
* Used `reshape` in `llm_graph_context::build_attn_mha()`
* Use `k_pe = ggml_reshape`
* Removed the 3 conts again
* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF
* Removed MQA optimisation from `build_attn_mha()` as no gains now
* Simplified `is_mla` branch in `llm_build_deepseek2()`
* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls
* Fixed call to `build_attn` in `llm_build_t5_enc`