Backbone¶
Concrete Model subclasses and the config dataclasses that build HuggingFace transformer backbones. You rarely need to instantiate these directly — use load_model instead.
ModelLlama¶
mouse.models.backbone.llama.ModelLlama
¶
ModelLlama(hidden_dim: int, backbone_kwargs: dict, embedding_kwargs: dict, sp_head_kwargs: dict, dqn_head_kwargs: dict, sv_head_kwargs: dict, vec_dqn_head_kwargs: dict, action_head: str | None = None)
Bases: Model, PyTorchModelHubMixin
MOUSE model with a Llama transformer backbone.
Attends over the full [B, S*T, D] token sequence with causal SDPA.
Supports KV-cache for incremental rollouts (use_cache=True).
Source code in mouse/models/base.py
backbone_forward
¶
backbone_forward(embeds: Tensor, token_type: Tensor, cache: dict[str, Any] | None = None, use_cache: bool = False, cache_position: Tensor | None = None, **kwargs: Any) -> tuple[torch.Tensor, dict[str, Any] | None]
Run the Llama backbone over the token sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embeds
|
Tensor
|
|
required |
token_type
|
Tensor
|
|
required |
cache
|
dict[str, Any] | None
|
KV-cache dict from a previous call, or |
None
|
use_cache
|
bool
|
If |
False
|
cache_position
|
Tensor | None
|
Unused; present for interface compatibility. |
None
|
**kwargs
|
Any
|
Forwarded to the underlying |
{}
|
Returns:
| Type | Description |
|---|---|
tuple[Tensor, dict[str, Any] | None]
|
Tuple of |
Source code in mouse/models/backbone/llama.py
ModelQwen3¶
mouse.models.backbone.qwen3.ModelQwen3
¶
ModelQwen3(hidden_dim: int, backbone_kwargs: dict, embedding_kwargs: dict, sp_head_kwargs: dict, dqn_head_kwargs: dict, sv_head_kwargs: dict, vec_dqn_head_kwargs: dict, action_head: str | None = None)
Bases: Model, PyTorchModelHubMixin
MOUSE model with a Qwen3 transformer backbone.
Attends over the full [B, S*T, D] token sequence with causal SDPA.
Supports an explicit head_dim (set in backbone_kwargs) for grouped-query
attention with a head size independent of the model width. Supports KV-cache
for incremental rollouts (use_cache=True).
Source code in mouse/models/base.py
backbone_forward
¶
backbone_forward(embeds: Tensor, token_type: Tensor, cache: dict[str, Any] | None = None, use_cache: bool = False, cache_position: Tensor | None = None, **kwargs: Any) -> tuple[torch.Tensor, dict[str, Any] | None]
Run the Qwen3 backbone over the token sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embeds
|
Tensor
|
|
required |
token_type
|
Tensor
|
|
required |
cache
|
dict[str, Any] | None
|
KV-cache dict from a previous call, or |
None
|
use_cache
|
bool
|
If |
False
|
cache_position
|
Tensor | None
|
Unused; present for interface compatibility. |
None
|
**kwargs
|
Any
|
Forwarded to the underlying |
{}
|
Returns:
| Type | Description |
|---|---|
tuple[Tensor, dict[str, Any] | None]
|
Tuple of |
Source code in mouse/models/backbone/qwen3.py
ModelNone¶
mouse.models.backbone.none.ModelNone
¶
ModelNone(hidden_dim: int, backbone_kwargs: dict, embedding_kwargs: dict, sp_head_kwargs: dict, dqn_head_kwargs: dict, sv_head_kwargs: dict, vec_dqn_head_kwargs: dict, action_head: str | None = None)
Bases: Model, PyTorchModelHubMixin
MOUSE model with no backbone; embeddings pass directly to the output heads.
Useful for ablations or lightweight baselines where no temporal context is
required. backbone_kwargs must be empty (or absent) in config.json.
KV-cache is not supported — always returns None for the cache.
Source code in mouse/models/base.py
backbone_forward
¶
backbone_forward(embeds: Tensor, token_type: Tensor, cache: dict[str, Any] | None = None, use_cache: bool = False, cache_position: Tensor | None = None, **kwargs: Any) -> tuple[torch.Tensor, dict[str, Any] | None]
Pass embeddings through unchanged; always returns None for cache.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embeds
|
Tensor
|
|
required |
token_type
|
Tensor
|
Ignored. |
required |
cache
|
dict[str, Any] | None
|
Ignored. |
None
|
use_cache
|
bool
|
Ignored. |
False
|
cache_position
|
Tensor | None
|
Ignored. |
None
|
Returns:
| Type | Description |
|---|---|
tuple[Tensor, dict[str, Any] | None]
|
Tuple of |
Source code in mouse/models/backbone/none.py
LlamaBackboneConfig¶
mouse.models.backbone.llama.LlamaBackboneConfig
dataclass
¶
LlamaBackboneConfig(num_layers: int, num_heads: int, num_key_value_heads: int | None = None, max_position_embeddings: int = 4096, expand: int = 4, intermediate_size: int | None = None, rope_parameters: dict | None = None, rms_norm_eps: float = 1e-05, attention_bias: bool = False)
Configuration for a Llama transformer backbone.
Builds a HuggingFace LlamaModel with SDPA attention and no token
embedding or final layer norm (norm is replaced with nn.Identity).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_layers
|
int
|
Number of transformer decoder layers. |
required |
num_heads
|
int
|
Number of query attention heads. |
required |
num_key_value_heads
|
int | None
|
Key/value heads for GQA; defaults to |
None
|
max_position_embeddings
|
int
|
Maximum sequence length for RoPE; should be at
least |
4096
|
expand
|
int
|
FFN intermediate size multiplier: |
4
|
intermediate_size
|
int | None
|
Exact FFN size; overrides |
None
|
rope_parameters
|
dict | None
|
Optional dict forwarded to |
None
|
rms_norm_eps
|
float
|
Epsilon for RMSNorm layers. |
1e-05
|
attention_bias
|
bool
|
Whether to add bias to QKV and output projections. |
False
|
build
¶
Instantiate a LlamaModel with this config.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_dim
|
int
|
Model hidden dimension |
required |
Returns:
| Type | Description |
|---|---|
LlamaModel
|
|
Source code in mouse/models/backbone/llama.py
Qwen3BackboneConfig¶
mouse.models.backbone.qwen3.Qwen3BackboneConfig
dataclass
¶
Qwen3BackboneConfig(num_layers: int, num_heads: int, num_key_value_heads: int | None = None, head_dim: int | None = None, max_position_embeddings: int = 32768, expand: int = 3, intermediate_size: int | None = None, rope_parameters: dict | None = None, rms_norm_eps: float = 1e-06, attention_bias: bool = False, use_sliding_window: bool = False)
Configuration for a Qwen3 transformer backbone.
Builds a HuggingFace Qwen3Model with SDPA attention and no token
embedding or final layer norm (norm is replaced with nn.Identity).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_layers
|
int
|
Number of transformer decoder layers. |
required |
num_heads
|
int
|
Number of query attention heads. |
required |
num_key_value_heads
|
int | None
|
Key/value heads for GQA; defaults to |
None
|
head_dim
|
int | None
|
Per-head attention dimension. When |
None
|
max_position_embeddings
|
int
|
Maximum sequence length for RoPE. |
32768
|
expand
|
int
|
FFN intermediate size multiplier: |
3
|
intermediate_size
|
int | None
|
Exact FFN size; overrides |
None
|
rope_parameters
|
dict | None
|
Optional dict forwarded to |
None
|
rms_norm_eps
|
float
|
Epsilon for RMSNorm layers. |
1e-06
|
attention_bias
|
bool
|
Whether to add bias to QKV and output projections. |
False
|
use_sliding_window
|
bool
|
Enable sliding-window attention (Qwen3 feature). |
False
|
build
¶
Instantiate a Qwen3Model with this config.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_dim
|
int
|
Model hidden dimension |
required |
Returns:
| Type | Description |
|---|---|
Qwen3Model
|
|