Heads¶
Output heads are constructed and owned by Model. They are documented here for reference when building custom training loops.
DQNHead¶
mouse.models.heads.dqn.DQNHead
¶
DQNHead(in_features: int, out_features: int, hidden_dim: int, num_layers: int, scale: float = 1.0, use_norm: bool = True)
Bases: BaseHeadWithTarget
SwiGLUHead paired with an EMA target copy and Polyak averaging.
forward runs the online head. target_forward runs the target head
(no gradient tracking). Call polyak_update(tau) after each optimizer
step to soft-update the target: θ_target ← τ·θ_online + (1−τ)·θ_target.
Initialize with tau=1.0 to copy online weights into the target.
Source code in mouse/models/heads/dqn.py
VecDQNHead¶
mouse.models.heads.vec_dqn.VecDQNHead
¶
VecDQNHead(in_features: int, max_num_actions: int, vec_dim: int, hidden_dim: int, num_layers: int, scale: float = 1.0, bias_scale: float | None = None, use_norm: bool = True)
Bases: BaseHeadWithTarget
SwiGLUHead paired with an EMA target copy and Polyak averaging.
Like DQNHead but each action produces a vec_dim-dimensional vector
instead of a single scalar. Output shape is [..., max_num_actions, vec_dim].
forward runs the online head. target_forward runs the target head
(no gradient tracking). Call polyak_update(tau) after each optimizer
step to soft-update the target: θ_target ← τ·θ_online + (1−τ)·θ_target.
Initialize with tau=1.0 to copy online weights into the target.
Source code in mouse/models/heads/vec_dqn.py
vec_dqn_scores¶
mouse.models.heads.vec_dqn.vec_dqn_scores
¶
Compute pairwise angular action scores from vec-DQN vectors.
For each pair of actions (i, a), computes the full signed angle
φ_a − φ_i via atan2(sin, cos), then sums over all i to give
action a's total angular lead.
The sin component is dot(rot90(vᵢ), vₐ) and the cos component is
dot(vᵢ, vₐ). Using both avoids the aliasing of a sin-only score,
which saturates at ±90° and folds back toward zero toward 180°. The
atan2 score is monotone across the full (−π, +π) range — aliasing only
occurs if two action vectors rotate past ±180° apart, which is twice as
hard to reach.
For D = 2 (a single rotation plane) this is geometrically exact.
For D > 2 (RoPE with multiple planes) it is a well-conditioned
approximation; the D = 2 case is recommended for exact geometry.
Self-terms contribute atan2(0, 1) = 0 and require no masking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vecs
|
Tensor
|
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
scores |
Tensor
|
|
Source code in mouse/models/heads/vec_dqn.py
rope_rotate¶
mouse.models.heads.vec_dqn.rope_rotate
¶
Rotate each consecutive pair of dimensions in x by theta.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
|
required |
theta
|
Tensor
|
|
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Rotated tensor of the same shape as |
Source code in mouse/models/heads/vec_dqn.py
SwiGLUHead¶
mouse.models.heads.swiglu.SwiGLUHead
¶
SwiGLUHead(in_features: int, out_features: int, hidden_dim: int, num_layers: int, scale: float = 1.0, use_norm: bool = True)
Bases: BaseHead
MLP head built from stacked SwiGLU blocks with a scaled output projection.
Architecture::
[RMSNorm →] SwiGLU(D→hidden) × (num_layers−1) → ScaledLinear(hidden→out)
The optional RMSNorm (use_norm=True) is applied to the input before
the first SwiGLU block. scale controls the output weight initialisation
magnitude — set small (e.g. 0.01) for a near-zero initial output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
in_features
|
int
|
Input dimension |
required |
out_features
|
int
|
Output dimension (number of actions |
required |
hidden_dim
|
int
|
Width of the SwiGLU hidden layers. |
required |
num_layers
|
int
|
Total depth including the final linear; must be |
required |
scale
|
float
|
|
1.0
|
use_norm
|
bool
|
Whether to prepend an |
True
|