1T MoE · 32B active · 15.5T tokens · MuonClip · synthetic agentic data · RLVR + self-critique · #1 open-source on LMSYS Arena
Architecture: ultra-sparse MoE with Multi-Head Latent Attention (MLA). 384 experts, 8 activated per token (sparsity 48). Similar to DeepSeek-V3 but optimized for inference.
MuonClip: vanilla Muon causes attention logits to explode past 1000 at scale, crashing training. QK-Clip caps max attention logits at τ=100, decaying naturally over training. Zero loss spikes across the full 15.5T token run.
Agentic data: 20,000+ synthetic tool specs, evolved across domains, with multi-turn trajectories generated in simulated + real code execution environments — no human demos.
RL: combines verifiable rewards (RLVR) with a self-critique rubric reward. The model scores its own open-ended outputs using learned rubrics, extending RL beyond tasks with ground truth.
Benchmarks (non-thinking): 65.8% SWE-Bench Verified, 66.1 Tau2-Bench, 49.5 AIME 2025, 75.1 GPQA-Diamond.
MuonClip
QK-Clip
20K+ tool specs
RLVR + self-critique
65.8% SWE-Bench
open weights MIT