From KV cache to residual attention

Every paper in Kimi's stack — click any to expand

Infrastructure
Optimizer / arch
RL / reasoning
Core model
Multimodal
2024
Mooncake Jun 2024 · arXiv 2407.00079
Infrastructure
KV-centric disaggregated serving — separates prefill and decode clusters
The problem: long-context requests during prefill disrupt decoding in standard serving systems.

Fix: disaggregate prefill and decoding into separate GPU pools. Schedule around the KVCache, not around requests. Leverage idle CPU/DRAM/SSD as a distributed KV store.

Results: 59–498% throughput gain in simulations. In production on Kimi: 115% more requests handled on A800 clusters.

Later: powered K2 deployment on 128 H200s at 224K tokens/sec prefill. Won Best Paper at FAST 2025.
disaggregated prefill KVCache scheduling FAST 2025 Best Paper 525% throughput gain
MoBA 2024
Infrastructure
Mixture of Block Attention — solves quadratic scaling at 128K–2M token contexts
Standard attention is O(n²) in sequence length. At 2M tokens, that's unusable. MoBA partitions the sequence into blocks and mixes sparse block-level attention with full local attention. The result: linear scaling in context length without sacrificing quality on long-document tasks. Direct prerequisite for Kimi's 2M character context window.
O(n) context scaling sparse block attention 2M token context
Early 2025
Moonlight / Muon scaling 2025 · arXiv 2502.16982
Optimizer
Proved the Muon optimizer scales to LLM training — better token efficiency than AdamW
Muon (Momentum + orthogonalization) was known to work well on small models. Moonlight proved it scales. On the same token budget, Muon reaches better loss than AdamW — meaning each training token does more work. This became the foundation for MuonClip in K2. Moonshot also released SOTA small model checkpoints trained with Muon.
token efficiency Muon optimizer better than AdamW
Kimi K1.5 Jan 22 2025 · arXiv 2501.12599
RL / Reasoning
RL scaling without MCTS, value functions, or process reward models — claimed o1-parity
Key insight: long-context scaling + improved policy optimization is sufficient. No Monte Carlo tree search. No value functions. No process reward models.

The paper showed that scaling the context window during RL training (letting the model see longer reasoning chains as rollouts) plus better policy gradient techniques produces o1-level reasoning in math, code, and multimodal tasks.

Also contains (Appendix B) the full data processing pipeline for the 15.5T token pre-training corpus used later in K2.
no MCTS long-context RL o1-parity claimed multimodal reasoning
Mid 2025
Kimi-VL Apr 2025
Multimodal
16B MoE vision-language model — 3B active params, agentic capabilities
A 16B MoE vision-language model with only 3B active parameters per forward pass. Handles multimodal reasoning, long-context understanding, and agentic tasks with vision. Sets the foundation for vision integration that later scales into K2.5's MoonViT encoder.
MoE VLM 3B active params agentic vision
Kimi K2 — Open Agentic Intelligence Jul 28 2025 · arXiv 2507.20534
Core Model
1T MoE · 32B active · 15.5T tokens · MuonClip · synthetic agentic data · RLVR + self-critique · #1 open-source on LMSYS Arena
Architecture: ultra-sparse MoE with Multi-Head Latent Attention (MLA). 384 experts, 8 activated per token (sparsity 48). Similar to DeepSeek-V3 but optimized for inference.

MuonClip: vanilla Muon causes attention logits to explode past 1000 at scale, crashing training. QK-Clip caps max attention logits at τ=100, decaying naturally over training. Zero loss spikes across the full 15.5T token run.

Agentic data: 20,000+ synthetic tool specs, evolved across domains, with multi-turn trajectories generated in simulated + real code execution environments — no human demos.

RL: combines verifiable rewards (RLVR) with a self-critique rubric reward. The model scores its own open-ended outputs using learned rubrics, extending RL beyond tasks with ground truth.

Benchmarks (non-thinking): 65.8% SWE-Bench Verified, 66.1 Tau2-Bench, 49.5 AIME 2025, 75.1 GPQA-Diamond.
MuonClip QK-Clip 20K+ tool specs RLVR + self-critique 65.8% SWE-Bench open weights MIT
Late 2025
Kimi K2 Thinking Nov 6 2025
RL / Reasoning
Extended reasoning variant — $4.6M training cost, 256K context, NIST evaluated
Trained for ~$4.6M. Same 1T MoE base as K2, tuned for extended chain-of-thought reasoning. NIST/CAISI evaluated it as the most capable PRC-origin model at the time of release — though still below leading US models on cyber and math. Notably uncensored in English; heavily censored in Chinese (similar to DeepSeek R1).
$4.6M training 256K context NIST evaluated extended reasoning
2026
Kimi K2.5 — Visual Agentic Intelligence Jan 2026 · arXiv 2602.02276
Multimodal
Native multimodal · MoonViT · Agent Swarm (100 sub-agents) · PARL · 4.5× latency reduction
Vision: continues pretraining K2 on 15T mixed visual + text tokens. MoonViT (400M param encoder) processes images and video. Vision and text improve together at scale — no capability trade-off.

Agent Swarm: instead of sequential tool calls, K2.5 dynamically spawns up to 100 sub-agents executing in parallel. Each sub-agent is domain-specific and gets its own task slice. The orchestrating model decomposes the task, routes, and merges results — all learned, no hard-coded workflow.

PARL (Parallel-Agent RL): new RL paradigm that trains the model to do agent decomposition and delegation as a learnable skill. Rewards flow back through the swarm.

Results: 76.8% SWE-Bench Verified, 85.0% LiveCodeBench v6, 4.5× latency reduction vs single-agent.
MoonViT 400M 100 sub-agents PARL 76.8% SWE-Bench 4.5× faster open weights
Attention Residuals (AttnRes) Mar 16 2026 · arXiv 2603.15031
Architecture
Replaces fixed residual connections with depth-wise softmax attention — first change to residuals since ResNets (2015)
Problem: standard PreNorm residuals add every layer's output with fixed unit weights. As depth grows, early layers get progressively diluted — hidden states grow unboundedly.

Fix (AttnRes): replace the fixed sum with softmax attention over preceding layer outputs. Each layer learns input-dependent weights over its own history — like a transformer over depth instead of over tokens.

Block AttnRes: attending to all L previous layers is O(Ld) memory. Block version groups layers into N blocks and attends only to block representations — O(Nd). With 8 blocks, recovers ~95% of full AttnRes gains.

Results: same performance at 1.25× less compute. Validated on Kimi Linear (48B total / 3B active) trained on 1.4T tokens.
depth-wise attention 1.25× compute saving Block AttnRes replaces PreNorm residuals Kimi Linear validated