Insights · AI Research

Two “Chinese” Research Labs That Got Me Hooked Lately

StepFun and Moonshot AI have the two most coherent research arcs in Chinese AI right now. Here’s why they matter and what connects the dots.

Carol Zhu·March 2026·8 min read

StepFun: Every Paper Is About Cost

StepFun’s research story is fundamentally about cost — every major paper is either asking “how do we make this cheaper to decode” or “how do we predict training costs before we spend them.” The Farseer + Step Law papers (Parts I and II of “Predictable Scale”) are their version of Moonshot’s Muon work — infrastructure science that makes the core model training tractable.

The MFA → Step-3 → Step 3.5 Flash architecture thread is tighter than it looks: MFA (Dec 2024) solves the KV cache / expressiveness tradeoff → AFD in Step-3 (Jul 2025) solves the compute disaggregation problem → Step 3.5 Flash (Feb 2026) brings MTP-3 and the new MIS-PO RL algorithm to get frontier performance at 11B active params.

The new MIS-PO algorithm in Step 3.5 Flash is worth watching — it replaces continuous importance weighting with discrete distributional filtering at both token and trajectory levels, substantially reducing gradient variance while preserving effective learning signals, which is a real contribution to off-policy RL stability at scale.

Moonshot AI (Kimi): From KV Cache to Residual Attention

Moonshot’s arc is different but equally coherent. Where StepFun obsesses over cost, Moonshot obsesses over architecture — finding structural improvements that compound across the stack.

The thread runs: Mooncake (Jun 2024) disaggregates serving to handle long-context efficiently → Moonlight proves Muon scales to LLM training → K1.5 shows you can get o1-level reasoning without MCTS or value functions → K2 combines MuonClip + synthetic agentic data + self-critique RL to become the #1 open-source model on LMSYS Arena → K2.5 introduces Agent Swarm (100 parallel sub-agents) and PARL → and their latest, AttnRes, replaces the fixed residual connection that’s been unchanged since ResNets in 2015.

The K2 → K2.5 jump is particularly interesting: instead of scaling the single-agent paradigm, they introduced a learned decomposition-and-delegation system where the model spawns domain-specific sub-agents executing in parallel. PARL trains this as a learnable skill end-to-end, with rewards flowing back through the swarm. 76.8% SWE-Bench Verified at 4.5× lower latency than single-agent.

Why These Two

What sets both labs apart isn’t just the quality of individual papers — it’s the coherence. Each paper clearly builds on the last. StepFun’s architecture decisions flow directly from their cost thesis; Moonshot’s flow from their conviction that structural innovation compounds faster than scale alone.

In a landscape where most labs publish a grab bag of benchmarks and call it a day, these two are telling research stories you can actually follow. Click through the timelines above to see for yourself.

Carol Zhu

CEO and Co-Founder of DecodeNetwork.AI. Previously launched products at AWS AI, TikTok, and Credit Karma. Writes about AI research, infrastructure, and the systems that shape how we experience intelligence.

← Back to Insights