StepFun Research Arc

Core LLM

Architecture

Multimodal

Scaling science

Audio

Image gen / edit

Agent

2024

Step-1 + Step-1V Mar–Jul 2024

Core LLM

Dense LLM + 100B multimodal VLM — first public StepFun models; Step-1V topped OpenCompass in China, GPT-4V comparable

Step-1: initial text LLM series (step-1-8k, step-1-32k variants), serving as the base for StepFun's API platform and the "Yuewen" consumer app.

Step-1V: 100B multimodal model integrating vision. Ranked #1 on OpenCompass multimodal leaderboard in China at launch. Performance described as comparable to GPT-4V on image understanding and visual reasoning tasks.

This was StepFun's opening declaration: build multimodal-first, not text-first-then-add-vision. The API platform launched alongside; multimodal API usage grew 45× in H2 2024 alone.

100B multimodal #1 OpenCompass CN GPT-4V comparable multimodal-first strategy

Step-2 Jul 2024 (preview Mar 2024) · WAIC

Core LLM

First trillion-parameter MoE model from a Chinese startup — 5th globally on LiveBench, #1 in China

What it was: a MoE language model with over 1T parameters — the first trillion-parameter model from a Chinese startup. Previewed at a developer conference in March 2024 and officially released at the World AI Conference (WAIC) in July alongside Step-1.5V and Step-1X.

Benchmarks (Nov 2024, Step-2-16k): ranked 5th globally on LiveBench (behind OpenAI o1-mini, GPT-4o, Gemini-1.5-pro, and Claude). Scored 86.57 on Instruction Following — highest in the world at the time.

Significance: showed Chinese startups could compete at trillion-parameter scale. The Step-2 work began MoE architecture exploration that later directly feeds Step-3's design.

1T+ params MoE 5th globally LiveBench #1 China LLM 86.57 IF score first Chinese 1T startup

Step-1.5V + Step-1X Jul 2024

Multimodal

Step-1.5V: enhanced video understanding VLM · Step-1X: image generation model

Released together at WAIC 2024 alongside Step-2. Step-1.5V improved upon Step-1V with significantly stronger video understanding and perception. Step-1X is StepFun's image generation model — the foundation for the image generation/editing track that later leads to Step1X-Edit and NextStep-1.

video understanding image generation multimodal suite

MFA — Multi-Matrix Factorization Attention Dec 2024 · arXiv 2412.19255 · ACL 2025

Architecture

Novel attention that scales head count + dimensions under strict KV cache budgets — later becomes the core of Step-3

Problem: MQA/GQA/MLA all reduce KV cache by compressing keys and values, which hurts model capacity. There's a fundamental tension between memory savings and attention expressiveness.

MFA insight: the number and dimension of attention heads are the critical capacity factors — not just the compression ratio. MFA scales both head count and dimension efficiently within the same KV cache budget, approaching MHA's theoretical upper bound.

MFA-KR (Key-Reuse variant): extends this with key reuse across heads for even better parameter efficiency.

Published at ACL 2025. This paper is explicitly cited as the architectural foundation for Step-3's attention mechanism.

head count + dim scaling KV cache efficient better than MLA at budget ACL 2025

Early 2025

Predictable Scale Part I — Step Law Mar 2025 · arXiv 2503.04715

Scaling science

Optimal hyperparameter scaling law for LLM pretraining — predicts learning rate, batch size across scales

Standard practice is to tune hyperparameters at small scale and hope they transfer. StepFun showed this is unreliable — the optimal learning rate and batch size both depend on model size in a predictable way. "Step Law" provides a formula to compute optimal hyperparameters directly from scale, skipping expensive grid searches. Part I of a two-paper series.

optimal LR scaling batch size prediction no grid search needed

Step-Audio Feb 2025

Audio

First production-ready open-source end-to-end speech model — multilingual, emotional tones, dialects, rap

End-to-end audio understanding and generation. Supports Chinese, English, Japanese; emotional tones (joy, sadness); regional dialects (Cantonese, Sichuanese); adjustable speech rates; prosodic styles including rap. Unified model for both comprehension and generation — not a separate ASR + TTS pipeline. Also released StepEval-Audio-360, a multi-turn audio benchmark.

end-to-end multilingual emotional TTS open-source

Step1X-Edit Apr 2025 · arXiv 2504.17761

Image gen / edit

Unified image editing model — object add/remove, background swap, style transfer via natural language

A unified diffusion-based model for general image editing from natural language. Handles object addition/removal, background modification, action changes, and style transfer. Released GEdit-Bench alongside for evaluation. Later updated (v1p1) to add text-to-image generation, then v1p2 (Nov 2025) added native reasoning: a thinking-editing-reflection loop using MLLM world knowledge for complex instructions.

unified editing natural language control GEdit-Bench open-source

Mid 2025

Farseer — Predictable Scale Part II Jun 2025 · arXiv 2506.10972

Scaling science

Refined loss surface model L(N,D) — significantly better fit than Chinchilla's law across all scales

Chinchilla's scaling law models loss as a simple power law in N and D separately. Farseer constructs a full 2D loss surface L(N,D), showing the interaction terms matter significantly. The result: dramatically better prediction of what a large-scale run will achieve before spending the compute. Critical for StepFun to make confident training decisions for Step-3 without expensive ablations.

beats Chinchilla law 2D loss surface scale-predictive

Step-Audio 2 Jul–Aug 2025

Audio

End-to-end multimodal LLM for audio — latent space encoder + audio RL, tool calling, paralinguistic reasoning

Upgraded architecture: integrates a latent space audio encoder with audio reinforcement learning. Adds tool calling capability directly from audio input. Understands paralinguistic information (emotion, tone, speaking style) not just speech content. Step-Audio 2 mini released Aug 2025; Step-Audio 2 mini Think added test-time compute scaling for audio reasoning in Sept 2025. Released StepEval-Audio-Paralinguistic and StepEval-Audio-Toolcall benchmarks.

latent audio encoder audio RL tool calling from audio paralinguistic understanding

Step-3 — Cost-Effective Multimodal Intelligence Jul 2025 · arXiv 2507.19427

Core LLM

321B MoE (38B active) · MFA attention · AFD disaggregated inference · designed for lowest decoding cost per token

Core thesis: hardware-aware model-system co-design. Most labs optimize model architecture and serving separately — Step-3 designs them together from scratch.

MFA (attention): from the Dec 2024 paper. Reduces KV cache while maintaining higher attention expressiveness than MLA. Key insight: don't just compress — scale the right dimensions.

AFD (Attention-FFN Disaggregation): physically separates attention and FFN into different GPU subsystems. Attention runs on one pool, FFN on another, streaming in parallel via RDMA. This lets each subsystem be optimized for its own compute pattern — attention is memory-bandwidth bound, FFN is compute bound.

Result: lower theoretical decoding cost than DeepSeek-V3 and Qwen3 MoE 235B despite activating 38B params (more than both). The gains widen at longer context. 5B vision encoder for multimodal, 4T token multimodal dataset.

Open-sourced under Apache 2.0. Also released StepMesh (distributed training library).

MFA attention AFD disaggregation cheaper than DSv3 38B active / 321B total 4T multimodal tokens Apache 2.0

NextStep-1 Aug 2025 · arXiv 2508.10711

Image gen / edit

Autoregressive image generation with continuous tokens — rivals diffusion models, patch-by-patch generation

Most AR image models use discrete VQ tokens or rely on a heavy diffusion backbone. NextStep-1 generates continuous image patches autoregressively using a lightweight flow matching model per patch — pure next-token-prediction, no diffusion. Achieves state-of-the-art results on T2I generation and supports instruction-guided editing. Trained on 550M high-quality image-text pairs with Step-1o-turbo used for re-captioning.

AR + flow matching no heavy diffusion backbone continuous tokens 550M training pairs

Late 2025

Step-DeepResearch Dec 2025 · arXiv 2512.20491

Agent

End-to-end deep research agent — ReAct loop, atomic capability internalization, 600+ curated authority sources

Unlike OpenAI/Gemini Deep Research which hardcode workflow patterns externally, Step-DeepResearch internalizes research capabilities into the model via end-to-end training. Atomic capabilities — planning, search, cross-validation, report generation — are trained as model-level behaviors, not orchestration wrappers.

Infrastructure: 600+ curated authoritative sites indexed separately from low-quality SEO content. 20M+ high-quality items in a dedicated knowledge library. Paragraph-level indexing for dense retrieval.

Training pipeline: Agentic Mid-Training → SFT → RL, shifting objective from "predict next token" to "decide next atomic action."

end-to-end trained atomic capability RL 600+ authority sources not workflow-orchestrated

2026

Step 3.5 Flash — Frontier Intelligence, 11B Active Feb 2026 · arXiv 2602.10604

Core LLM

196B MoE (11B active) · MTP-3 multi-token prediction · MIS-PO RL · 100–350 tok/s · rivals GPT-5.2 and Gemini 3 Pro

Architecture: 196B total, 11B active per token. 288 routed experts + 1 always-active shared expert. 3:1 sliding-window/full attention ratio. 256K context.

MTP-3: Multi-Token Prediction head predicts 4 tokens per forward pass via a sliding-window + dense FFN module. Enables 100–300 tok/s in production (350 tok/s peak on coding).

MIS-PO (new RL algorithm): Metropolis Independence Sampling-Filtered Policy Optimization. Replaces continuous importance weighting with discrete distributional filtering at both token and trajectory level. Reduces gradient variance under large-scale off-policy training while preserving signal. Enables stable RL at scale.

Benchmarks: 85.4% IMO-AnswerBench, 86.4% LiveCodeBench-v6, 88.2% Tau2-Bench, 69.0% BrowseComp, 74.4% SWE-Bench Verified. Comparable to GPT-5.2 xHigh and Gemini 3.0 Pro on most tasks.

Also introduced Step-GUI (on-device GUI agent) for edge-cloud collaboration: Step 3.5 Flash as cloud brain, Step-GUI as on-device hand.

11B active / 196B total MTP-3 MIS-PO RL 350 tok/s peak 74.4% SWE-Bench Step-GUI edge agent

From dense to intelligence density