Insights · Voice AI & Cognition

The Missing Layer in Voice AI

Why the problem between hearing and reasoning remains unsolved.

Carol ZhuCEO & Founder, DecodeNetwork12 min readVoice AI & CognitionPublishedFebruary 2026

The voice AI industry has made extraordinary progress on everything it knows how to measure. The problem is that the thing that matters most doesn't have a metric yet.

The Architecture That Won (And What It Can't Do)

The dominant voice AI architecture is the cascade: ASR → LLM → TTS. Speech comes in, gets transcribed, gets reasoned about, gets spoken back. Modular. Debuggable. Each component improvable independently. This architecture won for good reasons. But it has a structural flaw that no amount of component improvement can fix: it destroys paralinguistic information at the ASR→text boundary.

When a customer says “that sounds great,” the cascade passes along a short string of text. What it loses: whether those words carried genuine enthusiasm or resigned acceptance. Whether the speaker's voice tightened slightly, suggesting doubt they haven't articulated. Whether there was a micro-pause before “great” — the kind of hesitation a skilled negotiator would immediately probe. The LLM never sees any of this. It reasons over an impoverished representation of what was actually communicated.

The End-to-End Promise

The alternative — end-to-end models that process audio tokens directly — promises to preserve this information. OpenAI's voice mode and Kyutai's Moshi represent this direction. But end-to-end models introduce their own problems. Reasoning quality degrades in audio-token-space versus text-token-space. Debuggability disappears. Safety and alignment become harder without an intermediate text layer to inspect.

A February 2025 paper, “The Cascade Equivalence Hypothesis,” found that most current open-source speech LLMs are behaviorally equivalent to simple ASR→LLM pipelines — a cascade in disguise, with the same information bottleneck and none of the modularity. What's happening inside proprietary systems at OpenAI, Google DeepMind, or other major labs is unknown.

The Token Problem: We Don't Have the Right Categories

The field divides speech tokens into “acoustic tokens” (optimized for waveform reconstruction) and “semantic tokens” (optimized for linguistic meaning). A comprehensive late-2025 review revealed that “semantic tokens” are actually more phonetic than semantic — they capture sound patterns, not meaning. We don't have the right vocabulary to describe what “understanding” audio means at the token level.

Key finding

When text and audio emotion cues conflict — neutral words in an angry tone — model accuracy collapses to 37–43%, near random. For purely paralinguistic scenarios with no lexical cues: 15–22%, barely above chance.

Recent benchmarks quantify how badly current models fail at paralinguistic understanding. LISTEN (October 2025) and EMIS confirmed: when speech models are explicitly prompted to ignore word meaning and focus on acoustic emotion, they still predict based on word-level labels over 80% of the time. They literally cannot separate what was said from how it was said.

The most practically advanced work is narrow: AssemblyAI's Universal-Streaming model trains on a special end-of-turn token, replacing crude silence-threshold approaches. It's real progress. But knowing when someone stopped talking is orders of magnitude easier than knowing what they meant while talking.

The TTS Herd and the Problem Nobody Is Solving

Look at where the investment went in 2024 and 2025. ElevenLabs v3, Inworld TTS-1, Fish Speech S2, StepFun EditX, Cartesia Sonic-3 — a cluster of major TTS releases, not spread across the year but compressed into months of each other. That is not coincidence. That is a herd.

TTS produces a demo. You play a clip, it sounds human, the investor writes a check. “We improved paralinguistic understanding of incoming speech” is not a pitch that plays audio.

VC incentives compound the effect. TTS revenue is demonstrable in a pitch deck. The commercial story for the input side requires explaining why it matters — a harder sell than just playing audio.

But this is not only herd behavior. The input side — robust streaming ASR, noise handling, and full semantic understanding — is genuinely harder in almost every dimension, and the field is only beginning to reckon with how hard.

Start with the engineering problem. Whisper, OpenAI's foundational ASR model and still the accuracy benchmark for much of the field, was designed for batch transcription of pre-recorded audio. Streaming audio — a live conversation — is structurally incompatible with this. WhisperFlow (MobiSys 2025) showed that retrofitting Whisper for real-time use on commodity ARM hardware requires a GPU-centric pipeline with CPU offload, separate thread pools for encoding and decoding, and a “hush word” mechanism to stop the model from hallucinating when audio cuts off.

And that is still only the transcription problem — getting words right in real time under real acoustic conditions. The semantic understanding layer above transcription is harder still. TTS has a natural training signal: does it sound right? ASR understanding has no clean equivalent. The labeling bottleneck is not an engineering obstacle. It is a philosophical one. The field cannot brute-force its way through it with more compute.

What the Voice AI Companies Have Learned

To understand where the real knowledge lives, look at how the leading voice AI companies have approached speech — and be honest about what their architectures actually are.

AssemblyAI

Universal-3 Pro (Feb 2026) — the first ASR model with LLM-style prompt control over transcription output. Universal-Streaming introduced semantic endpointing via a special end-of-turn token — the most practically advanced work on replacing silence-threshold turn detection. The paralinguistic gap remains: all of this is still fundamentally transcription infrastructure — knowing when and what, not what it means.

Cartesia

Sonic-3 (late 2025). Sub-90ms time-to-first-audio. Architecture is State Space Models (Mamba family) — not an incremental improvement but a paradigm-level choice. SSMs process audio in constant time per step with fixed memory. The SSM advantage is most significant for always-on, on-device processing. Currently TTS-only; no published work on paralinguistic input understanding.

ElevenLabs

v3 / v3-Alpha (2025–2026), their most expressive model yet — emotional tags with Transformer + diffusion architecture. Moat is product breadth and distribution: Dubbing Studio, Voice Isolator, Sound Effects, Reader app, and a Voice Library marketplace. Their generation-side understanding of paralinguistic signals is arguably the most battle-tested in the industry — but that knowledge has not been applied to input understanding.

OpenAI

Voice mode (GPT-4o-Audio) is the most widely deployed end-to-end voice AI system in the world. Architecture is not published. Observable behavior still shows misread sarcasm, missed discomfort, and poor pragmatic inference — consistent with Cascade Equivalence findings that suggest implicit transcription may still be happening internally.

Kyutai

Moshi — marketed as end-to-end and full-duplex, but the architecture uses an 'Inner Monologue' mechanism that predicts time-aligned text tokens before audio tokens at every timestep. Text is still an internal intermediate, just not a separate module. The paralinguistic bottleneck still applies — just internalized.

Hume AI

Built the foundational emotion classification datasets and trained production-grade emotion recognition models. Their EVI (Empathic Voice Interface) is the most direct attempt by any commercial company to close the paralinguistic gap. The limitation: their framework still reduces to categorical emotion labels rather than a communicative state model.

StepFun

Step-Audio 2 (March 2026) is genuinely end-to-end: continuous audio input via latent space encoder, discrete audio output via audio RL. Step-Audio-EditX specializes in iterative paralinguistic editing — explicitly tagging exhale, snort, inhale, chuckle, giggle, throat-clearing — making StepFun the company doing the most explicit work on paralinguistic signal manipulation, even if still on the generation side.

Google DeepMind

Gemini multimodal stack processes audio natively as part of the model rather than a cascade. Has the largest proprietary training corpus, most compute, and deepest research bench. If brute-force scaling solves the paralinguistic gap, DeepMind is the most likely source. But observable voice products still exhibit the same failure modes: misread tone, poor pragmatic inference.

What all of these companies share, regardless of architecture, is hard-won knowledge about the paralinguistic layer. To generate convincing speech, you must model emotion, pacing, emphasis, breathing, and vocal texture. The TTS companies have solved this on the output side. Nobody has solved it on the input side.

The Missing Layer: Audio-Grounded Intent

What doesn't exist and should: an intermediate representation between raw audio and text that preserves the signals humans actually use to understand each other. This layer would capture confidence trajectories, emotional dynamics, turn-taking signals, cultural communication patterns, and cognitive load indicators.

Confidence trajectories

How certainty shifts within a single utterance

Emotional dynamics

State changes during conversation, not just point-in-time labels

Turn-taking signals

Pause as invitation vs. pause as thought

Cognitive load indicators

Speech rate, filler frequency, self-corrections, breathing patterns

The Industry Thinks the Gap Is Emotion. It's Not.

Emotion classification is the paralinguistic equivalent of keyword spotting. Labeling an utterance “happy” or “angry” tells you almost nothing about what happens next. Best specialized SER models achieve 53–70% accuracy; even 100% wouldn't solve the problem, because knowing someone sounds angry doesn't tell you why, what they'll do, or what they need right now.

The real gap is pragmatic inference — what the speaker believes, what they're trying to accomplish, the breath suggesting they almost said something else. No emotion classifier gets you here.

What's needed is a communicative state model: epistemic state (what the speaker believes right now, how certainty shifts), pragmatic intent (what they're trying to accomplish by saying this in this way), the unsaid (the pause before an answer, the breath suggesting they almost said something else), and model of the listener (are they adjusting because they think I don't understand?).

The Hard Epistemological Problem

Humans often don't understand their own intent. A customer in a sales conversation doesn't always know whether they're going to buy. This isn't a data problem — it's philosophical. Intent is often constructed after the fact, rationalized in retrospect. This is the deep reason SER plateaus at 50–70% accuracy. The labels are inherently noisy because the thing being labeled isn't objectively accessible.

Audio alone is fundamentally insufficient for intent labeling. A 2024 Nature Communications study found that situational knowledge alone was often sufficient for emotion inference, while isolated facial cues were not. If faces are insufficient without context, audio-only clips are even more impoverished.

The insight

What actually works: interpretive models useful without being “true.” AI should generate multiple plausible models of a speaker's communicative state, scored by how well each predicts what actually happens next — not a single label, but ranked working hypotheses. This is how skilled human communicators operate.

The Technical Path Forward

Outcome-grounded learning

Train on observable outcomes (customer renews, deal closes, student passes) rather than subjective emotion labels

Multi-hypothesis modeling

Mixture-of-experts where output is a weighted distribution across interpretive frameworks

Integration architecture

A parallel processing path alongside ASR, producing communicative-state-enriched context passed to the LLM

Intervention-based evaluation

Did the system's model lead to a better response? Not classification accuracy

Why This Matters

Sales

Surfaces "the buyer's confidence dropped when you mentioned timeline"

Healthcare

Contextualizes vocal biomarkers within conversational dynamics

Customer experience

Identifies when frustration shifted to resignation — the moment you've lost them

Education

Detects confusion from prosodic signals before a student articulates it

We've built AI that can hear. We've built AI that can think. We've built AI that can speak. What we haven't built is AI that can listen.

About the author

Carol Zhu

CEO and Founder of DecodeNetwork.AI, building at the intersection of AI, psychology, and human connection. Previously launched products at AWS AI (SageMaker), TikTok Shop (founding PM), and Credit Karma — building systems that predict what users want, and thinking hard about what that means.

Sources & References

"The Cascade Equivalence Hypothesis" (February 2025). Behavioral comparison of open-source speech LLMs and ASR→LLM pipelines.
LISTEN Benchmark (October 2025). Paralinguistic understanding evaluation for speech models.
EMIS Benchmark. Acoustic emotion prediction vs. word-level label reliance in speech models.
WhisperFlow (MobiSys 2025). Real-time Whisper inference on commodity ARM hardware.
Bloomberg/Interspeech 2025. Two-pass Whisper streaming framework with CTC draft + attention rescoring.
Nature Communications (2024). Situational knowledge vs. isolated cues in emotion inference.
AssemblyAI Universal-3 Pro (February 2026). LLM-style prompt control for ASR.
Fish Audio Fish Speech S2 (March 2026). Open-source TTS with emotion control.
StepFun Step-Audio 2 (March 2026). End-to-end voice AI with latent space encoder.
Kyutai Moshi (September 2024). Full-duplex voice with Inner Monologue architecture.

The music that helps you focus depends on where your nervous system actually is.

Discover Your State →