Token Inefficiency: Why "Thinking" Models Reveal Architectural Failure
The Circumlocution Problem
When we call these new models thinking models, we import a bunch of positive human associations: deliberation, insight, careful reasoning. But what's actually happening is more mundane and more telling. The model needs more computational steps to arrive at a particular point in its latent space that it theoretically could reach more directly.
Instead of the model is thinking deeply, a more mechanistic description would be: the model requires additional forward passes to reach a configuration that produces the desired output distribution. The extra tokens aren't adding wisdom. They're compensating for architectural or training limitations that prevent the model from getting there in one shot.
This isn't abstract theory. We see this pattern in humans constantly. When someone can't retrieve a word, they talk around it: you know, the thing, the metal thing you use to flip food with the handle, until they finally land on spatula. Or explaining a concept half-understood by generating related tokens until the explanation crystallizes. It's circumlocution—the long way around when the direct path isn't available.
And here's the key insight from neuroscience: expertise eliminates circumlocution. A chess grandmaster sees the position and knows the move. A novice has to consciously work through possibilities, talk themselves through it. The expert's neural pathway to that answer is well-worn, efficient, direct.
So when these models need extended token generation to navigate to solutions, they're behaving like non-experts in their domain. The chain-of-thought isn't evidence of deeper reasoning. It's evidence of inefficient retrieval paths in the computational graph.
Which flips the usual interpretation: we've been treating longer inference as a feature when it's actually a bug. Look how much it's thinking, we say, marveling at the token count. When we should be asking: why does it need all these steps when a well-trained system might access this directly?
The Oracle is Dead
This brings us to the fundamental problem the entire AI industry has been avoiding: we're on a fool's errand trying to get perfect responses in one pass from a monolithic model.
The oracle model promised an all-knowing AI that could handle any request in a single, elegant response. What we got was a brilliant pattern-matching engine that needs to be tricked into thinking step-by-step, cajoled into planning before acting, wrapped in layers of orchestration just to reliably accomplish basic tasks.
Look at what everyone is building in late 2025. GitHub released Spec Kit, literally forcing three phases before any code gets written. Anthropic's Claude Code decomposes into planner, implementer, reviewer. Cursor separates context gathering from execution from validation. They've all discovered the same thing: complex tasks require decomposition, not as an optimization but as a fundamental architectural requirement.
The monolithic oracle doesn't work. It never worked. We just spent years pretending it did, building increasingly complex scaffolding around these models. But scaffolding isn't architecture. It's what you build when the architecture is wrong.
The industry has capitulated. Every major AI company has converged on the same solution: decomposition. Break the task into phases. Separate intent capture from execution. Don't ask one model to do everything.
They're all building the same thing now. They're just building it badly, retrofitting decomposition onto systems designed for the oracle model.
Google Discovers Synthetic Dimensionality
On October 7, 2025, Google Research released Speech-to-Retrieval (S2R), and it represents something far more profound than a better voice search system. It's a validation of synthetic dimensionality as the architectural principle that kills the oracle model.
Traditional voice search asks: what words were said? It transcribes speech to text, then searches that text. Any error in transcription—"scream" becomes "screen"—cascades through the entire system, producing wrong results from a perfect search engine asked the wrong question.
S2R doesn't try to transcribe words at all. Instead, it creates a shared semantic space where both spoken queries and web documents exist as vectors. The system uses dual encoders:
An audio encoder processes raw speech and converts it into a rich vector representation that captures semantic meaning—not words, but intent. What information is being sought.
A document encoder takes web documents and converts them into vectors in that same semantic space.
The magic is in the training. Using massive datasets of paired audio queries and relevant documents, both encoders learn simultaneously. The training objective is elegantly simple: make the vector for an audio query geometrically close to the vectors of its relevant documents in this shared representation space.
This is synthetic dimensionality. Neither the audio nor the documents naturally exist in this vector space. The encoders learned to create it—to synthesize a dimension where both can be directly compared without converting between modalities. The audio doesn't become text. The documents don't become audio. They both map into a learned semantic space where "this sound pattern" and "this document about Munch's painting" end up in the same neighborhood.
When you speak a query, the audio encoder generates a query vector and uses it to identify candidate results from the document index through vector similarity. No transcription. No text. Just direct mapping from sound to meaning to retrieval.
Google's results are striking. S2R significantly outperforms traditional cascade ASR systems and approaches the performance of a theoretical "perfect transcription" system—without transcribing anything at all. They also discovered something crucial: word error rate doesn't reliably predict retrieval accuracy across languages. The specific nature of errors matters, and S2R bypasses the entire problem by never trying to get the words perfect in the first place.
The Synthetic Dimensionality Principle
What Google discovered with S2R is the same principle that makes decomposed AI architectures work: you need a shared semantic space where messy human intent and clean machine execution can meet.
This is what the oracle model got wrong. It tried to do everything in the model's native representation space—the space of tokens and probabilities. That space is great for predicting the next word, but it's terrible for bridging the gap between what humans mean and what machines need to execute.
The solution isn't better prompting or longer chain-of-thought. The solution is synthetic dimensionality: create a learned representation space specifically designed to bridge modalities.
For Google's S2R, that's bridging audio and documents. For a business AI system, it's bridging human intent and structured actions. But the architectural principle is identical:
Don't convert between representations. Don't try to transcribe audio perfectly, then search text. Don't try to parse human requests into API calls directly. Instead, create a shared semantic space where both sides can be represented, compared, and matched.
This is why decomposition works. The Stenographer phase learns to map messy human input—voice, text, whatever—into a structured representation of intent. Not by parsing or transcribing, but by learning a vector space where human utterances and formal task specifications live together. The system learns: when someone says this, they mean this task specification.
The Analyst phase then executes on those specifications deterministically. It doesn't need to re-interpret intent or navigate through token-space circumlocutions. The intent is already represented in a form it can act on directly.
We're Building This (Better)
Google's S2R is brilliant, but it's proprietary, massive, and designed for general voice search. They released the benchmark dataset but not the model. Classic Google: prove it works, keep the implementation locked up.
But here's what matters: the architectural principle they've validated—synthetic dimensionality through dual encoders and shared semantic spaces—is exactly what we've been building. And for specific business applications, we can accomplish the same thing better.
The approach: fine-tune existing open-source models into specialized roles using execution logs as training data.
The Stenographer learns a shared semantic space between human input and formal intent specifications. Feed it execution logs showing: this messy human utterance led to this structured task specification which succeeded. The model learns the mapping. Not through rules or parsing, but through the same dual-encoder principle Google uses—learn vectors for human input, learn vectors for task specifications, train them to be geometrically close in a shared representation space.
When a user says "check on the Henderson project," the Stenographer doesn't parse that sentence. It maps it into the semantic space where it learned that pattern means: project status query, henderson_2024, include timeline blockers and budget variance. Direct mapping, like S2R's audio to documents.
The Analyst executes on those specifications deterministically. It learned from execution traces: this specification pattern leads to these API calls in this sequence with these parameters. No re-interpretation needed. The synthetic dimensionality of the Stenographer's output is designed to be directly executable.
This is where circumlocution disappears. The Stenographer learned the expert's direct path during training on your execution logs. It doesn't need 500 tokens to figure out what you mean—it learned your company's intent patterns. The thinking happened during training. Inference is just retrieval.
And unlike Google's S2R trained on general web search, your models learn your business. Your terminology. Your workflows. Your intent patterns. Every successful execution becomes training data, continuously teaching the system more direct mappings in the semantic space that bridges your users' language and your systems' actions.
The Real Innovation is Synthetic Dimensionality
Google's S2R uses dual encoders and massive training on paired audio-document datasets. You could try to replicate that scale. Or you could recognize that the real innovation isn't the specific technology—it's the architectural principle of synthetic dimensionality.
Stop trying to build oracles that do everything in one pass. Stop expecting perfect transcription or parsing. Stop treating token-space circumlocution as thinking.
Instead: create learned semantic spaces that bridge modalities. Map messy human input and clean machine actions into the same representation where they can be directly compared and matched. Use your execution logs to teach models these mappings. Let the learning happen during training through dual encoders optimizing for geometric closeness in shared vector spaces. Let inference be fast, direct, expert retrieval.
When you fine-tune on your logs with this architecture, you're doing what S2R does—learning direct mappings between inputs and their corresponding actions—but for your specific domain, with your proven successful outcomes. You're not solving general voice search. You're building an expert system for your business using the same synthetic dimensionality principle Google just validated at scale.
The industry spent years chasing the oracle, building ever more elaborate scaffolding around models that needed to think their way through every request. Now they're discovering what should have been obvious: complex tasks require bridging different representations, and that requires synthetic dimensionality—learned semantic spaces where different modalities can meet.
Google validated the principle with S2R's dual-encoder architecture and shared vector space for audio and documents. But you don't need their scale. You just need their insight: stop trying to convert between representations, and start creating shared semantic spaces where they can be directly compared.
The oracle is dead. Long live synthetic dimensionality.