Calling Chain-of-Thought tokens "reasoning traces" or "thoughts" isn't a harmless metaphor — it actively distorts how we build, evaluate, and trust AI systems.
∗ Corresponding author: rao@asu.edu † Work done at ASU; currently at Amazon AGI ‡ Work done at ASU; currently at Samsung Research America § Work done at ASU; currently at Yale University
Calling ITG "thinking" reflects our desire for LLMs to reason like humans — not evidence they do. The anthropomorphic framing is aspirational, not descriptive, and conflates statistical token prediction with structured cognition.
A growing body of work shows that models improve even with semantically incorrect or structurally meaningless traces — directly undermining the claim that intermediate tokens encode meaningful reasoning steps.
Users and researchers treat visible intermediate tokens as a window into model "thinking." This creates unwarranted trust — models can produce plausible-sounding traces causally unrelated to their final answers.
The anthropomorphic framing has spawned research aimed at improving trace human-interpretability as a proxy for model capability — a goal that may be both misguided and empirically counterproductive.
The paper collates an extensive body of emerging empirical work that collectively challenges the interpretation of intermediate tokens as human-like reasoning traces.
Multiple studies show that LLMs fine-tuned on incorrect or meaningless intermediate tokens paired with correct final answers still improve on downstream tasks. If traces were "doing the reasoning," this result should be impossible.
Traces are filtered by whether they lead to correct final answers — not by their own correctness. No direct optimization pressure is applied to intermediate tokens themselves during RL post-training, so their semantic content is largely uncontrolled.
The "adaptive computation" narrative — that longer traces signal harder problems — is not reliably supported empirically. Trace length is not a trustworthy signal of the model's internal effort or the difficulty of the task.
The most performance-effective traces (verbose DeepSeek R1 outputs) are rated least interpretable by human users. Verifiably correct, human-readable traces perform worst in SFT. Interpretability and utility as a training signal are empirically decoupled.
OpenAI deliberately hides o1's intermediate tokens from users, providing only a sanitized post-hoc summary. This is an implicit acknowledgment by a leading lab that raw intermediate tokens are not interpretable — directly contradicting the anthropomorphic framing.
For humans, post-facto verbal explanations are known to be rationalizations — not faithful accounts of mental processes. The same skepticism applies to LLM "explanations" of their own intermediate token production, generated by the same model after the fact.
The paper proposes "derivational trace" as a neutral, non-anthropomorphic term for intermediate tokens — whether generated by humans, formal solvers, or LLMs. Key terminology shifts:
| Avoid ✗ | Prefer ✓ | Why it matters |
|---|---|---|
| Reasoning trace | Derivational trace / Intermediate tokens | "Reasoning" implies structured, interpretable problem-solving steps the evidence does not support. |
| Thinking trace / Thoughts | Intermediate token sequence | Anthropomorphizes a statistical sampling process as cognition; conflates token generation with human experience. |
| Chain-of-Thought (as reasoning) | Intermediate token generation (ITG) | The mechanism is token prediction — not chained logical inference. Neutral framing preserves the empirical facts. |
| Model is "thinking" | Model is generating intermediate tokens | Precision here prevents false confidence about model reliability and interpretability among users and policymakers. |
Use "intermediate tokens" or "derivational traces" in place of "reasoning traces" or "thoughts." Framing shapes research intuitions — neutral terms prevent anthropomorphic assumptions from silently driving methodology.
Optimize intermediate tokens as a training signal for task performance. Separately develop post-hoc explanation modules for user-facing interpretability. These are distinct objectives that should not be conflated.
When papers claim traces provide insight into model behavior, demand empirical evidence. Plausibility of trace content to human readers is not the same as causal fidelity to the model's internal computation.
Design benchmarks that independently measure trace quality and answer correctness. Current evaluations conflate these, masking the empirical independence — and frequent disconnect — between trace validity and final answer accuracy.
Anthropomorphic milestones like "emergent reasoning" and "Aha moments" are compelling stories, but extraordinary claims require extraordinary evidence. Treat such narratives with appropriate critical skepticism.
Resources spent improving human-interpretability of raw intermediate tokens may be better invested in robust external verifiers, sound formal solvers, and genuine post-hoc explanation systems with formal guarantees.
@inproceedings{kambhampati2026anthropomorphizing,
title = {Position: Stop Anthropomorphizing Intermediate
Tokens as Reasoning/Thinking Traces!},
author = {Kambhampati, Subbarao and Valmeekam, Karthik
and Bhambri, Siddhant and Palod, Vardhan
and Saldyt, Lucas and Stechly, Kaya
and Samineni, Soumya Rani and Kalwar, Durgesh
and Biswas, Upasana},
booktitle = {Proceedings of the 43rd International Conference
on Machine Learning (ICML) -- Position Papers},
year = {2026},
url = {https://arxiv.org/abs/2504.09762}
}