ICML 2026  ·  Position Paper Track

Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!

Calling Chain-of-Thought tokens "reasoning traces" or "thoughts" isn't a harmless metaphor — it actively distorts how we build, evaluate, and trust AI systems.

Subbarao Kambhampati
Karthik Valmeekam
Siddhant Bhambri
Vardhan Palod
Lucas Saldyt
Kaya Stechly§
Soumya Rani Samineni
Durgesh Kalwar
Upasana Biswas

Corresponding author: rao@asu.edu Work done at ASU; currently at Amazon AGI   Work done at ASU; currently at Samsung Research America   § Work done at ASU; currently at Yale University

📄 Paper (arXiv) ⬇ PDF
Intermediate token generation (ITG) — where a model produces output before the solution — has become standard in LLM reasoning. These tokens are called "reasoning traces" or "thoughts," implicitly anthropomorphizing the model and implying human-like, interpretable reasoning steps. We present evidence that this anthropomorphization is not a harmless metaphor: it confuses the nature of these models, leads to questionable research, and creates false confidence about model capabilities. We call on the community to stop anthropomorphizing intermediate tokens and adopt precise, neutral terminology.

The Position

Four Reasons Anthropomorphization is Dangerous

"Anthropomorphizing intermediate tokens as reasoning/thinking traces is (1) wishful, (2) has little concrete supporting evidence, (3) engenders false confidence, and (4) may be pushing the community into fruitless research directions."
— Kambhampati et al., ICML 2026
01

Wishful

Calling ITG "thinking" reflects our desire for LLMs to reason like humans — not evidence they do. The anthropomorphic framing is aspirational, not descriptive, and conflates statistical token prediction with structured cognition.

02

Little Supporting Evidence

A growing body of work shows that models improve even with semantically incorrect or structurally meaningless traces — directly undermining the claim that intermediate tokens encode meaningful reasoning steps.

03

Engenders False Confidence

Users and researchers treat visible intermediate tokens as a window into model "thinking." This creates unwarranted trust — models can produce plausible-sounding traces causally unrelated to their final answers.

04

Fruitless Research Directions

The anthropomorphic framing has spawned research aimed at improving trace human-interpretability as a proxy for model capability — a goal that may be both misguided and empirically counterproductive.


Evidence

What the Literature Actually Shows

The paper collates an extensive body of emerging empirical work that collectively challenges the interpretation of intermediate tokens as human-like reasoning traces.

🔬
Empirical

Models improve with semantically incorrect traces

Multiple studies show that LLMs fine-tuned on incorrect or meaningless intermediate tokens paired with correct final answers still improve on downstream tasks. If traces were "doing the reasoning," this result should be impossible.

🧩
Semantic

Intermediate tokens have no formal semantic grounding

Traces are filtered by whether they lead to correct final answers — not by their own correctness. No direct optimization pressure is applied to intermediate tokens themselves during RL post-training, so their semantic content is largely uncontrolled.

📏
Structural

Trace length does not reliably track problem complexity

The "adaptive computation" narrative — that longer traces signal harder problems — is not reliably supported empirically. Trace length is not a trustworthy signal of the model's internal effort or the difficulty of the task.

🪟
Semantic

Interpretable traces ≠ better model performance

The most performance-effective traces (verbose DeepSeek R1 outputs) are rated least interpretable by human users. Verifiably correct, human-readable traces perform worst in SFT. Interpretability and utility as a training signal are empirically decoupled.

🎭
Framing

OpenAI hides o1's own "thinking"

OpenAI deliberately hides o1's intermediate tokens from users, providing only a sanitized post-hoc summary. This is an implicit acknowledgment by a leading lab that raw intermediate tokens are not interpretable — directly contradicting the anthropomorphic framing.

🔁
Framing

Post-hoc rationalization ≠ faithful reasoning trace

For humans, post-facto verbal explanations are known to be rationalizations — not faithful accounts of mental processes. The same skepticism applies to LLM "explanations" of their own intermediate token production, generated by the same model after the fact.


Terminology

A Vocabulary for Precision

The paper proposes "derivational trace" as a neutral, non-anthropomorphic term for intermediate tokens — whether generated by humans, formal solvers, or LLMs. Key terminology shifts:

Avoid ✗ Prefer ✓ Why it matters
Reasoning trace Derivational trace / Intermediate tokens "Reasoning" implies structured, interpretable problem-solving steps the evidence does not support.
Thinking trace / Thoughts Intermediate token sequence Anthropomorphizes a statistical sampling process as cognition; conflates token generation with human experience.
Chain-of-Thought (as reasoning) Intermediate token generation (ITG) The mechanism is token prediction — not chained logical inference. Neutral framing preserves the empirical facts.
Model is "thinking" Model is generating intermediate tokens Precision here prevents false confidence about model reliability and interpretability among users and policymakers.

Call to Action

What the Community Should Do

1

Adopt neutral terminology

Use "intermediate tokens" or "derivational traces" in place of "reasoning traces" or "thoughts." Framing shapes research intuitions — neutral terms prevent anthropomorphic assumptions from silently driving methodology.

2

Decouple training signal from interpretability

Optimize intermediate tokens as a training signal for task performance. Separately develop post-hoc explanation modules for user-facing interpretability. These are distinct objectives that should not be conflated.

3

Scrutinize interpretability claims

When papers claim traces provide insight into model behavior, demand empirical evidence. Plausibility of trace content to human readers is not the same as causal fidelity to the model's internal computation.

4

Evaluate traces and answers independently

Design benchmarks that independently measure trace quality and answer correctness. Current evaluations conflate these, masking the empirical independence — and frequent disconnect — between trace validity and final answer accuracy.

5

Challenge "Aha moment" narratives

Anthropomorphic milestones like "emergent reasoning" and "Aha moments" are compelling stories, but extraordinary claims require extraordinary evidence. Treat such narratives with appropriate critical skepticism.

6

Redirect research investment

Resources spent improving human-interpretability of raw intermediate tokens may be better invested in robust external verifiers, sound formal solvers, and genuine post-hoc explanation systems with formal guarantees.


Reference

Cite This Work

BibTeX
@inproceedings{kambhampati2026anthropomorphizing,
  title     = {Position: Stop Anthropomorphizing Intermediate
               Tokens as Reasoning/Thinking Traces!},
  author    = {Kambhampati, Subbarao and Valmeekam, Karthik
               and Bhambri, Siddhant and Palod, Vardhan
               and Saldyt, Lucas and Stechly, Kaya
               and Samineni, Soumya Rani and Kalwar, Durgesh
               and Biswas, Upasana},
  booktitle = {Proceedings of the 43rd International Conference
               on Machine Learning (ICML) -- Position Papers},
  year      = {2026},
  url       = {https://arxiv.org/abs/2504.09762}
}