Interpretable Traces, Unexpected Outcomes

We challenge a core assumption in LLM training: that CoT reasoning traces must be semantically correct and interpretable to improve task performance. Using a rule-based problem decomposition framework on Open-Book QA, we show that correct traces led to correct answers only 28% of the time, and that the most performance-effective traces (DeepSeek R1) were rated as the least interpretable by 100 human participants — calling for a fundamental rethink of how we design and evaluate reasoning traces.

Key Findings

Two Surprising Disconnects

Trace correctness ≠ answer correctness.
Correct CoT traces led to correct final solutions only 28% of the time. Incorrect traces did not consistently degrade performance either — confirming near-independence (χ² < 3.841).

R1: 3.39 vs 4.86

Best performance = worst interpretability.
R1 traces achieved top accuracy but scored 3.39 / 5 on interpretability and 4.59 / 5 on cognitive load. Verifiably-correct traces were most interpretable (4.86) but performed worst in SFT.

Methodology

Experimental Design

We design a controlled distillation framework where both intermediate traces and final answers can be independently evaluated, removing the confound that plagues standard CoT evaluation.

📖

Open-Book QA
CoTemp · MARCO · bAbI

✂️

Rule-based
Decomposition
Class + IR steps

📊

SFT Datasets
✓ correct / ✗ incorrect

🤖

Fine-tune SLMs
Llama · Qwen

👥

Human Study
100 on Prolific

DATASETS

📎 CoTemp QA 4,748 samples
📎 MARCO QA 6,000 samples
📎 bAbI QA 4,149 samples

MODELS

🦙 Llama-3.2-1B-Instruct
⚡ Qwen3-1.7B
🦙 Llama-3.1-8B / Qwen3-8B

TRAINING

80GB A100 GPU
QLoRA (rank 16, α 32)
3 epochs · LR 2e-4

Results

Final Solution Accuracy (%)

SFT Vanilla

SFT w/ Correct Traces

SFT w/ Incorrect Traces

Qwen3-1.7B

SFT Vanilla

60.3%

Correct Traces

52.9%

Incorrect Traces

63.9%

Llama-3.2-1B-Instruct

SFT Vanilla

44.7%

Correct Traces

39.6%

Incorrect Traces

45.6%

↑ Incorrect traces outperform correct traces — even achieving the highest accuracy for Qwen. Max bar = 100%.

SFT Vanilla

SFT w/ Correct Traces

SFT w/ Incorrect Traces

Qwen3-1.7B

SFT Vanilla

3.4%

Correct Traces

26.3%

Incorrect Traces

20.3%

Llama-3.2-1B-Instruct

SFT Vanilla

33.4%

Correct Traces

33.7%

Incorrect Traces

28.9%

↑ On MARCO, correct traces lead — but only marginally over incorrect traces for Llama.

SFT Vanilla

SFT w/ Correct Traces

SFT w/ Incorrect Traces

Qwen3-1.7B

SFT Vanilla

97.9%

Correct Traces

94.4%

Incorrect Traces

95.2%

Llama-3.2-1B-Instruct

SFT Vanilla

96.5%

Correct Traces

94.4%

Incorrect Traces

86.2%

↑ On bAbI, SFT Vanilla wins — traces add little and incorrect traces degrade Llama performance.

Human Interpretability Study

100 participants (Prolific, US) rated four trace types on a 5-point Likert scale across interpretability and cognitive load dimensions.

"Fine-tuning on verbose R1 traces produced the best model performance but these traces were rated as least interpretable by users — scoring 3.39 on interpretability and 4.59 on cognitive load. Correct traces were rated most interpretable (4.86) but performed worst in SFT."

📊 Interpretability Score (↑ better)

R1 Traces

3.31

R1 Summaries

4.53

R1 Explanations

4.29

Correct Traces

4.86

All p-values significant (Mann-Whitney U, Bonferroni corrected)

🧠 Cognitive Load — Mental Demand (↓ better)

R1 Traces

4.65

R1 Summaries

2.87

R1 Explanations

2.92

Correct Traces

2.31

NASA-TLX scale: mental demand, effort, frustration

Discussion

Key Takeaways

Incorrect traces can outperform correct ones in SFT

Models learn structural trace patterns while ignoring semantics. Cross-entropy loss on <incorrect trace, correct answer> pairs enables the model to reproduce the pattern at inference, achieving surprisingly high final solution accuracy.

Trace correctness and answer accuracy are statistically independent

χ² tests confirm the null hypothesis: trace accuracy and final solution accuracy are independent (χ² = 0.34 and 2.93, both < 3.841 critical value). Correct traces led to correct answers only 28% of the time; False Positives reached 71.54% in bAbI.

Training and user-facing objectives require decoupling

The best-performing training traces (verbose R1 traces) are the worst for users. Generating interpretable, user-facing explanations should be treated as a separate objective from the SFT training signal — requiring distinct modules or post-hoc processing pipelines.

Reference

Cite This Work

BIBTEX

@inproceedings{bhambri2026interpretable,
  title     = {Interpretable Traces, Unexpected Outcomes: Investigating the
               Disconnect in Trace-Based Knowledge Distillation},
  author    = {Bhambri, Siddhant and Biswas, Upasana and Kambhampati, Subbarao},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association
               for Computational Linguistics (ACL)},
  year      = {2026},
  url       = {https://arxiv.org/abs/2505.13792}
}

📄 Read Paper View Code 🤗 Datasets