ACL 2026  ·  Main Track

Interpretable Traces, Unexpected Outcomes:
Investigating the Disconnect in
Trace-Based Knowledge Distillation

When correct reasoning traces don't guarantee correct answers — and vice versa.

Siddhant Bhambri Samsung Research America
Upasana Biswas Arizona State University
Subbarao Kambhampati Arizona State University

We challenge a core assumption in LLM training: that CoT reasoning traces must be semantically correct and interpretable to improve task performance. Using a rule-based problem decomposition framework on Open-Book QA, we show that correct traces led to correct answers only 28% of the time, and that the most performance-effective traces (DeepSeek R1) were rated as the least interpretable by 100 human participants — calling for a fundamental rethink of how we design and evaluate reasoning traces.

Key Findings

Two Surprising Disconnects

01
0%
Trace correctness ≠ answer correctness.
Correct CoT traces led to correct final solutions only 28% of the time. Incorrect traces did not consistently degrade performance either — confirming near-independence (χ² < 3.841).
02
R1: 3.39 vs 4.86
Best performance = worst interpretability.
R1 traces achieved top accuracy but scored 3.39 / 5 on interpretability and 4.59 / 5 on cognitive load. Verifiably-correct traces were most interpretable (4.86) but performed worst in SFT.
Methodology

Experimental Design

We design a controlled distillation framework where both intermediate traces and final answers can be independently evaluated, removing the confound that plagues standard CoT evaluation.

📖
Open-Book QA
CoTemp · MARCO · bAbI
✂️
Rule-based
Decomposition
Class + IR steps
📊
SFT Datasets
✓ correct / ✗ incorrect
🤖
Fine-tune SLMs
Llama · Qwen
👥
Human Study
100 on Prolific
DATASETS
  • 📎 CoTemp QA  4,748 samples
  • 📎 MARCO QA  6,000 samples
  • 📎 bAbI QA  4,149 samples
MODELS
  • 🦙 Llama-3.2-1B-Instruct
  • ⚡ Qwen3-1.7B
  • 🦙 Llama-3.1-8B / Qwen3-8B
TRAINING
  • 80GB A100 GPU
  • QLoRA (rank 16, α 32)
  • 3 epochs · LR 2e-4
Results

Final Solution Accuracy (%)

SFT Vanilla
SFT w/ Correct Traces
SFT w/ Incorrect Traces
Qwen3-1.7B
SFT Vanilla
60.3%
60.3%
Correct Traces
52.9%
52.9%
Incorrect Traces
63.9%
63.9%
Llama-3.2-1B-Instruct
SFT Vanilla
44.7%
44.7%
Correct Traces
39.6%
39.6%
Incorrect Traces
45.6%
45.6%

↑ Incorrect traces outperform correct traces — even achieving the highest accuracy for Qwen. Max bar = 100%.

SFT Vanilla
SFT w/ Correct Traces
SFT w/ Incorrect Traces
Qwen3-1.7B
SFT Vanilla
3.4%
3.4%
Correct Traces
26.3%
26.3%
Incorrect Traces
20.3%
20.3%
Llama-3.2-1B-Instruct
SFT Vanilla
33.4%
33.4%
Correct Traces
33.7%
33.7%
Incorrect Traces
28.9%
28.9%

↑ On MARCO, correct traces lead — but only marginally over incorrect traces for Llama.

SFT Vanilla
SFT w/ Correct Traces
SFT w/ Incorrect Traces
Qwen3-1.7B
SFT Vanilla
97.9%
97.9%
Correct Traces
94.4%
94.4%
Incorrect Traces
95.2%
95.2%
Llama-3.2-1B-Instruct
SFT Vanilla
96.5%
96.5%
Correct Traces
94.4%
94.4%
Incorrect Traces
86.2%
86.2%

↑ On bAbI, SFT Vanilla wins — traces add little and incorrect traces degrade Llama performance.

Human Interpretability Study

100 participants (Prolific, US) rated four trace types on a 5-point Likert scale across interpretability and cognitive load dimensions.

"Fine-tuning on verbose R1 traces produced the best model performance but these traces were rated as least interpretable by users — scoring 3.39 on interpretability and 4.59 on cognitive load. Correct traces were rated most interpretable (4.86) but performed worst in SFT."
📊 Interpretability Score (↑ better)
R1 Traces
3.31
R1 Summaries
4.53
R1 Explanations
4.29
Correct Traces
4.86
All p-values significant (Mann-Whitney U, Bonferroni corrected)
🧠 Cognitive Load — Mental Demand (↓ better)
R1 Traces
4.65
R1 Summaries
2.87
R1 Explanations
2.92
Correct Traces
2.31
NASA-TLX scale: mental demand, effort, frustration
Discussion

Key Takeaways

1
Incorrect traces can outperform correct ones in SFT
Models learn structural trace patterns while ignoring semantics. Cross-entropy loss on <incorrect trace, correct answer> pairs enables the model to reproduce the pattern at inference, achieving surprisingly high final solution accuracy.
2
Trace correctness and answer accuracy are statistically independent
χ² tests confirm the null hypothesis: trace accuracy and final solution accuracy are independent (χ² = 0.34 and 2.93, both < 3.841 critical value). Correct traces led to correct answers only 28% of the time; False Positives reached 71.54% in bAbI.
3
Training and user-facing objectives require decoupling
The best-performing training traces (verbose R1 traces) are the worst for users. Generating interpretable, user-facing explanations should be treated as a separate objective from the SFT training signal — requiring distinct modules or post-hoc processing pipelines.
Reference

Cite This Work

BIBTEX
@inproceedings{bhambri2026interpretable,
  title     = {Interpretable Traces, Unexpected Outcomes: Investigating the
               Disconnect in Trace-Based Knowledge Distillation},
  author    = {Bhambri, Siddhant and Biswas, Upasana and Kambhampati, Subbarao},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association
               for Computational Linguistics (ACL)},
  year      = {2026},
  url       = {https://arxiv.org/abs/2505.13792}
}