When correct reasoning traces don't guarantee correct answers — and vice versa.
We challenge a core assumption in LLM training: that CoT reasoning traces must be semantically correct and interpretable to improve task performance. Using a rule-based problem decomposition framework on Open-Book QA, we show that correct traces led to correct answers only 28% of the time, and that the most performance-effective traces (DeepSeek R1) were rated as the least interpretable by 100 human participants — calling for a fundamental rethink of how we design and evaluate reasoning traces.
We design a controlled distillation framework where both intermediate traces and final answers can be independently evaluated, removing the confound that plagues standard CoT evaluation.
↑ Incorrect traces outperform correct traces — even achieving the highest accuracy for Qwen. Max bar = 100%.
↑ On MARCO, correct traces lead — but only marginally over incorrect traces for Llama.
↑ On bAbI, SFT Vanilla wins — traces add little and incorrect traces degrade Llama performance.
100 participants (Prolific, US) rated four trace types on a 5-point Likert scale across interpretability and cognitive load dimensions.
@inproceedings{bhambri2026interpretable,
title = {Interpretable Traces, Unexpected Outcomes: Investigating the
Disconnect in Trace-Based Knowledge Distillation},
author = {Bhambri, Siddhant and Biswas, Upasana and Kambhampati, Subbarao},
booktitle = {Proceedings of the 64th Annual Meeting of the Association
for Computational Linguistics (ACL)},
year = {2026},
url = {https://arxiv.org/abs/2505.13792}
}