A Chain-of-Thought Is as Strong as Its Weakest Link:
A Benchmark for Verifiers of Reasoning Chains

1Bar Ilan University,   2Google Research,   3Google DeepMind,   4Tel Aviv University  
arXiv

🤗

Dataset


We introduce REVEAL: Reasoning Verification Evaluation, a dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning in open-domain question-answering settings.

REVEAL includes comprehensive labels and free-text justifications for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a language model's answer, across a variety of datasets and state-of-the-art language models. Evaluation on REVEAL shows that verifiers struggle at verifying reasoning chains - in particular, verifying logical correctness and detecting contradictions.

Background

Prompting models to generate reasoning chains is popular and useful, but when checking whether this reasoning is actually correct, it often isn’t. Better reasoning is correlated with better performance on the end task, and can also be useful in itself.

Recent works introduced verification methods for reasoning chains, but how good are these verifiers? To answer this question, we introduce REVEAL, a human-labeled benchmark for verifiers of reasoning chains.

What is REVEAL?

Each reasoning chain in REVEAL was annotated step-by-step, marking for every step its type, relevance and correctness: Either attribution to an evidence passage or logical inference from previous steps.


Verifying Reasoning Chains

We collected the dataset by separating the reasoning chains into steps, labeling the step type (attribution or logic) and relevance, and depending on the step type, labeling its correctness. For attribution, we retrieve Wikipedia paragraphs to attribute the step, and for logic, we check whether the step is logically entailed by the previous steps.


Takeaway (1): Verifying reasoning chains is hard.

Interestingly, while LMs are decent at making simple logical inferences, they are bad at making attributable claims (77% attribution errors vs. 18% logic errors). *Verifiers* of reasoning, however, are better at checking attribution than logical correctness.


Takeaway (2): The current best ways to verify reasoning chains and chain-of-thought.

Verifiers improve if you structure the prediction into predicting specific errors in each step separately (“Pipeline” in the table), but there’s still a lot of headroom.


Takeaway (3): Most reasoning chains by modern language models are incorrect.

The actual percentage of correct reasoning in REVEAL is around 20% (i.e., the LMs we tested gave fully correct CoTs 20% of the time). The entailment of knowledge claims is generally low (43.8%). Even if the models are correct at their final answer, the chance that their given reasoning is correct is not high.


Additional Analyses

Check the paper for a lot more details and analyses, for example: The challenges of collecting this kind of data, including a deep dive analysis of disagreements and borderline cases. We release all of the hard borderline cases as a separate data split.


Conclusions

REVEAL is a high-quality, meticulously-crafted dataset for benchmarking and analysing manual and automatic verification of reasoning chains in chain-of-thought formats. The data includes various useful labels and free-text justifications behind them for each reasoning step.

The paper has a wide variety of analyses and evaluations using the data, and we also detail our annotation methodology in detail to support future annotation efforts. Please consider using the dataset in your research :)

BibTeX

@misc{jacovi2024chainofthought,
      title={A Chain-of-Thought Is as Strong as Its Weakest Link: {A} Benchmark for Verifiers of Reasoning Chains}, 
      author={Alon Jacovi and Yonatan Bitton and Bernd Bohnet and Jonathan Herzig and Or Honovich and Michael Tseng and Michael Collins and Roee Aharoni and Mor Geva},
      year={2024},
      eprint={2402.00559},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}