MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering

MedPAIR:
Measuring Physicians and AI Relevance Alignment in Medical Question Answering

¹ MIT, ² Cornell University, ³ Oxford University, ⁴ University of Potsdam, ⁵ Hasso Plattner Institute, ⁶ Independent Research
^†Corresponding author: mghassem@mit.edu

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance on various medical question-answering (QA) benchmarks, including standardized medical exams. However, correct answers alone do not ensure correct logic, and models may reach accurate conclusions through flawed processes. In this study, we introduce the comparative reasoning benchmark MedPAIR (Medical Physicians and AI Relevance Alignment in Medical Question Answering) to evaluate how physicians and LLMs prioritize relevant information when answering QA questions. We obtain annotations on 2,000 QA pairs from 36 physicians, labeling each sentence within the question components for relevancy. We compare these relevancy estimates to those for LLMs, and further evaluate the impact of these ``relevant'' subsets on downstream task performance for both humans and LLMs. We further implemented a structured intervention where physicians guided LLMs to focus on relevant information. We find that LLMs are frequently not aligned with physicians' estimates of content relevancy. Further, we highlight the limitations of current LLM behavior and the importance of aligning AI reasoning with clinical standards.

BibTeX

@misc{hao2025medpairmeasuringphysiciansai, title={MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering}, author={Yuexing Hao and Kumail Alhamoud and Hyewon Jeong and Haoran Zhang and Isha Puri and Philip Torr and Mike Schaekermann and Ariel D. Stern and Marzyeh Ghassemi}, year={2025}, eprint={2505.24040}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.24040}, }

MedPAIR:
Measuring Physicians and AI Relevance Alignment in Medical Question Answering

Abstract

Sentence Position Analysis. (a) Distribution of LLM self-reported relevance labels by sentence position; (b) Distribution of physician trainee majority-voted relevance labels by sentence position; (c) Qwen-14B ContextCite scores across the clinical context.

Video Presentation

BibTeX