MedPAIR:
Measuring Physicians and AI Relevance Alignment in Medical Question Answering

1 MIT, 2 Cornell University, 3 Oxford University, 4 University of Potsdam, 5 Hasso Plattner Institute, 6 Independent Research
Corresponding author: mghassem@mit.edu
Study Design Pipeline

We began with four QA data sources. All questions were consolidated and structured into two components: patient profile and query. In the first step, physician trainee labelers and LLMs independently selected the best answer. In the second step, physicians experts then annotated the relevance of each sentence in the patient profile. We excluded annotations from physician trainees who selected incorrect answers and applied majority voting to derive binary sentence-level relevance labels. In parallel, we also asked LLMs to label each sentence in Patient Profile through prompting and ContextCite.


Abstract

Large Language Models (LLMs) have demonstrated remarkable performance on various medical question-answering (QA) benchmarks, including standardized medical exams. However, correct answers alone do not ensure correct logic, and models may reach accurate conclusions through flawed processes. In this study, we introduce the comparative reasoning benchmark MedPAIR (Medical Physicians and AI Relevance Alignment in Medical Question Answering) to evaluate how physicians and LLMs prioritize relevant information when answering QA questions. We obtain annotations on 2,000 QA pairs from 36 physicians, labeling each sentence within the question components for relevancy. We compare these relevancy estimates to those for LLMs, and further evaluate the impact of these ``relevant'' subsets on downstream task performance for both humans and LLMs. We further implemented a structured intervention where physicians guided LLMs to focus on relevant information. We find that LLMs are frequently not aligned with physicians' estimates of content relevancy. Further, we highlight the limitations of current LLM behavior and the importance of aligning AI reasoning with clinical standards.





Third Image

The dataset overview represents the annotation process for 2000 clinical question-answer pairs. We consolidated four QA data sources into two main components: the patient profile and the query. In the first step, 36 physician trainees and 4 LLMs independently selected the most appropriate answer. In the second step, physician trainees annotated the relevance of each sentence within the patient profile, excluding annotations linked to incorrect answers. Majority voting was used to produce binary relevance labels for physician trainees. Concurrently, we employed ContextCite with open-source LLMs (Qwen-14B, Qwen-72B, Llama-70B) to generate relevance scores, while GPT-4o was prompted to replicate the physician annotation process for each sentence following the same instructions.





Fourth Image

Sentence Position Analysis. Plot (a) Distribution of physician trainees’ majority‐vote relevance labels by sentence position. Plot (b) Distribution of GPT-4o self-reported relevance labels by sentence position. Plot (c) ContextCite scores across the context for three open-source models (Qwen-14B, Llama-70B, Qwen-72B).





Fifth Image

Sentence Position Analysis. (a) Distribution of LLM self-reported relevance labels by sentence position; (b) Distribution of physician trainee majority-voted relevance labels by sentence position; (c) Qwen-14B ContextCite scores across the clinical context.





Evaluation Pipeline Overview

Effect of Filtering Context on Final Performance. GPT-4o outperforms all tested open-source language models. After removing irrelevant and low-relevance sentences, LLaMA 70B and Qwen 14B demonstrated the most substantial accuracy improvements. In contrast, Qwen 72B occasionally experiences performance drops following the removal process.




Video Presentation

BibTeX

@misc{hao2025medpairmeasuringphysiciansai,
      title={MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering}, 
      author={Yuexing Hao and Kumail Alhamoud and Hyewon Jeong and Haoran Zhang and Isha Puri and Philip Torr and Mike Schaekermann and Ariel D. Stern and Marzyeh Ghassemi},
      year={2025},
      eprint={2505.24040},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.24040}, 
}