Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering
Abstract
Large language models (LLMs) have shown promise in medical question answering, yet they often overlook the domain-specific expertise that professionals depend on—such as the clinical subject areas (e.g., trauma, airway) and the certification level (e.g., EMT, Paramedic). Existing approaches typically apply general-purpose prompting or retrieval strategies without leveraging this structured context, limiting performance in high-stakes settings. We address this gap with EMSQA, an 24.3K-question multiple-choice dataset spanning 10 clinical subject areas and 4 certification levels, accompanied by curated, subject area-aligned knowledge bases (40K documents and 2M tokens). Building on EMSQA, we introduce (i) Expert-CoT, a prompting strategy that conditions chain-of-thought (CoT) reasoning on specific clinical subject area and certification level, and (ii) ExpertRAG, a retrieval-augmented generation pipeline that grounds responses in subject area-aligned documents and real-world patient data. Experiments on 4 LLMs show that Expert-CoT improves up to 2.05% over vanilla CoT prompting. Additionally, combining Expert-CoT with ExpertRAG yields up to a 4.67% accuracy gain over standard RAG baselines. Notably, the 32B expertise-augmented LLMs pass all the computer-adaptive EMS certification simulation exams.
EMS-MCQA Dataset Statistics
A high-level snapshot of EMS-MCQA by subject area, certification level, and question attributes.
Results
Where do SOTA LLMs shine or stumble on EMSQA across subject areas and certification levels?
How much does explicit expertise injected by Expert-CoT lift baseline accuracy?
How much does explicit expertise injected by Expert-CoT and ExpertRAG lift baseline accuracy?
Can expertise-aware LLMs meet the NREMT passing score at different certification levels??
Video Presentation
Poster
BibTeX