Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering

University of Virginia
AAAI 2026
Overview of Expert-CoT and ExpertRAG for EMS QA
Figure 1. Recent medical MCQA systems leverage chain-of-thought (CoT) prompting and retrieval-augmented generation (RAG) to boost reasoning and use external knowledge. However, they typically treat both reasoning and retrieval as task-agnostic: the model sees a question, then reasons or retrieves without regard to the type of clinical expertise required. In practice, clinicians first identify the subject area (e.g., trauma, airway, pharmacology) and the relevant certification level (EMR/EMT/AEMT/Paramedic), and then apply protocols specific to that expertise. Existing benchmarks (e.g., MedQA, MMLU-Med, MedMCQA) lack this structured notion of expertise and the aligned knowledge resources, making it difficult to align retrieval and reasoning with human clinical workflow.

Abstract

Large language models (LLMs) have shown promise in medical question answering, yet they often overlook the domain-specific expertise that professionals depend on—such as the clinical subject areas (e.g., trauma, airway) and the certification level (e.g., EMT, Paramedic). Existing approaches typically apply general-purpose prompting or retrieval strategies without leveraging this structured context, limiting performance in high-stakes settings. We address this gap with EMSQA, an 24.3K-question multiple-choice dataset spanning 10 clinical subject areas and 4 certification levels, accompanied by curated, subject area-aligned knowledge bases (40K documents and 2M tokens). Building on EMSQA, we introduce (i) Expert-CoT, a prompting strategy that conditions chain-of-thought (CoT) reasoning on specific clinical subject area and certification level, and (ii) ExpertRAG, a retrieval-augmented generation pipeline that grounds responses in subject area-aligned documents and real-world patient data. Experiments on 4 LLMs show that Expert-CoT improves up to 2.05% over vanilla CoT prompting. Additionally, combining Expert-CoT with ExpertRAG yields up to a 4.67% accuracy gain over standard RAG baselines. Notably, the 32B expertise-augmented LLMs pass all the computer-adaptive EMS certification simulation exams.

EMS-MCQA Dataset Statistics

A high-level snapshot of EMS-MCQA by subject area, certification level, and question attributes.

EMS-MCQA split statistics table
Figure 2a. EMS-MCQA statistics by split (train/val/test, public vs. private).
Distribution by certification level and subject area
Figure 2b. Question distribution by certification level and subject area.

Approach

High-level overview of Expert-CoT and ExpertRAG for EMS QA
Figure 3. How to inject domain expertise into LLMs for EMS QA? (1) Expertise-guided prompting (Expert-CoT): Standard CoT encourages step-by-step reasoning but does not specify where to begin. Expert-CoT anchors the model’s first steps by explicitly providing the subject area and certification level. (2) Expertise-guided RAG (ExpertRAG): A lightweight filter predicts the question’s subject area and steers the retriever toward matching knowledge-base entries and patient records; the LLM then conditions on the predicted expertise plus the retrieved evidence to produce the answer. We study three retriever strategies: Global – retrieve the top M KB and N PR documents from the entire corpora (baseline, no subject filtering); Filter → Retrieve (FTR) – first filter KB/PR to the predicted subject area ŝi, then retrieve the top M and N; Retrieve → Filter (RTF) – first retrieve a larger pool from full KB/PR (e.g., 10×M and 10×N), then drop items whose subject area ≠ ŝi, keeping the best M and N.

Results

Video Presentation

Poster

BibTeX