Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering

Ge, Xueren; Murtaza, Sahil; Cortez, Anthony; Alemzadeh, Homa

Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering

Xueren Ge Sahil Murtaza Anthony Cortez Homa Alemzadeh

University of Virginia
AAAI 2026

Paper Supplementary Code Video Slides

Dataset

Knowledge Base

Overview of Expert-CoT and ExpertRAG for EMS QA — **Figure 1.** Recent medical MCQA systems leverage *chain-of-thought (CoT)* prompting and *retrieval-augmented generation (RAG)* to boost reasoning and use external knowledge. However, they typically treat both reasoning and retrieval as task-agnostic: the model sees a question, then reasons or retrieves without regard to the *type of clinical expertise* required. In practice, clinicians first identify the **subject area** (e.g., trauma, airway, pharmacology) and the relevant **certification level** (EMR/EMT/AEMT/Paramedic), and then apply protocols specific to that expertise. Existing benchmarks (e.g., MedQA, MMLU-Med, MedMCQA) lack this structured notion of expertise and the aligned knowledge resources, making it difficult to align retrieval and reasoning with human clinical workflow.

Abstract

Large language models (LLMs) have shown promise in medical question answering, yet they often overlook the domain-specific expertise that professionals depend on—such as the clinical subject areas (e.g., trauma, airway) and the certification level (e.g., EMT, Paramedic). Existing approaches typically apply general-purpose prompting or retrieval strategies without leveraging this structured context, limiting performance in high-stakes settings. We address this gap with EMSQA, an 24.3K-question multiple-choice dataset spanning 10 clinical subject areas and 4 certification levels, accompanied by curated, subject area-aligned knowledge bases (40K documents and 2M tokens). Building on EMSQA, we introduce (i) Expert-CoT, a prompting strategy that conditions chain-of-thought (CoT) reasoning on specific clinical subject area and certification level, and (ii) ExpertRAG, a retrieval-augmented generation pipeline that grounds responses in subject area-aligned documents and real-world patient data. Experiments on 4 LLMs show that Expert-CoT improves up to 2.05% over vanilla CoT prompting. Additionally, combining Expert-CoT with ExpertRAG yields up to a 4.67% accuracy gain over standard RAG baselines. Notably, the 32B expertise-augmented LLMs pass all the computer-adaptive EMS certification simulation exams.

EMS-MCQA Dataset Statistics

A high-level snapshot of EMS-MCQA by subject area, certification level, and question attributes.

EMS-MCQA split statistics table — **Figure 2a.** EMS-MCQA statistics by split (train/val/test, public vs. private).

Distribution by certification level and subject area — **Figure 2b.** Question distribution by *certification level* and *subject area*.

Approach

High-level overview of Expert-CoT and ExpertRAG for EMS QA — **Figure 3.** **How to inject domain expertise into LLMs for EMS QA?** (1) **Expertise-guided prompting (Expert-CoT):** Standard CoT encourages step-by-step reasoning but does not specify where to begin. Expert-CoT anchors the model’s first steps by explicitly providing the *subject area* and *certification level*. (2) **Expertise-guided RAG (ExpertRAG):** A lightweight filter predicts the question’s subject area and steers the retriever toward matching knowledge-base entries and patient records; the LLM then conditions on the predicted expertise plus the retrieved evidence to produce the answer. We study three retriever strategies: **Global** – retrieve the top M KB and N PR documents from the entire corpora (baseline, no subject filtering); **Filter → Retrieve (FTR)** – first filter KB/PR to the predicted subject area ŝ_i, then retrieve the top M and N; **Retrieve → Filter (RTF)** – first retrieve a larger pool from full KB/PR (e.g., 10×M and 10×N), then drop items whose subject area ≠ ŝ_i, keeping the best M and N.

Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering

Abstract

EMS-MCQA Dataset Statistics

Approach

Results

Where do SOTA LLMs shine or stumble on EMSQA across subject areas and certification levels?

How much does explicit expertise injected by Expert-CoT lift baseline accuracy?

How much does explicit expertise injected by Expert-CoT and ExpertRAG lift baseline accuracy?

Can expertise-aware LLMs meet the NREMT passing score at different certification levels??

Video Presentation

Poster

BibTeX