Can Artificial Intelligence Write Exam Questions?
Our recent study demonstrates the promise and pitfalls of letting large language models draft SBAs
Large language models (LLMs) have arrived at a moment when assessment teams are stretched and question banks feel threadbare. The allure is obvious: type a learning objective into a chatbot and receive a perfectly-formed, single-best-answer (SBA) item a few seconds later. But how close are we to that reality, and what risks come with outsourcing question writing to machines?
Generating high-quality SBAs demands time, cognitive effort and content expertise. Perhaps you know a fellow educator who spends an hour or more polishing each item? Perhaps you know some that don’t provide any SBAs at all!? If AI could shoulder even part of that burden, faculties could redirect energy towards feedback, blueprinting and wider curriculum tasks.
We recently published a study, which piloted AI-generated SBAs with our cohort of graduate-entry medical students1. This study used GPT-4 to craft 220 single-best-answer questions, of which 69% survived expert review with no or minor edits. When deployed in a controlled online exam for ScotGEM students, the surviving AI items performed equivalently to faculty questions on both facility and discrimination indices.
Several other empirical studies also compared AI-generated items against those written by humans.
In a German cohort test2, ChatGPT-generated questions matched human items for difficulty but were significantly less discriminating (mean discrimination 0.24 vs 0.36). Interestingly, students could not reliably spot the source of each question, identifying origin only 57% of the time.
More recently, a study from Hong Kong3 compared ChatGPT-4o items with bank questions in a high-stakes setting, although only a small number of participants were involved. No psychometric inferiority emerged, although reviewers still highlighted factual slips and style breaches that had escaped the pre-test screen.
Across studies summarised in a 2024 systematic review4, common findings emerged: LLMs accelerate drafting, but every dataset contained errors serious enough to invalidate at least some items.
It should be noted that the above studies were all conducted using previous-generation models, some of which are more than two years old. Advances in the field of AI combined with huge delays in peer review make it likely that the picture outlined above is outdated. Current models are likely much more capable.
It is also worth noting that none of the papers above compare the quality of AI-generated questions against the “raw” quality of human-authored questions. I’m sure many of us would be pleased if 70% of questions returned to us from our colleagues were suitable as-is for inclusion in exams.
Practical Guide
Indran and colleagues distilled twelve practical tips for educators using genAI to produce questions5. These include:
begin with a tightly scoped learning outcome,
embed local style-guide rules in the prompt,
insist on alphabetical distractors,
always run a human “cover-test” to ensure the stem alone signals the answer.
Even with such safeguards AI still hallucinates. Common faults include outdated guidelines, US spellings, unresolved negatives (NOT/EXCEPT), and distractors that duplicate the key concept. Psychometric review after pilot testing is therefore essential.
The full reference list is available here for paid subscribers:
Keep reading with a 7-day free trial
Subscribe to AI × MedEd by Andrew O'Malley to keep reading this post and get 7 days of free access to the full post archives.