Weekly Roundup: Can AI grade like a professor?
Two new studies find LLMs can grade short-answer assessments in medical education with high agreement to human examiners, offering potential efficiency gains but highlighting the need for oversight.
Many of us are currently on annual leave and we will soon return to work to prepare for the return of our students. This often means that assessment marking is not yet on our minds. Perhaps now I’ve reminded you of it you’re already experiencing palpitations?
We are all familiar with the growing pressures: rising student numbers, constrained faculty time, and the pedagogical need for richer evaluation of reasoning. These drivers are making many of us consider the potential of AI-assisted grading. This week, I have noticed separate two studies that tested large language models (LLMs) for scoring short-answer questions (SAQs), each reporting high concordance with human markers while identifying limits and opportunities for future use.
The first study1, from Sri Lanka, developed an Automated SAQ Scoring Tool (ASST) powered by GPT-4 to grade pharmacology SAQs against detailed instructor rubrics. Thirty student responses were marked both by the AI and two expert human examiners. The LLM’s marks showed very strong correlation with human scores (r = 0.93–0.96) and excellent inter-rater reliability (ICC = 0.94). By explicitly embedding the marking rubric in the prompt, the system provided transparent, criteria-aligned feedback without any fine-tuning. While the sample size was small, the authors argue this rubric-driven approach could adapt to different domains, provided high-quality rubrics exist. They caution, however, that topics poorly represented in the model’s training data may require retrieval-augmented approaches.
The second study2, from the US, evaluated GPT-4o as a zero-shot grader for 1,450 SAQs from case-based learning exams. Using the same rubrics as 23 faculty markers, the AI’s overall mean scores were statistically equivalent to faculty grades within a ±5% margin, with exceptional precision across repeated runs (overall ICC = 0.993). Equivalence held for “Remembering,” “Applying,” and “Analyzing” items in Bloom’s taxonomy, but not for “Understanding” (AI slightly under-scored) or “Evaluating” (AI slightly over-scored) questions. Greater divergences appeared on more difficult items.
The discussions of both these papers note the potential efficiency gains of AI-powered marking but emphasise that AI should serve as a preliminary scorer, with faculty oversight to address complex, ambiguous, or borderline responses.
Both studies demonstrate that LLMs can achieve human-level scoring accuracy for many SAQ types, offering a viable route to scale richer assessments beyond multiple-choice formats. Yet each also underscores that careful rubric design, transparency, and ongoing human moderation remain essential if AI is to be trusted in high-stakes educational contexts.
The potential ethical implication of AI-marking and the use of AI in assessment more generally were mentioned in a recent entry in the “AMEE Guides” series3, which you may find useful.
If you enjoyed this roundup you might be interested in reading my recent article about using AI to write exam questions:
Can Artificial Intelligence Write Exam Questions?
Large language models (LLMs) have arrived at a moment when assessment teams are stretched and question banks feel threadbare. The allure is obvious: type a learning objective into a chatbot and receive a perfectly-formed, single-best-answer (SBA) item a few seconds later. But how close are we to that reality, and what risks come with outsourcing question writing to machines?
The full list of references is available to paying subscribers:
Keep reading with a 7-day free trial
Subscribe to AI × MedEd by Andrew O'Malley to keep reading this post and get 7 days of free access to the full post archives.