Artificial intelligence and the future of test scoring

Jun 25, 2024

Alex Scharaschkin, Executive Director of Assessment Research and Innovation, AQA

When the topic of Artificial Intelligence (AI) comes up in relation to educational assessment it tends to be treated with extreme excitement or extreme suspicion, or sometimes a combination of both. The opportunities presented throw up an equal volume of questions and concerns around how and when it should or could be used. An obvious application of AI to educational assessment is to use it for marking (scoring) learners’ responses to test items, against a mark scheme (rubric). Assessments such as GCSE or A level examinations in England comprise many such items, for instance essays and open text responses. And while we’re not currently considering using AI as a prime marker, we are looking at ways to train large language models to support our existing quality assurance arrangements for examination marking.

Clearly one outstanding question is whether these models can be trained to classify text-based responses accurately and consistently, so that marks awarded align with those that an experienced expert examiner would give. A more fundamental question is whether such models can produce valid explanations for the marks they award. Can they convincingly say, referring to the rubric and to construct-relevant features of a student’s response, why it has received the suggested mark?

Explaining scores – the black box dilemma

The requirement that standardised marking should consistently produce explainable scores is at the heart of the public examination system in the UK. If an examination candidate challenges their result, the awarding organisation must be able to explain the relationship between marking criteria in the rubric, and qualitative features of the candidate’s response. Experienced examiners can do so because they’ve been trained to pay attention to the specific qualitative features or attributes of responses that enable discrimination between learners with respect to the construct. The mark scheme describes these construct-relevant criteria and how they should be valued.

This brings us to the question of how large language models (LLMs) ‘reason’. Generative LLMs are basically very sophisticated predictive text engines. They produce text outputs that are combinations of words that are statistically likely, conditional on the body of text the model has been trained on, rather than on the basis of an explicit process of reasoning. So how much confidence can we have in using them to produce reasons for scores?

I’m looking forward to discussing this question at this year’s AEA-Europe conference. At AQA we’ve been considering two approaches to it. The first involves treating the LLM as a ‘black box’, and figuring out how to make it produce valid explanations. By the term ‘black box’ we mean a system which can be viewed in terms of its inputs and outputs without any knowledge of its internal workings. The second is to try to get inside the black box sufficiently to check if it is paying attention to the linguistic and substantive features of responses that it should be, to score responses validly.

Applying black boxes effectively

Training (‘standardising’) new examiners to apply a mark scheme involves showing them examples of responses that should be classified in particular mark-categories (1-mark responses, 2-mark responses, etc.). They practice classifying responses, modifying their initial attempts if necessary based on feedback from experienced examiners. Once able to classify responses accurately, they can start live marking.

We could take a similar approach with a generative AI (genAI) model, by giving it the mark scheme and examples of correctly marked responses, and prompting it to provide marks, and rationales for them, for unseen responses. However, success is dependent on how the prompts are written (as well as other facets, such as the degree of fine-tuning of the model). Mark schemes for subjects such as mathematics are quite tightly rules-based, so it may be easier to convert them into prompts that enable mimicking of a human marker (although language models can also have trouble with maths). For subjects such as English, however, mark schemes indicate the general kinds of features or attributes that tend to be associated with responses at different levels of creditworthiness. An adequate description of these to ensure human markers categorise responses with acceptable consistency may be problematic for a language model. An overly-specified, complex, or detailed mark scheme can be as difficult for human markers to apply as an insufficiently precise one. But whereas for humans, less is sometimes more, for prompting a LLM, more is more.

In addition to exploring these issues we’re considering how to generate effective feedback for students based on their responses to formative tasks, and look forward to hearing what others are doing in this space.

Opening the black box

Language models don’t ‘know’ that they are processing language. They operate purely on numbers generated from text using so-called ‘vector-space embeddings’. The model doesn’t see text, it sees sequences of vectors of numbers. The analogue of ‘reading’ a text is what is called the ‘attention’ mechanism in a LLM.. Attention effectively means the model doesn’t just treat words in isolation, but processes their context. Processing the text amounts to generating numbers by applying the attention algorithm to the inputs, and these numbers determine the relevant output, such as a classification decision, or a word to output next.

So language models can be thought of as operating on vectors in a high-dimensional space. We can investigate these ‘linguistic’ vector spaces and try to approximate them. For example, if we find that the 1-mark, 2-mark, and 3-mark responses, as scored by the model, tend to occupy relatively distinct positions in a lower-dimensional (say, a two-dimensional) space that approximates the full space, we could investigate whether each of these two dimensions could be interpreted as a relevant feature of the responses (e.g. a linguistic feature and a content feature). This might help explain what the model is ‘really’ paying attention to and check whether it’s doing what we really want it to. This is another approach to explainability we’ve been exploring.

What is the future of marking?

All this exploratory work aims to make AI do what human markers currently do, i.e. assign scores to candidates’ responses to tasks in a consistent and explainable way. These scores are used as proxies for substantive statements about candidates’ attainment (‘you have demonstrated A, B, and C’). If such statements could be generated in a trustworthy way by the application of LLMs to assessment responses, we might question whether we need a separate scoring process in the traditional sense. . Rather than considering a student’s qualitative response, turning it into a sequence of item scores that are manipulated to obtain a final score or grade, and then turning that result back into natural language when discussing what inferences to draw from it, could we just work with the qualitative inputs directly to produce a meaningful, linguistically expressed output?

A multidisciplinary approach to exploring these challenges and opportunities seems to offer the most effective way forwards. It will certainly be very interesting to see how the dialogue between psychometrics, natural language processing, and machine learning evolves in this respect over the coming years. At AQA we are keen to both lead and participate in those conversations!