Marking is inaccurate — does it matter when our feedback makes the impact?

Sep 11, 20245 min read

My very first September Inset day as a teacher was led by Dylan William — how’s that for a high bar? This was in 2011, and all staff finished the day with a treasure box full of teaching resources and the challenge to begin trying Assessment For Learning strategies in our classrooms.

It took me until the third year of my career to realise that effective teaching didn’t require lolly sticks or True/False cards, but one piece of information that stuck from day one was the idea that pupils will remember the feedback or the mark, but not both.

I've heard it repeated often in CPD sessions around feedback in the years that followed. ‘What did you get?’ remains the most damaging sentence to hear echoing around the classroom when trying to get a class to focus on their green pen improvement work.

As a teacher, however, I lost hundreds of hours of my life sweating over marks at the expense of feedback. Whole days as an English department were given over to moderation, and I recall (in the Controlled Assessment days) genuinely feeling like I hated a colleague as they bumped one of my pupils from a Band 4 to a Band 3. Luckily that feeling passed, but my hatred of awarding marks hasn’t.

I have good reason to hate it.

Moving targets

Research into consistency of exam marking in the 2017 series found that in my subject, English, there was just a 40-50% chance of two examiners agreeing within one grade on a single component. Not within one mark — within one grade.

English is famously subjective, so perhaps we can forgive this inconsistency — except that other subjects don’t fare much better.

English Language, English Literature, Business Studies, Geography, History, Religious Studies and Sociology all carry less than a 70% chance of students being awarded their ‘definitive grade’* at qualification level. In other words, as many as 3/10 students in these subjects might be awarded the wrong grade in a given year.

Maths and Chemistry are, in fact, the only two subjects with a higher than 90% probability of being awarded the ‘definitive’ grade at qualification level. At component level, no subject scores above 90%.

As we’ve explored how to improve marking student work using artificial intelligence at stylus, this high degree of subjectivity in some subjects has moved more firmly into the spotlight.

Teachers want to know if our marking is accurate — but what can we compare it to? Human marking isn’t accurate, and it certainly isn't consistent as human teachers often don’t agree with each other, so we are essentially comparing two moving targets.

Do we need marks?

The stakes are high when it comes to assigning grades. I have been the Head of Department poring over a disappointing set of mock exam results and planning what I can do next to drive them up. I know the feeling when a change of only a couple of marks here and there moves a pupil up or down a grade, on or off the ‘Vital few’ list, from red to amber to green on my RAGged department data sheet.

It is only now, with the benefit of hindsight, that I can look back at the hours spent quibbling over whether to award a 13 or 14 and wish I’d used that time better.

That said, school leaders often rely on predicted grades or pass rates to be able to make well-informed decisions about where to direct resources (intervention, teacher CPD, external tuition, and so on). They also need to keep Governing bodies informed about progress and outcomes.

Where do we go from here?

Life beyond marks

Moderating every piece of work, using a team of teachers, would hypothetically deliver consistency in human marking — but the time and money this would require makes it unfeasible for schools. One option to use AI to replicate this process, at a fraction of the cost.

We’ve found that marking the same essay or exam response five times can sometimes give five different outputs — but averaging these tends to deliver accuracy in line with, or better than, human levels of accuracy on the same questions.

I’d expect to see a similar result if we repeated this process with human markers, and this is of course what seeded papers and moderation sessions are intended to replicate. AI makes it easier, and more affordable (both in time and money) to do at scale.

The other option is to look at marks for what they are — a best guess at how one might quantify a pupil’s performance — and then focus on the really important part, which is what the feedback is telling you.

On results day, when students (and teachers!) receive their final grades, no one is discussing knowledge gaps or coverage of each Assessment Objective — summative assessment is king. Until then, the overall mark is the least relevant information we have about our pupils — if what we want to do is help them improve!

What’s more important is identifying the areas of the curriculum or specification they demonstrated, the ‘near misses’, and the larger misconceptions — and, in turn, where we need to spend our teaching time next.

Feedback that matters

In an ideal world, every mark schools record has been moderated and its validity vouched for, and every piece of feedback (be it summative or formative) is actioned by teachers and leaders, and used to drive forward pupil progress.

In reality, schools need to determine how teachers’ limited time will be allocated: marking, moderating, or acting upon feedback?

The last of those three has to be the most important. Time spent determining and confirming marks is of course important, particularly during key assessment periods in the run up to national assessments.

But we must beware of letting the mark itself dominate the feedback process. Particularly for classes approaching end-of-phase exams, it is more important than ever to know where to spend rapidly-diminishing teaching time.

AI’s reliability in determining a precise mark is only ever as good as the human against which it’s compared. Nevertheless, it has huge potential to help teachers identify misconceptions and knowledge gaps within and across a class. This in turn frees teachers to moderate, and to plan accordingly, in a fraction of the time a cohort assessment would usually demand.

AI is only useful when it improves processes or saves time — in the case of marking and feedback, it has potential to do both.

We’ve received funding from a British Business Bank backed AI-fund to explore effective use of artificial intelligence models to alleviate teacher workload and combat the recruitment & retention crisis., We’re recruiting discovery schools to help us develop and test our AI-led, human-moderated marking service. To register interest for your school, complete this form and someone will be in touch.

*’Definitive Grade’ is a measure referenced in the 2017 report ‘Marking consistency metrics: An update’, which states ”The term ‘definitive’ is based on terminology ordinarily used in exam boards for the mark given by the senior examiners at item level for each seeding response”. Read the full report here.

Marking is inaccurate — does it matter when our feedback makes the impact?

Recent Posts

Comentarios