Introduction
The claim made in this essay is that the academic literature clearly indicates that the capacity of teachers to predict their pupils’ grades falls far below acceptable levels. Furthermore, the evidence that teachers can rank-order their pupils within-grade is scant to non-existent. Indeed, the Awarding Bodies couldn’t stand up the claim that any of their examinations rank-order pupils on the construct they purport to measure. It follows that the only defensible solution is to provide two measures per examination: (i) a teacher-predicted grade (without an associated rank-order); and (ii) a test that could be used – if the pupil so decides – to overwrite the teacher prediction, in relevant cases. Where a pupil cannot take the test, he or she must accept the teacher-predicted grade. There is no credible evidence in the literature that one can mobilize some “standardization” algorithm (which has yet to be detailed by Ofqual or CCEA) to somehow correct for any excesses in teacher judgement.
The perils of expert prediction of all types
As far back as 1954 Paul Meehl (Clinical versus Statistical Prediction: A theoretical analysis and a review of the evidence) analysed the ability of a range of teachers to predict measures of academic success, and found scant evidence that this could meet acceptable standards. Meehl’s book ranged far beyond teachers’ predictions of grades to consider, for example, expert predictions of an individual’s probability of violating parole, predictions of success in pilot training, predictions of criminal recidivism, and so on. In his book Thinking, Fast and Slow the Nobel Laureate Daniel Kahneman (2011, p. 225) endorsed Meehl’s findings and stressed that the range of studies demonstrating the limitations of experts’ abilities to predict the future had expanded greatly since Meehl’s book was published:
“Another reason for the inferiority of expert judgement is that humans are incorrigibly inconsistent in making summary judgements of complex information. When asked to evaluate the same information twice, they frequently give different answers. The extent of the inconsistency is often a matter of real concern. Experienced radiologists who evaluate chest X-rays as “normal” or “abnormal” contradict themselves 20% of the time when they see the same picture on separate occasions. A study of 101 independent auditors who were asked to evaluate the reliability of internal corporate audits revealed a similar degree of inconsistency. A review of 41 separate studies of the reliability of judgements made by auditors, psychologists, pathologists, organizational managers, and other professionals suggests that this level of inconsistency is typical, even when a case is re-evaluated within a few minutes. Unreliable judgements cannot be valid predictors of anything.”
The perils of predicting within-grade rank order
Ofqual and CCEA are requiring teachers to rank order their pupils according to their achievement in mathematics, English, biology, and so on. However, no GCSE or A level product designed by the Awarding Bodies can itself perform this feat. The rank-ordering of candidates on the construct “achievement in geography,” for example, is a validity issue and the Awarding Bodies have a very, very poor record in this area.
In 1991 an expert on the work of the examination boards, Robert Wood, summarized his conclusions in the book Assessment and Testing: A survey of research commissioned by the University of Cambridge Local Examination Syndicate (UCLES). On pages 147- 151 he wrote:
“If an examining board were to be asked point blank about the validities of its offerings or, more to the point, what steps it takes to validate the grades it awards, what might it say? … The examining boards have been lucky not to have been engaged in validity argument. … Nevertheless, the extent of the boards’ neglect of validity is plain to see once attention is focused. Whenever boards make claims that they are measuring the ability to make clear reasoned judgements, or the ability to form conclusions (both examples from IGCSE and Economics), they have a responsibility to at least attempt a validation of the measures. … The boards know so little about what they are assessing that if, for instance, it were to be said that teachers assess ability … rather than achievement, the boards would be in no position to defend themselves. … As long as examination boards make claims that they are assessing this or that ability or skill, they are vulnerable to challenge from disgruntled individuals.”
The claim that a GCSE or A level examination could rank-order candidates on some appropriate construct would require the Awarding Bodies to use Structural Equation Modelling to compute three indices: root mean-square residual, adjusted goodness-of-fit, and chi-squared divided by degrees of freedom. To claim a rank order, these three statistics would have to be demonstrated to satisfy relevant inequalities. Can it be reasonable to ask teachers to predict something that is beyond the capabilities of the GCSE and A level examinations themselves?
The resolution
Needless to say, staff at Ofqual and CCEA are mandated to provide young people with grades that are as error-free as possible. They should take heed of Paul Meehl’s counsel in respect of teachers’ capacities to anticipate the future: “When one is dealing with human lives and life opportunities, it is immoral to adopt a mode of decision-making which has been demonstrated repeatedly to be … inferior.” If teacher judgement (omitting the requirement to rank-order) is to be used to forecast grades, pupils must also be offered speedy access to a public examination which protects them from the well-documented vagaries of teacher prediction.
Stephen Elliott