ability, Cambridge Assessment, Derrick Lawley, Dr Hugh Morrison, GCSE Grade 5, Gerd Gigerenzer, Harvey Goldstein, Herbert Simon, Item Response Theory, ludwig Wittgenstein, Niels Bohr, OECD, Ofqual, Pisa rankings, Svend Kriener, TES, Tim Oates, William Stewart
The GCSE Grade 5 controversy: why it’s time for Ofqual to “take back control”
Dr Hugh Morrison, The Queen’s University of Belfast (retired) firstname.lastname@example.org
The GCSE Grade 5 Controversy
A highly unusual feature of the new numbered GCSE grade scale is the claim that the new grade 5 will somehow reflect the standards of educational jurisdictions ranked near the top of international league tables. Given the controversy surrounding such tables it will be possible, for the first time, to raise profound technical concerns about a particular grade on the GCSE grade scale. Moreover, it will be difficult to make the case that grade 5 has any technical merit if Pisa ranks have any role in its determination. Pisa is the acronym for the Paris-based “Programme for International Student Assessment” and a glance at the Times Educational Supplement of 26.07.2013 will reveal that Pisa league tables are fraught with technical difficulties.
Concerns about Item Response Theory
The methodology which underpins Pisa ranks is called “Item Response Theory” (IRT). IRT software claims to estimate the ability of individuals based on their responses to test items. However, while the claim that ability is some inner state or “trait” of the individual from which his or her tests responses flow – the so-called reservoir model – is central to IRT, the claim is rarely supported by evidence. There is very good reason to reject the inner state approach. Consider, for example, the child solving simple problems in arithmetic. To explain this everyday behaviour it transpires that one must invoke inner states with talismanic properties in that the state must be timeless, infinite and future-anticipating!
Great caution is needed when using the word “ability” – while test evidence can justify us in saying that an individual has ability, that same evidence can never be used to justify the claim that the word “ability” refers to an inner (quantifiable) state of that individual. Ability is not an intrinsic property of an individual; rather, it is a property of the interaction between individual and test items. The individual’s responses to the test items are an inseparable part of that ability. Indeed, divorced from a measurement context, ability is indefinite. Individuals have definite ability only relative to a measurement context; even here it is incorrect to suggest that individuals have a quantifiable entity called “ability.” Abandoning IRT’s appealingly simple picture of ability as an inner (quantifiable) state that individuals carry about with them, renders IRT untenable. Ability is a two-faceted entity governed by first-person/third-person asymmetry: while we ascribe ability to ourselves without criteria, criteria are an essential prerequisite when ascribing ability to others.
The picture central to all IRT modelling – that ability is something intrinsic to the individual which is definite (and quantifiable) at all times – is rejected by the Nobel laureate Herbert Simon and by two giants of 20th century thought, physicist Niels Bohr and philosopher Ludwig Wittgenstein. Indeed, Wittgenstein described the rationale which underpins IRT modelling – that test responses can be explained by appealing to inner processes – as a “general disease of thinking.” Psychologists have a name for this error; Gerd Gigerenzer of the Max Plank Institute writes: “The tendency to explain behaviour internally without analysing the environment is known as the ‘Fundamental Attribution Error’.”
The criticisms levelled by Bohr and Wittgenstein are particularly damaging because IRT modellers construe ability as something inner which can be measured. Few philosophers can match Wittgenstein’s contribution to our understanding of what can be said about the “inner”; and few scientists can match Bohr’s contribution to our understanding of measurement, particularly when the object of that measurement lies beyond direct experience. (Bohr is listed among the top ten physicists of all time in recognition of his research on the quantum measurement problem.) Both Bohr and Wittgenstein are concerned with the same fundamental question: how can one communicate unambiguously about aspects of reality which are beyond the direct experience of the measurer? Just as Bohr rejected entirely the existence of definite states within the atom, Wittgenstein also rejected any claim to inner mental states; potentiality replaces actuality for both men.
For the duration of his professional life, Bohr maintained that quantum attributes have a “deep going” relation to psychological attributes in that neither can be represented as quantifiable states hidden in some inner realm. We will always be limited to talking about ability; we will never be able to answer the question “what is ability?” let alone quantify someone’s ability. Bohr believed that “Our task is not to penetrate into the essence of things, the meaning of which we don’t know anyway, but rather to develop concepts which allow us to talk in a productive way about phenomena in nature. …The task of physics is not to find out how nature is, but to find out what we can say about nature. … For if we want to say anything at all about nature – and what else does science try to do? – we must somehow pass from mathematical to everyday language” [italics added].
Given that IRT software is designed to measure ability, it may surprise readers that the claim that ability can be construed as a quantifiable inner state is rarely defended in IRT textbooks and journal articles. In their article “Five decades of item response modelling,” Goldstein and Wood trace the beginnings of IRT to a paper written in 1943 by Derrick Lawley. They note: “Lawley, a statistician, was not concerned with unpacking what ‘ability’ might mean.” Little has changed in the interim.
Why Ofqual must protect GCSE pupils from the OECD’s “sophisticated processes”
These profound conceptual difficulties with the model which underpins Pisa rankings must surely undermine the OECD’s claim that one can rank order countries for the quality of their education systems. In a detailed analysis of the 2006 Pisa rankings, the eminent statistician Svend Kreiner revealed that “Most people don’t know that half of the students taking part in Pisa  do not respond to any reading item at all. Despite that, Pisa assigns reading scores to these children.” Given such revelations, why are governments, the media and the general public not more sceptical about Pisa rankings? Kreiner offers the following explanation: “One of the problems that everybody has with Pisa is that they don’t want to discuss things with people criticising or asking questions concerning the results. They didn’t want to talk with me at all. I am sure it is because they can’t defend themselves.”
Given the depth of the conceptual problems which afflict IRT and, as a consequence, Pisa rankings, it seems to me foolhardy in the extreme to predicate the new GCSE grade 5 on Pisa rankings. Ofqual have announced that grade 5 will be “broadly in line with what the best available evidence tells us is the average PISA performance in countries such as Finland, Canada, the Netherlands and Switzerland.” In addition to Ofqual, the Department for Education and Tim Oates, director of research at Cambridge Assessment, appear to endorse a role for Pisa in UK public examinations. The Department for Education have produced a report – PISA 2009 Study: How big is the gap? – which creates the impression that “gaps” between England and high performing Pisa countries can be represented on a GCSE grade scale designed for reporting achievement rather than ability.
Finally, the director of Cambridge Assessment asserts: “I am more optimistic … than most other analysts, I don’t see too many problems in these kinds of international comparisons.” Indeed, Mr Oates believes that UK assessment has much to learn from involving Pisa staff directly in solving the grade 5 problem: “If we want to do it formally then we ought to have discussions with OECD. … OECD have some pretty sophisticated processes of equating tests which contain different items in different national settings.” There is an immediate problem with this claim. Since the psychometric definition of equity begins with the words: “for every group of examinees of identical ability …,” equity itself is founded on the erroneous assumption that ability can be quantified.
For the first time in the history of public examinations in the UK the technical fidelity of a GCSE grade will be linked to Pisa methodology. Given the concerns surrounding IRT, is it not time for Ofqual to distance itself from the claim that the grade 5 standard is somehow invested with properties which allow it to track international standards in the upper reaches of the Pisa league tables? The recent introduction of the rather vague term “strong pass” smacks of desperation; couldn’t a grade 6 also be deemed a strong pass? Why not stop digging, sever the link with Pisa, and simply interpret grade 5 as nothing more than the grade representing a standard somewhere between grade 4 and grade 6?