One has only to google the words “research excellence framework” (REF) to find a torrent of self-congratulatory messages from university departments across the land. While the Higher Education Funding Council for England eschews any assigning of ranks to institutions, the institutions themselves are rushing to report the league table REF ranking which most flatters the quality of their research. In the plethora of post-REF tweets I can find none questioning a central tenet of the entire process, namely, that one can capture the complex construct “research quality” in a number.
Much has been written about the relentless rise of managerialism in British universities and there is at least anecdotal evidence that REF-generated numbers are being used to “manage” university staff, setting colleague against colleague. The “tacit” nature of many of the abilities academics exercise in respect of both teaching and research – abilities which defy articulation let alone quantification – advantages the academic in his or her dealings with the human resources manager. Quantification switches the balance of power by subordinating the tacit and personal to the mechanical and the impersonal. Theodore Porter’s 1995 book “Trust in Numbers” anticipates this governance-by-number with remarkable acuity.
Before examining the numbers used to represent research quality in the various league tables, it is instructive to consider concerns about the data from which all REF inferences are derived. In the early stages of the REF process, groups of manager-academics assess the research output of their colleagues, assigning each item of submitted research a number on a four point scale. “World-leading” research is grade 4*, “internationally excellent” research is graded 3*, research that is “recognised internationally” is graded 2*, and research that is “recognised nationally” is graded 1*. Can human beings, no matter how expert, make such judgements with any degree of consistency? In 1961 the world’s elite testing agency, America’s Educational Testing Service, investigated this issue under carefully controlled conditions. They concluded:
“When 300 student papers were graded by 53 graders (a total of 15,900 readings), more than one third of the papers received every possible grade. … 94% (of the papers) received either seven, eight or nine different grades; and no essay received less than five different grades from 53 readers.”
The Nobel Laureate Daniel Kahneman came to the same conclusion:
“Another reason for the inferiority of expert judgement is that humans are incorrigibly inconsistent in making summary judgements of complex information. When asked to evaluate the same information twice, they frequently give different answers. The extent of the inconsistency is often a matter of real concern. Experienced radiologists who evaluate chest X-rays as “normal” or “abnormal” contradict themselves 20% of the time when they see the same picture on separate occasions. … A review of 41 separate studies of the reliability of judgements made by auditors, pathologists, psychologists, organizational managers, and other professionals suggests that this level of inconsistency is typical, even when a case is re-evaluated within a few minutes.”
Turning now to the ranks assigned to universities in REF-related league tables, let’s use the most widely discussed measure of research quality, the Grade Point Average (GPA), to illustrate what happens when one attempts to quantify quality. The GPA is computed using the following arithmetic rule: first multiply the institution’s percentage of 4* research by 4, its percentage of 3* research by 3, its percentage of 2* research by 2, and its percentage of 1* research by 1; then compute the total of these four numbers and divide by 100.
Now there is a profound problem with this simple arithmetic exercise. The REF involves the assignment of the published output of every ref-submitted academic to a four point ordinal scale. The scale is ordinal in the sense that a 3* journal article, for example, is deemed superior to journal articles rated 1* or 2*, and also deemed inferior to journal articles assigned to the 4* category. However, here’s the problem: arithmetical operations (such as the GPA algorithm above) are not meaningful when applied to an ordinal scale. Conclusion: GPA ranks are not meaningful.
The only hope of salvaging anything for the GPA as a measure of research quality is to go out on a very long limb and posit the existence of an underlying continuous quantifiable construct “research quality,” thereby transforming an ordinal to an interval scale in which the “distance” between consecutive levels is constant. But the construct “research quality” isn’t an intrinsic property of an academic manuscript; rather, it is a joint property of the manuscript and the relevant academic practice from which the manuscript derives its authority. Moreover, the relation between article and academic tradition is internal, and “research quality” is a property of an interaction rather than something intrinsic to the manuscript. It is important to stress that the reasoning set out here has implications for all of the REF-related “measures” of research quality I have encountered.
Finally, it does not seem to disturb the Higher Education community that each new league table tends to produce different rank orders. This shouldn’t surprise given that the same community doesn’t seem to be troubled by the notion that one can summarise the research quality of a university in a single number. The Nobel Laureate Sir Peter Medawar labelled this practice “unnatural science.” Medawar asked how it could be acceptable to capture something as complex as the research quality of a university in a single number when
“[t]he physical properties and field behaviour of soil depend on particle size and shape, porosity, hydrogen ion concentration, material flora and water content and hygroscopy . No single figure can embody itself in a constellation of values of all these variables in any single real instance.”
Dr Hugh Morrison