Tags

, , , , , , , , , , , , , , , ,

A fundamental conundrum in psychology’s standard model of measurement and its consequences for PISA global rankings.

Dr. Hugh Morrison
Formerly The Queen’s University of Belfast
( drhmorrison@gmail.com)

Introduction

This paper is concerned with current approaches to measurement in psychology and their use by organisations like the Organisation for Economic Co-operation and Development (OECD) to hold the education systems of nation states to “global” standards. The OECD’s league table – the Programme for International Student Assessment (PISA) – has the potential to throw a country’s education system into crisis. For example, Ertl (2006) documents the effects of so-called “PISA-shock” in Germany, and Takayama (2008) describes a similar reaction in Japan. Given that a country’s PISA ranking can play a role in decisions concerning foreign direct investment, it is important to confirm that the measurement model which produces the ranks is sound. Moreover, the OECD has already spread its remit beyond the PISA league table to include teacher evaluation through its Teaching and Learning International Survey (TALIS). The OECD is currently developing PISA-like tests to facilitate global comparisons of the education on offer in universities through its Assessment of Higher Education Learning Outcomes (AHELO) programme: “Governments and individuals have never invested more in higher education. No reliable international data exists on the outcomes of learning: the few studies that exist are nationally focused” (Rinne & Ozga, 2013, p. 99). Given the sheer global reach of the OECD project, it is important to investigate the coherence of the measurement model which underpins its data.

At the heart of 21st century approaches to measurement in psychology is the Generalised Linear Item Response Theory (GLIRT) approach (Borsboom, Mellenbergh and Van Heerden, 2003, p. 204) and the OECD uses Item Response Theory (IRT) to generate its PISA ranks. A particular attraction of IRT for the OECD is its claim that estimates of examinee ability are item-independent. This is vital to PISA’s notion of “plausible values” because each examinee only takes a subset of items from the whole “item battery.” Without the Rasch model’s claim to item-independent ability measures, PISA’s assertion that student performance can be reported on common scales, even when these students have taken different subsets of items, would be invalid.

This paper will focus on the particular IRT model used by OECD, the so-called Rasch model, but the arguments generalise to all IRT models. Proponents of the model portray Rasch as closing the gap between psychological measurement and measurement in the physical sciences. Elliot, Murray and Pearson (1978, pp. 25-26) claim that “Rasch ability scores have many similar characteristics to physical measurement” and Wright (1997, p. 44) argues that the arrival of the Rasch model means that “there is no methodical reason why social science cannot become as stable, as reproducible, and hence as useful as physics.” This paper highlights the incoherence of the model.

The Rasch model and its paradox

The Rasch model is defined as follows:

P(X_is=1 ┤| θ_(s,) β_i)= e^((θ_s-β_i))/(1+ e^((θ_s-β_i)) )

X_is is the response (X) made by subject s to item i;

θ_(s )is the trait level of subject s;

β_i is the difficulty of item i; and

X_is=1 indicates a correct response to the item.

On the face of it, the model uses a mathematical function to allow the psychometrician to compute the probability that a randomly selected individual of ability θ will provide the correct response to an item of difficulty β. A particular ability and difficulty value will be chosen for illustration, but the analysis which follows has universal application. When the values θ = 1 and β = 2, for example, are substituted in the Rasch model, a scientific calculator will quickly confirm that the probability that an individual of ability θ = 1 will respond correctly to an item of difficulty β = 2 is given as 0.27 approximately. It follows that if a large sample of individuals, all with this same ability, respond to this item, 27% will give the correct response.

In the Rasch model “the abilities specified in the model are the only factors influencing examinees’ responses to test items” (Hambleton, Swaminathan & Rogers, 1991, p. 10). This results in a paradox. If a large sample of individuals of exactly the same ability respond to the same item, designed to measure that ability, why would 27% get it right and 73% get it wrong? If the item measures ability and the individuals are all of equal ability, then surely the model must indicate that they all get it right, or they all get it wrong?

Does the Rasch model really represent an advance on classical test theory?

The Rasch model is portrayed as a radical advance on what went before – classical test theory (CTT). In classical test theory, “[p]erhaps the most important shortcoming is that examinee characteristics and test characteristics cannot be separated: each can be interpreted only in the context of the other. The examinee characteristic we are interested in is the ‘ability’ measured by the test” (Hambleton, Swaminathan & Rogers, 1991, p. 2).

An examinee’s ability is defined only in terms of a particular test. When the test is “hard,” the examinee will appear to have low ability; when the test is “easy,” the examinee will appear to have higher ability. What do we mean by “hard” and “easy” tests? The difficulty of a test item is defined as ‘the proportion of examinees in a group of interest who answer the item correctly.’ Whether an item is hard or easy depends on the ability of the examinees being measured, and the ability of the examinees depends on whether the items are hard or easy! (Hambleton, Swaminathan & Rogers, 1991, pp. 2-3)

Measures of ability in the Rasch model, on the other hand, are claimed to be completely independent of the items used to measure such abilities. This is vital to the computation of plausible values because no student answers more than a fraction of the totality of PISA items.

A puzzle emerges immediately: if the Rasch model treats as separable what classical test theory treats as profoundly entangled – with Rasch regarded as a significant advance on classical test theory – why does the empirical data not reflect two radically different measurement frameworks? Based on large scale comparisons of item and person statistics, Fan (1998) notes: “These very high correlations indicate that CTT- and IRT-based person ability estimates are very comparable with each other. In other words, regardless of which measurement framework we rely on, the same or very similar conclusions will be drawn regarding the ability levels of individual examinees” (p. 8), and concludes: “the results here would suggest that the Rasch model might not offer any empirical advantage over the much simpler CTT framework” (p. 9). Fan (1998) confirms Thorndike’s (1962, p. 12) pessimism concerning the likely impact of IRT: “For the large bulk of testing, both with locally developed and standardized tests, I doubt that there will be a great deal of change. The items that we select for a test will not be much different, and the resulting tests will have much the same properties.”

In what follows, the case is made that in the Rasch model, just as in Classical Test Theory, ability cannot be separated from the item used to measure it. Rasch’s model is shown to be incoherent and this has clear consequences for the entire OECD project. Moreover, the arguments presented here undermine psychology’s “standard measurement model” (Borsboom, Mellenbergh & van Heerden, 2003) with implications for all IRT models and Structural Equation Modelling.

The Rasch model: early indications of incoherence

The first hints of Rasch’s confusion appear in the early pages of his 1960 treatise which sets out the Rasch model, Probabilistic Models for Some Intelligence and Attainment Tests. Rasch’s lifelong obsession – captured in his closely associated notions of “models of measurement” and “specific objectivity” – with measurement models capable of application to the social and natural sciences can be recognized in his portrayal of the Rasch model. In constructing his model Rasch (1960, p. 10) rejects deterministic Newtonian measurement for the indeterminism of quantum mechanics:

For the construction of the models referred to I shall take recourse to some points of view … of a more general character. Into the system of classical physics enter a number of fundamental laws, e.g. the Newtonian laws. … A characteristic property of these laws is that they are deterministic. … None the less it should not be overlooked that the laws do not give an accurate picture of nature. … In modern physics … the deterministic view has been abandoned. No deterministic description for e.g. radioactive emission seems within reach, but for the description of such irregularities the theory of probability has proved an extremely valuable tool.

Rasch (1960, p. 11) likens the unmeasured individual to a radioactive nuclide about to decay. Quantum mechanics teaches that, unlike Newtonian mechanics, if one had complete information about the nuclide, one still couldn’t predict the moment of decay with accuracy. Indeterminism is a constitutive feature of quantum mechanics: one cannot know, even if one had complete knowledge of the universe, what will happen next to a quantum system. Irreducible uncertainty applies. For Rasch (1960, p. 11): “Where it is a question of human beings and their actions, it appears quite hopeless to construct models which will be useful for purposes of prediction in separate cases. On the contrary, what a human being actually does seems quite haphazard, none less than radioactive emission.” Rasch (1960, p. 11) makes clear his rejection of deterministic Newtonian models: “This way of speaking points to the possibility of mapping upon models of a kind different from those used in classical physics, more like the models in modern physics – models that are indeterministic.”

Quantum indeterminism has implications for Rasch’s “models of measurement.” In quantum mechanics, measurement doesn’t simply produce information about some pre-existing state. Rather, measurement transforms the indeterminate to the determinate. Measurement causes what is indeterminate to take on a determinate value. In the classical model which Rasch rejects, measurement is simply a process of checking up on what pre-existed the act of measurement, while quantum measurement causes the previously indeterminate to take on a definite value. However, latent variable theorists in general, and Rasch in particular, treat “ability” as an intrinsic attribute of the person, and they view measurement as an act of checking up on that attribute.

The early pages of Rasch’s (1960) text raise doubts about his understanding of the central mathematical conceit of his model: probability. One gets the clear impression that Rasch associates probability with indeterminism. But completely determinate situations can involve probability. The outcome of the toss of a coin is completely determined from the moment the coin leaves the thrower’s hand. If one had knowledge of the initial speed of projection, the angle of inclination of the initial motion to the horizontal, the initial angular momentum, the local acceleration of gravity, and so on, one could use Newtonian mechanics to predict the outcome. Probability is invoked because of the coin-thrower’s ignorance of these parameters. Such probabilities are referred to as subjective probabilities.

In modern physics, uncertainty is constitutive and not a consequence of the limitations of human beings or their measuring instruments. Quantum physicists deal in objective probability. Finally, the notion of separability or “specific objectivity” as Rasch labelled it, is absolutely central to his thinking: “Rasch’s demand for specific objective measurement means that the measure of a person’s ability must be independent of which items were used” (Rost, 2001, p. 28). However, quantum mechanics is founded on non-separabilty; one cannot break the conceptual link between what is measured and the measuring instrument. The mathematics of the early pages of Rasch (1960) do not auger well for the mathematical coherence of his model, but it is important to set out the case against the model with greater rigour.

Bohr and Wittgenstein: indeterminism in psychological measurement

A possible source of Rasch’s efforts to find “models of measurement” which would apply equally to both psychometric measurement and measurement in physics was the writings of Rasch’s famous countryman, Niels Bohr. (Indeed, Rasch attended lecture courses in mathematics given by the great physicist’s brother.) Bohr argued for all of his professional life that there existed a structural similarity between psychological predicates and the attributes of interest to quantum physicists. Although he never published the details, he believed he had identified an “epistemological argument common to both fields” (Bohr, 1958, p. 27). For Bohr, no psychologist has direct access to mind just as no physicist has direct access to the atom. Both disciplines use descriptive language which was developed to make sense of the world of direct experience, to describe what cannot be available to direct experience. Bohr summarized this common challenge in the question, “How does one use concepts acquired through direct experience of the world to describe features of reality beyond direct experience?”

Given the central preoccupation of this paper, Bohr’s words are particularly striking: “I want to emphasize that what we have learned in physics arose from a situation where we could not neglect the interaction between the measuring instrument and the object. In psychology, we meet the quite similar situation” (Favrholdt, 1999, p. 203). Also, prominent psychologists echo Bohr’s thinking: “The study of the human mind is so difficult, so caught in the dilemma of being both the object and the agent of its own study, that it cannot limit its inquiries to ways of thinking that grew out of yesterday’s physics” (Bruner, 1990, p. xiii). Given that Bohr never developed his ideas for the epistemological argument common to both fields, what follows also addresses en passant a lacuna in Bohr scholarship.

If all this sounds fanciful (after all, what possible parallels can be drawn between Rasch’s radionuclide on the point of decaying and an individual on the point of answering a question?) it is instructive to return to Rasch’s (1960, p. 11) claim that “what a human being does seems quite haphazard, none less than radioactive emission.” In fact there are striking parallels between the experimenter’s futile attempts to predict the moment of decay and the psychometrician’s attempts to predict the child’s response to a (hitherto unseen) addition problem such as “68 + 57 = ?”

If one restricts oneself to all of the facts about the nuclide, the outcome is completely indeterminate. Similarly, Wittgenstein’s celebrated rule-following argument (central to his philosophies of mind, mathematics and language), set out in his Philosophical Investigations, makes clear that if one restricts oneself to the totality of facts (inner and outer) about the child, these facts are in accord with the right answer (68 + 57 = 125) and an infinity of wrong answers. Mathematics will be used for illustration but the reasoning applies to all rule-following. The reader interested in an accessible exposition of this claim is directed to the second chapter of Kripke’s (1982) Wittgenstein on Rules and Private Language. (The reader should come to appreciate the power of the rule-following reasoning without being troubled by Kripke’s questionable take on the so-called skeptical argument.) The author will now attempt the barest outlines of Wittgenstein’s writing on rule-following .

By their nature, human beings are destined to complete only a finite number of arithmetical problems over a lifetime. The child who is about to answer the question “68 + 57 = ?” for the first time has, of necessity, a finite computational history in respect of addition. Through mathematical reasoning which dates back to Leibniz, this finite number of completed addition problems can be brought under an infinite number of different rules, only one of which is the rule for addition. In short, any answer the child gives to the problem can be demonstrated to be in accord with a rule which generates that answer and all of the answers the child gave to all of the problems he or she has tackled to date. If one had access to the totality of facts about the child’s achievements in arithmetic, one couldn’t use these facts to predict the answer the child will give to the novel problem “68 + 57 = ?” because one can always derive a rule which generates the child’s entire past problem-solving history and any particular answer to “68 + 57 = ?”

Now what of facts concerned with the contents of the child’s mind? Surely an all-seeing God could peer into the child’s mind and determine which rule was guiding the child’s problem-solving? By substituting the numbers 68 and 57 into the rule, God could predict with certainty the child’s response. Alas, having access to inner facts (about the mind or brain) won’t help because having a rule in mind is neither sufficient nor necessary for responding correctly to mathematical problems. Is having a rule in mind sufficient? Clearly not since all pupils taking GCSE mathematics, for example, have access to the quadratic formula and yet only a fraction of these pupils will provide the correct answer to the examination question requiring the application of that formula. Is having the rule in mind necessary? Once again, clearly not because one can be entirely ignorant of the quadratic formula and yet produce the correct answers to algebraic problems involving quadratics using alternative procedures like “completing the square,” graphical methods, the Newton-Raphson procedure, and so on.

It is important to be clear what is being said here. If one could identify an addition problem beyond the set of problems Einstein had completed during his lifetime, is the claim that one couldn’t predict with certainty Einstein’s response to that problem? Obviously not. But the correct answer and an infinity of incorrect answers are in keeping with all the facts (inner and outer) about Einstein. When one is restricted to these facts, Einstein’s ability to respond correctly is indeterminate. In summary, before the child answers the question “68 + 57 = ?” his or her ability with respect to this question is indeterminate. The moment he or she answers, the child’s ability is determinate with respect to the question (125 is pronounced correct, and all other answers are deemed incorrect). One might portray this as follows: before responding the child is right and wrong and, at the moment of response, he or she is right or wrong.

The problem with the Rasch model

Ability only becomes determinate in context of a measurement; it’s indeterminate before the act of measurement. The conclusion is inescapable – ability is a relational property rather than something intrinsic to the individual, as psychology’s standard measurement model would have it. A definite ability cannot be ascribed to an individual prior to measurement. Ability is a joint property of the individual and the measurement instrument; take away the instrument and ability becomes indeterminate. It is difficult to escape the conclusion that ability (and intelligence, and self-concept, and so on) is a property of the interaction between individual and measuring instrument rather than an intrinsic property of the individual. If psychological constructs were viewed as joint properties of individuals and measuring instruments, then intractable questions such as “what is intelligence?”, “what is memory?” need no longer trouble the discipline.

What can be concluded in respect of Rasch? It is clear that the Rasch model is no more capable of separating ability from the item used to measure it than was its predecessor, classical test theory. Pick up any textbook on IRT and one finds the same assumption stated again and again in model development: individuals carry a determinate ability with them from moment to moment and measurement involves checking up on that ability. The ideas of Bohr and Wittgenstein can be used to reject this; for them, measurement effects a “jump” from the indeterminate to the determinate, transforming a potentiality to an actuality.

In simple terms it can be argued that ability has two facets; it is indeterminate before measurement and determinate immediately afterwards. The single description of the standard measurement model is replaced by two mutually exclusive descriptions. Ability is indeterminate before measurement and only determinate with respect to a measurement context. Neither of these descriptions can be dispensed with. The indeterminate and the determinate are mutually exclusive facets of one and the same ability.

Returning to the child who has been taught to add but hasn’t yet encountered the question “68 + 57 = ?” what can be said of his or her ability with respect to this question? When one ponders ability as a thing-in-itself, it’s tempting to think of it as something inner, something that resides in the child prior to being expressed when the child answers. If ability is to be found anywhere, surely it’s to the unmeasured mind one should look? Isn’t it tempting to think of it as something the child “carries” in his or her mind? When the focus is on ability as a thing-in-itself, it seems the child’s eventual answer to the question is somehow inferior; it’s the mere application of the child’s ability rather than the ability itself.

The concept of causality in classical physics is replaced by the notion of “complementarity” in quantum mechanics. Complementarity treats pre-measurement indeterminism and the determinate outcome of measurement as non-separable. Whitaker (1996, p. 184) portrays complementarity as “mutual exclusion but joint completion.” One cannot meaningfully separate the pre-measurement facet of ability from its measurement-determined counterpart. The analogue of Bohr’s complementarity is what Wittgensteinians refer to as first-person/third-person asymmetry. The first-person facet of ability (characterised by indeterminism) and the third-person measurement perspective cannot be meaningfully separated. Suter (1989, pp. 152-153) distinguished the first-person/third-person symmetry of Newtonian attributes from the first-person/third-person asymmetry of psychological predicates: “This asymmetry in the use of psychological and mental predicates – between the first-person present-tense and second- and third-person present-tense – we may take as one of the special features of the mental.” Nagel (1986, p. 22) notes: “the conditions of first-person and third-person ascription of an experience are inextricably bound together in a single public concept.”

This non-separability of first-person and third-person perspectives obviates the need to conclude, with Rasch, that the individual’s response need be “haphazard.” The first-person indeterminism detailed earlier seems to indicate that individuals offer responses entirely at random. After all, the totality of facts is in keeping with an infinity of answers, only one of which is correct. But one need only infer “random variation located within the person” (Borsboom, 2005, p. 55) if one mistakenly treats the first-person facet as separable from the third-person. (The author’s earlier practice of stressing the restriction to the totality of facts about the individual was intended to highlight this taken-for-granted separability.) Lord’s (1980) admonition that item response theorists eschew the “stochastic subject” interpretation for the “repeated sampling” interpretation led IRT practitioners astray by purging entirely the first-person facet from an indivisible whole. One only arrives at conclusions that are “absurd in practice” (p. 227) if one follows Lord (1980) and divorces ability from the item which measures it. Like Rasch, Lord failed to grasp that the within-subject and the between-subject aspects of psychological measurement are profoundly entangled.

Holland, Lord and the ensemble interpretation as the route out of paradox

Holland (1990) repeats Lord’s error by eschewing the stochastic subject interpretation for the random sampling interpretation, despite acknowledging “that most users think intuitively about IRT models in terms of stochastic subjects” (p. 584). The stochastic subject rationale traces the probabilities of the Rasch model to randomness in the individual subject:

Even if we know a person to be very capable, we cannot be sure that he will solve a certain difficult problem, not even a much easier one. There is always a possibility that he fails – he may be tired or his attention is led astray, or some other excuse may be given. And a person of slight ability may hit upon the correct solution to a difficult problem. Furthermore, if the problem is neither “too easy” nor “too difficult” for a certain person, the outcome is quite unpredictable. (Rasch, 1960, p. 73)

Rasch is proposing what quantum physicists call a “local hidden variables” measurement model. While Wittgenstein argues that ability is indefinite before the act of measurement (an act which effects a” jump” from indefinite to definite), psychometricians in general and Rasch in particular, treat ability as definite before measurement. The local hidden variables of the Rasch model are variables such as examinee fatigue, degree of distraction, and any other influence militating against his or her capacity to provide a correct answer. Rasch is suggesting that if one had complete information concerning the examinee’s ability, his or her level of fatigue, propensity for distraction, and so on, one could predict, in principle, the examinee’s response with a high degree of confidence. It is the absence of variables capable of capturing fatigue, attention, and so on, from the Rasch algorithm, that makes its probabilistic nature inevitable. In this local hidden variable model, probability is being invoked because of the measurer’s ignorance of the effects of fatigue, attention loss, and so on.

But Bell (1964) proved beyond doubt that local hidden variables models are impossible in quantum measurement. One can avoid the difficulties thrown up by Bell’s celebrated inequalities by treating unmeasured predicates as indefinite (Fuchs, 2011). This would have profound implications for how one conceives of latent variables in the Rasch model. If local hidden variables are ruled out, latent variables could not be assigned investigation-independent values. Ability only takes on a definite value in a measurement context. IRT can no more separate these two entities (ability and the item used to measure it) than could classical test theory. The “random sampling” approach that Holland (1990) recommends is a so-called “ensemble” interpretation. The definitive text on ensembles – Home and Whitaker (1992) – finds ensembles illegitimate because they mistakenly replace “superpositions” by “mixtures” (Whitaker, 2012, p. 279).

One gets the distinct impression from the IRT literature that the random sampling method is being urged on the field because of embarrassments that lurk in the stochastic subject model. For example Lord (1980, p. 228) refers to the later as “unsuitable”:

The trouble comes from an unsuitable interpretation of the practical meaning of the item response function … If we try to interpret Pi(A) as the probability that a particular examinee A will answer a particular item i correctly, we are likely to reach absurd conclusions. (Lord, 1980, p. 228)

Lord (1980) and Holland (1990) both attempt to avoid embarrassment by taking the simple step of ignoring the stochastic subject for the comfort of an ensemble interpretation. Home and Whitaker (1992) close their text with the words: “[W]e see the ensemble interpretation as the “comfortable” option, creating the illusion that all difficulties may be removed by taking one simple step” (p. 311).

What of the paradox identified earlier?

It is now possible to address the paradox presented earlier. Here is a restatement: If a large sample of individuals of exactly the same ability respond to the same item, designed to measure that ability, why would 27% get it right and 73% get it wrong? Suppose a large number of individuals answer a question (labelled Q1), and, of those who give the correct answer, 100 individuals, say, are posed a second question (labelled Q2). When these 100 individuals respond to Q2, 27% give the correct answer and 73% respond with the wrong answer. What can be said about the ability of each individual immediately after answering Q1 but before answering Q2? Given the natural tendency to think of ability as an attribute of mind, it seems reasonable to focus on the individual’s ability “between questions” as it were.

Poised between questions, each individual’s ability with respect to Q1 is determinate; they have answered Q1 correctly moments before. What of their ability with respect to Q2, the question they have yet to encounter? According to the reasoning presented above, all the facts are in keeping with both a correct and an incorrect answer. The individual’s ability relative to Q2 is indeterminate. Quantum mechanics portrays such states as “superpositions” – the individuals all have the same indefinite ability characterised as: “correct with probability 27% and incorrect with probability 73%.” It is easy to see why 100 individuals each with an ability characterised in this way could be portrayed as subsequently producing 27 correct responses and 73 incorrect responses to Q2.

In this approach the paradox dissolves. All 100 individuals have definite abilities (as measured by Q1), but only 27% go on to answer Q2 correctly. But note the crucial step in the logic required to dissolve the paradox: each individual’s ability is simultaneously determinate with respect to Q1 and indeterminate with respect to Q2. A change in question (from Q1 to Q2) effects a radical change from indeterminate to determinate. It is therefore only meaningful to talk about a definite ability in relation to a measurement context. Ability is a joint property of the individual and the item; pace Rasch they cannot be construed as separable! It follows therefore that the examiner (the person who selects the item) participates in the ability manifest in a response to that item. Pace Rasch measurement in education and psychology is a more dynamic affair than measurement in classical physics. The former is dynamic while the latter is merely a matter of checking up on what’s already there. Because that which is measured is inseparable from the question posed, the measurer participates in what he or she “sees.” Newtonian detachment is as unattainable in psychology and education as it is in quantum theory.

Conclusion

Returning to the real life consequences of this refutation of latent variable modelling in general and Rasch modelling in particular, one cannot escape the conclusion that the OECD’s claims in respect of its PISA project have scant validity given the central dependence of these claims on the clear separability of ability from the items designed to measure that ability.

References

Bell, J.S. (1964). On the Einstein-Podolsky-Rosin paradox. Physics, 1, 195-200.
Bohr, N. (1929/1987). The philosophical writings of Niels Bohr: Volume 1 – Atomic theory and the description of nature. Woodbridge: Ox Bow Press.
Bohr, N. (1958/1987). The philosophical writings of Niels Bohr: Volume 2 – Essays 1933 – 1957 on atomic physics and human knowledge. Woodbridge: Ox Bow Press.
Borsboom, D. (2005). Measuring the mind: conceptual issues in contemporary psychometrics. Cambridge: Cambridge University Press.
Borsboom, D., Mellenbergh, G.J., & van Heerden, J. (2003). The theoretical status of latent variables. Psychological Review, 110 (2), 203-219.
Bruner, J.S. (1990). Acts of meaning. Cambridge, MA: Harvard University Press.
Davies, E.B. (2003). Science in the looking glass. Oxford: Oxford University Press.
Davies, E.B. (2010). Why beliefs matter. Oxford: Oxford University Press.
Elliot, C.D., Murray, D., & Pearson, L.S. (1978). The British ability scales. Windsor: National Foundation for Educational Research.
Ertl, H. (2006). Educational standards and the changing discourse on education: the reception and consequences of the PISA study in Germany. Oxford Review of Education, 32(5), 619-634.
Fan, X. (1998). Item response theory and classical test theory: an empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58(3), 357-381.
Favrholdt, D. (Ed.). (1999). Niels Bohr collected works (Volume 10). Amsterdam: Elsevier Science B.V.
Fuchs, C.A. (2011). Coming of age with quantum information: Notes on a Paulian idea. Cambridge: Cambridge University Press.
Hacker, P.M.S. (1993). Wittgenstein, mind and meaning – Part 1 Essays. Oxford: Blackwell.
Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamental of item response theory. Newbury Park, CA: Sage Publications.
Hark ter, M.R.M. (1990). Beyond the inner and the outer. Dordrecht: Kluwer Academic Publishers.
Holland, P.W. (1990). On the sampling theory foundations of item response theory models. Psychometrika, 55(4), 577-601.
Home, D., & Whitaker, M.A.B. (1992). Ensemble interpretation of quantum mechanics. A modern perspective. Physics Reports (Review section of Physics Letters), 210 (4), 223-317.
Jöreskog, K.G., & Sörbom, D. (1993). LISREL 8 user’s reference guide. Chicago: Scientific Software International.
Kalckar, J. (Ed.). (1985). Niels Bohr collected works (Volume 6). Amsterdam: Elsevier Science B.V.
Kripke, S.A. (1982). Wittgenstein on rules and private language. Oxford: Blackwell.
Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
Michell, J. (1997). Quantitative science and the definition of measurement in psychology. British Journal of Psychology, 88, 355-383.
Nagel, T. (1986). The view from nowhere. New York: Oxford University Press.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Paedagogiske Institut.
Rinne, R., & Ozga, J. (2013). The OECD and the global re-regulation of teacher’s work: Knowledge-based regulation tools and teachers in Finland. In T. Seddon & J.S. Levin Eds.), World yearbook of education (pp. 97-116). London: Routledge.
Rost, J. (2001). The growing family of Rasch models. In A. Boomsma, M.A.J. van Duijn, & T.A.B. Snijders (Eds.), Essays on item response theory (pp. 25-42). New York: Springer.
Sobel, M.E. (1994). Causal inference in latent variable models. In A. von Eye & C.C. Clogg (Eds.), Latent variable analysis (pp. 3-35). Thousand Oakes: Sage.
Suter, R. (1989). Interpreting Wittgenstein: A cloud of philosophy, a drop of grammar. Philadelphia: Temple University Press.
Takayama, K. (2008). The politics of international league tables: PISA in Japan’s achievement crisis debate. Comparative Education, 44(4), 387-407.
Thorndike, R.L. (1982). Educational measurement: Theory and practice. In D. Spearritt (Ed.), The improvement of measurement in education and psychology: Contributions of latent trait theory (pp. 3-13). Melbourne: Australian Council for Educational Research.
Whitaker, A. (1996). Einstein, Bohr and the quantum dilemma. Cambridge: Cambridge University Press.
Whitaker, A. (2012). The new quantum age. Oxford: Oxford University Press.
Wittgenstein, L. (1953). Philosophical Investigations. G.E.M. Anscombe, & R. Rhees (Eds.), G.E.M. Anscombe (Tr.). Oxford: Blackwell.
Wittgenstein, L. (1980a). Remarks on the philosophy of psychology Volume 1 (Edited by G.E.M. Anscombe & G.H. von Wright; translated by G.E.M. Anscombe). Oxford: Basil Blackwell.
Wittgenstein, L. (1980b). Remarks on the philosophy of psychology Volume 2 (Edited by G.H. von Wright & H. Nyman; translated by C.G. Luckhardt & M.A.E. Aue). Oxford: Basil Blackwell.
Wright, B.D. (1997). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-52
Wright, C. (2001). Rails to infinity. Cambridge, MA: Harvard University Press.