David Jones has posted about the general lament of high-stakes testing and asks; “what’s the alternative”? You could rephrase this to ask, does measurement help us or hurt us? Not only has he piqued my interest to think more along these lines, but I think the question is also relevant to data analytics, LAK11, and anyplace where measurement is used. So. . .dive in I will!
David sites the association of the testing movement with globalization and managerialization, but I also believe that analytics, appropriately applied, can benefit education in pragmatic everyday ways. He also quotes Goodhart’s Law, a British Economist, who spoke on the corruptibility of policy measurement. I prefer Donald Campbell’s similar law even better for this situation because he was a psychometrician and speaks in testing language. He states:
The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.
I believe that the problem being discussed, at its heart, is a validity problem. High stakes testing is not appropriately bearing the high burden assigned to it. It is not meeting the criteria of consequential validity; it does not produce better long-term educational outcomes. Cronbach and Meehl (1955) explained why validity is important for situations like this one.
We do believe that it is imperative that psychologists make a place for (construct validity) in their methodological thinking, so that its rationale, its scientific legitimacy, and its dangers may become explicit and familiar. This would be preferable to the widespread current tendency to engage in what actually amounts to construct validation research and use of constructs in practical testing, while talking an “operational” methodology which, if adopted, would force research into a mold it does not fit.
This is what I believe they are saying (also taking into account Cronbach and Messick’s later developments in the concept of validity). We measure constructs, not thing in themselves. A construct is defined by the network of associations or proposition within which it occurs (See Cronbach and Meehl, Recapitulation Section #1). Validity is the rational and empirical investigation of the association between our operational definitions of our constructs (that is our tests) and our network of associations. Without this investigation, what we are measuring is operationally defined by our tests, but what that is remains undefined in any meaningful way. We can’t teach constructs or measure them unless we thoroughly understand them at least at a theoretical level. Operizationalism has been rejected, whether it was founded in positivist philosophy or in common sense ignorance.
Most people who I’ve heard advocating for standardized and standards based high stakes testing, do so based on a principle of school accountability, not because such testing has been unequivocally demonstrated as a way to improve schools. It’s a logical argument and it seems to be lacking empirical support. Teaching to the test is regarded as inappropriate, but if the test is the sole standard of accountability, than it is an operational definition of what we are to be teaching. In that case, anything other than teaching to the test seems illogical.
So let’s dig deeper. I think there are nested psychometric problems within this testing movement. Campbell’s Law may overtake any attempt to use measurement in policy in a large sense, but I am going to start with how a measurement regime might be designed better.
1. What measurement problems exist with current tests?
Teaching to the test as it is commonly practiced is not good because it is doubtful that tests are really measuring the correct information. There are many unintended things being measured in these high stakes standardized tests (technically referred to in validity theory as irrelevant variance). In many ways, our measures are based (operationalized) more on tradition and common sense as opposed to empirically sound psychometry. This is what Cronbach warned of when our tests don’t match the constructs. To improve tests we need to go beyond common sense and clarify the constructs we desire our students to exhibit. Why don’t we do this now. Most likely it is too difficult for policymakers to get their head around, but there is a possible second reason. It would reduce the validity of tests as their validity is measured by positivist methodology. Validity is an overall judgement but positivist don’t like fuzzy things like judgements. Tests may need to be reduce in validity in some areas, in order to gain validity overall. Many people guiding testing procedures likely have a narrow view of validity as opposed to a more broad view of validity as espoused by Messick or Cronbach. This lead to other issues.
2. Standards do not Address Many Important Educational Outcomes.
The curriculum, as it is reflected in standards, is not always focused on the most important knowledge and skills. I think it reflects three things. A kitchen sink approach (include the request of every constituency), a focus on standards that are easily measured by multiple choice or similar types of questions, and expert opinion. The ability to creatively argue points of view, write with persuasion and conviction, to read, interpret, discuss and develop subtle points of meaning among peers, and to track the progression and maturation of these types of skills over time are important things that are not well measured by current high stakes tests. A kitchen sink approach does not allow teachers to focus on depth. Assessments like portfolios contain more information and a broader validity base, but are seen as less reliably (i.e. it’s possible to cheat or include personal bias). Expert opinion is a type of content validity and is considered the weakest form of validity evidence. With the development of high stakes testing, we are in more of a position to measure the validity of curriculum standards and to adjust standards accordingly, but I see no one doing this. Maybe there is some research on the ability of high school students to function as college freshman, but this outcome is inconsequential in a long view of one’s life. Tests should be held accountable for consequential validity and to empirically show that they result in improved lives not just parroting facts or helping teachers of college freshman. It is not just teachers that should be held accountable, it is also test and standard developers.
3. Post-positivist Psychometrics
To be sure, there are trade-off in any form of measurement. Sometimes improving validity in one area weakens validity in other areas. Validity never reaches 100% in any situation. However, because tests are mandated by law, I believe current validity questions favor views of what will be held valid in a court of law. Law tends to be conservative and conservative psychometric are based in philosophical positivism. I bet that many people making policy decisions have a poor understanding of what I consider to be sound psychometrics, psychometrics that are consistent with post-positivist philosophy. Let me be clear, positivist psychometrics are not wrong, just incomplete and limited. This was the insight of Wittgenstein. Positivism looks at a small slice of life, while ignoring the rest of the pie. Wittgenstein said if we want to understand language, look at how people are using language. Similarly, Samuel Messick said, if you want to understand a test, follow the outcomes. How are people using the test and what are the results of what they are doing. This is the most important test of validity.
To sum up
There are many possible things that could be done in answer to David’s question. I have focused on how you might improve testing processes. Do not focus on tradition and traditional technique, but on standards and testing practices that creating authentic value (what Umair Haque would call thick value) for students who will live out their lives in the 21st century, a century that is shaping up to be quite different from the last. Testing could be part of the equation, but lets hold teachers and schools accountable for the value they create as it is measured in improved lives, not in some questionably valid test score.