#LAK11 – Validity is the Only Guardian Angel of Measurement (Geekish)

David Jones has posted about the general lament of high-stakes testing and asks; “what’s the alternative”?  You could rephrase this to ask, does measurement help us or hurt us?  Not only has he piqued my interest to think more along these lines, but I think the question is also relevant to data analytics, LAK11, and anyplace where measurement is used.  So. . .dive in I will!

David sites the association of the testing movement with globalization and managerialization, but I also believe that analytics, appropriately applied, can benefit education in pragmatic everyday ways.  He also quotes Goodhart’s Law, a British Economist, who spoke on the corruptibility of policy measurement.  I prefer Donald Campbell’s similar law even better for this situation because he was a psychometrician and speaks in testing language.  He states:

The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.

I believe that the problem being discussed, at its heart, is a validity problem.  High stakes testing is not appropriately bearing the high burden assigned to it.  It is not meeting the criteria of consequential validity; it does not produce better long-term educational outcomes.  Cronbach and Meehl (1955) explained why validity is important for situations like this one.

We do believe that it is imperative that psychologists make a place for (construct validity) in their methodological thinking, so that its rationale, its scientific legitimacy, and its dangers may become explicit and familiar. This would be preferable to the widespread current tendency to engage in what actually amounts to construct validation research and use of constructs in practical testing, while talking an “operational” methodology which, if adopted, would force research into a mold it does not fit.

This is what I believe they are saying (also taking into account Cronbach and Messick’s later developments in the concept of validity).  We measure constructs, not thing in themselves.  A construct is defined by the network of associations or proposition within which it occurs  (See Cronbach and Meehl, Recapitulation Section #1).  Validity is the rational and empirical investigation of the association between our operational definitions of our constructs (that is our tests) and our network of associations.  Without this investigation, what we are measuring is operationally defined by our tests, but what that is remains undefined in any meaningful way.  We can’t teach constructs or measure them unless we thoroughly understand them at least at a theoretical level.  Operizationalism has been rejected, whether it was founded in positivist philosophy or in common sense ignorance.

Most people who I’ve heard advocating for standardized and standards based high stakes testing, do so based on a principle of school accountability, not because such testing has been unequivocally demonstrated as a way to improve schools.  It’s a logical argument and it seems to be lacking empirical support.  Teaching to the test is regarded as inappropriate, but if the test is the sole standard of accountability, than it is an operational definition of what we are to be teaching.  In that case, anything other than teaching to the test seems illogical.

So let’s dig deeper.  I think there are nested psychometric problems within this testing movement.  Campbell’s Law may overtake any attempt to use measurement in policy in a large sense, but I am going to start with how a measurement regime might be designed better.

1. What measurement problems exist with current tests?

Teaching to the test as it is commonly practiced is not good because it is doubtful that tests are really measuring the correct information.  There are many unintended things being measured in these high stakes standardized tests (technically referred to in validity theory as irrelevant variance). In many ways, our measures are based (operationalized) more on tradition and common sense as opposed to empirically sound psychometry.  This is what Cronbach warned of when our tests don’t match the constructs.  To improve tests we need to go beyond common sense and clarify the constructs we desire our students to exhibit.  Why don’t we do this now.  Most likely it is too difficult for policymakers to get their head around, but there is a possible second reason.  It would reduce the validity of tests as their validity is measured by positivist methodology.  Validity is an overall judgement but positivist don’t like fuzzy things like judgements.  Tests may need to be reduce in validity in some areas, in order to gain validity overall.  Many people guiding testing procedures likely have a narrow view of validity as opposed to a more broad view of validity as espoused by Messick or Cronbach.  This lead to other issues.

2. Standards do not Address Many Important Educational Outcomes.

The curriculum, as it is reflected in standards, is not always focused on the most important knowledge and skills.  I think it reflects three things.  A kitchen sink approach (include the request of every constituency), a focus on standards that are easily measured by multiple choice or similar types of questions, and expert opinion.  The ability to creatively argue points of view, write with persuasion and conviction, to read, interpret, discuss and develop subtle points of meaning among peers, and to track the progression and maturation of these types of skills over time are important things that are not well measured by current high stakes tests.  A kitchen sink approach does not allow teachers to focus on depth.  Assessments like portfolios contain more information and a broader validity base, but are seen as less reliably (i.e. it’s possible to cheat or include personal bias).  Expert opinion is a type of content validity and is considered the weakest form of validity evidence.  With the development of high stakes testing, we are in more of a position to measure the validity of curriculum standards and to adjust standards accordingly, but I see no one doing this.  Maybe there is some research on the ability of high school students to function as college freshman, but this outcome is inconsequential in a long view of one’s life.  Tests should be held accountable for consequential validity and to empirically show that they result in improved lives not just parroting facts or helping teachers of college freshman.  It is not just teachers that should be held accountable, it is also test and standard developers.

3. Post-positivist Psychometrics

To be sure, there are trade-off in any form of measurement.  Sometimes improving validity in one area weakens validity in other areas.  Validity never reaches 100% in any situation.  However, because tests are mandated by law, I believe current validity questions favor views of what will be held valid in a court of law.  Law tends to be conservative and conservative psychometric are based in philosophical positivism.  I bet that many people making policy decisions have a poor understanding of what I consider to be sound psychometrics, psychometrics that are consistent with post-positivist philosophy.  Let me be clear, positivist psychometrics are not wrong, just incomplete and limited.  This was the insight of Wittgenstein.  Positivism looks at a small slice of life, while ignoring the rest of the pie.  Wittgenstein said if we want to understand language, look at how people are using language.  Similarly, Samuel Messick said, if you want to understand a test, follow the outcomes.  How are people using the test and what are the results of what they are doing.  This is the most important test of validity.

To sum up

There are many possible things that could be done in answer to David’s question.  I have focused on how you might improve testing processes. Do not focus on tradition and traditional  technique, but on standards and testing practices that creating authentic  value (what Umair Haque would call thick value) for students who will live out their lives in the 21st century, a century that is shaping up to be quite different from the last.  Testing could be part of the equation, but lets hold teachers and schools accountable for the value they create as it is measured in improved lives, not in some questionably valid test score.

#LAK11 Data Science and Analytics: the Good, the Bad and the Ugly

Hans de Zwart posted a great summation of the critiques of big data and its usages.  I will comment his post in 3 sections, the good, the bad, and the ugly.

The Good

I like this Dataist’s Venn Diagram on Data Science combining Hacking Skills (innovating with technology), Math and Stat knowledge, with core expertise.

Data Science Venn Diagram

Data Science Venn Diagram

Data science is then a combination of an expert in Machine Learning (to deal with the massive amounts of data being generated), traditional research expertise (to process and analyze that information) and a willingness to engage in creative disciplinary innovation to bring these insights to practice (danger zone).  I think this is a list of skill and knowledge needs.

The Bad

Most of the naughty list is from Drew Conway’s original definition of danger zone and from George Siemens’ 10 concerns.  Drew’s reason for calling it a danger zone was to warn of people who hack (innovate) with poor core knowledge and George’s concern list is mostly about data procedures getting away from our intentions.  These are valid concerns, but I think they relate to statistical and measurement concerns.  My take on the problem is this: due to common pedagogy, most people have a rather formulaic understanding of measurement and statistic.  They know how to plug in the numbers, but they aren’t so good understanding what they are doing conceptually and what limitation are being violated. Not only is this a problem because they are operating blindly, but also because they are missing the inherent limitations that exist in their calculations.  So people are blind to both the validity problems they are creating and do not have a good conceptual understanding of what their procedures are capable of doing.

The Ugly

Hacking skills are the most likely skill to be ignored in this diagram.  This is a new area and it can’t progress without innovation.  Even though innovation is widely celebrated, managers do not really like it because the very idea of management is wraped up in the idea of control (with or without command).  In a standardized economy people were interchangeable and must conform to existing processes.  Today the world, even the data world, changes to quickly for standardized process in most circumstances.  To respond, management must be reformed to its core purposes and I don’t think the discipline is ready to tread these waters.


This view is against Chris Andersen’s view of The End of Theory in favor of dimensionally agnostic statistics.  Google is just a tool.  Popularity does not equal quality or relevance as was pointed out with recent concerns that organizations spamming google results.  As the Sloan article Hans quotes states:

Information Must Become Easier to Understand and Act Upon

#LAK11 – Utopian and Dystopian Visions of Analytics: It’s a Question of Validity

Catching up on the beginning of LAK11 which began last week.

George Siemens’ 1-16 post has initiated a discussion on critiques, much of which seems to focus on dystopian critique.

David Jones’ earlier critique is a good example.  His interesting critique is based on his fear of teleological implementation:

This remains my major reservation about all these types of innovations. In the end, they will be applied to institutional contexts through teleological processes. i.e. the change will be done to the institution and its members to achieve some set plan. Implementation will have little contextual sensitivity and thus will have limited quality adoption. . ..

This is what I consider to be a basic modernist approach with only quantitative teleology, that is, final causes can be judged solely through numbers resulting from simple quantitative analyses.

I studied Samuel Messick for my dissertation and my reading of him was that he was a psychometrician who took seriously the postmodern critique of the 20th Century philosophers of sciences.  His response was that the question of validity could never be answered without both quantitative and qualitative analysis.  Messick’s approach has always been seen negatively by those who need the teleological certainty of positivist quantitative only answers.  This is exactly the simplistic way David fears analysis will be used and his fear is valid.  Not because these tools can not achieve good things, they could improve our lives tremendously.  However, understanding in depth their use and the consequences of their use is a difficult undertaking requiring quantitative and qualitative analysis in it’s own right.  Many people will not be willing to put in that kind of effort.  A utopian leaning vision can only be achieved with hard work and much effort, but a dystopian vision can be achieve with only minimal effort.