Action Analytics: Formative Assessment with An Evidentiary Base

I found Linda Baer’s interview on Action Analytics (AA) with George Siemens interesting by way of what I see as a natural synthesis with evidence-based practice.  I see the relationship in the way that  Linda seems to defines AA in 3 steps:

  1. Identify from the research base what is needed to improve student performance
  2. Develop appropriate measures to generate needed data
  3. Use the data and data technology to guide teacher action while students can still be helped.

I find it similar to the idea of formative assessment, but with the addition of an evidentiary component.  Formative assessment is about developing feedback during the learning process in contrasted to summative assessment that occurs after learning. Summative assessment has only a post-hoc pedagogical purpose, while formative assessment is an integral part of everyday pedagogy.  The major difference between formative assessment and AA is that Linda specifies a place for evidence in the process.

I believe that  AA can be relevant beyond the field of education as a general methodology for practice and also as a way to combine evidence based practice and the growing field of analytics.  Analytics can be most productive when they are integrated into feedback loops in a formative way and will be even better when research based evidence is included in the design of feedback loops and in the development of the  measures that generating feedback data.  I expect integration to be tricky and will likely require a robust systems approach.

Avoiding Naive Operationalism: More on Lee Cronbach and Improving Analytics

Introduction

Consider again Cronbach and Meehl’s (1955) quote from my last post.
We do believe that it is imperative that psychologists make a place for (construct validity) in their methodological thinking, so that its rationale, its scientific legitimacy, and its dangers may become explicit and familiar. This would be preferable to the widespread current tendency to engage in what actually amounts to construct validation research and use of constructs in practical testing, while talking an “operational” methodology which, if adopted, would force research into a mold it does not fit.  (Emphasis added)
What was widespread in 1955 has not substantially changed today.  Construct measures are routinely developed without regards to their construct or consequential validity, and it is in detriment to our practices.  I will name this state, naive operationalism; measuring constructs with what amounts to an operational methodology.  I will also show why it is a problem.

Operational Methodology: Its Origins as a Philosophical Concept

What do Cronbach & Meehl mean by an operational methodology?  Early in my psychological studies I heard the definition of intelligence stated as “that which is measured by an intelligence test”.  It was an example of operationalism (or operationism). Originally conceived by a physicist named Percy Bridgman, operationalism conceptually states that the meaning of a term is wholly defined by its method of measurement.  It became popular as a way to replace metaphysical terms (eg. desire or anger) with a radical empirical definition.  It was briefly adopted by the logical positivist school of philosophy because of its similarity to the verification theory of meaning. It also became popular for a longer time period in psychology and the social sciences.  Neither use stood up to scrutiny as noted in Mark Bickhard’s paper.
Positivism failed, and it lies behind many of the reasons that operationalism is so pernicious: the radical empiricism of operationalism makes it difficult to understand how science does, in fact, involve theoretical and metaphysical assumptions, and must involve them, and thereby makes it difficult to think about and to critique those assumptions.
Not only does the creation of any measurement contains many underlying assumptions, the meaning of any measurement is also a by-product of the uses to which the measurement is put.  The heart of validity theory in the work of Cronbach (and also in Samuel Messick), is in analyzing various measurement assumptions and measurement uses through the concepts of construct and consequential validity.  Modern validity theory stands opposed to operationalism.

Operational Definition as a Pragmatic Psychometric Concept

Specifying an operational definition of a measure is operationalism backwards.  Our measurements operationalizes how we are defining a term, not in the abstract, but in actual practice.  When we implement a measurement in practice, that measurement effectively becomes the construct definition in any processes that involves that measure.  If the process contains multiple measures, it is only a partial definition.  If it is the sole measure, it also becomes the sole construction definition.  Any measure serves as an operational definition of the measured construct in practice, but we don’t believe (as in operationalism) that the measures will subsume the full meaning of any construct.  Our operational definition is no more than a partial definition and that is why consequential and construct validity are needed in our methodological thinking.  Validity research tell us when our operational definitions are problematic and may give us indication as to how to make improvements to our measures.  Validity research studies the difference between our operational definitions and the construct being measured.

Naive Operationalism

For most of us, operationalization outside the larger issue of a research question and conceptual framework is just not very interesting.
I could not disagree more! Not including validity in our methodological thinking will mean that our operationalized processes will result in what I will call naive operationalism.  If we devise and implement measures in practice, without regard for their validity, we will also fail to understand any underlying assumptions and will be unable to address any validity problems.  In effect, it is just like philosophical operationalism and sets us up for the same problems. Lets consider a concrete example to see how it can become a problem.

An Example of Naive Operationalism

Richard Nantel and Andy Porter both suggests that we do away with Performance Measurement, which is considered “a Complete Waste of Time”.  These are the reasons given for scrapping performance measurement:
  1. Short term or semiannual performance  reviews preventing big picture thinking, long-term risk taking and innovation. We want employees to fail early and often.
  2. Performance systems encourage less frequent feedback and interferes with real-time learning.
  3. Compensation and reward systems are based on faulty  incentive premises and undermining intrinsic motivation.
  4. There’s no evidence that performance rating systems improve performance.
Consider each reason in turn
  1. This critique is advocating for a different set of constructs.  True, the constructs they imply may not be common to most performance measurement systems, but there is no reason to stay with standard constructs if they are not a good fit.
  2. There is no reason why formative assessments like action analytics and other more appropriate feedback structures could be a part of any performance improvement systems.
  3. This is another instance where it appears that the wrong constructs, based on out of date motivational theories, are being measured.  They are the wrong constructs and therefore the wrong measures.
  4. The consequences of any measurement systems is the most important question to ask.  Anyone who doesn’t ask this questions should not be managing measurement processes.

Conclusion

What is the bottom line?  There is nothing Richard or Andy point out  that would make the concept of performance measurement wrong.  The measurement systems they describe are guilty of naive operationalism.  The idea that any specific measure of performance is the sole operational definition needed and this is true even they are unaware of what they are doing.  No!  We should assess the validity of any measurement system and adjust according to an integrated view of validity within an appropriate theoretical and propositional network as advocated by Cronbach and Meehl.  Measurement systems of any kind should be based on construct and consequential validity, not an operational methodology, whether it is philosophical or naive.

#LAK11 – Validity is the Only Guardian Angel of Measurement (Geekish)

David Jones has posted about the general lament of high-stakes testing and asks; “what’s the alternative”?  You could rephrase this to ask, does measurement help us or hurt us?  Not only has he piqued my interest to think more along these lines, but I think the question is also relevant to data analytics, LAK11, and anyplace where measurement is used.  So. . .dive in I will!

David sites the association of the testing movement with globalization and managerialization, but I also believe that analytics, appropriately applied, can benefit education in pragmatic everyday ways.  He also quotes Goodhart’s Law, a British Economist, who spoke on the corruptibility of policy measurement.  I prefer Donald Campbell’s similar law even better for this situation because he was a psychometrician and speaks in testing language.  He states:

The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.

I believe that the problem being discussed, at its heart, is a validity problem.  High stakes testing is not appropriately bearing the high burden assigned to it.  It is not meeting the criteria of consequential validity; it does not produce better long-term educational outcomes.  Cronbach and Meehl (1955) explained why validity is important for situations like this one.

We do believe that it is imperative that psychologists make a place for (construct validity) in their methodological thinking, so that its rationale, its scientific legitimacy, and its dangers may become explicit and familiar. This would be preferable to the widespread current tendency to engage in what actually amounts to construct validation research and use of constructs in practical testing, while talking an “operational” methodology which, if adopted, would force research into a mold it does not fit.

This is what I believe they are saying (also taking into account Cronbach and Messick’s later developments in the concept of validity).  We measure constructs, not thing in themselves.  A construct is defined by the network of associations or proposition within which it occurs  (See Cronbach and Meehl, Recapitulation Section #1).  Validity is the rational and empirical investigation of the association between our operational definitions of our constructs (that is our tests) and our network of associations.  Without this investigation, what we are measuring is operationally defined by our tests, but what that is remains undefined in any meaningful way.  We can’t teach constructs or measure them unless we thoroughly understand them at least at a theoretical level.  Operizationalism has been rejected, whether it was founded in positivist philosophy or in common sense ignorance.

Most people who I’ve heard advocating for standardized and standards based high stakes testing, do so based on a principle of school accountability, not because such testing has been unequivocally demonstrated as a way to improve schools.  It’s a logical argument and it seems to be lacking empirical support.  Teaching to the test is regarded as inappropriate, but if the test is the sole standard of accountability, than it is an operational definition of what we are to be teaching.  In that case, anything other than teaching to the test seems illogical.

So let’s dig deeper.  I think there are nested psychometric problems within this testing movement.  Campbell’s Law may overtake any attempt to use measurement in policy in a large sense, but I am going to start with how a measurement regime might be designed better.

1. What measurement problems exist with current tests?

Teaching to the test as it is commonly practiced is not good because it is doubtful that tests are really measuring the correct information.  There are many unintended things being measured in these high stakes standardized tests (technically referred to in validity theory as irrelevant variance). In many ways, our measures are based (operationalized) more on tradition and common sense as opposed to empirically sound psychometry.  This is what Cronbach warned of when our tests don’t match the constructs.  To improve tests we need to go beyond common sense and clarify the constructs we desire our students to exhibit.  Why don’t we do this now.  Most likely it is too difficult for policymakers to get their head around, but there is a possible second reason.  It would reduce the validity of tests as their validity is measured by positivist methodology.  Validity is an overall judgement but positivist don’t like fuzzy things like judgements.  Tests may need to be reduce in validity in some areas, in order to gain validity overall.  Many people guiding testing procedures likely have a narrow view of validity as opposed to a more broad view of validity as espoused by Messick or Cronbach.  This lead to other issues.

2. Standards do not Address Many Important Educational Outcomes.

The curriculum, as it is reflected in standards, is not always focused on the most important knowledge and skills.  I think it reflects three things.  A kitchen sink approach (include the request of every constituency), a focus on standards that are easily measured by multiple choice or similar types of questions, and expert opinion.  The ability to creatively argue points of view, write with persuasion and conviction, to read, interpret, discuss and develop subtle points of meaning among peers, and to track the progression and maturation of these types of skills over time are important things that are not well measured by current high stakes tests.  A kitchen sink approach does not allow teachers to focus on depth.  Assessments like portfolios contain more information and a broader validity base, but are seen as less reliably (i.e. it’s possible to cheat or include personal bias).  Expert opinion is a type of content validity and is considered the weakest form of validity evidence.  With the development of high stakes testing, we are in more of a position to measure the validity of curriculum standards and to adjust standards accordingly, but I see no one doing this.  Maybe there is some research on the ability of high school students to function as college freshman, but this outcome is inconsequential in a long view of one’s life.  Tests should be held accountable for consequential validity and to empirically show that they result in improved lives not just parroting facts or helping teachers of college freshman.  It is not just teachers that should be held accountable, it is also test and standard developers.

3. Post-positivist Psychometrics

To be sure, there are trade-off in any form of measurement.  Sometimes improving validity in one area weakens validity in other areas.  Validity never reaches 100% in any situation.  However, because tests are mandated by law, I believe current validity questions favor views of what will be held valid in a court of law.  Law tends to be conservative and conservative psychometric are based in philosophical positivism.  I bet that many people making policy decisions have a poor understanding of what I consider to be sound psychometrics, psychometrics that are consistent with post-positivist philosophy.  Let me be clear, positivist psychometrics are not wrong, just incomplete and limited.  This was the insight of Wittgenstein.  Positivism looks at a small slice of life, while ignoring the rest of the pie.  Wittgenstein said if we want to understand language, look at how people are using language.  Similarly, Samuel Messick said, if you want to understand a test, follow the outcomes.  How are people using the test and what are the results of what they are doing.  This is the most important test of validity.

To sum up

There are many possible things that could be done in answer to David’s question.  I have focused on how you might improve testing processes. Do not focus on tradition and traditional  technique, but on standards and testing practices that creating authentic  value (what Umair Haque would call thick value) for students who will live out their lives in the 21st century, a century that is shaping up to be quite different from the last.  Testing could be part of the equation, but lets hold teachers and schools accountable for the value they create as it is measured in improved lives, not in some questionably valid test score.

#LAK11 Data Science and Analytics: the Good, the Bad and the Ugly

Hans de Zwart posted a great summation of the critiques of big data and its usages.  I will comment his post in 3 sections, the good, the bad, and the ugly.

The Good

I like this Dataist’s Venn Diagram on Data Science combining Hacking Skills (innovating with technology), Math and Stat knowledge, with core expertise.

Data Science Venn Diagram

Data Science Venn Diagram

Data science is then a combination of an expert in Machine Learning (to deal with the massive amounts of data being generated), traditional research expertise (to process and analyze that information) and a willingness to engage in creative disciplinary innovation to bring these insights to practice (danger zone).  I think this is a list of skill and knowledge needs.

The Bad

Most of the naughty list is from Drew Conway’s original definition of danger zone and from George Siemens’ 10 concerns.  Drew’s reason for calling it a danger zone was to warn of people who hack (innovate) with poor core knowledge and George’s concern list is mostly about data procedures getting away from our intentions.  These are valid concerns, but I think they relate to statistical and measurement concerns.  My take on the problem is this: due to common pedagogy, most people have a rather formulaic understanding of measurement and statistic.  They know how to plug in the numbers, but they aren’t so good understanding what they are doing conceptually and what limitation are being violated. Not only is this a problem because they are operating blindly, but also because they are missing the inherent limitations that exist in their calculations.  So people are blind to both the validity problems they are creating and do not have a good conceptual understanding of what their procedures are capable of doing.

The Ugly

Hacking skills are the most likely skill to be ignored in this diagram.  This is a new area and it can’t progress without innovation.  Even though innovation is widely celebrated, managers do not really like it because the very idea of management is wraped up in the idea of control (with or without command).  In a standardized economy people were interchangeable and must conform to existing processes.  Today the world, even the data world, changes to quickly for standardized process in most circumstances.  To respond, management must be reformed to its core purposes and I don’t think the discipline is ready to tread these waters.

Conclusion

This view is against Chris Andersen’s view of The End of Theory in favor of dimensionally agnostic statistics.  Google is just a tool.  Popularity does not equal quality or relevance as was pointed out with recent concerns that organizations spamming google results.  As the Sloan article Hans quotes states:

Information Must Become Easier to Understand and Act Upon

#LAK11 – Utopian and Dystopian Visions of Analytics: It’s a Question of Validity

Catching up on the beginning of LAK11 which began last week.

George Siemens’ 1-16 post has initiated a discussion on critiques, much of which seems to focus on dystopian critique.

David Jones’ earlier critique is a good example.  His interesting critique is based on his fear of teleological implementation:

This remains my major reservation about all these types of innovations. In the end, they will be applied to institutional contexts through teleological processes. i.e. the change will be done to the institution and its members to achieve some set plan. Implementation will have little contextual sensitivity and thus will have limited quality adoption. . ..

This is what I consider to be a basic modernist approach with only quantitative teleology, that is, final causes can be judged solely through numbers resulting from simple quantitative analyses.

I studied Samuel Messick for my dissertation and my reading of him was that he was a psychometrician who took seriously the postmodern critique of the 20th Century philosophers of sciences.  His response was that the question of validity could never be answered without both quantitative and qualitative analysis.  Messick’s approach has always been seen negatively by those who need the teleological certainty of positivist quantitative only answers.  This is exactly the simplistic way David fears analysis will be used and his fear is valid.  Not because these tools can not achieve good things, they could improve our lives tremendously.  However, understanding in depth their use and the consequences of their use is a difficult undertaking requiring quantitative and qualitative analysis in it’s own right.  Many people will not be willing to put in that kind of effort.  A utopian leaning vision can only be achieved with hard work and much effort, but a dystopian vision can be achieve with only minimal effort.