What to Do about Testing: A Response to Audrey Waters

Audrey Waters posted about John Oliver’s takedown of testing and Pearson Ed.  She asks:

How do we seize the opportunity of all this media attention to the problems with standardized testing to do more than talk about testing?  . . . Can we articulate (a better alternative) now so that Pearson and other testing companies don’t replace the old model with simply a re-branded, repackaged one?

Samuel Messick was a Vice President and Distinguished Research Scientist at Educational Testing Services (ETS).  His was an authoritative voice on test validity advocating for restraint in the use of test scores, better and more in-depth interpretations of test score. the collection of multiple sources of information for making important decisions and for consideration of the consequences of test use.  I believe that much of his legacy has been ignored, co-opted, or argued away (even at ETS I suspect).  I’ll speculate on what would he advocate;

  • using more than one or two sources of information when making complex important decisions,
  • understanding the information in the context of a decision and considering the consequences of your testing practices.
  • I also suspect that I could argue with him for the consideration of the validity of testing practices with how it fit within an overall set of district practices.  (i.e. If a student fails, how do you respond?)

Technically Pearson may not be at fault for it is the district use of tests that is most problematic, but Pearson is at least implicit in not providing better guidance and for developing ways for districts  to collect other sources of information.  Eg. The value added model of teacher assessment needs many more sources of information and in fact does not really provide an assessable model of pedagogy, only largely discredited positivist assertions. The first step is to expose those who advocate positivist models of empiricism for which even analytic philosophers would no longer advocate.

Finally it necessary to look at the overall model of education which is still primarily built of a mechanistic metaphor with the student as a vessel to be filled.  The metaphor should be a biological organism adapting in an environment that is primarily social, networked and interactive.  When Pearson speaks of their “potential game-changer: performance tasks”, they are talking in this direction, but their really co-opting performance tasks within the old metaphor.  They have a long way to go.  We should expunge the mechanistic metaphor from educational leadership and assessment models.

The bottom line for Pearson

You may not be technically wrong in your assessments, but when your the brunt of a comedic takedown, you should really look at the consequences of your products use and attempt to deal with it.

#LAK11 – Validity is the Only Guardian Angel of Measurement (Geekish)

David Jones has posted about the general lament of high-stakes testing and asks; “what’s the alternative”?  You could rephrase this to ask, does measurement help us or hurt us?  Not only has he piqued my interest to think more along these lines, but I think the question is also relevant to data analytics, LAK11, and anyplace where measurement is used.  So. . .dive in I will!

David sites the association of the testing movement with globalization and managerialization, but I also believe that analytics, appropriately applied, can benefit education in pragmatic everyday ways.  He also quotes Goodhart’s Law, a British Economist, who spoke on the corruptibility of policy measurement.  I prefer Donald Campbell’s similar law even better for this situation because he was a psychometrician and speaks in testing language.  He states:

The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.

I believe that the problem being discussed, at its heart, is a validity problem.  High stakes testing is not appropriately bearing the high burden assigned to it.  It is not meeting the criteria of consequential validity; it does not produce better long-term educational outcomes.  Cronbach and Meehl (1955) explained why validity is important for situations like this one.

We do believe that it is imperative that psychologists make a place for (construct validity) in their methodological thinking, so that its rationale, its scientific legitimacy, and its dangers may become explicit and familiar. This would be preferable to the widespread current tendency to engage in what actually amounts to construct validation research and use of constructs in practical testing, while talking an “operational” methodology which, if adopted, would force research into a mold it does not fit.

This is what I believe they are saying (also taking into account Cronbach and Messick’s later developments in the concept of validity).  We measure constructs, not thing in themselves.  A construct is defined by the network of associations or proposition within which it occurs  (See Cronbach and Meehl, Recapitulation Section #1).  Validity is the rational and empirical investigation of the association between our operational definitions of our constructs (that is our tests) and our network of associations.  Without this investigation, what we are measuring is operationally defined by our tests, but what that is remains undefined in any meaningful way.  We can’t teach constructs or measure them unless we thoroughly understand them at least at a theoretical level.  Operizationalism has been rejected, whether it was founded in positivist philosophy or in common sense ignorance.

Most people who I’ve heard advocating for standardized and standards based high stakes testing, do so based on a principle of school accountability, not because such testing has been unequivocally demonstrated as a way to improve schools.  It’s a logical argument and it seems to be lacking empirical support.  Teaching to the test is regarded as inappropriate, but if the test is the sole standard of accountability, than it is an operational definition of what we are to be teaching.  In that case, anything other than teaching to the test seems illogical.

So let’s dig deeper.  I think there are nested psychometric problems within this testing movement.  Campbell’s Law may overtake any attempt to use measurement in policy in a large sense, but I am going to start with how a measurement regime might be designed better.

1. What measurement problems exist with current tests?

Teaching to the test as it is commonly practiced is not good because it is doubtful that tests are really measuring the correct information.  There are many unintended things being measured in these high stakes standardized tests (technically referred to in validity theory as irrelevant variance). In many ways, our measures are based (operationalized) more on tradition and common sense as opposed to empirically sound psychometry.  This is what Cronbach warned of when our tests don’t match the constructs.  To improve tests we need to go beyond common sense and clarify the constructs we desire our students to exhibit.  Why don’t we do this now.  Most likely it is too difficult for policymakers to get their head around, but there is a possible second reason.  It would reduce the validity of tests as their validity is measured by positivist methodology.  Validity is an overall judgement but positivist don’t like fuzzy things like judgements.  Tests may need to be reduce in validity in some areas, in order to gain validity overall.  Many people guiding testing procedures likely have a narrow view of validity as opposed to a more broad view of validity as espoused by Messick or Cronbach.  This lead to other issues.

2. Standards do not Address Many Important Educational Outcomes.

The curriculum, as it is reflected in standards, is not always focused on the most important knowledge and skills.  I think it reflects three things.  A kitchen sink approach (include the request of every constituency), a focus on standards that are easily measured by multiple choice or similar types of questions, and expert opinion.  The ability to creatively argue points of view, write with persuasion and conviction, to read, interpret, discuss and develop subtle points of meaning among peers, and to track the progression and maturation of these types of skills over time are important things that are not well measured by current high stakes tests.  A kitchen sink approach does not allow teachers to focus on depth.  Assessments like portfolios contain more information and a broader validity base, but are seen as less reliably (i.e. it’s possible to cheat or include personal bias).  Expert opinion is a type of content validity and is considered the weakest form of validity evidence.  With the development of high stakes testing, we are in more of a position to measure the validity of curriculum standards and to adjust standards accordingly, but I see no one doing this.  Maybe there is some research on the ability of high school students to function as college freshman, but this outcome is inconsequential in a long view of one’s life.  Tests should be held accountable for consequential validity and to empirically show that they result in improved lives not just parroting facts or helping teachers of college freshman.  It is not just teachers that should be held accountable, it is also test and standard developers.

3. Post-positivist Psychometrics

To be sure, there are trade-off in any form of measurement.  Sometimes improving validity in one area weakens validity in other areas.  Validity never reaches 100% in any situation.  However, because tests are mandated by law, I believe current validity questions favor views of what will be held valid in a court of law.  Law tends to be conservative and conservative psychometric are based in philosophical positivism.  I bet that many people making policy decisions have a poor understanding of what I consider to be sound psychometrics, psychometrics that are consistent with post-positivist philosophy.  Let me be clear, positivist psychometrics are not wrong, just incomplete and limited.  This was the insight of Wittgenstein.  Positivism looks at a small slice of life, while ignoring the rest of the pie.  Wittgenstein said if we want to understand language, look at how people are using language.  Similarly, Samuel Messick said, if you want to understand a test, follow the outcomes.  How are people using the test and what are the results of what they are doing.  This is the most important test of validity.

To sum up

There are many possible things that could be done in answer to David’s question.  I have focused on how you might improve testing processes. Do not focus on tradition and traditional  technique, but on standards and testing practices that creating authentic  value (what Umair Haque would call thick value) for students who will live out their lives in the 21st century, a century that is shaping up to be quite different from the last.  Testing could be part of the equation, but lets hold teachers and schools accountable for the value they create as it is measured in improved lives, not in some questionably valid test score.

Scanning Horizons: Standardization and Practice

Summary Where standards exist (including its close cousin, evidence-based practice), evaluation instruments leading to change projects is a good way to reduce complexity and improve practice.

My last 2 jobs involved standardization.  One was designing consultative services designed to assist health care organizations in complying with HIPAA security standards.  The other job was to provide teacher professional development where the primary concern centered on improving student proficiency scores on tests of graduation standards.  The purpose of standardization is to reduce variability, improve quality, facilitate measurement and change practices.

Achieving compliance is generally achieved by standardizing practice.  The HIPAA measures I constructed were designed to use published HIPAA standards to assess an existing level of practice and to provide a logical path in designing projects to bring practices into compliance.  The problem with educational standards is that state performance standards are not directed toward practice (which is easily measured and easily brought in compliance), they are directed toward performance (Which is easily measured, but whose variability is subject to a wide variety of social, environmental, cognitive and developmental variables).  It would make better sense to have two sets of standards: one for student performance and one for teaching practice.  Practice standards would specify the materials and instructional sequences that would be necessary to achieve compliance on performance assessments.

Why hasn’t this been done; there are 4 things that make this difficult:

  • Education is not under the control of the federal government and the undertaking (if it reflects all 12 years of education) may be too large an undertaking for individual states and municipalities.
  • Materials would have to be standardized and there are many textbook providers, all in competition with each other.  Choosing one set of materials would put others out of business.
  • Standardization of practice would reduce teacher’s control, and teachers are not ready to accept this and to think of themselves as standards implementers.
  • Standardized education would work for an industrialized economy, but not for a creative economy.  Just like the Army is always being reorganized to fight the last war, education is often organized to serve the last age.  Still this would be a much more efficient way to organize 40 to 50% of educational needs that could be standardized.

Well, I found teacher professional development to be a somewhat frustrating endeavor.  Standardization is not the only way to go and you could even say this was teaching to the test.  I would fully support moving to something other than a high stakes testing regime, but if your going to use these tests, if the schools and teachers are strongly focussed on them; they imply a standardization process and anything else is a mismatch.  The organization I worked for was a master of standardization in educational practice, but the program they tried to implement was not in this ballpark.

How to Increase Knowledge Worker Productivity: The Drucker Question

Tony Karrer has re-posed a question originally asked by Peter Drucker: how do you increase the productivity of knowledge (concept) workers?  Since more jobs depend on knowledge today, we might just as well ask, how do we support productivity in the 21st Century.  I suggest that significant organizational learning is the best measure of productivity in a knowledge intensive environment.  The following 4 things could be measured as evidence of significant learning:

  1. An increase in organizational capabilities, and
  2. In innovation.
  3. The development of a maturity framework with evidence-(standards)-based practices for improving the quality of repeatable or routine processes.
  4. An increase in soft skills relating to non-routine and networking processes, which could be measured by the extent and the strength of an organization’s internal and external networks.

I believe that organizational learning is facilitated by individuals, but I would not consider it synonymous with individual learning.  I’m not sure exactly how to measure individual learning, which might vary with different contexts.  I do believe that employers can focus mostly on the measurement of organizational learning and maybe on individual contributions to organizational learning.

There is certainly enough here to keep me thinking for some time.  Thanks for the question Tony!

A Measure of Process Standards can become a Key to Unleashing Creativity

When you’re a hammer, everything looks like a nail.  I have to be careful or everything looks like a measurement opportunity to me.  Nonetheless, I can’t deny that there seems to be opportunities to implement better measures supporting evidence-based practice (as I suggested in my last post).  I think the process would go something like this.

  • Identify and scope out the domains of interest that are important to you.
  • Conduct systemic reviews to establish a description of the processes that represent best practices within each domain.
  • Develop a descriptive questionnaire to allow an organization to compare their current practice with best practices.
  • Initiating a change project based on a capability maturity model of process change.

The best practice questionnaire becomes the focal point.  It is the measure of your organizations current performance and it provides a prescription for where you’re headed.  It’s easy to understand.  Also necessary are outcome measures that provide feedback on the validity of the standards to your organization.

2 caveats:

1. Complete consensus may not be possible, but at least consensus within a proscribed paradigm should be expected.  What the instrument would have the potential to do is to focus research within a paradigm and provide a research platform for many organization to conduct their own improvement projects in the management discipline; similar to what six sigma has done for manufacturing.

2. Which leads to one final caveat.  This is not the end all and be all in management decision-making.  What this approach does is to provide a framework to organize and scaffold your thinking around evidence-based practice.  Science can only provide you with standards; with a description of what has been proven to work in the abstract.  Not everything can be proven by science; not everything can be summarized in a standard process.  What standards do is to tell you these things work, stop re-inventing the wheel. Put these things into place and then place your development focus on the contextual, the relationship, the imaginative, and other areas where empirical science is less helpful.  Knowing where to put your creativity, that’s the real benefit of standards.