#LAK11 – Validity is the Only Guardian Angel of Measurement (Geekish)

David Jones has posted about the general lament of high-stakes testing and asks; “what’s the alternative”?  You could rephrase this to ask, does measurement help us or hurt us?  Not only has he piqued my interest to think more along these lines, but I think the question is also relevant to data analytics, LAK11, and anyplace where measurement is used.  So. . .dive in I will!

David sites the association of the testing movement with globalization and managerialization, but I also believe that analytics, appropriately applied, can benefit education in pragmatic everyday ways.  He also quotes Goodhart’s Law, a British Economist, who spoke on the corruptibility of policy measurement.  I prefer Donald Campbell’s similar law even better for this situation because he was a psychometrician and speaks in testing language.  He states:

The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.

I believe that the problem being discussed, at its heart, is a validity problem.  High stakes testing is not appropriately bearing the high burden assigned to it.  It is not meeting the criteria of consequential validity; it does not produce better long-term educational outcomes.  Cronbach and Meehl (1955) explained why validity is important for situations like this one.

We do believe that it is imperative that psychologists make a place for (construct validity) in their methodological thinking, so that its rationale, its scientific legitimacy, and its dangers may become explicit and familiar. This would be preferable to the widespread current tendency to engage in what actually amounts to construct validation research and use of constructs in practical testing, while talking an “operational” methodology which, if adopted, would force research into a mold it does not fit.

This is what I believe they are saying (also taking into account Cronbach and Messick’s later developments in the concept of validity).  We measure constructs, not thing in themselves.  A construct is defined by the network of associations or proposition within which it occurs  (See Cronbach and Meehl, Recapitulation Section #1).  Validity is the rational and empirical investigation of the association between our operational definitions of our constructs (that is our tests) and our network of associations.  Without this investigation, what we are measuring is operationally defined by our tests, but what that is remains undefined in any meaningful way.  We can’t teach constructs or measure them unless we thoroughly understand them at least at a theoretical level.  Operizationalism has been rejected, whether it was founded in positivist philosophy or in common sense ignorance.

Most people who I’ve heard advocating for standardized and standards based high stakes testing, do so based on a principle of school accountability, not because such testing has been unequivocally demonstrated as a way to improve schools.  It’s a logical argument and it seems to be lacking empirical support.  Teaching to the test is regarded as inappropriate, but if the test is the sole standard of accountability, than it is an operational definition of what we are to be teaching.  In that case, anything other than teaching to the test seems illogical.

So let’s dig deeper.  I think there are nested psychometric problems within this testing movement.  Campbell’s Law may overtake any attempt to use measurement in policy in a large sense, but I am going to start with how a measurement regime might be designed better.

1. What measurement problems exist with current tests?

Teaching to the test as it is commonly practiced is not good because it is doubtful that tests are really measuring the correct information.  There are many unintended things being measured in these high stakes standardized tests (technically referred to in validity theory as irrelevant variance). In many ways, our measures are based (operationalized) more on tradition and common sense as opposed to empirically sound psychometry.  This is what Cronbach warned of when our tests don’t match the constructs.  To improve tests we need to go beyond common sense and clarify the constructs we desire our students to exhibit.  Why don’t we do this now.  Most likely it is too difficult for policymakers to get their head around, but there is a possible second reason.  It would reduce the validity of tests as their validity is measured by positivist methodology.  Validity is an overall judgement but positivist don’t like fuzzy things like judgements.  Tests may need to be reduce in validity in some areas, in order to gain validity overall.  Many people guiding testing procedures likely have a narrow view of validity as opposed to a more broad view of validity as espoused by Messick or Cronbach.  This lead to other issues.

2. Standards do not Address Many Important Educational Outcomes.

The curriculum, as it is reflected in standards, is not always focused on the most important knowledge and skills.  I think it reflects three things.  A kitchen sink approach (include the request of every constituency), a focus on standards that are easily measured by multiple choice or similar types of questions, and expert opinion.  The ability to creatively argue points of view, write with persuasion and conviction, to read, interpret, discuss and develop subtle points of meaning among peers, and to track the progression and maturation of these types of skills over time are important things that are not well measured by current high stakes tests.  A kitchen sink approach does not allow teachers to focus on depth.  Assessments like portfolios contain more information and a broader validity base, but are seen as less reliably (i.e. it’s possible to cheat or include personal bias).  Expert opinion is a type of content validity and is considered the weakest form of validity evidence.  With the development of high stakes testing, we are in more of a position to measure the validity of curriculum standards and to adjust standards accordingly, but I see no one doing this.  Maybe there is some research on the ability of high school students to function as college freshman, but this outcome is inconsequential in a long view of one’s life.  Tests should be held accountable for consequential validity and to empirically show that they result in improved lives not just parroting facts or helping teachers of college freshman.  It is not just teachers that should be held accountable, it is also test and standard developers.

3. Post-positivist Psychometrics

To be sure, there are trade-off in any form of measurement.  Sometimes improving validity in one area weakens validity in other areas.  Validity never reaches 100% in any situation.  However, because tests are mandated by law, I believe current validity questions favor views of what will be held valid in a court of law.  Law tends to be conservative and conservative psychometric are based in philosophical positivism.  I bet that many people making policy decisions have a poor understanding of what I consider to be sound psychometrics, psychometrics that are consistent with post-positivist philosophy.  Let me be clear, positivist psychometrics are not wrong, just incomplete and limited.  This was the insight of Wittgenstein.  Positivism looks at a small slice of life, while ignoring the rest of the pie.  Wittgenstein said if we want to understand language, look at how people are using language.  Similarly, Samuel Messick said, if you want to understand a test, follow the outcomes.  How are people using the test and what are the results of what they are doing.  This is the most important test of validity.

To sum up

There are many possible things that could be done in answer to David’s question.  I have focused on how you might improve testing processes. Do not focus on tradition and traditional  technique, but on standards and testing practices that creating authentic  value (what Umair Haque would call thick value) for students who will live out their lives in the 21st century, a century that is shaping up to be quite different from the last.  Testing could be part of the equation, but lets hold teachers and schools accountable for the value they create as it is measured in improved lives, not in some questionably valid test score.

#LAK11 – Utopian and Dystopian Visions of Analytics: It’s a Question of Validity

Catching up on the beginning of LAK11 which began last week.

George Siemens’ 1-16 post has initiated a discussion on critiques, much of which seems to focus on dystopian critique.

David Jones’ earlier critique is a good example.  His interesting critique is based on his fear of teleological implementation:

This remains my major reservation about all these types of innovations. In the end, they will be applied to institutional contexts through teleological processes. i.e. the change will be done to the institution and its members to achieve some set plan. Implementation will have little contextual sensitivity and thus will have limited quality adoption. . ..

This is what I consider to be a basic modernist approach with only quantitative teleology, that is, final causes can be judged solely through numbers resulting from simple quantitative analyses.

I studied Samuel Messick for my dissertation and my reading of him was that he was a psychometrician who took seriously the postmodern critique of the 20th Century philosophers of sciences.  His response was that the question of validity could never be answered without both quantitative and qualitative analysis.  Messick’s approach has always been seen negatively by those who need the teleological certainty of positivist quantitative only answers.  This is exactly the simplistic way David fears analysis will be used and his fear is valid.  Not because these tools can not achieve good things, they could improve our lives tremendously.  However, understanding in depth their use and the consequences of their use is a difficult undertaking requiring quantitative and qualitative analysis in it’s own right.  Many people will not be willing to put in that kind of effort.  A utopian leaning vision can only be achieved with hard work and much effort, but a dystopian vision can be achieve with only minimal effort.

Testing can be for Learning: The retrieval Effect

Tests get high marks as a learning tool by  Anne Mciloy — Science Reporter for the Globe and Mail

This article reports on a testing effect – testing students improves their learning.  It’s also called a retrieval effect and can be achieved in other activities that demand recall.  I would say it is also active learning in that students are using information in a different type of activity (answering a test questions as opposed to the original task such as listening to a lecture or reading a book. ) (The article claims the same effect for a good pedagogical activity where students pair off after reading a book or passage to summarize the reading and then to to criticize the summary and exchanging roles for the next reading)

Note – This supports the validity for testing, but it does not justify the validity of using testing for other purposes.  Most testing is for rating, ranking, segmenting student into groups or planning instruction.  The fact that testing is a good learning activity doesn’t justify its use for these other purposes which should be judged for their own purposes.  Testing should be use as a way for achieving success for students, not a way to rate their final success.  I saw this effect in my daughter’s schooling.  She learned through testing activities and often had decent command of the content after a test was completed, but the grade and often her identity was already assigned.  This identity is not about empty self-esteem.  Identity is deeper and more important than self-esteem.

Evidence-based Testing (Assessment)

I confess; I love the 35,000 foot view.  An article by an old pro, who gives us their overview of the future of their field.  This is Howard Wainer’s, 14 Conversations About Three Things.  His intended audience are researchers of the 21st Century.  His three things are what skills will they need (see #1 below), what problems are worth investigating (See #2) and what topics are not (See #3).

What jumped out at me was the topic, Evidence-based Testing (EBTD) and the premiss behind his recommendation.  (I have more study to do, but EBTD seems to be testing designed with validity in mind.)  His premiss is that statistical analysis has been very well researched and we can get more bang for the buck by focusing on improvements in test design.  We have done a better job improving data analysis than we have in data collection.  I think this premiss holds true across society (education, business, science etc. . . ).  We are generally better at analysis than we are at data collection.  In many cases it is garbage in – garbage out.  It’s not that analysis is unimportant, it’s just that the easiest way to improve analysis is in improving the data / information that forms the basis of analysis.  How do we do this?  By designing measures with greater validity.

  1. Six skills needed by 21st Century Researchers: Bayesian Methods, (Modeling) Causal Inference, (Dealing with) Missing Data, (Graphic representation for) Picturing Data, Writing Clear Prose, A Deep Understanding of Type I and Type II Errors.
  2. Important topic for 21st Century investigation: Evidence-based Test Design, Value Added (statistical) Models, New Kinds of Data (mostly made possible by computer networks)The integration of Computerized Adaptive Testing, Diagnostic Testing and Individualized Instruction.
  3. Topic that can be given a rest: Differential Item Functioning, The Rasch Model, Factor Analysis / Path Models, New Measures of Reliability.

      References

      Wainer, H, (2010). 14 Conversations About Three Things, Journal of Educational and Behavioral Statistics, 35 (#1) 5-25.

      One Description of Science and the Basis for an Argumentative Approach to Validity Issues

      I came across an interesting metaphor for science (and structural ways of understanding in general) in the Partially Examined Podcast Episode #8.   Here is my take on the metaphor.

      Imagine the world as a white canvas with black spots on it.  Over that, lay a mesh made of squares and describe what shows through the mesh.  We are describing the world, but as it shows through the mesh.  Change the mesh in size or in shape and we have a new description of the world.

      Now, these descriptions are useful and allow us to do things, but they are not truth, they are description.  They may be highly accurate in their descriptions of an actual world, but they are still descriptions.  It’s how science functions and is how science progresses and changes.  It also is why I advocate an argumentative approach to validity in the use of scientific structures like assessment or the use of evidence.  Old forms of validity (dependent on criterion validity) and much of the current discussion of evidence-based approaches is about the accuracy in certain forms of description.  But we must also allow for discussions of the mesh (to return to the metaphor).  As in construct validity, any discussion of how the world is must also include a discussion of how the mesh interact with the world to create the description.

      In addition to methods like random controlled trials (RCTs), there is also a need for research into how we understand and rethink the assumptions and things that are sometimes unexamined in research.  RCTs are very good at helping us do things with very accurate descriptions (like describe linear causal processes).  We also need research that uses other meshes that will allow us to understand in new ways and facilitating our ability to do new and different things; to make progress.

      Mathematics in The Real World: Are Your Use(s) of Numbers Valid

      It is my premise that most people do not really understand how to use mathematics strategically in a concrete world.  They don’t think much about what the numbers mean and meaning is everything if you want to know what the numbers are doing.  At its heart, math is an abstraction; an idea that is not connected to real world circumstances.  (See Steven Strogatz’s NY Times article for a detailed look at math and its misuse in education pedagogy)

      The trick to understanding and using math in the real world can often be traced to how we devise the measurements that define the meaning of numbers that are then to be treated mathematically.  Let look at some problems relation to the use of numbers and how their meaning is misunderstood.

      Problem #1 Educational Testing – Measurement should aways be designed to serve a goal; goals should never be design to fit a measurement protocol.  This is why proficiency testing will never help education and the core idea behind a recent New York Times editorial by Susan Engel.  Current public school measures do not reflect the capabilities we need to develop in students.  It’s not bad that people teach to the test, what’s bad is that the test itself it not worth teaching too.

      Our current educational approach — and the testing that is driving it — is completely at odds with what scientists understand about how children develop . . . and has led to a curriculum that is strangling children and teachers alike.

      (Curriculum should reflect) a basic precept of modern developmental science: developmental precursors don’t always resemble the skill to which they are leading. For example, saying the alphabet does not particularly help children learn to read. But having extended and complex conversations during toddlerhood does. (What is needed is) to develop ways of thinking and behaving that will lead to valuable knowledge and skills later on.

      The problem we see in current testing regimes is that we’re choosing to test for things like alphabet recall for two reasons.

      1. We base measures on common sense linear thinking like the idea that you must recognize letters, before recognizing words, before using words to build statements.  But if fact (as Ms Engel’s article points out) the psychological processes of building complex conversations is the developmental need for students and that is rather unrelated to how thought is considered in schools and how curriculum is developed.  Developmental needs should be studied for scientific validity and not left to common sense.
      2. The current measurement protocols behind proficiency testing  is not very good at measuring things like the ability to participate in complex conversations, it simply doesn’t translate well to a multiple choice question.  We could develop rubrics to do that, but it would be hard to prove that the rubrics were being interpreted consistently.  So instead we test abilities that fit the testing protocol, even if they are rather irrelevant (read invalid) to the capabilities that we really desire to foster.

      Problem #2 Business Analytics – Things like analytics and scientific evidence are used in ways that relate mostly to processes and activities that can be standardized.  These are ways of doing things where there is clearly a best way to do it that can be scientifically validated and is repeatable.  The problem occurs when we try to achieve this level of certainty in everything, even if there is little that science can say about the matter.  Math is not about certainty, it’s about numbers.

      The problem, says (Roger) Martin, author of a new book, The Design of Business: Why Design Thinking is the Next Competitive Advantage, is that corporations have pushed analytical thinking so far that it’s unproductive. “No idea in the world has been proved in advance with inductive or deductive reasoning,” he says.

      The answer? Bring in the folks whose job it is to imagine the future, and who are experts in intuitive thinking. That’s where design thinking comes in, he says.

      The problem with things like six sigma and business analytics is that you need to understand what it’s doing mathematically and not just follow a process.  If you’er just applying it, and you don’t understanding what it’s doing, you’ll try to do things that make no sense.  It not usually a problem with the mathematical procedures, it’s a problem with what the numbers mean.  How the numbers are derived and what’s being done as a result of calculations.  There is nothing worse than following a procedure without understanding what that procedure is doing or accomplishing.  Martin’s basic thought that innovation and proof are incompatible is false.  The real problem is a lack of understanding in how mathematics and proof can be use in concrete situations.

      Problem #3, Use of the bell curve in annual reviews and performance management.

      A recent McKinsey article, (Why You’re Doing Performance Reviews All Wrong, by Kirsten Korosec) generated a lot of negative comments by people force to make their review correspond to a bell curve.  In statistic we know that if you take a large enough random sample of anything that can be represented by numbers, the resulting distribution of the represented quality will resemble a bell curve, large in the middle and tapering off at either end.  But performance management is about fighting the bell curve; it’s about improving performance and moving the bell curve.  If you have to fit your reviews to a bell curve, your making performance look random.  That’s exactly what you do not want to do.  Once again we see a management practice that uses mathematics without understanding what they are doing.

      What’s needed?  The valid use of mathematics not the random use

      The basic problem is that mathematics is abstract, but human activity is concrete.  If we want to bridge these two worlds (and, as Strogatz explains it, they really seem like parallel universes) we must build a bridge of understanding that is called validity.  Validity is really the scientific study of how the concrete is made abstract and how the abstract is made concrete.  It’s an explicit theory of how the scope of activities can be represented by numbers, laid out so that it can be argued and understood.  You can do amazing things with mathematics in the real world, but only if you understand what you are doing, if you understand how the abstract and the concrete are related.  You must understand how numbers can represent and are related to the world of human activity.

      Analyzing the Research Practice Gap Via Inductive and Deductive Logics

      The following is an attempt to further refine my views on evidence-based practices.

      If evidence-based practice is going to gain traction as a movement, a better way is needed, not necessarily to translate research into practice, but rather, to devise a way to gather evidence in support of practice and practice development.  I will suggest that the evidence-based movement is interested in make inferences about the validity of practices given an overview of relevant evidence.  In general, determining the validity of practices means gathering evidence through inductive processes.  This contrasts with much research, which is conducted according to the requirements of deductive analyses.  Constructing evidence-based practices may depend on research acquired through deductive processes, but culling evidence for the validity of a practice requires inductive processes.

      It has been recognized that there are differences between practice and research in how they categorize information and approach problems.  Hodgkinson and Rousseau (2009) note that: “the research–practice gap is due to more than language and style differences between science and practice. The categories scientists and practitioners use to describe the things they focus on are very different”.  Cohen (2007) also considers differences in the way problems are approached.  “There is a fundamental difference between how academics approach the analysis of a problem and how practitioners focus on a problem” (p. 1017).

      Consider that researchers often construct information in terms of independent and dependent variables and that their relationships are investigated through rigorous forms of deductive logic.  There are very good reasons for researchers to do this.  Deduction does not equal truth from a philosophical perspective, but as Goodman (1978) puts it; “Among the most explicit and clearcut standards of rightness we have anywhere are those for validity of a deductive argument”.  The problem is, applying deductive logic to practices is difficult at best and often impossible due to the complexity and scope of most practices.  Even when research is conducted through inductive methodologies, the rigorous treatment of variables and  categorization schemes does not often approach the scope and complexity of processes we see in practice.  Consider the example of practice provided by Cohen in regards to the evidence supporting the use of testing in hiring and talent management practices.

      Human resource managers must “select individuals who will fit the organization as well as have the technical capability to do the job. A person who is smart and who gets “results” may be a person who discriminates, bullies, or causes turnover in the organization.   . . .  Intelligence and personality are only partially predictive of success in candidates for management positions. Factors such as accurate job descriptions, effective organizational structure, sound compensation philosophies and reward structures must all be considered in both attracting and selecting employees.

      Human resource managers cannot restrict their practices to the rigorous analysis of a few simple variable.  Often they must gather a wide scope of information and must analyze that information in terms of a goodness of fit between the individual and a position.

      Practitioners then must use methods that are appropriate for practices that often involve a constellations of variables and numerous related theories.  The methodological requirements for the standards of deductive experimental research cannot often be adapted to such complex circumstances.  There are simply too many variables that are often too loosely defined.  When looking at a specific practice, we are most interested not in determining if the practice is right or wrong, but is it valid.  Validity is an inference, a conclusion based on evidence and reasoning.  I am making generalities.  It may be possible to devise an experiment that provides a deductive inference about a practice, but in general, most conclusions in validity are made by inductively compiling information that infers a goodness of fit between the practice and the evidence.

      Subsuming and combining multiple arguments, research findings and theoretical frameworks is a job for inductive processes, even though inductive logic may be seen as a philosophical step removed from the truth.  In addition, language imposes yet another layer of complexity.  Scientists have rather standardized ways of talking about research methodology, theories and variables, but practitioners have to deal with much more pluralistic ways in which practice is often framed and the language that is used to frame it.  Again, the nature of the inductive process necessary to guide practice is one whose truth value is not necessarily determined by its correctness, but by its overall goodness of fit (Goodman, 1978).

      In my next post I’ll look at what type of information is useful in inferences about the validity of practice.

      References

      Cohen, D.J. (2007), The Very Separate Worlds of Academic and Practitioner Publications in Human Resource Management: Reasons for the Divide and Concrete Solutions for Bridging the Gap, Academy of Management Journal, 50 (#5) 1013-1019.

      Goodman N. (1978). Ways of Worldmaking, Indianapolis, IN: Hackett Publishing Company.

      Hodgkinson, G.P. & Rousseau, D.M. (2009). Bridging the Rigour-Relevance Gap in Management Research: It’s Already Happening!, Journal of Management Studies, 46:3, 534-546.

      Naturalistic Decision-making or Algorithmic Practice: Which is Appropriate and When

      Interesting Article in APA’s American Psychologist.

      Kahneman, D. & Klein, G. (2009). Conditions for Intuitive Expertise: A Failure to Disagree, American Psychologist, 64, #6, 515-524.

      The question, what works best, the intuition of expert decision-makers (Naturalistic Decision Making) or a statistical prediction algorithmic approach (Heuristics and Biases).

      The answer of course, it depends on the context.

      Intuition (which is presented as a form of pattern recognition) works well when the context include clear and consistent patterns and the experts has ample opportunities to practice recognition.

      Where simple and valid clues exist, humans will find them given sufficient experience and enough rapid feedback. (p. 523)

      This expert pattern recognition type of decision-making is especially relevant when time is a factor like in nursing or firefighting.  In situations where there are contra-indications, an algorithmic would be warranted, but the authors note there may be a potential for push back from practitioners.

      An important point here is that an evidence-based approach is portraited not a simplistic application of science, but rather the development of a specific practice oriented algorithm – an scientific extenuation of the practice.

      Contra-indications for a naturalistic decision-making process would include:

      • weak or difficult to detect patterns (e.g. high ceiling effects),
      • the lack of feedback,
      • feedback over long time periods or situations involving wicked problems where the feedback is misleading.

      Contra-indications for a hubristic algorithmic approach include:

      • a lack of adequate knowledge about relevant variables,
      • reliable criterion,
      • a body of similar cases,
      • a cost benefit ratio that allow for algorithm development,
      • a low likelihood of changing conditions that would render the algorithm obsolete

      The authors also note that algorithmic approaches should be closely monitored for changing conditions.

      My take: Kahneman and Klein set up their discussion as a debate between themselves and discuss different approaches primarily as an either or choice.  I value their clarifications, but I would like to think of the many other situations where algorithms would be appropriate to supplementing not replacing naturalistic decision-making.  For instance, they use nursing diagnoses as an example of a reliable intuition space.  In some situations it is appropriate to use it, however diagnosis is a complex tasks that can include a large amount of data that can be combined in different ways.  I’ll have to look at the literature to see if there is a contra-example for naturalistic decision-making.  I’m not saying that naturalistic decision-making is inappropriate in many situations, only that they seem to be short changing algorithmic approaches.  There are also indications that these to authors are not sharing a philosophical heuristic framework.  My bet is that the positivist side is overstating naturalistic bias (which mean failing to see their own) and the naturalistic side is ignoring sources of bias when is suits them (throwing our the scientific baby with the bath water).  Again this is pointing to a need for a framework that can being people with different perspective into true communication and exchange.

      Howe’s Critique of a Positivist Evidence-based Movement with a Potentially Valid Way Forward

      A summary of Kenneth Howe’s article criticizing positivism and the new orthodoxy in educational science (evidence-based education).

      (Howe, K.R., (2009). Epistemology, Methodology, and Education Sciences: Positivist Dogma, Rhetoric, and the Education Science Question, Education Researcher, 38 (#6) pp. 428-440.

      Keywords: Philosophy; politics; research methodology

      “Although explicitly articulated versions (of positivism) were cast off quite some time ago in philosophy, positivism continues to thrive in tacit form on the broader scene . . . now resurgent in the new scientific orthodoxy.” (p.428)

      (A positivist stance on science) has sought to “construct a priestly ethos – by suggesting that it is the singular mediator of knowledge, or at least of whatever knowledge has real value . . . and should therefore enjoy a commensurate authority” (Howe quoting Lessl, from Science and Rhetoric).

      Howe traces the outline of this tacit form of positivism through the National Research Council’s 2002 report titled Scientific Research in Education and relates this report to three dogmas of positivism:

      1. The quantitative – qualitative dichotomy – A reductionism dogma that had the consequence of limiting the acceptable range of what could be considered valid in research studies.
      2. The fact value distinction – An attempts to portray science as a value free process with the effect of obscuring the underlying values in operation.
      3. The division between the sciences and the humanities. Another distinction of positivism designed to limit any discussions to a narrow view of science.

      Howe’s article does a good job of summarizing these general critiques of positivist methodology, which include: (1)its overall claims could not stand up to philosophical scrutiny, (2) it tended to not recognize many of its own limitations including applying adequate standards to itself and (3) it also was inhabited by a political agenda that sought to stifle and block many important directions that inquiry otherwise might have taken.

      The crux of the political matter: While the goal of positivism may have been to positively establish a objective verifiable method of conducting social science modeled on the physical sciences, the primary result was an attempt to politically limit the scope of what could be considered meaningful scientific statements to include only statements that were verifiable in a narrow positivist sense. Howe is among the cohort who believe that the evidence-based movement is being used by some as a context to advance a tacit return to a form of positivism.

      The crux of the scientific matter: Howe’s primary interest appear to be political, the politics of how research is received and funded, but there is also an effectiveness issue.  Positivism’s primarily scientific problems are in the tendency to ignore or to down play many of the limitations of positivist methods, (overstating the meaning of positivist research) and in the way it oversimplifies and fails to problematizes the rather complex relationship between research and practice.

      Messick’s Six Part Validity Framework as a Response

      There are four responses to Howe in this journal issue. To me, none of the responses address the primary issue at play: to bring some sense of unity to varying ideas and communication with people using different scientific methodological frameworks.  There are suggestions to allow for multiple methods, but they are more of a juxtaposition of methods rather than a framework that serves to guide and support communication and understanding among scientists use differing methods.  This is why I support Messick’s validity framework as a response to just this type of concern.  Although Messick spoke specifically of test validity, there is nothing that would preclude this framework from being applicable to practice validity and to the development of post-positivist evidence to support the validity of practices.  What is the evidence-based movement really concerned with, if it is not the validity of the practices being pursued by practitioners.  This is not primarily about the validity of individual research studies, but is about the validity of practices and developing evidence to support the validity of specific practices.  It is also a mature framework that considers the full range of inquiry when developing evidence.

      Messick’s six areas for developing validity are six different types of validity evidence and I develop here an initial set of ideas about how they might relate to evidence-based practice as follows:

      • Content – Content defines the scope of the practice domain with evidence (including rationales and philosophical debates) for the relevance of a particular practice, how the practice represents ideas within the general domain and the technical quality as compared to other examples of the practice.
      • Substantive – Evidence that the design of actual processes involved are consistent with the knowledge of design from relevant domains (i.e. psychology, sociology, engineering, etc. . ..)
      • Structural – The consistence between the processes involved and the theories that underly and support rationales for the structure of the actual process.
      • External – Empirical data to support criterion evidence (Random controlled trials (RCT) would be one example).  For many practices this may include both convergent and discriminant evidence.  (My thinking is still in development here, but I am think that empirical evidence from the research base would function more like criterion evidence.  Direct empirical evidence from the actual practice being validated would be considered in most situations under consequential evidence.  See below.)
      • Generalization – Evidence for the practice to be relevant and effective across different populations, contexts and time periods.
      • Consequential – Evidence that the practice is actually achieving the purpose it was originally intended to achieve.

      I consider this list to be an early formation with more development needed.  Critiques are most welcome.

      Messick’s original formulation for test validity is available here.

      Evidence-Based Management as a Research/Practice Gap Problem

      This is a response I made to a post on the Evidence Soup Blog about the potential demise of EBMmgt
      I’ve been think about the health of the movement in response to (Tracy’s) post and I’m still surprised by the lack of EBMgmt discussions and how the movement does not seem to be gaining much traction. I re-looked at the Rousseau – Learmonth and the Van De Van, Johnson – McKelvey discussions for potential reasons why. (both are in Academy of Management vol31 #4, 2006). Here’s my take after reading them:
      (1) Cognitive, Translation and Synthesis Problems: One, just like the example Rousseau gave in her Presidential Address, there are too many different concerns and issues floating about. We need the field to be more organized so people can get a better cognitive handle on what’s important. Also, I’m not sure peer review is the best strategy. When I did my dissertation, doing something exciting took a back seat to doing something bounded and do-able. I can’t imagine someone whose publishing for tenure doing anything more than incremental and that does not translate well for cognitive translation reasons. We need a synthesis strategy.
      Possible response – A EBMgmt wiki See my 7-31 post on scientific publishing at howardjohnson.edublogs.org
      (2) Belief problems – Henry Mintzberg believes that managers are trained by experience and MBA programs should be shut down. (3-26-09 Harvard Business Ideacast) He says that universities are good for that scientific management stuff, but implies that science is only a small part (management’s mostly tacit stuff). All my previously mentioned discussions noted that managers and consultant do not read the scientific literature. Part of the problem is communication (see #1), but part is current management paradigms that include little science.
      Possible response – Far be it from me to suggest how to deal with paradigm change.
      (3) Philosophical Problems – If EBMmgt is to succeed, it must be presented as a post-positivist formulation. Taken at face value, it seems positivist; and positivism has been so thoroughly critiqued that I could see where many people would dismiss it out of hand. Part of my thing is trying to be post-positivist, without throwing out the baby with the bath water. Rousseau tries to mollify Learmonth’s concern that touches on this area, she sees some issue, but I don’t see understanding. A positivist outlook will only lead you in circles.
      Possible response – It’s much like your previous post, you need “both and” thinking, not “either or” thinking. EBMgmt must be an art and a science. This is how I understand the validity issue that I’ve mentioned to you before. I use Messick’s validity as a model for post-positivist science. It’s also important because measurement is the heart of science.
      I would love your thoughts