#LAK11 Data Science and Analytics: the Good, the Bad and the Ugly

Hans de Zwart posted a great summation of the critiques of big data and its usages.  I will comment his post in 3 sections, the good, the bad, and the ugly.

The Good

I like this Dataist’s Venn Diagram on Data Science combining Hacking Skills (innovating with technology), Math and Stat knowledge, with core expertise.

Data Science Venn Diagram

Data Science Venn Diagram

Data science is then a combination of an expert in Machine Learning (to deal with the massive amounts of data being generated), traditional research expertise (to process and analyze that information) and a willingness to engage in creative disciplinary innovation to bring these insights to practice (danger zone).  I think this is a list of skill and knowledge needs.

The Bad

Most of the naughty list is from Drew Conway’s original definition of danger zone and from George Siemens’ 10 concerns.  Drew’s reason for calling it a danger zone was to warn of people who hack (innovate) with poor core knowledge and George’s concern list is mostly about data procedures getting away from our intentions.  These are valid concerns, but I think they relate to statistical and measurement concerns.  My take on the problem is this: due to common pedagogy, most people have a rather formulaic understanding of measurement and statistic.  They know how to plug in the numbers, but they aren’t so good understanding what they are doing conceptually and what limitation are being violated. Not only is this a problem because they are operating blindly, but also because they are missing the inherent limitations that exist in their calculations.  So people are blind to both the validity problems they are creating and do not have a good conceptual understanding of what their procedures are capable of doing.

The Ugly

Hacking skills are the most likely skill to be ignored in this diagram.  This is a new area and it can’t progress without innovation.  Even though innovation is widely celebrated, managers do not really like it because the very idea of management is wraped up in the idea of control (with or without command).  In a standardized economy people were interchangeable and must conform to existing processes.  Today the world, even the data world, changes to quickly for standardized process in most circumstances.  To respond, management must be reformed to its core purposes and I don’t think the discipline is ready to tread these waters.

Conclusion

This view is against Chris Andersen’s view of The End of Theory in favor of dimensionally agnostic statistics.  Google is just a tool.  Popularity does not equal quality or relevance as was pointed out with recent concerns that organizations spamming google results.  As the Sloan article Hans quotes states:

Information Must Become Easier to Understand and Act Upon