by Elizabeth Matsui
A piece in NY Times on 12/14/2014, “In Big Data, Shepherding Comes First” highlights a major, but underappreciated, obstacle to generating meaningful results from any type of data, including “big” data: that to extract meaning from data, there is lots more “craftsmanship” than automation required. This comes as no surprise to those of us who regularly contend with the messiness of collecting “real-world” data and the time-consuming process of wrangling data. Data scientists, statisticians and others who are in the business of data analysis also devote a significant amount time to careful, scientific thinking to refine the question to be answered and to develop and execute an analysis plan to answer the question.
Of course, producing a simple summary statistic, such as a frequency of a characteristic in the data set, can be automated easily, but conducting analyses to understand relationships between characteristics or whether/which characteristics predict a particular outcome is much more complex, and at least for the foreseeable future, will be very difficult to automate. And in those instances in which it is automated, there is a real risk that the automated results may be misleading. The biases of unattended automation of Google Flu Trends is one such example. Not surprisingly, data science companies that initially developed software products are finding that automation has turned out to have a smaller role in the data science process than expected and that their clients need data strategy and consulting services as much, if not more than, highly specialized proprietary software.