Garbage in, garbage out: what if your Big Dataset is lousy data? (UPDATED)
Submitted by drupaladmin on 6 June 2012.I'm all for making the most of the data we already have--but no more than that. An hazard of trying to wring as much as possible from any dataset is that you'll overstep and try to use the data to address questions or draw conclusions that can't be addressed or drawn. In ecology, this was the motivation behind the excellent NutNet project-existing data weren't really adequate, so they had to go collect new data.
Over at Cop in the Hood there's a fun rant by Peter Moskos on just this point, in a social science context. A huge, information-rich Big Dataset recently was used to argue that people in poor neighborhoods have just as easy access to nutritious food as people in rich neighborhoods, so lack of easy access to nutritious food can't explain higher incidence of obesity in poor neighborhoods. Which is total bunk because the data on what constitutes a "grocery store" are, if not total garbage, at least totally inadequate for the purpose for which this study tried to use them. A fact which the study recognizd, only to dismiss it with the excuse that better data would have been difficult and expensive to obtain. Which amounts to saying "Doing it right would've been hard, so we decided to do it badly." Click through to read the whole thing, it's a great, short read and not at all technical.
I'm curious to hear from readers who work more with pre-existing data than I do: Have you ever looked into doing some sort of analysis of pre-existing data, only to drop it because you decided that the data weren't good enough? Or have you ever reviewed a synthetic paper and told the authors, "Sorry, but your whole project is worthless because the data just aren't good enough"?
And are there any general strategies that can be used to guard against making more of the data than is reasonable? One possibility is to involve the people who collected the data in any synthetic effort using those data. That's certainly something my CIEE working group on plankton dynamics did, and I think it was a good thing, even if it does have its own risks (e.g., causing the synthesizer to worry about truly minor flaws in the data that don't actually affect the results).
Note that one strategy that doesn't guard against poor-quality data is "make sure you have a really big dataset." Having more fundamentally-flawed numbers, or more non-flawed numbers to go with the fundamentally-flawed ones, doesn't make the fundamentally-flawed numbers any less flawed. Put another way, flaws in your data don't just create "noise" from which a "signal" can be extracted if only you have enough data. Flaws in your data can eliminate the signal entirely, or worse, generate false signals (as in the social science study linked above).
It's only natural that someone like me would worry about this sort of thing, as I don't work with pre-existing data that much. I'd be interested to hear from people who do data synthesis for a living and are really invested in it (the 'synthesis ecologists'). How often do you run into serious problems with data quality, bad enough to prevent you from answering the question you want to answer? Does the possibility keep you up at night? What do you do about it?
HT Andrew Gelman, who also comments.
p.s. Before anyone points this out in the comments: I freely grant that everyone always tries to push every method or approach as far as it will go, so everyone always runs the risk of overstepping what their chosen method or approach can teach them. But 'synthesis ecology' is what's hot right now, so that's the context in which I think it's most important to raise this issue.
UPDATE: Here's this post in cartoon form.