Junk Data

February 5, 2013

(This article was originally published at Numbers Rule Your World, and syndicated at StatsBlogs.)

There are junk charts, and there are junk data.

That was the thought that ran through my mind when I saw this post about a new FourSquare app (link). For those who are not familiar with it, FourSquare is this website that lets you broadcast your current location to your friends/followers. This new app, which won a competition hosted by FourSquare, allows users to fake their check-ins, in other words, to pretend to be somewhere when you're not. It's being portrayed as a kind of marketing of yourself to your social circle.

This is one of many problems with the so-called Big Data era. Yes, we collect lots of data. But a lot of the data are junk. It's worse than junk because they are mixed together with the good stuff, and it is often difficult if not impossible to tell them apart.

In this case, I wonder if I am given a dataset that has all these checkins, and some of them are faked, would I be able to filter out the faked ones? One way is to identify the source of the check-in, and blacklist apps like CouchCachet. That only works if (a) the only way to post fake check-ins is through trackable apps, and (b) there are no legitimate check-ins from those black-listed apps.

Alternatively, I would have to go create a labeled dataset in which I verify that some of the past checkins are faked. This would be very hard to put into practice.

The next question to ask is: if Big Data contain a lot of such faked or just bad data, how much can we trust the analyses?

Would love to hear about your experiences with junk data.

Please comment on the article here: Numbers Rule Your World

Tags: , , , , , , ,