Has data sleaze reached a turning point?

March 20, 2018

(This article was originally published at Big Data, Plainly Spoken (aka Numbers Rule Your World), and syndicated at StatsBlogs.)

Matthew-henry-toronto-87142-small-squareHas the data sleaze industry reached a turning point?

At the end of last week, Facebook scrambled to get in front of some unsavory press coverage, by “proactively” suspending Cambridge Analytica – the data analytics outfit credited with the unlikely successes of the Brexit campaign in the U.K. and the Trump campaign in the U.S. – from its social media platform. It knew that The Guardian, and The New York Times were poised to publish critical articles about how Cambridge Analytica exploited the Facebook platform in building its invasive database on 50 million Americans, data that form the foundation for psychological scoring algorithms used to target and sway voters during the 2016 Presidential election.

As explained in my previous posts (here and here), data sleaze is the practice of taking and trading consumer data serendipitously. The third parties frequently utilize such data in ways that usurp the consumer’s self-interest.

Facebook’s response has led to an avalanche of negative publicity, and there is perhaps some hope that it, as well as other tech firms including Google, Twitter, etc., may finally take action to stop the data sleaze.

I’ve decided to split this post into two parts. Part 1 is about the inner working of the data-sleaze operation. Part 2, to be published later this week, is a call for industry leaders to take proactive action to curb the excesses of data sleaze. Part 1 explains the underlying technologies because one can never fully trust information coming mostly from conflicted actors.

In a Nutshell, How Our Personal Data Got Taken

Cambridge Analytica has boasted frequently to the media that they have amassed an extensive database of millions of Americans, which it has claimed is used to predict the psychological states of voters in support of election candidates who wanted to target and sway likely voters. A key part of this database consists of data gleaned from Facebook accounts. Cambridge Analytica purchased Facebook data from an outfit called Global Scientific Research (GSR), which is run by Aleksandr Kogan, an assistant professor of psychology at the University of Cambridge. GSR is a for-profit entity that he manages, separate from his academic appointment.

Dr. Kogan amassed the data by means of an online survey – which is a psychometric test similar to the Big 5 Personality Test – advertised on a service called Mechanical Turks, run by Amazon. Mechanical Turks are people willing to do “small tasks” for cents per task. In this case, the small task is completing the psychometric test. With a twist: Dr. Kogan also requires that the Turks download an app which they must connect to Facebook, and in so doing, they must permit him access to their Facebook data, as well as their network of Facebook friends.

Facebook data is valuable because they include real names and email addresses (only a minority of users block such snooping). These can then be used as match keys to connect with other sources of data, such as electoral rolls. See here for the kinds of data that Facebook allows partners to obtain.

Back in 2015 and 2016, the media were already on to this story. Several articles were published by The Intercept and The Guardian. Facebook at the time was slow to react. It wasn’t until 2016 that Facebook lawyers informed Kogan and his associates that they violated Facebook’s platform policies, and requested that these entities delete the data. Reporters tracked down some of the Facebook data, and a whistle-blower has come forward with documents, so it appears that Facebook, GSR, and Dr. Kogan have all been less than honest about the data sleaze. Indeed, at a hearing in the U.K., Facebook and Cambridge Analytica representatives denied that the political consulting firm has ever obtained or used Facebook data in its work.

The Enablers

The process of data sleaze outlined above is not unique to GSR or Cambridge Analytica. Many other companies in the social media ecosystem rely on variants. Let’s run down the list of enablers in this process.

  1. Facebook – The popular social-media platform is at the center of this controversy, precisely because it has built such a powerful database. This database is hugely valuable to marketers who want to know what we like, and who we know. The social-media company has mastered the art of getting people to share their personal data by providing free, useful services or convenience via the platform. The same machinery that powers Facebook’s billions of revenues is driving data sleaze.
  1. Terms and conditions and privacy policies – In every case including this one, tech firms expressly use terms and conditions as cover for invasion of privacy. They hide behind the façade of “if you don’t agree with our terms, then don’t use our service.” Later, many of these firms devolve to even more sly tactics, such as “if you continue to use our service, we assume you agree with our terms.” It’s a form of blackmail. Very few users read these terms and policies, but the businesses claim with a straight face that they have obtained permission from users to collect their data. Facebook and Cambridge Analytica argued that user permission was properly obtained. Dr. Kogan apparently disclosed to survey respondents that their data could be used for any reason. Because of his affiliation with University of Cambridge, some of the Turks were misled into thinking they were taking part in an academic study.
  1. Bait and switch – the psychological test is a front for collecting each respondent’s Facebook Graph. The Facebook data contain information about who knows whom. Similarly, every weather app is a front for a detailed database of user locations at all times. In my view, the most important dataset Dr. Kogan wants is not the results of the psychological testing, as widely reported, but the names and emails of the network of friends and acquaintances of all those who signed up. A few hundred thousand, self-selected responses to the survey are not sufficient to create an accurate model of every American’s psychological state.
  1. Data sharing technologies – a typical app delivers a service to users by pulling in various sources of data and integrating them. In order to support real-time sharing of data between app developers and data collectors like Facebook, data collectors set up automated processes by which apps can pull down the data. There are usually costs associated with these interfaces, especially when a sizable amount of data is delivered, which is a source of revenues for the data collectors. It’s hard to control access given that these systems allow automated bots to interface with them. Dr. Kogan, for example, could create both a good bot collecting data for academic research and a bad bot siphoning data to Cambridge Analytica.
  1. Data governance black hole – once the data show up in one database, it is bound to show up in many databases –internally as well as at third parties. Once the data reach a third party, Facebook cannot know how many copies are made, and where those copies are. Even within Facebook, with so many employees having access to the data, it is almost impossible to monitor who has copied the data where. Facebook and other social-media outlets have community rules. For example, Facebook has “platform policies” that restrict friends’ data to noncommercial use such as improving user experience. Talk about unenforceable! Facebook can only know what data have been sent to a third party but it has no way of knowing how the third party is utilizing the data.
  1. Data deletion myth – just as it takes special skills to eradicate a file on one’s PC, so it is basically impossible to remove all traces of data from existence. We have trouble even counting and locating all copies of a given dataset. Thus, Facebook didn’t even bother to check whether Cambridge Analytica, and selected third parties, truly destroyed the data they were supposed to. Facebook didn’t suspend the controversial company until the media unearthed evidence that the data haven’t been destroyed.
  1. Mechanical Turks – these bit players were exploited as pawns to sell out their “friends” for mere cents.
  1. Weak regulation and enforcement – Europe might be finally ready to enact laws to regulate the data collection industry but the U.S. government sees no evil.
  1. Anonymity – many businesses trot out this buzz word to justify their data collection operations. To put it bluntly, we are being lied to. Anonymity is declared every time an analyst replaces identifiers (such as emails) with encrypted or scrambled versions of those entities. However, in most cases, lookup tables are available to unmask these users.
  1. Slack ethics – enough said.

In Part 2, I suggest some actions.

Please comment on the article here: Big Data, Plainly Spoken (aka Numbers Rule Your World)

Tags: , , , , , , , , , , , , , , ,