Could you say that again less clearly, please? A general-purpose data garbler for applications requiring confidentiality

May 13, 2018
By

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Ariel Rokem pointed me to this Python program by Bill Howe, Julia Stoyanovich, Haoyue Ping, Bernease Herman, and Matt Gee that will take your data matrix and produce a new data matrix that has the same size, shape, and general statistical properties but with none of the same actual numbers.

The use case is when you want to give your data to someone to play around with some analyses, but the data themselves are proprietary or confidential.

This is different from the scenario such as with the Census where they create a synthetic dataset that adds a little bit of noise or takes out some observations to preserve confidentiality but otherwise is intended to give the right answers. Here, there’s no aim to use the fake data to perform real applied analyses; it’s all about creating something similar that people can work with, for example to prototype their data analysis plans.

Howe et al. call their program DataSynthesizer but if it were up to me it would just be called Garbler.

Here’s the paper describing what the program actually does. The statistical procedures used in the garbling are nothing fancy, but that’s fine: just as well to keep it simple, given the simplicity of the goals. The title of their paper is “Synthetic Data for Social Good,” but I don’t really understand that last part: it seems to me that Garbler, like other statistical tools, could be used for good or bad.

The post Could you say that again less clearly, please? A general-purpose data garbler for applications requiring confidentiality appeared first on Statistical Modeling, Causal Inference, and Social Science.



Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

Tags:


Subscribe

Email:

  Subscribe