Last time, we talked broadly about privacy and we used the term “differential privacy.” I first heard about this idea several years ago and I’ve been fascinated by how the math works and how it’s implemented. This post isn’t going to dive deep into the math – Damien Desfontaines blog posts are fantastic for this – but let’s look at why it’s important.
For public health researchers or others to really use large amounts of data, they may need sensitive information. Knowing, truthfully, the answers to questions like: “do you use illegal drugs?”, “are you sexually active?”, or “do you have an extremely transmissible virus?” means that the person holding that data can put you at risk. You’re vulnerable not just to them, but to anyone who has access even to aggregated data. When we’ve been listening to people talk about ways to track the spread of COVID-19, we’ve been extremely skeptical of any proposals that could create suspicion or division between neighbors by “tagging” people as risky. Our methods of managing this disease and its risks are almost entirely social right now, pre-vaccine, so putting people – often already vulnerable populations – at risk of isolation or violence is absolutely something we have to take care to avoid.
Differential privacy helps us do that by ensuring that individuals cannot be identified out of aggregated data. The techniques of differential privacy allow publishers of data to set a privacy budget that balances “how much” noise is added into the outputs and the accuracy of the aggregated data. Striking that balance carefully means that it is possible to release data to the world to help, in this case, keep us safe from COVID-19 while also protecting people’s privacy.