By Navin Vembar
Raw data can be toxic. I’m not the first person to say this – that collecting, storing, and processing data should be treated as a dangerous process of refinement. When discussing mobility data, this is even more so the case – deanonymizing movement data can be trivial.
We’ve talked about differential privacy before and how aggregating statistical data can be done safely. But, there’s another way to keep things even more safe: to prevent centralized data collection from even happening. And yet, that has a different downside; preventing collection would potentially mean that scientists will lose a treasure trove of important data that’s applicable not only to COVID-19, but to natural disasters, city planning, and other population-level analyses.
So, is there a way to prevent the collection of data while also supporting these use cases? The approach that Apple and Google have taken with contact tracing suggests that, in fact, there may be a happy path.
The way that Apple and Google do it is by using a combination of carefully designed local computations (about who you’ve interacted with) and a detailed cryptographic protocol that keeps your personal information safe. That means that there is no central data collection; the data stays with you and is carefully secured on your phone to keep your information private while also keeping people around you safe. Similarly, Apple and Google already apply differential privacy to protect things like collecting crash statistics – so, the approach has not only been studied but has been proven out at scale. The question becomes, can we collect other metrics safely, in aggregate, with clear consent, and in such a way that it benefits scientists and the public?
A model that combines the two approaches together – allow for phone operating system APIs that can compute a range of statistics, and apply differential privacy and thresholding at the phone level. Apple’s recent spate of privacy modals and granular controls gives us an example of how users can be given the information needed to clearly consent to their data being used – and, in this case, to ensure that data is appropriately protected. There are also techniques such as secure multiparty computing which allows for the sharing of data with multiple people for the purposes of aggregation without revealing individual data.
There’s a lot of unknowns here, but I do think there’s a path forward for the application of the best practices of protecting privacy while also being able to tap into the type of statistical information that could be invaluable for public safety, health, and decision-making.