In a world where the risks and costs associated with privacy are on the rise, differential privacy offers a solution. Simply put, differential privacy is a mathematical definition of the privacy loss that results to individual data records when private information is used to create a data product. Specifically, differential privacy measures how effective a particular privacy technique — such as inserting random noise into a dataset — is at protecting the privacy of individual data records within that dataset.
Simply put, differential privacy is a mathematical definition of the privacy loss that results to individual data records when private information is used to create a data product.
Why Differential Privacy Matters
At least half of the companies we talk to say that even though they may have varying degrees of freedom to use the data that they collect, they hesitate to aggregate data across customers. There’s either some doubt about whether that’s contractually allowed or they’re afraid that their customers won’t be happy if they find out about it.
To get around these issues, some companies try to anonymize customer data by removing personally identifiable information. However, de-anonymizing data is easier than you might think. High-profile examples of de-anonymization have included the Netflix Prize fiasco and the Massachusetts Group Insurance Commission disclosures in the mid-1990s. The latter resulted in the then Governor of Massachusetts having his medical records identified. More recently, the NYC taxi cab disclosures resulted in the release of detailed location information for 173 million taxi trips. Differential privacy is a direct answer to this issue of de-anonymization.
How Differential Privacy Works
Differentially private solutions inject noise into a dataset, or into the output of a machine learning model, without introducing significant negative effects on data analysis or model performance. Differential privacy achieves this by calibrating the noise level to the sensitivity of the algorithm. The end result is a differentially private dataset or model that cannot be reverse engineered by an attacker. Differential privacy makes it impossible to identify with certainty individual records, e.g. customers, patients, within a dataset. For example, imagine being able to analyze driving information for every driver in California without being able to identify any individual driver. In this example, macro-level questions about driver behavior and road safety can be answered without compromising the individual privacy of the people contributing the data.
Differential Privacy at Georgian
As part of our applied research practice within the Georgian Impact team, we have worked closely with multiple portfolio companies to research and implement differential privacy in their solutions. One of these companies, Bluecore, helps its customers analyze consumer behavior and turn those insights into personalized marketing recommendations. You can read more about the Bluecore project in this case study, or listen to this podcast with Bluecore’s CTO and Co-Founder, Mahmoud Arram.
The Bluecore system relies on machine learning. One challenge faced by the company was that some of its most sought-after offerings worked best once enough data was collected from a new customer to build an accurate predictive model. The onboarding process could take anywhere from several weeks to six months. To address this cold start problem, Bluecore built a predictive model based on its customers’ aggregated data and used differential privacy to provide privacy guarantees to all customers.
Solving the cold start problem has huge potential for any SaaS company looking to provide insights to its customers via aggregated data. While the noise introduced through differentially private techniques may reduce model accuracy if no additional data is added, the loss is more than compensated for by the accuracy gained through training on the aggregate dataset. The net result is a significant increase in model performance.
Where to from Here?
We’re excited about the potential for differential privacy to solve privacy issues and cold start problems that many SaaS companies face when providing analytical services to their customers. It’s still early days, however things are moving quickly. We believe that differential privacy is no longer solely the domain of large technology companies such as Apple and Google. This is an area that we will continue to work on with our portfolio companies.
To learn more about this topic, check out our CEO’s Guide to Differential Privacy.
Or you can listen to this podcast with our own Chang Liu: