Advancing in AI without Compromising Privacy or Breaking the Law

Developing Artificial Intelligence (AI) tools that can address medical and societal problems presents a data privacy quandary. To become good at advanced pattern detection, AI tools need access to real data, and lots of it. A medical diagnostic AI program, for example, will only work properly if it can train itself on actual patient records. However, in the US and EU, accessing this kind of data presents a range of problems.

In the US, healthcare privacy laws like HIPAA restrict third party access to patient data without the patient’s permission. Getting this permission may be a cumbersome chore or just flatly impossible. Then, privacy laws like GDPR and the CCPA add more regulatory impediments to patient data sharing. Even if the permissions are in place, the healthcare organizations that hold the data may not want to share it with an AI diagnostic tool maker for branding reasons. Patients may not like the idea of having their data shared. Plus, there’s always the chance that patient data, once shared, will be breached.

At the same time, developers of AI tools in China face few such obstacles. They can freely access patient data from their own country. And, to broaden their data set, it appears they are simply stealing health records from other countries for the purpose of AI research. This is one possible answer to the question posed in a recent Wall Street Journal article, “What Does Beijing Want With Your Medical Records?”

They may not want it for geopolitical espionage purposes, which is one theory behind the breaches of companies like Anthem Blue Cross. More likely, Chinese companies want the data so they can compete and win in the AI-driven medical technology field. American companies are lagging behind, encumbered by privacy rules. It’s a difficult matter to resolve. It’s good that American and European governments protect private citizens’ data. At the same time, losing out to China on key industries will lead to economic problems that will negatively affect our societies.

What can be done about this?  One solution is to use synthetic data. Synthetic data is real data that’s been masked to hide private identifying information. Using synthetic data, an AI-based tool can train itself without anyone having to worry that people’s privacy is being compromised.

Generating synthetic data takes specialized technology like ARM Insight Mimic™. With Mimic, a user can convert a database into secure, anonymized, statistically relevant data. The challenge is to make a statistically usable replica of the existing data without revealing confidential information. Some data, like a person’s social security number, must be totally blocked. Other fields, like birth date, can be randomized so they’re close the real facts, but not identifiable. Mimic can also remove trade secrets from data.

“This is a path to monetization of your data,” explained Randy Koch, CEO of ARM Insight. “It’s synthetic data of high quality— perfect to train AI tools and stimulate machine learning algorithms. But, it is also completely protects privacy and is 100% compliant with all regulations or security policies that limit the use of personal data.”

Synthetic data creation offers a way forward for American and European companies that want to stay competitive in AI. Creating usable synthetic data at scale is not always easy, but solutions like Mimic are showing that it is possible.