Synthetic Data Promises Fair AI And Privacy Compliance, But How Exactly Does It Work?

While AI has delivered incredible breakthroughs in pattern recognition and organizing immense data sets to help humans make better decisions, certain issues related to bias and privacy have created challenges to both innovation in AI and consumer trust.

An algorithm recently gave a higher credit limit to a husband over his wife even though she possessed the higher credit score. Amazon’s hiring AI ranked female candidates lower than males, or at the extreme end, a soap dispenser failed to recognize darker skin. Bias is a critical problem in AI and it is only getting worse.

But bias isn’t just offensive to consumers on the receiving end, it also makes AI ineffective. When algorithms are based on historical data, the AI can then encode that bias, even if the demographic personas today point to different outcomes. According to Gartner, it is estimated that “by 2022, 85% of algorithms will be erroneous due to bias.” There are many reasons that bias finds its way into AI, a primary one being the training data that is used contains historical biases, such as a low percentage of women in traditional technical roles.

Another recent mishap by Google had the company launch a discriminatory job ads campaign that had high-paying job ads being shown to men but not to women. Clearly, a bias problem of this nature hurts not only the candidates but the company’s own bottom line as it has been proven that a more diverse workforce leads to better business decision-making and profits.

Bias or imbalances can also affect algorithms in instances when new data emerges and is silenced by oceans of historical data. Behavior can change dramatically over time. For example, our shopping behavior changed radically once we entered a worldwide pandemic and algorithms had to be retrained with large amounts of new data on brand new behaviors that hadn’t been previously available. There is concern that recent shifts in language, reflecting social change can be drowned out by vast quantities of historical records, thus AI is unable to learn those new patterns without due intervention.

Bias is only one of the hidden dangers embedded in data. With increasing volumes of collected and stored data, the risks of data leaks and de-identification are on the riseA data breach is not only costly for the enterprise, but can have very real consequences for individuals as well. These data breaches could potentially lead to private health information being leaked that would inform insurance companies of sensitive health records and subsequently penalize patients with higher premiums or worse coverage denial. What happens when your health insurance company knows you are predisposed to heart disease or other health conditions?

On the one hand, a large amount of granular data is needed to train AI, yet, on the other hand, the data requires privacy protection and traditionally anonymized data oftentimes isn’t useful enough. Many organizations use badly anonymized data for training their algorithms putting their customer’s privacy at risk. From a privacy perspective, sensitive personal information must be protected, especially in light of increasing regulations and data privacy policies such as Europe’s GDPR and California’s Consumer Privacy Act (CCPA). In the case of both regulations, data is only considered anonymous if none of the subjects are re-identifiable, neither directly nor indirectly, neither by the data controller nor by any third party.

The true financial cost of failing to protect privacy, however, goes well beyond hefty regulatory fines. Data breaches are on the increase: COVID-19 has been linked to a whopping 238% rise in worldwide cyberattacks against the financial sector between February and April 2020. And it’s not only cyberattacks businesses should worry about. Inside jobs and inadvertent mishandling of data also regularly cause data leaks. The average cost of a data breach in the United States in 2020 – $8.64 million.

While it’s extremely concerning that data leaks are on the rise, traditional data anonymization tools are also failing. What was once good enough no longer suffices. New types of threats, like linkage attacks, are exposing vulnerabilities in traditionally anonymized data sets. These old tools are also destroying the statistical patterns and the rich insights AI and data-driven tech needs to learn from: they must strip out so much data to adhere to regulations and to protect privacy, that there is little left in the data to leverage effectively.

This creates a problem for both large enterprises and smaller innovation-led startups who are both locked out of data access. Enterprises are unable to use their own data to improve their products or customer experiences because of adherence to privacy restrictions and startups are blocked from accessing good data for use in their new ideas.

Further, even when stripped down to what passes for privacy protection, it has been shown that with a traditionally anonymized dataset where only 3 out of 100s of data points are left intact, hackers can still discover the true identity of the data subjects.

This doesn’t mean all AI is evil. There are countless positive examples where AI has been leveraged for good, such as in healthcare and medical diagnosis of cancer where AI is better than humans. Similarly, enterprises need AI for efficiency and cost-reduction, ultimately, for carving out a competitive edge. No AI, no way forward  – that is simply the state of the world today. AI is a tool, like any other – it needs to be used well. AI is fast becoming a mission-critical part of any business, and so issues of bias and privacy need to be addressed head-on to protect consumers and to elevate businesses.

Synthetic data, itself a product of sophisticated generative AI, offers a way out of privacy risks and bias issues. These algorithms can learn data structures and correlations to generate infinite amounts of artificial data of the same statistical qualities, allowing insights to be retained with brand new, synthetic data points. The process can unlock privacy-sensitive information, correct bias, fix imbalances, and is private and compliant by design.

“Synthetic data enables companies to build software in healthcare leveraging patient data but staying within HIPAA compliance,” says Sree Batchu, Business Development, Biocom, a public policy leader and advocate for California’s life science sector.

“It’s the most effective way to build algorithms that create new products and solutions while keeping patient data safe and allowing for innovation that can save lives and make incredible positive impact to humanity.”

“We saw the same breakthroughs possible for structural data as in image synthesis. We knew that synthetic data was the way forward for big data privacy,” says Klaudius Kalcher, Co-founder, and Chief Data Scientist at MOSTLY AI.

MOSTLY AI created an algorithm capable of learning patterns from data and generating statistically nearly identical data sets geared towards maximum accuracy and privacy. The inspiration came from seeing Nvidia’s progress in its AI’s ability to create synthetic faces which were indistinguishable from real photos.

“Our technology enables organizations across the world and across industries to safely share big data assets, internally as well as externally, while keeping the privacy of their customers fully protected.”

Synthetic data algorithms are especially good at synthesizing behavioral records, such as credit card transactions or purchase histories, including time-dependencies of customers’ actions and behaviors.

AI-generated synthetic data is private by design. This data is entirely artificial, generated to match statistical patterns, not individual data points. The individuals in the original and synthetic datasets are completely different, but if you ask the same questions from both datasets, you will get the same answers.

Due to the sensitive nature of their client records and highly restricted privacy requirements, the finance and health care sector frequently use synthetic solutions. A top 5 US health insurance company, is leveraging AI and synthetic data to reduce healthcare costs, improve patient outcomes, and support healthier lives for its members.

Until recently, data privacy issues hindered efforts to use AI and machine learning because anonymizing data took months and destroyed scientific validity in the process. Now using synthetic data, the company is able to share data with outside innovators and researchers freely in a new, disruptive way.

One of the largest global telco firms uses synthetic data for startup collaborations and analytics. Before, they were only able to analyze and understand a fraction of their customer data since only 10-25% of enterprise customers tend to consent to the use of their data for analytics on average. With synthetic capabilities, they can access data insights locked by policies and regulations to improve products and deepen customer understanding while keeping their customers’ privacy safe and secure.

Inspired by the book Ethical Algorithms, Dr. Paul Tiwald, Head of Data Science at MOSTLY AI, realized that fixing AI’s privacy dilemma isn’t the only thing their synthetic data algorithm could be used for. Correcting biased datasets became a priority topic of the team in 2020. Little did they know how relevant their research would become, which was recently featured by deep learning expert Andrew Ng. The recipe for ethical algorithms is simple: feed them with fair data. Synthesize more high earning women, teach HR algorithms about the benefits of employing older people, etc.

Realizing societal changes is still, of course, necessary, but if we provide synthetic data to create a more equal digital world now, we can prevent a cyclical pattern of physical/digital bias.

Banking and insurance companies use the same technology to fix imbalances that complicate the data-picture. In fraud detection, for example, AI has a hard time detecting fraudulent events as a result of their rarity.

Other rare events, like a financial crisis or pandemics, are also difficult for AI to learn from unless the training data is augmented and rebalanced with synthetic copies of rare data. While AI is truly revolutionary in solving problems in new ways, algorithms are only as good as the data that feeds them. The more quality, balanced data they have, the better they perform. Synthetic data can make datasets fairer and larger by extracting statistical insights from sensitive data locked up by policies and regulations.

Regarding the role of technology in fairness, Alexandra Ebert, Chief Trust Officer of MOSTLY AI said, “Contrary to what some might think, AI isn’t the root cause for discrimination. It is merely picking up on patterns we’ve been guilty of long before the first algorithm was born.

“Technology is making this bias visible and risks enshrining them in an intransparent, self-reinforcing system. Synthetic data is a remedy that addresses the problem directly at its origin. By removing bias in the data used for AI training, synthetic copies allow us to perform these tasks fairly, efficiently, and safely.”

Synthetic data can rebalance datasets to reflect reality not as it is, but as we would like to see it. This rebalancing capability is critical not only for the correction of embedded human biases but also for historical anomalies.

With synthetic data, it is also possible to shape it in order to mitigate any potential violation of fairness. The result is fair synthetic data that is fully anonymous and de-biased (in accordance with a specific fairness definition). In other words, you can program the AI to deliver the desired results of fairness in society thus correcting for bias at the source.

An important example of how bias can hurt populations is the case of the ill-famed Compas recidivism dataset used by judges to assess a convict’s likelihood to commit another crime after being released from prison. The original dataset skewed toward Black recidivism by 24 percent, overestimating the likelihood that a Black convict would commit another crime.

As a result, people of color who got a disproportionately worse score from the system had to stay in prison for longer. Using synthetic data, the company adjusted the generator using a parity correction technique and reduced the difference to only 1 percent. While the concept of fairness by design is still in its infancy, we can expect to see regulators picking up this issue very soon, just like they did with data privacy.

“We believe that synthetic data will democratize the insights locked up in data silos to further advance innovation, collaboration, and science,” says Ebert.

“Access to good quality data is what sets successful companies apart.”


Original post:

Leave a Reply

Your email address will not be published. Required fields are marked *