How AI Is Helping Anonymize Clinical Trial Submissions

We’re seeing a record number of data breaches over the pandemic. A recent IBM report found that data breach costs have also shot up.

Healthcare tops the list of most impacted industries, with an average data breach costing $9.2 million per incident. Sensitive customer data was the most common type of information exposed in these breaches.

Pharmaceutical and healthcare firms operate under stringent guidelines that mandate the protection of patient data; hence any breach can prove costly. For example, throughout the drug discovery phase, firms collect, process, and store personally identifiable information (PII). When firms make clinical submissions at the end of the trials, they must protect patient privacy in the published results.

Regulations such as the European Medicines Agency’s (EMA) 0070 and Health Canada’s Public Release of Clinical Information (PRCI) lay out specific recommendations for anonymization of data to minimize the risk of re-identifying patient details from the results.

In addition to advocating data privacy, these regulations mandate sharing of trial data to enable the community to build upon the research. This poses a dilemma for companies.

How can pharma firms balance data privacy with transparency while publishing results in a timely and cost-efficient manner? Here’s where artificial intelligence (AI) can help by potentially saving over 97% effort in the submission process.

Why anonymization of clinical study results (CSRs) is tough

There are three key challenges firms run into while anonymizing clinical submissions:

1. It’s tricky to handle unstructured data: A significant part of clinical trial data is unstructured. Study results have large volumes of textual data, scanned pictures, and tables that are not easy to process. Identifying sensitive information from a 1500-page report is like finding needles in a haystack. Further, no standardized technology solutions exist to automate this process.

2. The manual process is tedious and error-prone: Today, pharma firms employ hundreds of people to anonymize clinical study submissions. Teams go through over 25 complex steps, which could take up to 45 days for a typical summary document. When people manually scrutinize thousands of pages, it is tedious and often error-prone.

3. Regulatory guidelines are open to interpretation: While the regulations lay out detailed recommendations, the specifics aren’t crystal clear. For example, Canada Health’s PRCI prescribes the risk of re-identification to be under 9%. However, the guidelines don’t go into too much detail on the computation methodology for this risk score.

Let’s look at the ingredients of an anonymization solution that can tackle these roadblocks.

Leveraging augmented analytics to anonymize clinical submissions

Here are the three aspects that can help arrive at a technology-driven anonymization solution:

a) AI language models for natural language processing (NLP)

Today, AI can create like an artist and diagnose like a physician. Deep learning techniques are powering many of AI’s advances. AI language models, the family of algorithms that process the human language, are pretty good at detecting named entities – for example, patient names, social security numbers, and zip codes.

People don’t realize that many of these powerful AI models are available for free in the public domain. They are often trained on public documents such as Wikipedia or MIMIC-III v1.4, a database with de-identified data from 40,000 patients. To improve model performance, one must retrain these models on internal clinical trial reports under domain experts’ supervision.

b) Human-in-the-loop design to improve accuracy

The 9% risk threshold mandated by Canada Health PRCI could translate into a model accuracy of approximately 95% (often measured by recall or sensitivity). AI algorithms improve accuracy by seeing more data and running multiple training cycles. However, no amount of technical improvement can prepare them for clinical use. They need human support.

To tackle the subjectivity of clinical trial data and improve outcomes, design analytics solutions to work alongside humans – this is called augmented intelligence. Design humans to be a part of the loop not just to label data and train the models but also share periodic feedback while the solution is live. This helps improve model accuracy and outcomes.

c) Partnerships for collaborative problem-solving

Let’s say a study has 1000 patients, with 98% belonging to the United States and the rest from South America. Should data about these 20 patients be redacted (blackened out) or anonymized? Is it better to aggregate patients at the country or the continent level? What’s the risk of an attacker combining these anonymized details with other information such as zip code or age to re-identify patients?

Unfortunately, there are no standard answers to such questions. For better clarity about the interpretation of clinical submission guidelines, collaborate with industry stakeholders – for example, researchers at pharmaceutical manufacturers, clinical research organizations (CROs), technology solution providers, and academia.

An AI-driven approach for anonymization

Let’s now piece together the above building blocks into a solution workflow. This is based on the approach we adopted in our work to build a technology-driven anonymization solution.

Clinical study reports contain structured data (numbers and identified entities like demographics or addresses) and unstructured data elements we discussed earlier. They must be processed to identify sensitive named entities. While this detection is easy with structured data, AI algorithms must act on unstructured data.

The unstructured data, typically in formats such as scanned images or PDFs, are first converted into a readable form using techniques like optical character recognition (OCR) or computer vision. Then, AI algorithms are applied to the documents to detect personally identifiable information. To improve algorithm performance, users share feedback on sample results. The samples are picked from cases where the model confidence is low.

Once the anonymizations are done, the risk of re-identification is assessed. This is usually done within the context of a reference population by pooling data from other similar trials. The risk assessment uses a set of factors that bring in three types of risks – prosecutor, journalist, and marketer. These represent different scenarios of re-identification of sensitive patient information.

Until the risk level falls below the suggested threshold of 9%, the anonymization process is repeated in cycles by bringing in more business rules and algorithm improvements. The entire anonymization process is built into a repeatable workflow by integrating with other technology applications and setting up a machine learning operations (ML Ops) process.

Data quality, a bigger challenge than algorithm complexity

When implemented for a pharma company, the above anonymization solution approach led to potential savings of 97% in submission turnaround time. This semi-automated workflow delivered efficiency improvements while keeping humans in the loop. But what was the biggest challenge in building the AI-driven anonymization solution?

As with most data science implementations, the real hurdle faced in this effort was not with AI algorithms for identifying named entities. The challenge was getting the study reports into a readable format – good quality data for the AI to process. With documents across different formats, styles, and structures, the pipeline built for ingesting these documents often failed.

The solution had to be constantly fine-tuned for reading yet another document encoding format. Or to detect where columns began and ended in tables scanned as pictures. Clearly, this is an area where considerable time and effort must be budgeted while building such solutions.

Emerging challenges for clinical study anonymization

With rapid technological advances, would anonymization of clinical study submissions become easier and more efficient? While the improving sophistication of AI-driven solutions offers a lot of promise, there are some emerging challenges to watch out for.

The explosion of consumer data online through social media, device usage, and online tracking increases the risk of re-identification. Attackers could combine these public details with clinical study data to single out patients. It’s worrying that advances in AI research are often tapped by hackers and attackers even before pharma practitioners leverage them for data privacy.

Finally, regulations are evolving and branching out into country-specific variations. We might soon have countries announcing their versions of the clinical submission anonymization regulations. This could increase the complexity and cost of adherence for companies.


Original post:

Leave a Reply

Your email address will not be published. Required fields are marked *