Shining a Light on Dark Data | How Pfizer collects analytical knowledge for access and reuse from anywhere, at any time

February 23, 2021

by Sanji Bhal, Director, Marketing & Communications, ACD/Labs

Analytical data management is no longer a “nice-to-have” and recently I had the pleasure to host a webinar with some Pfizer colleagues. Vijay Bulusu, Pankaj Aggarwal, and David Foley spoke about how Pfizer’s Scientific Data Cloud (SDC) is helping scientists access information quickly and easily to generate insights from data, to respond to regulatory inquiries in a timely manner, and to help accelerate the product development pipeline. ACD/Spectrus technology is central to the handling of analytical data within the SDC. The topics discussed in the webinar included:

Missed opportunities, regulatory drivers, and challenges that led to Pfizer reimagining scientific data management
Goals for the Pfizer Scientific Data Cloud (SDC)
Why Pfizer chose Spectrus for analytical chemistry data handling
Benefits to Pfizer of Spectrus-enabled scientific data management

Watch the full webinar here or read the highlights below.

Why Pfizer Wanted to Reimagine their Scientific Data Management

Dark data

There are many data related challenges in large enterprises around how employees might find data relevant to their work, especially data generated by others. They (a) may not know whether the data exists in the first place, and/or (b) where it’s kept, (c) may not have access to the data repositories, and/or (d) may not have training on how to use the data.

“In these situations,” described Vijay, “we resort to the sneakernet. Colleagues looking for information either pick up the phone or send an email or message to other colleagues who they think may have better access to that data or information”

Results from polling the webinar audience showed that analytical data is managed in a variety of systems by those in R&D (Figure 1).

Figure 1. A poll of the webinar audience showed that analytical data is stored in lots of different systems

“We had the same issues and challenges when it comes to finding data…it’s often stored on individual computers or hard drives. We had the dark data challenge,” said Vijay.

Gartner coined the term “dark data” as the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes.[1]

Analytical data management—an enterprise/organizational challenge

When the SDC project began, Pfizer scientists were still exchanging documents and files manually. Data stayed in department, group or site silos. Some information flowed forwards—from research to development to manufacturing—but little information flowed backwards.

The state of ADM_pie — Figure 2. A poll of the audience revealed their opinions on the pain of analytical data management in their organizations.

Asked to rank the pain of analytical data management at Pfizer at the outset of the SDC project, Vijay countered that the question is very context specific. “At an individual level, you could argue that analytical data management is not painful because you know where you store your data, and you know, how to get access to that same data. The question from an enterprise perspective, at a larger scale, is where it gets more challenging because it’s harder to find and reuse data generated by others.”

Data scattered geographically & throughout IT systems

Additionally, analytical groups at Pfizer—as with every other large R&D organization—are geographically scattered. The Structure Elucidation group, which serves global development and clinical manufacturing, consists of 11 NMR and mass spectroscopists located across 2 sites—Groton (CT) in the US, and Sandwich in the UK. “Between us we probably elucidate structure for over 600 compounds each year. We have the challenge of these two groups being in very separate locations, and also separate time zones,” said Dave. With chromatographic data the problem is even greater with many more scientists collecting thousands of chromatograms in the US and UK. Not only is the data scattered but projects also change hands.

“Oftentimes projects might be in early stage development in the US, and then move to our UK site for late stage development, and vice versa. Efficient transfer of data between both sites is becoming more and more important,” added Pankaj.

Vijay also observed that Pfizer has analytical instruments from a variety of vendors, each with their own formats and data types, making comparison and interoperability difficult and time-consuming.

How much of a problem was scattered data, and what are some of the consequences?

“Oftentimes, we would find that it was easier to re-record the sample and repeat the whole analysis, rather than trying to find that original piece of raw data across multiple systems,” said Dave. “At the outset of the project we knew we were missing opportunities because the value-added information was very disparate—scattered on people’s laptops, on analytical instruments, and in people’s brains.”

Regulatory expectations, data integrity & quality

Regulatory agency expectations, data integrity, and data quality were all drivers to enhance data management.

“Over the years,” shared Vijay, “we have seen an uptick in requests for original raw data files from regulatory agencies. These raw data files were spread across multiple systems and repositories which made speedy access challenging. As an additional stressor, there was growing concern about product quality if data could not be located quickly during regulatory inspections. Data integrity was also becoming a more prevalent issue.”

Pfizer also wanted to focus on data quality. “It’s not enough to store data and make it accessible. We also wanted to make sure that if I generate some data, and my colleague finds it, six months later, or six years later, the context around how and why the data was collected and captured also needs to be clear.” (Vijay)

Speed of innovation

The pharmaceutical industry wants to speed up innovation and fill pipelines with new therapies. To manage fast-moving and simultaneous projects, a cloud-based data management and analysis system was expected to help keep project progress on track with efficient access to data.

Those organizational goals impact every individual. “Everyone is super busy; a lot of programs are being accelerated,” agreed Dave, “trying to make time and finding efficiencies is very important. One of the main goals was to eliminate duplication of structure assignment work—that is literally a waste of time.”

Scientific insights, not just data management

“People were trying to get away from the burden of managing data—tagging data, metadata management, etc.,” said Vijay. “Users were saying, ‘give me access to the data so I can analyze it to gain new insights,’ and our collection of scientific data systems and data repositories were not adequately capable of supporting this transition.” A new system built with cutting-edge technology would support on-demand analysis of existing data.

To help appreciate this, Vijay compared scientific data management with taking a photo on your cell phone. “When was the last time you took a picture or video on your smartphone and then had to name the file, organize the folders in which you were to put that picture or video; and by the way, remember that information forever? That’s pretty much how data management was happening. Individual scientists had to remember where they were saving their information. And in this day and age, we argued, there has to be a better way of managing scientific data.”

SDC Project Goals

Pfizer’s vision was to replace manual, siloed information exchange with an automated, centralized cloud-based system (the SDC). Data generated from lab instruments and manufacturing equipment was to be automatically swept into the SDC, stored, tagged, processed, and formatted for use. This data could then be made available to other systems for reporting, prediction and modelling, and analysis, without requiring tedious transcription to Excel and other analytics tools. The SDC needed to be GMP-compliant and to store and analyze multiple data types —spectroscopy data (NMR, MS etc.), chromatography data (LC, UV etc.), characterization data (PXRD etc.), manufacturing data (historians, PAT etc.) – as well as different formats – structured, unstructured, reaction schemes, chemical structures etc.

The SDC was also designed according to FAIR principles[2]: the data was to be Findable, Accessible, Interoperable, and Reusable.

Why Pfizer Chose Spectrus for analytical and chemical data handling

“We are analytical chemists in Pharma [working] on small molecules… everything is based on the structure.”
Pankaj A.

The Spectrus Platform integrates data from NMR, MS, LC, GC, UV, and more
Pfizer scientists had existing site licenses for ACD/Labs Spectrus software so they were familiar with the ACD/Labs platform
Scientists can search Spectrus databases using chemically intelligent parameters: chemical structure, substructure, and spectral elements (retention time, peak, peak area, etc.)
Parameters central to a scientist’s workflows and desired data search parameters could be integrated into the solution—at Pfizer, including ELN record numbers and PF numbers was important
Applications on the Spectrus Platform can use routine spectral data to speed up workflows and provide insights to support faster decisions; for example, training the NMR prediction database can support faster (or automated) structure verification.

“Assigned NMR data automatically feeds into ACD/Labs’ NMR training database for chemical shift prediction and can be tweaked by subject matter experts (SMEs). By introducing Pfizer specific compounds with associated chemical shifts into the database we’re honing the predictions.”
David F.

Real benefits of Spectrus-enabled scientific data management at Pfizer

Centralized, reusable analytical data

Data is automatically swept to the cloud from analytical instruments—“the scientist doesn’t have to deal with all the data conversions or integrations…and the complexity of the scientist needing to know the source of the data is removed. We have been able to get all the data into a central location making it available to the general user who can use it for numerous different applications,” said Vijay.

“I don’t need to go into my ELN, to call my friend and ask for those method conditions or search for the method conditions from LIMS. I have it readily available and I can go and try to repeat the method in the lab,” added Pankaj.

Facile search across data collected over multiple sites, groups, instruments, and data types

We have a fully searchable, gold standard database of NMR and mass spectra. It can be searched by PF number (Pfizer compound number), structure, and substructure. Scientists can do spectral matches. They can input NMR or mass spec and quickly answer the question “is there a spectrum or compound in the database that matches closely for this? Time spent searching for information is reduced way down,” explained David.

Increased efficiency through elimination of duplicated effort and repeat experiments

Pankaj described how access to a centralized library of methods data is resulting in significant time savings. “If I’m trying to develop a chiral method I can search the database to help answer the question ‘does a chiral method already exist?’ If I’m in development, a method might have already been developed in the research environment. [Now] I can just draw that structure and search the database to see if there is a pre-existing method that I can use or start with instead of developing a new chiral method from scratch.”

Insights are being drawn from analytical data

Panakaj continued, “The type of trend analysis I’m able to do with the Spectrus library would take a significant amount of time before. I would have to go across multiple systems like ELN, LIMS, Empower, assemble all that data into an Excel file and plot it simply to learn that chromatographic peak tailing was only seen from a particular LC instrument.”

Watch the Webinar

Learn more about the Spectrus Platform to see how this expert system can help you with more effective analytical data management.

[1] https://www.gartner.com/en/information-technology/glossary/dark-data

[2] https://www.nature.com/articles/sdata201618

About the Author

Sanji Bhal

Director, Marketing & Communications, ACD/Labs

Sanji Bhal is the Director of Marketing & Communications at ACD/Labs. Prior to joining ACD/Labs she was a medicinal chemist at Signalgene Inc., where she pursued her ongoing interest in cancer research, followed by a stint with the CRO NAEJA Pharmaceuticals. Sanji began her career in the U.K., completing her Ph.D. in synthetic organic chemistry at the University of Reading, and a post-doctoral fellowship at Cancer Research UK.