Development of natural language processing (NLP) models for extracting key features from unstructured notes to create real-world data (RWD) assets for clinical research at scale.

Authors

Smita Agrawal

ConcertAI, Bengaluru, Karnataka, India

Smita Agrawal , Rohini George , Vivek Prabhakar Vaidya , Sangavai Chakkrapani , Rambaksh Prajapati , Srikanth Tankala , Dhaval Parmar , Vinay Phani Santosh Lakkimsetty , Tapasya Bhardwaj , Ashwani Ashwani , Emma Mendonca , Babu Narayanan , Krishna Kumar Swaminathan , Pranay Mukherjee

Organizations

ConcertAI, Bengaluru, Karnataka, India, ConcertAI, Bengaluru, India, Syneos Health, Bengaluru, India, Merck & Co., Inc., Pune, India, Self, Bengaluru, India

Research Funding

No funding received

None.

Background: RWD derived from Electronic Health Records (EHR) has detailed clinical information about patient journeys that can assist in clinical research, trial design, safety assessments etc. However, much of the vital information is locked away in unstructured clinical texts and needs to be converted to structured format to be useful for downstream applications. We demonstrate how this can be achieved at scale with a high degree of accuracy through NLP. Methods: NLP models were developed to extract data for 11 clinical variables from unstructured notes of ~98k lung cancer patients and merged with the structured data into a common data model (Table). These models were a combination of domain knowledge, rule-based models, machine learning models, and deep learning models. The increase in fill rate per variable over structured data only was used to quantify the improvement by NLP. The accuracy of the models was assessed against a manually curated dataset comprising of 752 patients. Results: The NLP models significantly improved the fill rate of key clinical variables and were able to extract the information from clinical notes with high accuracy (Table). For some variables such as NSCLC/SCLC status, surgery, tumor grade and histology, all or most of the data was extracted via NLP. Metastatic status via NLP included distant metastasis, locally advanced disease and no metastasis whereas in the structured data, only data for distant metastasis was present. In the case of Performance Status (PS), even though a significant number of patients had at least 1 PS recorded in the structured data, NLP significantly increased longitudinal capture, thus increasing the density of this variable per patient. Conclusions: NLP models can be developed and used to enrich structured RWD data by extracting information from unstructured documents thus significantly improving the utility of this data for downstream applications. Given the high accuracy of these models and the scale at which they can be run, this can be a good alternative to human curation or can augment human curation enabling the creation of very large-scale datasets for clinical research.

Performance of NLP models and their contribution to enriching structured RWD.
NLP Field (# of patients = 98676)	Stage at Dx	T Stage at Dx	N Stage at Dx	M Stage at Dx	NSCLC / SCLC	Tumor Histology	Tumor Grade	Metastatic Status	Metastatic Site	Lung Cancer Surgery	PS
# of unique patients in RWD	57065	50139	51897	55035	0	10534	2771	34067	31510	0	70773
# of unique patients in RWD-NLP	83864	66138	66724	66593	88795	94677	56662	92627	47004	22844	82679
% contribution from NLP	32	20.9	22.2	17.4	100	88.9	95	63.2	33	100	58*
Precision/Recall	0.92/0.87	0.92/0.83	0.89/0.85	0.9/0.81	0.98/0.91	0.87/0.88	0.91/0.90	0.88/0.87	0.94/0.97	0.87/0.67	0.97**

* Calculated based on patients where at least 1 PS value was added by NLP. ** Accuracy.

Disclaimer

Abstract Details

Meeting

2023 ASCO Annual Meeting

Session Type

Poster Session

Session Title

Health Services Research and Quality Improvement

Track

Quality Care/Health Services Research

Sub Track

Real-World Data/Outcomes

Citation

J Clin Oncol 41, 2023 (suppl 16; abstr 6607)

DOI

10.1200/JCO.2023.41.16_suppl.6607

Abstract #

6607

Poster Bd #

Abstract Disclosures

FEATURED

Development of natural language processing (NLP) models for extracting key features from unstructured notes to create real-world data (RWD) assets for clinical research at scale.

Authors

Smita Agrawal

Organizations

Research Funding

Abstract Details

Meeting

Session Type

Session Title

Track

Sub Track

Citation

DOI

Abstract #

Poster Bd #

Similar Abstracts

Abstract

Enhancement in line of therapy (LoT) derivation from real-world data (RWD) from electronic health records (EHR) via integration of medical claims data.

Abstract

Differences in racial and sex at birth representativeness between randomized controlled trial (RCT) and real-world evidence (RWE) or Surveillance, Epidemiology, and End Results (SEER) datasets in advanced non-small cell lung cancer (aNSCLC).

Abstract

Using real-world evidence (RWE) in regulatory decision making: A study of 6 oncology approvals with RWE included in the product label.

Abstract

Application of natural language processing to assess the performance status documentation quality metric in patients with non–small-cell lung cancer.