Development of natural language processing (NLP) models for extracting key features from unstructured notes to create real-world data (RWD) assets for clinical research at scale.

Authors

null

Smita Agrawal

ConcertAI, Bengaluru, Karnataka, India

Smita Agrawal , Rohini George , Vivek Prabhakar Vaidya , Sangavai Chakkrapani , Rambaksh Prajapati , Srikanth Tankala , Dhaval Parmar , Vinay Phani Santosh Lakkimsetty , Tapasya Bhardwaj , Ashwani Ashwani , Emma Mendonca , Babu Narayanan , Krishna Kumar Swaminathan , Pranay Mukherjee

Organizations

ConcertAI, Bengaluru, Karnataka, India, ConcertAI, Bengaluru, India, Syneos Health, Bengaluru, India, Merck & Co., Inc., Pune, India, Self, Bengaluru, India

Research Funding

No funding received
None.

Background: RWD derived from Electronic Health Records (EHR) has detailed clinical information about patient journeys that can assist in clinical research, trial design, safety assessments etc. However, much of the vital information is locked away in unstructured clinical texts and needs to be converted to structured format to be useful for downstream applications. We demonstrate how this can be achieved at scale with a high degree of accuracy through NLP. Methods: NLP models were developed to extract data for 11 clinical variables from unstructured notes of ~98k lung cancer patients and merged with the structured data into a common data model (Table). These models were a combination of domain knowledge, rule-based models, machine learning models, and deep learning models. The increase in fill rate per variable over structured data only was used to quantify the improvement by NLP. The accuracy of the models was assessed against a manually curated dataset comprising of 752 patients. Results: The NLP models significantly improved the fill rate of key clinical variables and were able to extract the information from clinical notes with high accuracy (Table). For some variables such as NSCLC/SCLC status, surgery, tumor grade and histology, all or most of the data was extracted via NLP. Metastatic status via NLP included distant metastasis, locally advanced disease and no metastasis whereas in the structured data, only data for distant metastasis was present. In the case of Performance Status (PS), even though a significant number of patients had at least 1 PS recorded in the structured data, NLP significantly increased longitudinal capture, thus increasing the density of this variable per patient. Conclusions: NLP models can be developed and used to enrich structured RWD data by extracting information from unstructured documents thus significantly improving the utility of this data for downstream applications. Given the high accuracy of these models and the scale at which they can be run, this can be a good alternative to human curation or can augment human curation enabling the creation of very large-scale datasets for clinical research.

Performance of NLP models and their contribution to enriching structured RWD.

NLP Field
(# of patients = 98676)
Stage at Dx T Stage at Dx N Stage at Dx M Stage at Dx NSCLC / SCLC Tumor Histology Tumor Grade Metastatic Status Metastatic Site Lung Cancer Surgery PS
# of unique patients in RWD 57065 50139 51897 55035 0 10534 2771 34067 31510 0 70773
# of unique patients in RWD-NLP 83864 66138 66724 66593 88795 94677 56662 92627 47004 22844 82679
% contribution from NLP 32 20.9 22.2 17.4 100 88.9 95 63.2 33 100 58*
Precision/Recall 0.92/0.87 0.92/0.83 0.89/0.85 0.9/0.81 0.98/0.91 0.87/0.88 0.91/0.90 0.88/0.87 0.94/0.97 0.87/0.67 0.97**

* Calculated based on patients where at least 1 PS value was added by NLP. ** Accuracy.

Disclaimer

This material on this page is ©2024 American Society of Clinical Oncology, all rights reserved. Licensing available upon request. For more information, please contact licensing@asco.org

Abstract Details

Meeting

2023 ASCO Annual Meeting

Session Type

Poster Session

Session Title

Health Services Research and Quality Improvement

Track

Quality Care/Health Services Research

Sub Track

Real-World Data/Outcomes

Citation

J Clin Oncol 41, 2023 (suppl 16; abstr 6607)

DOI

10.1200/JCO.2023.41.16_suppl.6607

Abstract #

6607

Poster Bd #

99

Abstract Disclosures