ConcertAI, Bengaluru, Karnataka, India
Smita Agrawal , Rohini George , Vivek Prabhakar Vaidya , Sangavai Chakkrapani , Rambaksh Prajapati , Srikanth Tankala , Dhaval Parmar , Vinay Phani Santosh Lakkimsetty , Tapasya Bhardwaj , Ashwani Ashwani , Emma Mendonca , Babu Narayanan , Krishna Kumar Swaminathan , Pranay Mukherjee
Background: RWD derived from Electronic Health Records (EHR) has detailed clinical information about patient journeys that can assist in clinical research, trial design, safety assessments etc. However, much of the vital information is locked away in unstructured clinical texts and needs to be converted to structured format to be useful for downstream applications. We demonstrate how this can be achieved at scale with a high degree of accuracy through NLP. Methods: NLP models were developed to extract data for 11 clinical variables from unstructured notes of ~98k lung cancer patients and merged with the structured data into a common data model (Table). These models were a combination of domain knowledge, rule-based models, machine learning models, and deep learning models. The increase in fill rate per variable over structured data only was used to quantify the improvement by NLP. The accuracy of the models was assessed against a manually curated dataset comprising of 752 patients. Results: The NLP models significantly improved the fill rate of key clinical variables and were able to extract the information from clinical notes with high accuracy (Table). For some variables such as NSCLC/SCLC status, surgery, tumor grade and histology, all or most of the data was extracted via NLP. Metastatic status via NLP included distant metastasis, locally advanced disease and no metastasis whereas in the structured data, only data for distant metastasis was present. In the case of Performance Status (PS), even though a significant number of patients had at least 1 PS recorded in the structured data, NLP significantly increased longitudinal capture, thus increasing the density of this variable per patient. Conclusions: NLP models can be developed and used to enrich structured RWD data by extracting information from unstructured documents thus significantly improving the utility of this data for downstream applications. Given the high accuracy of these models and the scale at which they can be run, this can be a good alternative to human curation or can augment human curation enabling the creation of very large-scale datasets for clinical research.
NLP Field (# of patients = 98676) | Stage at Dx | T Stage at Dx | N Stage at Dx | M Stage at Dx | NSCLC / SCLC | Tumor Histology | Tumor Grade | Metastatic Status | Metastatic Site | Lung Cancer Surgery | PS |
---|---|---|---|---|---|---|---|---|---|---|---|
# of unique patients in RWD | 57065 | 50139 | 51897 | 55035 | 0 | 10534 | 2771 | 34067 | 31510 | 0 | 70773 |
# of unique patients in RWD-NLP | 83864 | 66138 | 66724 | 66593 | 88795 | 94677 | 56662 | 92627 | 47004 | 22844 | 82679 |
% contribution from NLP | 32 | 20.9 | 22.2 | 17.4 | 100 | 88.9 | 95 | 63.2 | 33 | 100 | 58* |
Precision/Recall | 0.92/0.87 | 0.92/0.83 | 0.89/0.85 | 0.9/0.81 | 0.98/0.91 | 0.87/0.88 | 0.91/0.90 | 0.88/0.87 | 0.94/0.97 | 0.87/0.67 | 0.97** |
* Calculated based on patients where at least 1 PS value was added by NLP. ** Accuracy.
Disclaimer
This material on this page is ©2024 American Society of Clinical Oncology, all rights reserved. Licensing available upon request. For more information, please contact licensing@asco.org
Abstract Disclosures
2023 ASCO Annual Meeting
First Author: Smita Agrawal
2023 ASCO Annual Meeting
First Author: Kristin M. Zimmerman Savill
2023 ASCO Annual Meeting
First Author: Jihong Zong
2023 ASCO Annual Meeting
First Author: Arash Maghsoudi