Using EHR data and machine learning approach to facilitate the identification of patients with lung cancer from a pan-cancer cohort.

Authors

null

Yue Yu

Mayo Clinic, Rochester, MN

Yue Yu , Kathryn Jean Ruddy , Konstantinos Leventakos , Bolun Liu , Nan Huo , Deirdre R. Pachman , Nansu Zong , Guohui Xiao , Christopher Chute , Emily Pfaff , Andrea L. Cheville , Guoqian Jiang

Organizations

Mayo Clinic, Rochester, MN, University of Bergen, Bergen, Norway, Johns Hopkins University, Baltimore, MD, University of North Carolina at Chapel Hill, Chapel Hill, NC

Research Funding

U.S. National Institutes of Health
U.S. National Institutes of Health

Background: Real-world data from Electronic Health Records (EHR) have been widely used for patient identification to build study cohorts for clinical research. Traditionally, diagnosis codes in the EHR, such International Classification of Diseases (ICD), are used to identify the target patients. However, the accuracy of this approach is dependent on the accuracy of ICD coding, with potential errors especially for tumor types that are frequent locations for metastases (which may contribute to mis-coding). In this study, we attempted to develop a Machine Learning (ML) based approach on EHR data to improve the accuracy of identification of patients with lung cancer. Methods: We used survey respondents in the Enhanced, EHR-facilitated Cancer Symptom Control (E2C2, NCT03892967) cluster-randomized trial at Mayo Clinic as our initial pan-cancer cohort. E2C2 includes adults receiving Medical Hematology/Oncology care for a solid or liquid tumor at Mayo Clinic. We collected cancer diagnoses from the individually abstracted Mayo Clinic Cancer Registry to annotate the cancer type for the patients. Lung cancer related ICD-9 (162.X) and ICD-10 (C34.X) codes were used to build a search query on the Mayo’s EHR to find target patients from the E2C2 cohort, and to investigate the ICD-based lung cancer patient identification performance. Diagnosis, radiation oncology treatment (CPT 77261 - 77799), and antineoplastic drug administration data were collected from EHR as variables. Logistic Regression (LR), Support vector machine (SVM), Random Forest (RF), and eXtreme Gradient Boosting (XGB) were selected to build models for lung cancer patient identification. 10-fold cross-validation was implemented to assess the models. Precision, Recall, F1 Score, and area under the curve (AUC) were selected to measure the performance. Results: We collected 13,893 patients with a specific cancer diagnosis, and 1,394 were identified as having lung cancer. The identification performance across different methods is shown. The ICD-based method only had 0.65 precision. It means we collected a lot of false positive cases (other cancer but no lung cancer patient), as we mentioned in the background. SVM gets the best precision results, but its recall and F1 score are not good enough. XGB shows the best F1 Score and AUC, which also means this method achieved the best and most balanced performance. Conclusions: In this study, we found that XGB-based methods achieved the best identification performance for lung cancer. In the future, we will investigate if this is also true for the identification of other cancer types.

PrecisionRecallF1 Score AUC
ICD0.650.96*0.77NA
LR0.510.760.600.83
SVM0.94*0.490.640.97
RF0.930.670.780.98
XGB0.900.920.91*0.99*

Disclaimer

This material on this page is ©2024 American Society of Clinical Oncology, all rights reserved. Licensing available upon request. For more information, please contact licensing@asco.org

Abstract Details

Meeting

2023 ASCO Annual Meeting

Session Type

Publication Only

Session Title

Publication Only: Care Delivery and Regulatory Policy

Track

Care Delivery and Quality Care

Sub Track

Clinical Informatics/Advanced Algorithms/Machine Learning

Citation

J Clin Oncol 41, 2023 (suppl 16; abstr e13552)

DOI

10.1200/JCO.2023.41.16_suppl.e13552

Abstract #

e13552

Abstract Disclosures