Using machine learning on real-world data to predict metastatic status.

Authors

Foad H. Green

Syapse, San Francisco, CA

Foad H. Green , Hu T. Huang , Michelle Lerman , Mary Tran , Vinod Subramanian , Joshua Loving , Matthew J. Rioth

Organizations

Syapse, San Francisco, CA

Research Funding

No funding received

Background: Real world data (RWD) is increasingly used to inform research, patient care, and population health in oncology; however, using RWD at scale requires accurate methods to identify clinically-relevant attributes. Metastatic status is a highly relevant clinical attribute in cancer patients but it is not routinely captured in structured formats and its determination conventionally requires review and interpretation by certified tumor registrars (CTRs). Clinical diagnoses, treatments, imaging procedures and other clinical variables documented in electronic health records (EHRs) can be used to differentiate metastatic from non-metastatic patients. This study describes an effective machine learning approach in utilizing prevalent and standardized data elements from EHRs across multiple health systems. Methods: 28,043 lung cancer and breast cancer patients from two large health systems within the Syapse Learning Health Network with data sources from CTR abstraction and EHRs were analyzed. Patients were labeled for reference metastatic status by CTRs and split into training (n = 22,434) and testing (n = 5,609) cohorts, with proportionate distribution of cancer type and metastatic status between cohorts. A regularized gradient boosting algorithm, XGBoost, was trained using over 750 variables from the patient records collected at the time of or after the initial cancer diagnosis. Results: Integration of ICD-10-CM codes with antineoplastic treatment history and radiologic imaging procedure orders achieved metastatic status prediction with increases to precision and recall in lung cancer (21% and 32% respectively) and breast cancer (39% and 9% respectively), when compared to the use of only ICD-10-CM diagnosis codes for secondary malignant neoplasms (Table). The addition of treatment and procedure data from different cancer types improved the model classification within individual cancer types. Conclusions: One of the biggest challenges in using RWD for precision oncology is identification of clinically-relevant phenotypes at scale. Here we demonstrate a scalable evidence-based method utilizing structured data for imputing metastatic status with high predictive power from two separate health systems. With further validation, this approach may be generalized to other cancer types, applied to temporal slices of data to identify changes in metastatic status, as well as provide a high-confidence designation of metastatic status for other use cases such as staging.

**Model performance metrics.**
	Precision	Recall
Lung and bronchus (ICD-10-CM only)	0.67	0.50
Lung and bronchus (Predictive model)	0.88	0.82
Breast (ICD-10-CM only)	0.56	0.82
Breast (Predictive model)	0.95	0.91

Disclaimer

Abstract Details

Meeting

2022 ASCO Annual Meeting

Session Type

Poster Session

Session Title

Care Delivery and Regulatory Policy

Track

Care Delivery and Quality Care

Sub Track

Clinical Informatics/Advanced Algorithms/Machine Learning

Citation

J Clin Oncol 40, 2022 (suppl 16; abstr 1550)

DOI

10.1200/JCO.2022.40.16_suppl.1550

Abstract #

1550

Poster Bd #

143

Abstract Disclosures

FEATURED

Using machine learning on real-world data to predict metastatic status.

Authors

Foad H. Green

Organizations

Research Funding

Abstract Details

Meeting

Session Type

Session Title

Track

Sub Track

Citation

DOI

Abstract #

Poster Bd #

Similar Abstracts

Abstract

Assessment of electronic health record (EHR) –based machine learning (ML) in predicting risk of brain metastasis among patients with early-stage non–small-cell lung cancer (eNSCLC).

Abstract

Using EHR data and machine learning approach to facilitate the identification of patients with lung cancer from a pan-cancer cohort.

Abstract

Enhancement in line of therapy (LoT) derivation from real-world data (RWD) from electronic health records (EHR) via integration of medical claims data.

Abstract

Association of smoking history extracted from electronic health records (EHR) using machine-learning methods and tumor characteristics in patients with lung cancer.