Association of smoking history extracted from electronic health records (EHR) using machine-learning methods and tumor characteristics in patients with lung cancer.

Authors

null

Semanti Mukherjee

Memorial Sloan Kettering Cancer Center, New York, NY

Semanti Mukherjee , Andrew Schroeder , Subrata Chatterjee , John Cadley , Christina J. Falcon , Miika Mehine , Chaitanya Bandlamudi , Avijit Chatterjee , Marc Ladanyi , David B. Solit , Michael F. Berger , Zsofia Kinga Stadler , Mark G. Kris , David Randolph Jones , Adam Jacob Schoenfeld , Fernanda C. G. Polubriaginof , Nikolaus Schultz , Jonine L. Bernstein , Charles M. Rudin , Kenneth Offit

Organizations

Memorial Sloan Kettering Cancer Center, New York, NY

Research Funding

U.S. National Institutes of Health
U.S. National Institutes of Health

Background: Though smoking is a major risk factor for lung cancer, it has been a challenge to collect patients’ smoking history information accurately from the EH due to data inconsistency and incompleteness. To address these challenges, we utilized a weak supervision methodology to automatically annotate smoking status of patients with lung cancer and correlated it with tumor characteristics. Methods: We analyzed 6,355 patients with lung cancer who underwent tumor profiling with MSK-IMPACT. In total, 14,555 unstructured clinical notes were extracted from EHR at the Memorial Sloan Kettering Cancer Center. The weak supervision methodology used a generative model for intermediate labels that were subsequently tuned by machine-learning classifier to generate the final labels. Clinical notes from a randomly sampled set of 564 patients were manually curated and used for performance assessment. The rest of the patients were split into training and validation datasets used for model training and hyperparameter tuning. Pack years were also extracted from clinical notes using Natural Language Processing. We next conducted multivariate analyses for primary and metastatic tumor samples separately to correlate smoking metrics with tumor characteristics including tumor mutation burden (TMB) and chromosomal instability, as inferred by the fraction of genome altered (FGA) after controlling for age at sequencing, gender, histological subtypes, ancestry, coverage and tumor purity. Results: The weak supervision classifier had almost perfect performance for 2-label classification model (ever smokers and never smokers) with macro F1-score: 97.7%, balanced accuracy: 97.1%, 97.1%, precision:98.4%, 98.4% and recall: 99.5%,94.6% respectively. For 3-label classification model (never smoker, former smoker, and current smoker), the macro F1-score was 79.8%; balanced accuracy: 97.1%, 86.7%, 71.2%, precision: 93.9%, 90.1%, 61.7%, recall: 96.1%, 93.3%, 46.0% respectively. Analyzing genomic data, we observed that smoking status (smoker vs. never smoker) and pack-years were associated with TMB in both primary and metastatic tumor samples (p<2e-16). FGA was marginally associated with smokers compared to never smokers in primary tumor samples (p=0.06). Among smokers diagnosed with lung adenocarcinoma, significantly high FGA in primary tumor samples was observed in males compared to females after adjusting for pack-years and other variables (p= 3.3e-3). Conclusions: We demonstrated high performance of our approach for automated curation of smoking history from EHR. The genomic results confirmed distinct mutational patterns associated with smoking behavior in patients with lung cancer. We are currently exploring multimodal approaches by including chest CT images and “time of quitting” to improve performance of the 3-class model.

Disclaimer

This material on this page is ©2024 American Society of Clinical Oncology, all rights reserved. Licensing available upon request. For more information, please contact licensing@asco.org

Abstract Details

Meeting

2023 ASCO Annual Meeting

Session Type

Poster Session

Session Title

Care Delivery and Regulatory Policy

Track

Care Delivery and Quality Care

Sub Track

Clinical Informatics/Advanced Algorithms/Machine Learning

Citation

J Clin Oncol 41, 2023 (suppl 16; abstr 1559)

DOI

10.1200/JCO.2023.41.16_suppl.1559

Abstract #

1559

Poster Bd #

153

Abstract Disclosures

Similar Abstracts