Guidance on modeling circulating miRNA to distinguish multiple cancer types by an observation of large-scale open data.

Authors

Jason Chia-Hsun Hsieh

Chang Gung Memorial Hospital, Linkuo, Taoyuan City, Taoyuan County, Taiwan

Jason Chia-Hsun Hsieh , Tsung-Ting Hsieh , Ko-Han Lee , Yu-Chuan Chang

Organizations

Chang Gung Memorial Hospital, Linkuo, Taoyuan City, Taoyuan County, Taiwan, Pharus Diagnostics (Pharus, Inc.), Zhubei, Taiwan

Research Funding

Pharmaceutical/Biotech Company

Pharus Diagnostics (Pharus, Inc.)

Background: Cell-free miRNAs (cf-miRNA), circulated in body fluids such as plasma or serum, have shown their ability to detect, diagnose, and monitor cancers. Combining machine learning (ML) technology with these biomarkers facilitates early detection of cancers, which increases the accuracy of clinical decisions and empowers people to take control of their health status. However, the data of the cf-miRNAs has characteristics, which will affect the results of ML. Therefore, this study tries to expound on them in different aspects and to build a reasonable model. Methods: We downloaded large-scale datasets of the platform GPL21263 from the Gene Expression Omnibus for modeling experiments. We curated 8,174 subjects with 2,565 miRNA targets across 7 cancer types of different cf-miRNA-based cancer studies. Moreover, we used principal component analysis (PCA) to observe the datasets, recursive feature elimination (RFE) for feature selection, and tree-based algorithms to build the prediction model. Results: The characteristics of the cf-miRNA we like to share are: (1) Cancer subjects express more cf-miRNAs than control subjects. In the control group, there were 294 and 327 miRNAs with missing rates under 50% and 25%, respectively. In contrast, there were 395 and 485 miRNAs with the same thresholds in the cancer group; (2) Dividing subjects into cancer and control groups is simpler than distinguishing specific cancer types. In the PCA, the average Euclidean distance between the control group and each cancer type is 98.48, while it is 20.23 within each cancer type; (3) For obtaining cancer-specific biomarkers, we suggested that other non-target cancer subjects should be considered as negative controls. We modeled 7 cancer types and compared the proportion of cancer-specific biomarkers, not selected by any other models. The proportion increased from 30.0% to 57.8% after we added other non-target cancer subjects to the control group; next, we focus on multi-cancer modeling: (4) We need at least 400 samples to distinguish seven cancer types. In our experiment, we kept increasing the size of the training data, a hundred at a time. As data was added to the model, the accuracy increased but plateaued after adding 400 samples; (5) Based on RFE, 120 miRNAs is a reasonable number to distinguish seven cancers. Moreover, we found some of these miRNAs are only expressed in cancer subjects. We might lose this kind of biomarkers if we filtered out them by the missing rate; (6) The 10-fold cross validation accuracy of the multi-cancer model can achieve 93.0% using the gradient-boosted trees algorithm. Conclusions: In this study, we showed the guidance for modeling miRNAs in different aspects, including labeling strategy, sample and feature sizes, and the high-accuracy multi-cancer model we can achieve. We hope this guidance would inspire researchers on cf-miRNA-related machine learning applications.

Disclaimer

Abstract Details

Meeting

2023 ASCO Annual Meeting

Session Type

Publication Only

Session Title

Publication Only: Care Delivery and Regulatory Policy

Track

Care Delivery and Quality Care

Sub Track

Clinical Informatics/Advanced Algorithms/Machine Learning

Citation

J Clin Oncol 41, 2023 (suppl 16; abstr e13537)

DOI

10.1200/JCO.2023.41.16_suppl.e13537

Abstract #

e13537

Abstract Disclosures

FEATURED

Guidance on modeling circulating miRNA to distinguish multiple cancer types by an observation of large-scale open data.

Authors

Jason Chia-Hsun Hsieh

Organizations

Research Funding

Abstract Details

Meeting

Session Type

Session Title

Track

Sub Track

Citation

DOI

Abstract #

Similar Abstracts

Abstract

Machine learning-based multiple cancer detections with circulating miRNA profiles in the blood.

Abstract

Circulating tumor cells as a biomarker for monitoring: Disease progression, treatment response, and minimal residual disease.

Abstract

Evaluating the clinical utility of circulating tumor cells (CTC) profiling to predict selection of preferred therapeutic regimens in newly diagnosed or pretreated refractory renal cell carcinomas (RCC).

Abstract

Identification of a panel of microRNA biomarkers for the management of patients with refractory germ cell tumor of the testis.