Can synthetic data accurately mimic oncology clinical trials?

Authors

SAMER EL KABABJI

Samer El Kababji

CHEO Research Institute, Ottawa, ON, Canada

Samer El Kababji , Nicholas Mitsakakis , Xi Fang , Ana-Alicia Beltran-Bless , Gregory Russell Pond , Lisa Vandermeer , Dhenuka Radhakrishnan , Lucy Mosquera , Mark J. Clemons , Khaled El Emam

Organizations

CHEO Research Institute, Ottawa, ON, Canada, Replica Analytics, Ottawa, ON, Canada, Division of Medical Oncology, Department of Medicine, University of Ottawa, Ottawa, ON, Canada, McMaster University, Hamilton, ON, Canada, Ottawa Hospital Research Institute, Ottawa, ON, Canada, Department of Pediatrics, University of Ottawa, Ottawa, ON, Canada, Division of Medical Oncology, Department of Medicine, The Ottawa Hospital and University of Ottawa, Ottawa, ON, Canada, School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada

Research Funding

Institutional Funding
CHEO Research Institute

Background: There is strong interest by researchers, the pharmaceutical industry, medical journal editors, funders of research, and regulators in sharing clinical trial data. Reusing data extracts the most utility possible from patient contributions. The majority of patients do want to share their data for secondary research purposes. However, data access for secondary analysis remains a challenge. A key reason why individual-level data is not made directly available to data users by authors and data custodians is concern over breaches of patient privacy. Synthetic data generation (SDG) is an effective way to address privacy concerns that can enable the broader sharing of clinical trial datasets. However, a key question is whether the reproducibility of the generated data is adequate to draw reliable conclusions. Methods: We synthesized datasets from five pragmatic breast cancer clinical trials performed by the REaCT group (https://react.ohri.ca/). A sequential synthesis method, a type of machine learning was performed. The published analysis of each trial was repeated on each synthetic dataset to evaluate reproducibility. We evaluated reproducibility on three criteria: (a) decision agreement: the direction and statistical significance of the primary endpoint effect estimates are the same as the real data, (b) estimate agreement: the parameter estimates from the synthetic data are within the 95% confidence interval of the real data, and (c) the confidence interval overlap between real and synthetic parameters is above 50%. In addition, we evaluated privacy using a membership disclosure metric. This evaluates the ability of an adversary to determine that a target individual was in the original dataset using the synthetic data, computed as an F1 classification accuracy score. Results: Our results show that decision and estimate agreements held true across all five trials, and the confidence interval overlap was high. The risks of membership disclosure are all below the established 0.2 threshold. Conclusions: In this study, we were able to successfully generate synthetic datasets that accurately replicated original data from 5 oncology trials and yielded the same results as in the original published studies, with a very low risk of membership disclosure. With proper modeling techniques, synthetic datasets can play a key role in data democratization and the reuse of oncology clinical trials.

Trial NameNCTDecision AgreementEstimate AgreementCI OverlapMembership Disclosure*
REaCT-G_G2NCT02428114 & NCT02816164YY94%0.06
REaCT-HER2+NCT02632435YY70%-0.11
REaCT-ILIADNCT02861859YY97%0.02
REaCT-ZOLNCT03664687N/A**81%***73%***0.02
REaCT-BTANCT02721433YY79%0.18

* A commonly used F1 threshold for acceptable membership disclosure is less than 0.2 and can be negative. ** Not applicable because the published analysis was descriptive only. *** This is the average across all descriptive values.

Disclaimer

This material on this page is ©2024 American Society of Clinical Oncology, all rights reserved. Licensing available upon request. For more information, please contact licensing@asco.org

Abstract Details

Meeting

2023 ASCO Annual Meeting

Session Type

Poster Session

Session Title

Care Delivery and Regulatory Policy

Track

Care Delivery and Quality Care

Sub Track

Clinical Informatics/Advanced Algorithms/Machine Learning

Citation

J Clin Oncol 41, 2023 (suppl 16; abstr 1554)

DOI

10.1200/JCO.2023.41.16_suppl.1554

Abstract #

1554

Poster Bd #

148

Abstract Disclosures

Similar Abstracts

Abstract

2022 ASCO Annual Meeting

Pharmaceutical industry payments to physicians for the promotion of cancer drugs.

First Author: Aaron Philip Mitchell

First Author: Aaron Philip Mitchell

First Author: Rashid N Lui