Performance of a trained large language model to provide clinical trial recommendation in a head and neck cancer population.

Authors

null

Tony Hung

Memorial Sloan Kettering Cancer Center, New York, NY

Tony Hung , Gilad Kuperman , Eric Jeffrey Sherman , Alan Loh Ho , Winston Wong , Anuja Kriplani , Lara Dunn , James Vincent Fetten , Loren S. Michel , Shrujal S. Baxi , Chunhua Weng , David G. Pfister , Jun J. Mao

Organizations

Memorial Sloan Kettering Cancer Center, New York, NY, Columbia University, New York, NY, Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, NY

Research Funding

emorial Sloan Kettering Cancer Center (MSK) Support Grant (P30-CA008748)

Background: Chatbots based on large language model (LLM) have demonstrated ability to answer oncology exam questions; however, leveraging LLM in medical-decision support have not yet demonstrated suitable performance in oncology practice. We evaluated the performance of a trained a LLM, GPT-4, to recommend appropriate clinical trials for a head & neck (HN) cancer population. Methods: In 2022, we developed an artificial intelligence powered clinical trial management mobile app, LookUpTrials, and demonstrated promising user engagement among oncologists. Using LookUpTrials database, we applied direct preference optimization to train GPT-4 as an in-app assistant to LookUpTrials. From Nov 7 to Dec 19, 2023, we collected consecutive, new patient cases and their respective clinical trial recommendations from oncologists in the HN medical oncology service at Memorial Sloan Kettering Cancer Center. Cases were categorized by diagnosis, cancer stage, treatment setting, and physician recommendation on clinical trials. Trained GPT-4 is prompted using a semi-structured template: “Given patient with a <diagnosis>, <cancer stage>, <treatment setting>, what are possible clinical trials?” Physician recommendations were compared with trained GPT-4 responses. We analyzed the performance of GPT-4 based on its response precision (positive predictive value), recall (sensitivity), and F1 score (harmonic mean of precision and recall). Results: We analyzed 178 patient cases, mean age 65.6 (SD 13.9), primarily male (75%) with local/locally advanced (68%) HN (61%), thyroid (16%), skin (9%), or salivary (8%) cancers. Majority were treated in the definitive setting with combined modality therapy (42%) and modest proportion were treated under clinical trials (10%). Overall, trained GPT-4 achieved a moderate performance matching physician clinical trial recommendations with 63% precision and 100% recall (F1 score 0.77), narrowing a total list of 56 HN clinical trials to a range of 0-4 relevant trials per patient case (mean 1, SD 1.2). Comparatively, performance of our trained GPT-4 exceeded historic performance of untrained LLMs to provide oncology treatment recommendation by 4-20 folds (F1 score 0.04 - 0.19). Conclusions: This proof-of-concept study demonstrated that trained LLM can achieve moderate performance in matching physician clinical trial recommendation in HN oncology. Our results suggest the potential of embedding trained LLM into oncology workflow to aid clinical trial search and accelerate clinical trial accrual. Future research is needed to optimize precision of trained LLM and to assess whether trained LLM may be a scalable solution to enhance the diversity and equity of clinical trial participation.

Disclaimer

This material on this page is ©2024 American Society of Clinical Oncology, all rights reserved. Licensing available upon request. For more information, please contact licensing@asco.org

Abstract Details

Meeting

2024 ASCO Annual Meeting

Session Type

Poster Session

Session Title

Quality Care/Health Services Research

Track

Care Delivery and Quality Care

Sub Track

Health Services Research

Citation

J Clin Oncol 42, 2024 (suppl 16; abstr 11081)

DOI

10.1200/JCO.2024.42.16_suppl.11081

Abstract #

11081

Poster Bd #

276

Abstract Disclosures