Large Language Models as Decision-support Tools for Adjuvant Therapy Planning in Early-stage Hormone Receptor–positive Breast Cancer

Abstract

Background/Aim

Adjuvant treatment decisions in hormone receptor–positive (HR), HER2-negative early-stage breast cancer are frequently guided by multigene assays; however, limited access to genomic testing remains a significant challenge, particularly in resource-limited settings. This study aimed to evaluate the concordance between adjuvant treatment recommendations generated by large language models (ChatGPT-4o and ChatGPT-o3) and those of an experienced medical oncologist in HR+/HER2− early-stage breast cancer patients when genomic assay results were unavailable.

Patients and Methods

Clinical and pathological data from 411 patients with HR+/HER2− early-stage breast cancer were provided to ChatGPT-4o and ChatGPT-o3. Both models generated adjuvant treatment recommendations, chemotherapy plus endocrine therapy (CT+ET) or endocrine therapy alone (ET) based on ESMO and NCCN guidelines. These recommendations were compared with those of a medical oncologist. Agreement was assessed using Fleiss’s and Cohen’s kappa statistics, and differences among evaluators were analyzed using Cochran’s Q test.

Results

Overall agreement among the clinician and the two models was substantial (κ=0.67). Moderate agreement was observed between the clinician and ChatGPT-4o (κ=0.60) and between the clinician and ChatGPT-o3 (κ=0.55). Agreement between the two language models was almost perfect (κ=0.88). ChatGPT-4o demonstrated closer alignment with clinical judgment.

Conclusion

Large language models showed substantial concordance with clinician decision-making in adjuvant therapy planning for HR+/HER2− early-stage breast cancer in the absence of genomic testing. These findings suggest that such models may serve as supportive decision-making tools rather than independent decision-makers, particularly in settings with limited access to multigene assays.

Keywords: Artificial intelligence, breast cancer, chemotherapy, ChatGPT, endocrine therapy

Introduction

Breast cancer is the most common type of cancer among women, and numerous scientific studies and clinical guidelines have been developed to guide its management. Thanks to global screening programs and awareness campaigns, early detection and treatment of breast cancer have become possible. Hormone receptor (HR)-positive, HER2-negative luminal-type breast cancer constitutes the majority of these cases. Despite advancements in disease management, there is still no clear consensus regarding adjuvant treatment in early-stage disease (2). When planning adjuvant therapy for these patients, several factors play a critical role, including the tumor (T) stage, patient’s age, menopausal status, Ki-67 index, presence of lymphovascular (LVI) and perineural invasion (PNI), tumor grade, and recurrence scores obtained from multigene assays (3). Multigene assays such as Oncotype DX have a broad impact spectrum in breast cancer care, influencing adjuvant treatment decisions in HR–positive, HER2-negative patients as well as the identification of HER2-low disease (4). Unfortunately, access to reliable tools such as multigene assays remains limited in many parts of the world, and even in developing countries, many clinicians are left to make decisions on their own when managing these patients.

In this context, artificial intelligence (AI)-based language models are emerging as potentially valuable and accessible tools in clinical practice (5). ChatGPT, developed by OpenAI, is an AI model capable of providing human-like responses to user queries by accessing online sources. One of the most commonly used versions is ChatGPT-4o(6). Large language models (LLMs) based on the GPT architecture stand out from traditional AI models due to their versatility, human-like language comprehension, and broad knowledge base(7). Although some studies in the literature have compared ChatGPT-4o with the previously high-performing model, ChatGPT3.5, no study to date has compared ChatGPT-4o and ChatGPT-o3 across a large patient population (8). While the role of AI in cancer management has been increasingly explored in the era of personalized oncology, there is limited data on its utility in managing early-stage luminal-type breast cancer. In our study, we aimed to investigate whether ChatGPT-4o and ChatGPT-o3 can be used as decision-support tools in making adjuvant treatment decisions for one of the most clinically challenging subgroups: HR-positive, HER2-negative, early-stage breast cancer patients without the guidance of multigene assay results. We also sought to evaluate the concordance between the decisions of these two models and those of clinicians.

Patients and Methods

Study design and patient selection. This study included 411 patients who were diagnosed with early-stage breast carcinoma (HR-positive, HER2-negative, pT1b-T1c-T2 N0M0) and were treated and followed at Dr. Abdurrahman Yurtaslan Ankara Oncology Training and Research Hospital between 01/01/2020 and 01/01/2024. Only patients with complete clinical data and no history of next-generation genomic testing were included. Inclusion criteria were: availability of complete data on diagnosis, staging, treatment history, and follow-up; age 18 years or older; and absence of any concurrent active malignancy. Patients with incomplete data were excluded from the study. Informed consent was obtained from all participants prior to their inclusion in the study.

Patient assessment form and AI evaluation. A standardized patient assessment form was developed, including parameters such as histopathological subtypes, stage, tumor grade, CerbB2 score, estrogen receptor (ER), progesterone receptor (PR), Ki67 index, lymphovascular invasion (LVI) and perineural invasion (PNI) status, ECOG performance status, menopausal status, and comorbidities. Patients included in this study had previously been evaluated by medical oncology specialists, each with a minimum of 10 years of clinical experience, and had either completed or were still receiving their adjuvant treatment. Information regarding the administered adjuvant therapies was retrospectively obtained from patient records. The histopathological and demographic characteristics of the patients were then provided to the ChatGPT-o3 and ChatGPT-4o models through a patient assessment form, and the models were asked to recommend an adjuvant treatment option as if making the clinical decision based on the NCCN and ESMO guidelines. Adjuvant treatment recommendations were categorized as either chemotherapy plus endocrine therapy (CT and ET) or endocrine therapy (ET) alone by investigators.

The AI-based language models used in this study were the pro versions of ChatGPT-4o and ChatGPT-o3. To avoid attenuation bias, no prior training was provided to the models, and no prompt engineering techniques were employed. For each patient, ChatGPT was prompted with the following standardized instruction: “Using the current ESMO and NCCN guidelines, and considering the clinical features provided in the patient assessment form, choose the most appropriate adjuvant treatment option for the patient: either chemotherapy plus endocrine therapy or endocrine therapy alone.” Although it was recognized that using Turkish commands might affect the models’ performance in accessing English-language sources, the prompts were intentionally given in Turkish to evaluate the natural language processing capabilities of the models. The Turkish form of prompt was the following: Güncel ESMO ve NCCN kılavuzlarını kullanarak ve hasta değerlendirme formunda sunulan klinik özellikleri dikkate alarak, hasta için en uygun adjuvan tedavi seçeneğini belirle: kemoterapi + endokrin tedavi veya yalnız endokrin tedavi. Clinician decisions were defined as the adjuvant treatment choices that were actually administered to patients in routine clinical practice, and these decisions were used as the reference standard for comparison with treatment recommendations generated by ChatGPT-4o and ChatGPT-o3.

GPT-4o and o3 models were accessed via the OpenAI interface with the following fixed parameters: temperature=0, max tokens=512, and the default system role (general assistant, with no customized instructions). A temperature value of 0 was used to ensure deterministic behavior, meaning identical inputs generated identical outputs across repeated evaluations. This configuration was chosen to maintain consistency and reproducibility of AI-based treatment recommendations.

Statistical analysis. Statistical analyses were performed using SPSS Version 27.0. Descriptive statistics were used to summarize patients’ demographic characteristics as well as the pathological and immunohistochemical features of the tumors. The normality of continuous variables was assessed using the Kolmogorov-Smirnov and Shapiro-Wilk tests, along with evaluations of skewness and kurtosis. The concordance between the decisions of the clinician, ChatGPT-4o, and ChatGPT-o3 was assessed using Fleiss’s Kappa and Cohen’s Kappa (κ) tests, interpreted as follows: κ<0.00=poor agreement, κ=0.00-0.20=slight agreement, κ=0.21-0.40=fair agreement, κ=0.41-0.60=moderate agreement, κ=0.61-0.80=substantial agreement, and κ=0.81-1.00=almost perfect agreement. Statistical differences among the decisions of the clinician, ChatGPT-4o, and ChatGPT-o3 were evaluated using Cochran’s Q test.

Logistic regression analysis was performed to compare patients with concordant chemotherapy plus endocrine therapy (CT+ET) recommendations by both the clinician and ChatGPT-4o with those who had concordant ET–only recommendations by both evaluators. The dependent variable was concordant CT+ET recommendation (yes/no), and odds ratios reflect clinicopathologic factors associated with concordant CT+ET versus concordant ET-only decisions. A p-Value of <0.05 was considered statistically significant.

Statement of ethics. For this study, single-center ethical approval was obtained from the Non-Interventional Clinical Research Ethics Committee of the University of Health Sciences, Dr. Abdurrahman Yurtaslan Ankara Oncology Training and Research Hospital, under the approval number 2024-09/123.

Results

A total of 411 patients were included in the study, with a mean age of 55.3±11.1 years. The clinical, histopathological, and demographic characteristics of the overall cohort, as well as those of patients for whom both the clinician and ChatGPT-4o recommended CT+ET or ET alone, are summarized in Table I.

Patients for whom CT+ET was recommended concordantly by both the clinician and ChatGPT-4o demonstrated significantly different clinicopathological features compared with those concordantly recommended ET alone. Specifically, patients in the CT+ET group were more likely to be younger than 50 years, premenopausal, have fewer than two comorbidities, T2-stage disease, grade 3 tumors, a CerbB2 score of 2, positive lymphovascular invasion (LVI), and a Ki-67 index greater than 25% (all p<0.001).

In univariate regression analyses, LVI positivity, Ki-67 status, age, menopausal status, tumor stage, comorbidity burden, tumor grade, and CerbB2 status were all significantly associated with the recommendation of CT+ET. In multivariate analysis, age under 50 years lost statistical significance, while other variables remained independently associated with treatment decisions (Figure 1, Table II).

Among all evaluated factors, a Ki-67 index greater than 25% exerted the strongest influence on the recommendation of CT (OR=73.2, p<0.001, 95% CI=24.0-223.6), followed by high tumor grade (OR=24.3, p<0.001, 95% CI=6.8-87.1) and premenopausal status (OR=11.1, p=0.001, 95% CI=2.6-47.6). The clinical and pathological characteristics of patients for whom CT+ET was recommended by the clinician but ET alone by ChatGPT-4o are presented in Table III.

Most of these patients were over 50 years of age (73.1%) and postmenopausal (71.6%), with predominantly grade 2 tumors (74.6%), a CerbB2 score of 0 (64.2%), low rates of LVI positivity (11.9%), and a Ki-67 index ≤25% in the majority of cases (88.1%). Overall agreement among the clinician, ChatGPT-4o, and ChatGPT-o3 was substantial, with a Fleiss’s κ value of 0.67 (p<0.001; 95% CI=0.62-0.73). Concordant treatment recommendations among all three evaluators were observed in 85.2% of CT+ET decisions and 81.7% of ET decisions (p<0.001). Cochran’s Q test demonstrated a statistically significant difference among the three decision sources (p<0.001); however, no significant difference was observed between the treatment recommendations generated by ChatGPT-4o and ChatGPT-o3 (p=0.058) (Table IV).

Pairwise agreement analyses revealed a moderate and statistically significant concordance between clinician decisions and those generated by ChatGPT-4o (κ=0.60, p<0.001) and ChatGPT-o3 (κ=0.55, p<0.001). The distribution of treatment recommendations across evaluators is illustrated in Figure 2, demonstrating that the clinician more frequently favored CT+ET compared with both AI models, while ChatGPT-o3 showed a greater tendency toward ET alone. Among the two LLMs, ChatGPT-4o demonstrated closer alignment with clinician decision-making patterns (Figure 3).

Confusion matrices including absolute numbers and corresponding percentages were constructed to compare adjuvant treatment recommendations between the clinician and each language model (Table V). For ChatGPT-4o, concordant CT+ET and ET decisions accounted for 50.1% and 30.4% of the cohort, respectively, while discordant recommendations were observed in the remaining cases. A similar distribution was observed for ChatGPT-o3. In addition, a three-way concordance table summarizing agreement patterns among the clinician, ChatGPT-4o, and ChatGPT-o3 demonstrated complete agreement in 75.9% of patients (Table VI). Detailed distributions of concordant and discordant decision patterns are presented in the respective tables.

Discussion

Although multigene assay methods are considered the most reliable tools for guiding adjuvant treatment decisions in patients with HR-positive luminal-type breast cancer, access to these tests remains limited in many countries. As a result, clinicians are often required to make adjuvant treatment decisions for early-stage HR-positive, HER2-negative breast cancer patients by integrating tumor- and patient-specific factors without the support of genomic tools. In daily clinical practice, there is therefore a clear need for fast, affordable, and reliable decision-support tools. In this context, the present study aimed to investigate whether ChatGPT, an easily accessible AI-based language model, could serve as a supportive tool in adjuvant treatment planning for this patient population.

In our analysis, although a substantial concordance was observed between the adjuvant treatment decisions of a medical oncology specialist and those generated by ChatGPT-4o and ChatGPT-o3, statistically significant differences persisted between clinician and AI-based recommendations. In contrast, no significant difference was observed between the decisions of ChatGPT-4o and ChatGPT-o3, and their agreement was almost perfect. Notably, ChatGPT-4o demonstrated closer alignment with clinician decisions, suggesting that more advanced model architectures may better approximate clinical reasoning patterns.

The AI models employed in the study. In this study, we evaluated two of the most commonly usedversions of ChatGPT, namely ChatGPT-o3 and ChatGPT-4o, in order to compare the most commonly usedmodel with one previously reported to perform well in scientific reasoning. In a study by Rao et al. focusing on breast cancer, ChatGPT-4o was shown to outperform ChatGPT-3.5 in radiological breast cancer diagnosis (9). Similar findings have been reported in other tumor types. Studies involving sarcoma and renal cancer patients demonstrated that ChatGPT-4 provided more consistent and reliable treatment recommendations than ChatGPT-3.5 (8, 10). Alsaudi et al. demonstrated that ChatGPT-o3 outperformed GPT-3.5 in clinical accuracy and decision support performance, while Naliyatthaliyazchayil et al. reported superior clinical reasoning and risk stratification performance of ChatGPT-o3 compared with GPT-4o (11, 12). However, direct head-to-head comparisons between GPT-4o and GPT-o3 under identical clinical scenarios remain scarce. Consequently, conclusions regarding the most robust model rely mainly on indirect inference. Large-scale, standardized, and population-based comparative studies are needed to reliably identify the optimal model for clinical decision support.

Assessments of AI in different tumor types. Several studies have explored the applicability of AI-based models in oncology across different tumor types. Kuş et al. evaluated the consistency of ChatGPT-4o with clinician decisions and NCCN/ESMO guidelines in stage II colon cancer and reported moderate agreement between ChatGPT-4o and clinicians (κ=0.47, p<0.001) (13). In that study, the AI model was pre-trained before evaluation. In contrast, no pre-training was applied in our study in order to avoid attenuation bias and to reflect real-world, non-optimized clinical use. Despite this methodological difference, statistically significant discrepancies between clinician and AI decisions were still observed, although the level of agreement in our study was numerically higher. This difference may be attributable to tumor type–specific decision complexity and the larger patient cohort included in our analysis. In another study by Lechien et al. on head and neck cancers, ChatGPT-4o was reported to have limited ability in making critical decisions compared to the tumor board (14). Similarly, in a study by Zabaleta et al. on non-small cell lung cancer, AI was considered a useful supportive tool for tumor boards but not suitable as a stand-alone decision-making system (15). These findings across different malignancies are consistent with our results and reinforce the supportive rather than autonomous role of AI in oncology.

Assessments of AI in breast cancer. The importance ofAI in the diagnosis and treatment of breast cancer is increasing steadily, with applications ranging from initial diagnosis to surgical decision-making and adjuvant treatment management (16, 17). Several studies have specifically evaluated AI-based tools in breast cancer management. Nabieva et al. compared ChatGPT-4o recommendations with the 18th St. Gallen International Consensus Conference and reported high consistency in some domains, but lower agreement in adjuvant endocrine therapy decisions where consensus is limited (18). To the best of our knowledge, there are no published studies evaluating GPT-o3 as a clinical decision-support tool for breast cancer treatment, with most prior work focusing on GPT-3.5. Lukac et al. demonstrated that ChatGPT-3.5 could serve as a supportive tool when compared with multidisciplinary tumor board decisions in early-stage breast cancer (19). Similarly, Stalp et al. reported overall consistency between ChatGPT-3.5 and tumor board decisions despite minor discrepancies in CT and ET recommendations (20). Sorin et al. observed concordant decisions in 7 of 10 early-stage breast cancer cases when comparing ChatGPT-3.5 with tumor board recommendations (21). Considering the available literature and our findings, our study appears to be the first to directly assess the performance of GPT-4o and OpenAI o3 as clinical support mechanisms in breast cancer. Taken together, these results support the inference that GPT-based models may serve as reliable clinical decision-support tools in this setting.

Evaluation of subgroup analyses. Subgroup analyses demonstrated that both clinicians and ChatGPT models prioritized well-established high-risk clinicopathological features when recommending chemotherapy. CT+ET recommendations were more frequent in patients with higher tumor grade, elevated Ki-67 index, lymphovascular invasion positivity, premenopausal status, higher tumor stage, and lower comorbidity burden. Among these factors, Ki-67 >25%, high tumor grade, and premenopausal status emerged as the most influential determinants of chemotherapy recommendation.

These findings are consistent with previous literature demonstrating the prognostic and predictive value of these parameters in early-stage breast cancer. Lymphovascular invasion has been associated with poorer overall and disease-free survival (22). A study reported that LVI positivity influences adjuvant treatment decisions alongside stage, age, and tumor histopathology (23). In a study involving triple-negative and HER2-positive early-stage breast cancer, patients with a lower comorbidity burden were significantly more likely to receive adjuvant CT (24). Additionally, patients with higher CerbB2 scores and premenopausal status have been shown to derive greater benefit from chemotherapy in selected settings (25, 26). Although the prognostic role of Ki-67 remains debated, multiple studies support its value as a marker of recurrence risk in HR-positive/HER2-negative disease (27-29). Tumor grade and T stage are also well-established risk indicators guiding adjuvant treatment decisions (30).

Discordance between clinician and ChatGPT-4o recommendations was mainly observed in patients with intermediate-risk profiles, including postmenopausal status, intermediate tumor grade, and lower Ki-67 values. This suggests that ChatGPT relied more strictly on objective risk factors, whereas clinicians may have incorporated additional contextual considerations such as clinical experience, patient preference, and precautionary reasoning. Improving the interpretability and transparency of AI-based decision-support systems through explainable outputs may help bridge this gap and enhance clinician trust.

Study imitations. First, being a retrospective and single-center study may limit the generalizability of the findings. Second, the prompts entered into ChatGPT were written in Turkish to evaluate the model’s natural language understanding in a non-English context. However, since ChatGPT primarily relies on English-language sources, this language difference may have slightly affected its performance. Third, due to the immaturity of the current dataset and the insufficient follow-up duration, survival analyses could not be performed. Nevertheless, once the data reach maturity, our results will be updated and shared again. Another limitation is that ChatGPT’s recommendations were based on the most recent versions of the ESMO and NCCN guidelines, whereas clinician decisions reflected the guideline updates and treatment standards valid at the time of patient management (2020-2024). Finally, ethical, privacy, and liability considerations related to the use of large language models in oncology decision-making should be acknowledged, emphasizing the need for expert oversight and strict data protection measures when integrating AI tools into clinical workflows.

Conclusion

In our study, a substantial and statistically significant level of concordance was observed between the treatment decisions of the clinician and those of ChatGPT-4o and ChatGPT-o3 in HR-positive/HER2-negative, early-stage breast cancer patients, for whom there is no clear consensus regarding adjuvant therapy in the absence of genomic profiling. Despite this overall consistency, statistically significant differences were also identified between the clinician’s decisions and those of both ChatGPT models. Pairwise analyses revealed no significant difference between the decisions of ChatGPT-4o and ChatGPT-o3, and ChatGPT-4o was found to be more numerically aligned with the clinician’s decisions.

Our findings suggest that while AI may not yet be suitable as an independent decision-maker in oncology, it shows potential as a decision-support tool under expert supervision. Further multicenter, prospective studies with longer follow-up periods are needed to evaluate the safe and effective integration of AI into oncology practice.

Conflicts of Interest

The Authors declare no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Authors’ Contributions

B.K: Writing, Statistical analysis; M.B: Conceptualization; E.K.K: Data curation; E.A: Figures, Data curation; O.B.K: Tables; M.E.Y: Review; E.K: Visualization; İ.D.O: Review; Ö.A: Editing; F.Y: Supervision.

Funding

This study did not receive any financial support or funding.

Artificial Intelligence (AI) Disclosure

Large language models (ChatGPT-4o and ChatGPT-o3) were used as decision-support tools to generate adjuvant treatment recommendations based on predefined clinical and pathological data, in accordance with current ESMO and NCCN guidelines. The AI models did not have access to patient-identifiable information and did not participate in data collection, data interpretation, or final clinical decision-making. All treatment decisions used for comparison were independently made by an experienced medical oncologist, who retained full responsibility for clinical judgment and patient care.