ChatGPT-5 Matches Surgeon-Level Assessment of Facelift Candidacy: A Pilot Proof-of-Concept Study.

Saeed, AT.; Breakey, RWF.; Saleh, DB.; Tiryaki, KT.; Mayou, BJ.; Saeed, TM.

ChatGPT-5 Matches Surgeon-Level Assessment of Facelift Candidacy: A Pilot Proof-of-Concept Study.

All Authors

Saeed, AT.

Breakey, RWF.

Saleh, DB.

Tiryaki, KT.

Mayou, BJ.

Saeed, TM.

LTHT Author

Breakey, William

LTHT Department

Plastic & Reconstructive Surgery
Cleft Department

Publication Date

2026

Item Type

Journal Article

Subject

ARTIFICIAL INTELLIGENCE , COSMETIC TECHNIQUES , AGEING

Abstract

BACKGROUND: The use of multimodal artificial intelligence (AI) in plastic surgery is steadily increasing. Whether a general-purpose multimodal AI tool can, from photographs alone, assess facial aging and facelift candidacy at a level comparable to board-certified specialist plastic surgeons remains unknown. OBJECTIVES: To determine if ChatGPT-5 (OpenAI, San Francisco, CA, USA) can identify facial aging features, stratify severity, and judge facelift candidacy from photographs alone, compared with board-certified plastic surgeons. METHODS: Two-center observational pilot. Twenty-two volunteers (mean age 42.0 +/- 16.8 years; median 34 years; range 24-80) provided standardized four-view facial composite photographs. Five board-certified plastic surgeons independently completed an eight-item questionnaire per case. ChatGPT-5 assessed the same images with identical wording. Assessments were image-only and blinded (no demographics/history). Surgeon consensus was defined by plurality. Primary outcomes were agreement and Cohen's kappa; for ordinal items, weighted kappa, Spearman's rho, and mean absolute error (MAE) were reported. McNemar's test assessed discordance for binary items. RESULTS: For facelift candidacy, agreement was 95.5% (21/22; Cohen's kappa = 0.91; McNemar P = 1.00). For binary aging features, agreement ranged from 81.8 to 90.9% (kappa = 0.61 to 0.81). For ordinal severity (lower face and midface), exact agreement was 77.3%, disagreements were adjacent only, weighted kappa = 0.74 to 0.86, Spearman's rho = 0.84 (P < .001). Inter-surgeon agreement on ordinal items was moderate to fair. For the adjunct-procedure recommendation, Top-1 accuracy was 70.6% (12/17; kappa = 0.58) and Top-2 agreement was 77.3% (17/22). CONCLUSIONS: In a blinded, standardized-photograph setting, ChatGPT-5 matched surgeons on binary facelift candidacy assessment and closely tracked severity grading with small, one-level differences at most. These findings may support use as a decision-support tool (triage, patient education) while surgeons retain hands-on examination and personalized planning. Larger, multicenter studies with more diverse image datasets are warranted to confirm generalizability and define deployment standards. LEVEL OF EVIDENCE IV: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .