Podium IA - AI
A PSYCHOMETRIC VALIDATION FRAMEWORK FOR AI FEEDBACK SYSTEMS IN SURGICAL SIMULATION
Mohamed S Baloul, MD1, Oviya A Giri, MBBS1, Aashna Mehta, MD1, Daniel Cui2, Jonathan D'Angelo, PhD, MA1; 1Mayo Clinic, 2University of Wisconsin
Background: Artificial intelligence (AI) systems increasingly generate performance feedback for surgical trainees. While existing literature focuses on classification accuracy (novice vs. expert identification), no frameworks measure how feedback adapts consistently to performance changes or accuracy reflecting performance level. We sought to develop a psychometric framework measuring AI feedback quality and identifying assessment design factors that predict AI performance through two complementary metrics: Feedback Robustness Index (consistency in detecting performance changes) and Calibration (correlation between feedback sentiment and performance scores).
Methods: Data from 5 surgical simulation assessments with systematically varying rubric designs from minimal (e.g., time/accuracy only) to detailed performance descriptors were used to generate feedback by a locally-hosted open-source LLM (Qwen3 4B – Alibaba Cloud). For each assessment, 5 controlled performance perturbations were created for 9 learners: global improvement, global decline, targeted weakness correction, targeted strength deterioration, and minimal change. We measured: (1) Calibration (R²): correlation between feedback sentiment and performance scores; (2) Feedback Robustness Index (FRI): percentage of times feedback direction correctly matched performance changes. Sentiment was quantified using TextBlob (general purpose) and VADER (lexicon and rule-based) algorithms.
Results: Analysis of 225 feedback instances (9 learners x 5 assessments x 5 perturbations) revealed feedback quality varied dramatically by rubric design (Figure 1). Top-performing assessments with detailed, behaviorally-anchored rubrics (FLS Knots: R²=0.40, FRI=84%; Vascular: R²=0.37, FRI=73%) demonstrated both strong calibration and high robustness. Mid-tier assessments with moderate rubric detail showed acceptable robustness but poor calibration (EndoStitch: R²=0.24, FRI=64%; Imaging: R²=0.07, FRI=69%). The lowest-performing assessment with minimal rubric design (FLS Circle: R²=0.03, FRI=49%) demonstrated near-zero calibration and at-chance robustness, indicating feedback direction was essentially random. Across all assessments, robustness exceeded calibration (mean FRI=68% vs mean R²=0.16).
Conclusions: This study presents the first psychometric validation framework explicitly measuring AI feedback appropriateness and consistency in surgical education, establishing standards analogous to validity and reliability measures used for human assessors. AI feedback quality depends critically on assessment rubric design rather than model sophistication, with minimalistic rubrics resulting in poorly calibrated or random feedback. Educators implementing AI feedback should validate systems using calibration and robustness metrics before deployment and ensure assessment rubrics include detailed performance criteria.

ARTIFICIAL INTELLIGENCE IN SURGICAL QUALITATIVE RESEARCH: A COMPARISON OF HUMAN AND AI-ASSISTED THEMATIC ANALYSIS
Bonnie E Laingen, MS1, Wesley H Iobst, BS1, Gabriel E Cambronero, MD1, Jarrett Dobbins, BSE1, Jake Johnson, BS1, Maggie E Bosley, MD2, Lucas P Neff, MD1; 1Wake Forest University School of Medicine, 2Oregon Health & Science University
Introduction: As artificial intelligence (AI), such as ChatGPT, becomes integrated into scientific research, there is increasing interest in applying AI to qualitative data analysis, a process historically dependent on labor-intensive human interpretation. While AI’s value in quantitative analysis is established, its performance in qualitative research necessitates further evaluation to ensure efficiency does not compromise interpretive depth. This study compares inter-rater reliability (IRR) of a human-derived codebook (HC) versus an AI-derived codebook (AIC) applied to the same qualitative dataset to assess the feasibility and limitations of AI-assisted thematic analysis in surgical research.
Methods: An open-ended survey identified barriers surgeons face when performing laparoscopic common bile duct exploration compared with endoscopic retrograde cholangiopancreatography. Two readers performed inductive coding of 200 responses to generate the HC. The same dataset was uploaded into ChatGPT for zero-shot inductive clustering to generate the AIC, which the same readers used to deductively analyze the dataset. IRR was evaluated with agreement rate and Cohen’s kappa coefficient with bootstrapped 95% confidence intervals (CIs) in RStudio for HC and AIC.
Results: Both codebooks identified 13 codes, five of which were identical (equipment, time constraints, cost constraints, support, and low case volume). Average agreement rate was 95.1% for HC and 92.3% for AIC. IRR (Cohen’s kappa) was 0.754 [95% CI: 0.708–0.794] for HC and 0.638 [95% CI: 0.589–0.685] for AIC. IRR for the five shared codes showed no statistically significant difference indicated by overlapping CIs. No IRR CIs in HC included 0, whereas four in AIC did. Qualitatively, both produced similar themes; however, HC tended to separate themes more distinctly, while AIC exhibited overlapping themes, anecdotally making coding more difficult.
Conclusion: Both methods yielded high coder agreement, with HC demonstrating higher IRR and clearer thematic separation. Despite lower IRR and greater thematic overlap resulting in coding ambiguity, AIC reliably identified broad thematic categories suggesting that AI can highlight overarching barriers and surgeon reported trends. This study supports a hybrid approach, where AI is used for initial clustering followed by human refinement. This approach may optimize efficiency and interpretive accuracy in surgical qualitative research.
ARTIFICIAL INTELLIGENCE AMPLIFIES GENDER BIAS IN GENERAL SURGERY RESIDENCY APPLICATIONS
Vikram Krishna, MD, Drew Bolster, MD, Raffaele Rocco, MD, Philicia Moonsamy, MD, Harmik J Soukiasian, MD, Farin Amersi, MD, Andrew R Brownlee; Cedars-Sinai Medical Center
Introduction:
Artificial intelligence (AI)-based writing tools are increasingly used throughout medical education, but its extent and effect on tone, grammar, and gender bias are poorly understood. We aimed to evaluate the impact of AI-generated language in letters of recommendation (LoR) and personal statements in general surgery applications.
Methods:
LoRs for general surgery residency applications from the 2025 cycle were analyzed for linguistic tone, grammar, and gender-coded language using Microsoft Copilot (GPT-4). Copilot quantified the percent of positive, negative, and neutral language and generated a grammatical score (0 to 100). Gender bias was calculated using published lexicon of agentic (i.e. assertive, confident) and communal (i.e kind, caring) descriptors and synonyms as the percentage of total words. QuillBot Premium detected AI involvement, classifying each LoR and personal statement by AI use and estimating the percentage of AI-generated content. The primary outcome was gender bias by AI use. The secondary outcomes included AI prevalence and extent of use.
Results:
A total of 306 LoRs from 104 applicants were included. AI use was detected in 43.8% of LoRs (134/306) and in 86.5% of personal statements (90/104). Applicant gender and letter writer gender were not associated with AI use (female applicants AI: 65.7% vs no AI: 59.7%, p=0.28; female writers: 30.6% vs 25.1%, p=0.29). AI-assisted LoRs had higher positive language (81 [78 – 85] vs 79 [74 – 83], p=0.003), less neutral language (18 [15-22] vs 21 [17-25], p=0.003), and better grammatical scores (94 [91 – 95] vs 92 [90 – 94], p=0.001) compared to non-AI assisted letters. Gender bias percentage was higher in AI-assisted LoRs (2.0 [1.1 – 2.4] vs 1.5 [0.9 – 2.0], p=0.002). Female letter-writers had higher gender bias scores compared to male writers (2.0 [1.0–2.4] vs 1.6 [1.0-2.0], p=0.018). Among personal statements, AI use was more common in female applicants compared to male applicants (92% vs 77%, p=0.026).
Conclusions:
Most applicants used AI for personal statements. In LoRs, AI improved grammar and positive language, but increased gender bias. These findings reveal unintentional consequences of using AI-generated content. As its prevalence grows, programs should establish AI disclosure guidelines and promote awareness of potential bias.

HOW DOES AI ENHANCED FEEDBACK COMPARE TO SURGEON FEEDBACK FOR ASSESSMENT OF BASIC SUTURING SKILLS?
Kendall Gross, MD1, Nicholas Roth, MD1, Alexander Vorreyer, BS1, Traves Crabtree, MD1, Pinckney Benedict, MFA1, Janet Ketchum, CSFA1, Sajan Koirala, MPH2, Nicole Sommer, MD1, Benjamin Rejowski, MD1, Jessica Cantrall, MPH2, Sowmy Thuppal, PhD3; 1Department of Surgery, Southern Illinois University School of Medicine, 2Department of Population Science and Policy, Southern Illinois University School of Medicine, 3Departments of Surgery and Population Science and Policy, Southern Illinois University School of Medicine
Purpose
Surgical residents undergo verification of proficiency (VOP), a standardized process to assess baseline technical skills. A surgeon-trained artificial intelligence (AI) model was generated to provide real-time, rubric-based feedback on suturing techniques. The aim was to determine whether AI-generated feedback is non-inferior to expert surgeon feedback on suturing techniques performed by surgical residents.
Methods
AI and faculty surgeons completed rubric-based assessments of various basic suturing techniques by surgical residents. Groups were randomized into receiving AI feedback followed by surgeon feedback (Group A) versus surgeon feedback followed by AI (Group B). Both groups completed a standardized Likert-scale questionnaire assessing feedback clarity, usefulness, and confidence improvement.
Results
AI feedback was shared within 24 hours and surgeon feedback was shared on average 30 days after completion of VOP for Group A, while Group B was first given surgeon feedback 30 days after VOP completion followed by AI feedback. There was excellent agreement between AI and faculty with vertical mattress (91.2%), subcuticular (86%), and simple interrupted (91%) suturing. AI was overall noninferior to faculty assessment for all rubrics. Residents who received immediate AI feedback were satisfied with the quality and timing of the actionable feedback with specific suggestions for improvement. AI was highly permissive, with a 100% sensitivity for correctly identifying every surgeon “Pass” case, while also passing some surgeon “Fail” cases. Most rubrics showed strong AI-surgeon agreement, but there were differences in rubrics where subjective judgement of skill nuance likely influences surgeon evaluation.
Conclusion
This study demonstrates that AI can provide comparable rubric-based assessments for suturing skills when compared to faculty evaluation. AI shows a pattern of being safe and lenient with assessments. AI could improve in its judgement in different motion aspects, such as multiple forcep grasps and skin eversion. With the expanding capabilities of AI, these results show that with adjustments to subjective criteria, AI can be an adjunct to provide efficient and effective feedback for surgical training.
AI AUGMENTED LEARNING: HOW AI CAN PERSONALIZE AND OPTIMIZE SURGICAL EDUCATION
Sara Saymuah, MD1, Carina McGuire, MD1, Griffin Feinberg2, Jessiel Castillo2, Julia Winschel2, Christina Raker3, Marcoandrea Giorgi1; 1The Miriam Hospital, Comprehensive Hernia Center, 2Brown University Warren Alpert School of Medicine, 3Brown University Biostatistics, Epidemiology, and Research Design Core
INTRODUCTION
Learning curves are valuable tools for surgical education, informing how much time trainees should spend intraoperatively to reach surgical proficiency. The advent of Entrustable Professional Activities indicates a shift toward more individualized training. This study uses AI evaluation of robotic retrorectus extended totally extraperitoneal (eTEP) hernia repair learning curves to provide individualized proficiency projections for surgical trainees, proposing the use of AI-augmentation as a unique tool in the armamentarium of personalized surgical education.
METHODS
Retrospective analysis was performed of 129 robotic eTEP abdominal wall hernia repairs performed by fellows between 11/8/2019 and 3/4/2025. Linear regression was used to compared first and last case operative time and mesh repair rate in cm/hr, with adjustment for concomitant transversus abdominis release and lysis of adhesions. A separate prospective analysis was performed on 9 robotic eETEP hernia repairs performed by a fellow-in-training between 8/5/25 and 10/20/25. ChatGPT5 analyzed the second data set (using the first data set as a benchmark) to predict the training fellow’s projected case numbers required to reach proficiency.
RESULTS
Fellows completed a mean of 25.8 + 8.7 cases, with mean mesh size of 579.6 + 212.9 cm2. Operative time decreased from 228 + 61.5 minutes in the first case to 186.4 + 56.5 (P = 0.04) minutes in the last case, representing an 18% improvement in efficiency. Mesh placement rate increased from 2.13 + 0.65 cm2/hr to 3.49 + 0.49 cm2/hr (P= 0.004), representing a 64% gain in technical efficiency. A fellow-in-training completed 9 cases with a mean mesh size of 584.7 cm2, mean operative time of 217.4 minutes, and mean mesh placement rate of 2.67 cm²/min ± 1.01.
CONCLUSION
Our study reveals significant improvement in operative time and mesh repair rate for fellows learning eTEP repair, with proficiency reached at 20-25 cases independent of case complexity. AI projected the fellow-in-training to reach operative time proficiency in 20 cases and mesh placement proficiency in 18 cases, indicating accelerated progression when compared to the benchmark (Figure 1). AI-augmented analysis is a powerful tool for personalized operative curriculums, adapting to individual trainee needs and optimizing intraoperative learning.

ACCURACY OF GENERATIVE AI IN TEACHING SURGICAL PROCEDURES: DO SURGICAL RESIDENTS CARE?
Blake T Beneville, MD1, Abigail J Hatcher, MD1, Michael M Awad, MD, PhD, MHPE1, Ganesh Sankaranarayanan, PhD2; 1Washington University in St. Louis, 2The University of Texas Southwestern Medical Center at Dallas
Introduction: Generative AI (GenAI) chatbots are easily accessible to surgical residents, yet the accuracy of their procedural guidance and how residents actually use them are not well described. We conducted a three-part study to compare chatbot accuracy on standardized prompts, evaluate performance during real resident use, and characterize perceptions and usage patterns.
Methods: First, four free chatbots (ChatGPT-4o, Claude, Gemini, Copilot) were prompted to generate responses to standardized prompts for six procedures (chest tube, central line, laparoscopic cholecystectomy, robotic hiatal hernia repair, open inguinal hernia repair, below-knee amputation), in both a “book-chapter” and “operative note” format. Board-certified faculty scored each response with a four-point scale (1=completely incorrect to 4=comprehensive). One faculty scored the inguinal hernia repair, two scored each of the other procedures, yielding 88 evaluations. Second, PGY 1–3 residents (n=11) interacted naturally with the chatbots to learn two operations (laparoscopic cholecystectomy, open inguinal hernia repair) using self-generated prompts. Faculty rated the resident conversations with ChatGPT using the same rubric to compare against the standardized versions, yielding 33 evaluations. Third, residents completed pre/post surveys on experience, trust, usability, and likelihood of future chatbot use. Analyses included descriptive statistics and Friedman chi-square.
Results: Faculty ratings showed no statistically significant difference in accuracy across chatbots for standardized prompts (χ², p=0.53). In contrast, residents ranked ChatGPT highest for accuracy (p=0.011) and—along with Claude—rated it most engaging and easiest to use. Across items referencing their top-ranked chatbot, all residents agreed it could provide correct operative information and understood their questions. Qualitative faculty comments suggested answers improved with iterative prompting, indicating that prompt quality influences output. After the exercise, residents’ intentions to use chatbots for operative preparation and other clinical tasks were mixed, reflecting enthusiasm tempered by caution.
Conclusion: Surgical residents are already using GenAI, but our findings suggest they prioritize user experience over objective accuracy. This creates an urgent need for educators to guide trainees in 'prompt engineering' and critical appraisal of AI-generated content. Future work will expand sample size, vary case complexity, and link chatbot-supported preparation to performance outcomes in simulation and the operating room.
LEVERAGING DIGITAL PLATFORMS TO ADVANCE GLOBAL ORTHOPAEDIC EDUCATION AND COLLABORATION
Emily Powis, Carlos Mercado, Samhita Kadiyala, Suhas Velichala, Sion Yu, Kiran Agarwal-Harding; Harvard Global Orthopaedics Collaborative
Background
Musculoskeletal injuries are a leading cause for death and disability worldwide especially in low- and middle-income countries (LMICs). Improving orthopaedic surgical education and global collaboration may help address this burden. As an extension of our virtual education conferences, a WhatsApp community and YouTube channel were formed to make discussion and surgical lectures more accessible to our global orthopaedics network. Our study aimed to assess the utility of these digital platforms to provide accessible opportunities for orthopaedic education and real-time sharing of clinical knowledge.
Methods
We conducted a mixed-methods descriptive study across two digital platforms (WhatsApp and YouTube) utilized by our community. A Python script was used to track message volume, cases shared, and geographic distribution of participants in a WhatsApp community consisting of eight subgroups. A REDCap survey was also distributed within the community to assess engagement, educational value, and influence on clinical decision making. Data from our affiliated YouTube channel was analyzed to evaluate engagement and subscriber demographics.
Results
Between March 2024 and October 2025, 1,166 cases from 26 countries were discussed in a WhatsApp community of 1,754 members across 95 countries, averaging 5.3 cases weekly. The REDCap survey was completed by 106 individuals from 42 countries, most of whom were male (95.3%), consultant surgeons (70.8%), orthopaedic specialists (91.5%), and based at academic institutions (54.7%). Nearly all participants found the WhatsApp community to be somewhat (24.5%) or extremely (67%) useful for clinical learning or decision making, and 80.2% reported utilizing the community to inform real clinical decisions. Since launching in August 2020, the YouTube channel has posted 193 lectures in English (48.7%), Spanish (25.9%), and French (25.4%), amassing 10,700 subscribers, 567,512 views, and 32,635 watch hours. Viewers from India, US, Pakistan, Bangladesh, and Kenya made up 55% of views.
Conclusion
Digital platforms such as WhatsApp and YouTube are powerful tools to support real-time clinical decisionmaking and democratize access to high-quality medical education. By fostering international collaboration, these platforms enable true multidirectional learning and build durable global networks of support that may help improve musculoskeletal care, particularly in LMICs.
Image

Figure 1. WhatsApp community membership
CREATING MOCK ORAL QUESTIONS FOR GENERAL SURGERY CERTIFYING EXAM - A POTENTIAL NOVEL APPLICATION OF ARTIFICIAL INTELLIGENCE (AI)
Brooke Bocklud, MD, Isabela Visintin, MD, Meghan Daly, Amy Crisp, PhD, Ruchir Puri, MD, T. Shane Hester, DO; University of Florida College of Medicine - Jacksonville
Background
The General Surgery Certifying Exam (GSCE) is a high stakes oral examination where residents go through case-based vignettes with examiners and is required by the American Board of Surgery for board certification. Residency programs simulate this experience during mock oral examinations. Creating oral exam questions is time consuming for faculty. Artificial intelligence (AI) tools could potentially assist faculty in creating these scenarios.
Methods
This was a single institution prospective study. Two separate mock oral examinations were administered by faculty to general surgery residents in 2025. Each exam had a total of 12 scenarios, 6 of which were written by AI (with faculty oversight for accuracy) and 6 were written by faculty (non-AI). Both the faculty and residents were blinded to origin of the questions and completed a post exam survey. Comparison analyses were performed using a chi-squared test or Fisher’s exact test as appropriate, and a ROC analysis was used to obtain sensitivity, specificity, and AUC. We hypothesized that the residents would not be able to differentiate between the AI and non-AI created scenarios.
Results
A total of 25 residents and 8 faculty completed the survey. Participants were predominantly male 52% and white 36%. For the AI written questions, 34% of participants correctly concluded they were written by AI. For the non-AI written questions, 28% of participants incorrectly assumed they were written by AI (p = 0.35). The AUC was 53.2% (95% CI: 47.3%-59.2%) indicating that participants were unable to correctly identify when the questions are AI-generated. Questions deemed appropriate for the exam were similar at 87% vs. 84% (p = 0.64) in the AI and non-AI groups, respectively. The adequacy of information provided within the questions was similar at 94% vs. 89% (p = 0.23) in the AI and non-AI groups. Participants felt that the questions were somewhat difficult at 27% vs. 33% (p = 0.27) in the AI and non-AI groups.
Conclusion
Faculty-written and AI-created board scenarios could not be differentiated by surgical trainees or faculty during mock oral exams. AI may offer extensive possibilities of creating new oral board scenarios, potentially saving faculty precious time.
