ePoster
TEST
Test Author List; Institution List
THEMATIC ANALYSIS OF INTRAOPERATIVE TEACHER-LEARNER COMMUNICATION: HOW RELIABLE IS AI?
Michael J Ferzoco, Lauren Lewis, MEng; Washington University School of Medicine in St. Louis
Introduction
Evaluating intraoperative teaching and resident autonomy remains a significant challenge in surgical education. Dialogue between attendings and trainees is a rich data source, but analysis is constrained by labor-intensive qualitative coding. Therefore, we evaluated the reliability of a GPT-based Artificial intelligence (AI) assistant to perform line-by-line qualitative coding of intraoperative dialogue as an automated, scalable assessment tool.
Methods
Five intraoperative audio recordings from distinct attending-trainee dyads were transcribed using automated speech recognition. A 35 to 65 minute de-identified segment from each recording, selected for dense interaction, was split into sentences. Using a 36-category codebook, the AI assistant coded each transcript twice. Reliability between AI runs was calculated using percent agreement and Cohen’s kappa (κ). Per-code precision was assessed with the Jaccard index.
Results
Across 2,245 transcript lines, between-run reliability demonstrated 79% agreement and κ of 0.75 (p<0.05). Per-code agreement varied by code complexity. Straightforward sentences were coded more consistently than those requiring implicit reasoning or contextual interpretation (Table 1).
Conclusions
The AI assistant achieved substantial agreement for intraoperative linguistic coding, supporting the reliability of an automated dialogue analysis as a scalable assessment tool. Lower reliability for abstract sentences highlights a need to refine code definitions and transcript accuracy to better capture nuanced teaching. This work represents a key step towards developing automated tools for real-time assessment of intraoperative teaching and providing structured feedback to trainees to advance autonomy.
| Code | Definition | Jaccard Index |
|---|---|---|
| Instrument Request | Attending or trainee asks for instrument | 0.82* |
| Acknowledgement/Affirmation | Confirmation or receipt of information | 0.81* |
| Off Target Talking | Conversations not related to the operation | 0.77* |
| Instruction | Directive guidance on next steps or actions | 0.76* |
| Explanation/Education | Explanations and teaching points | 0.57 |
| Feedback | Performance-oriented comments that reinforce or correct actions | 0.57 |
| Shared Mental Modeling | Explicit alignment on goals or next steps | 0.49 |
| Case Complexity/Anatomy | Characterizing difficulty/anatomical variation | 0.32* |
*p<0.05
FROM OR DIALOGUE TO ACTIONABLE FEEDBACK: A SCALABLE METHOD FOR ANALYZING INTRAOPERATIVE TEACHING
Blake T Beneville, MD, Katharine E Caldwell, MD, MSCI, Jenna Bennett, BS, Mohamed Jama, BS, Cory Fox, BS, Mike Ferzoco, BS, Lauren Lewis, BS, Jonathan Tong, BS, Michael M Awad, MD, PhD, MHPE; Washington University in St. Louis
Background: Feedback is central to surgical skill acquisition, yet trainees consistently report receiving less feedback than they need. Much of the most specific feedback occurs as real-time dialogue in the operating room and is quickly lost. Traditional qualitative review of these conversations is too time-intensive to scale. We sought to develop a practical method to capture, analyze, and return operative teaching feedback to trainees and programs.
Methods: Attending–trainee dialogue from 25 general surgery operations using open and minimally invasive techniques was audio-recorded, transcribed, manually corrected, and de-identified. Transcripts were reformatted to one sentence per line and coded using a 36-category codebook focused on feedback, coaching behaviors, questions, and off-target talk. A small set of pilot transcripts were manually coded to calibrate an AI assistant (GPT-based) within a human-in-the-loop workflow. The assistant then coded the remaining transcripts in reviewable batches, and investigators confirmed or corrected codes. Throughput and error rates were compared with manual coding alone.
Results: The AI-assisted workflow required ~10 minutes per 100 transcript lines versus >60 minutes for manual coding, operating at <20% of the manual time and saving an estimated 239 hours on the full dataset (28,799 lines). Overall error rate was 2.0% (≈98% accuracy), with calibrated batch accuracy ≥95%. Most errors involved boundary cases (for example, distinguishing interjections from off-target talk, or labeling brief words like “good/okay” as feedback versus simple agreement) and occasional misattribution of teaching directed to a third party. These were readily corrected during review.
Conclusions/Implications: Capturing and analyzing OR dialogue with an AI-supported, human-verified process makes it feasible to return structured, case-specific feedback to trainees at scale. Programs can generate rapid post-case summaries (for example, number and type of feedback statements, coaching strategies used, and missed opportunities), support faculty development with objective teaching profiles, and monitor learning climate over time. This approach turns ephemeral intraoperative teaching into actionable feedback for learners while preserving rigor and faculty time.
PUBLICLY AVAILABLE CHATGPT SIGNIFICANTLY INCREASED ARTIFICIAL INTELLIGENCE AUTHORSHIP IN PERSONAL STATEMENTS USED IN GENERAL SURGERY RESIDENCY APPLICATIONS
Sana Khan, Dana Cooley, Miguel Tobon, Eliza Beal, David Bouwman, David Edelman; Wayne State University School of Medicine
Background
The use of artificial intelligence authorship (AIA) remains controversial throughout the educational spectrum and could be considered a disruptive tool in the generation of the personal statement (PS). We hypothesized the use of AIA increased significantly in candidates PSs in general surgery residency applications after ChatGPT became widely available.
Methods
All applications to a large, urban based, academic affiliated general surgery residency program during three (2022-2025) application cycles were included. Since Chat GPT became available to the public on November 30, 2022, the 2022-2023 application cycle was considered the PRE-ChatGPT group (PRE) and the combined 2023-2024 and 2024-2025 were considered the POST-ChatGPT group (POST). Appropriate IRB and organizational permissions were obtained for this study. PSs were extracted, blinded, and analyzed using commercially available artificial intelligence detector software (Originality.ai). Additional items analyzed included type of medical school (United States vs. international), USMLE Step 2CK scores and VISA status. Appropriate statistical analyses were applied, p-value < 0.05 was considered significant.
Results
A total of 4596 PSs were included in this study of which 1822 (40%) were PRE and 2774 (60%) were POST. Overall, AIA was used in 2256 (49%) PSs. AIA usage was significantly higher in the POST group (1853/2774 = 67%) compared to the PRE group (403/1822 = 22%), p<0.01. Within the POST group, AIA was used significantly more in candidates attending international medical schools and non-US citizens (p<0.05). No correlation in AIA usage was seen based on Step 2CK scores (p>0.05). When analyzing the two application cycles within the POST group, AIA usage remained stable among US applicants and increased with international applicants.
Conclusions
AIA increased significantly after ChatGPT became publicly available and is currently commonly used. Candidates applying from international medical schools and non-US citizens use AIA more. AIA usage compromises the value of PSs during the consideration of a candidate for a general surgery residency position. A continued national conversation is warranted to develop policy around artificial intelligence authorship including the integration of detection algorithms and redefining evaluation metrics regarding PSs.
MOCK ORALS REIMAGINE: CHATGPT’S ROLE IN BOARD EXAM PREPARATION
Jessica L Masch, MD, John R Austin, MD, Michelle Lippincott, MD, Allison E Berndtson, MD, Jarrett E Santorelli, MD; University of California San Diego
Background:
AI is rapidly transforming medical education, yet its role in surgical training remains uncertain. The ABS Certifying Exam demands real-time clinical reasoning, traditionally refined through faculty-led mock orals or expensive review courses. ChatGPT-4o, with its dynamic response capabilities, could serve as a scalable alternative. This study evaluates its effectiveness in simulating oral board scenarios compared to traditional techniques.
Methods:
Eight senior general surgery residents (PGY 4–5) completed two mock oral sessions: one with trauma faculty and one using ChatGPT-4o. ChatGPT was pre-programmed to present cases, await resident responses at decision points, and adapt scenarios accordingly without providing immediate feedback. Feedback was given only at the end of each session. Residents then rated authenticity, utility, and likelihood of future use
Results:
ChatGPT reliably emulated the ABS exam format when appropriately prompted. There was no significant difference in residents' ratings of case similarity (p=0.13). However, residents felt using ChatGPT was less valuable for preparation (0.03) and reported being less likely to use ChatGPT for future preparation (p<0.01), citing concerns about authenticity (p=0.01), beneficial feedback (p=0.03), and information accuracy compared to faculty-led sessions (p<0.01).
Conclusion:
ChatGPT can generate structured oral board scenarios, highlighting AI’s potential in surgical education. Nonetheless, skepticism regarding authenticity and accuracy limits its adoption. Refining AI-driven mock orals—potentially by integrating verified ABS study materials—could enhance trust and usability, making AI a scalable, cost-effective adjunct in board preparation.
EARLY EXPERIENCES AND PERCEIVED BARRIERS TO IMPLEMENTATION OF ENTRUSTABLE PROFESSIONAL ACTIVITIES (EPAS) IN GENERAL SURGERY RESIDENCY PROGRAMS
Tasha Posid, MA, PhD1, Osama Elsawy, DO2, Jenny Guido, MD3, Lan Vu, MD4, Leslie Haislip5, Lisa Cunningham, MD1, Theresa N Wang, MD6, Emily Huang, MD1, Ellen Hagopian, MD, MHPE, MEd, MESA, FACS, FSSO7, Minna M Wieck, MD8, Justine Broecker, MD9, Kshama Jaiswal, MD10; 1The Ohio State University Wexner Medical Center, 2Saint Joseph's University Medical Center, 3Sanford Health, 4UCSF Benioff Children’s Hospital, 5Eastern Carolina University, 6University of Washington, 7University of Toledo College of Medicine and Life Sciences, 8University of California-Davis Health, 9University of California San Francisco, 10University of Utah
Introduction: Entrustable Professional Activities (EPAs) are a core component of competency-based education (CBE), offering a structured, workplace-based assessment of observable trainee behaviors. Despite recent adoption by the American Board of Surgery, and initial validity evidence across surgical specialties, implementation challenges remain underexplored. This study examines early experiences and perceived barriers to EPA use among general surgery residents and faculty.
Methods: A preliminary survey was distributed via REDCap to members of two ASE committee working groups and their respective institutions. Respondents included general surgery residents (n=30) and faculty (n=7). The survey assessed frequency and modality of EPA use, perceived ease of use, feedback practices, and perceived barriers to implementation. Descriptive and comparative analyses were performed using summary statistics, independent-sample t-tests or chi-square tests to compare residents and faculty, and single-sample tests to evaluate survey responses against predetermined benchmarks for satisfaction and ease of use.
Results: All respondents (100%) accessed their EPA platforms via mobile app, primarily for operative case assessments (Mean of EPAs completed: Residents = 27.6, Faculty = 31.4, p>0.1). Both groups reported the platform as “very easy” to use (p<0.05 vs. neutral). Residents reviewed completed EPAs with attendings less than half the time, and only 69% reported using EPA feedback for self-directed improvement. The most frequently cited barriers overall (>50% collectively) were difficulty developing the habit to complete EPAs, high cognitive load, and burnout (Figure 1). Residents more strongly perceived low faculty buy-in (p<0.05) and limited faculty response rates (p<0.05) as barriers. Faculty, conversely, identified the lack of linkage between EPAs and milestones (p<0.05) as a key limitation. Both groups were equally uncertain whether EPAs promote resident autonomy (p>0.1), though most residents (77%, p<0.05) viewed them as a valuable feedback tool. Qualitative comments highlighted that faculty reminders and structured expectations improved engagement.
Conclusion: Early implementation of EPAs in general surgery appears feasible and educationally valuable but is limited by workflow challenges and variable perception of faculty engagement given low buy-in. While preliminary findings are based on a small sample, we anticipate obtaining additional responses through an anticipated forthcoming ASE-wide survey.

ALIGNMENT BETWEEN OVERALL AND COMPONENT ENTRUSTMENT SCORES: INSIGHTS FROM GENERAL SURGERY EPA ASSESSMENTS
Kara L Faktor, MD, MSc1, Jessica R Santos-Parker, MD, MS, PhD1, Keli S Santos-Parker, MD, MS, PhD1, Patricia O'Sullivan, EdD1, Olle ten Cate, PhD2, Lan Vu, MD1; 1University of California San Francisco, 2Utrecht University
Introduction: A four-level entrustment-supervision scale (limited participation, direct supervision, indirect supervision, practice ready) is used to assess Entrustable Professional Activities (EPAs) in general surgery residency. Narrative descriptions of each level of entrustment are comprised of several components. We aimed to understand how scores on discrete components of intra-operative performance contribute to the overall EPA assessment score to gain insight into the aspects of resident intra-operative performance that most strongly influence faculty entrustment decisions across PGY levels.
Methods: Assessments on four common general surgery EPAs (gallbladder disease, appendicitis, inguinal hernia, small bowel obstruction) were selected from intra-operative EPA assessments for a single general surgery residency program in the 2024-2025 academic year. The American Board of Surgery (ABS) EPA narrative descriptions were deconstructed into component domains, including anatomy, operative steps and technical skill. We fit a linear regression model treating overall EPA assessment scores (1-4) as continuous with cluster-robust standard errors by attending. Case difficulty was coded as an ordered trend (straightforward, moderate, complex).
Results: A total of 758 EPA assessments were analyzed, including 348 gallbladder disease, 202 appendicitis, 171 inguinal hernia and 37 small bowel obstruction. Case complexity increased with PGY-level (OR 2.92 - 6.12 PGY2-5 compared to PGY1, p<0.001). Each step in increased complexity was associated with a -0.21 change in expected overall EPA assessment score (SE 0.04, p<0.001). Overall EPA assessment scores increased with PGY-level with significant increases in the PGY-4 (β = 0.47, p<0.001) and PGY-5 (β = 0.72, p<0.001) years. When holding all other component scores constant, technical skill had the strongest correlation with the overall EPA assessment score (β = 0.34, p<0.001), followed by operative steps (β = 0.27, p<0.001). Anatomy demonstrated a minimal and statistically borderline association with the overall EPA assessment score (β = 0.06, p = 0.051). The model explained 67.8% of the variance in overall EPA assessment scores.
Conclusion: Overall entrustment increased with PGY-level and decreased with case complexity. Technical skill and knowledge of operative steps demonstrated the strongest influence on overall faculty entrustment. However, 32.2% of the variance was unexplained, suggesting faculty rely on additional factors when making entrustment decisions.
MORE THAN A METRIC: SURGEONS IN THE EARLY ROBOTIC LEARNING CURVE REPORT VALUE AND IMPROVED CONFIDENCE FROM FEEDBACK AND TRAINING RECOMMENDATIONS BASED ON OBJECTIVE PERFORMANCE INDICATORS (OPIS)
Gretchen P Jackson, MD, PhD, FACS, FACMI, FAMIA1, Jeffrey Voien2, Karlis Draulis2, Andrew Yee, PhD2, Michael M Awad, MD, PhD, MHPE, FACS3; 1Intuitive Surgical / Vanderbilt University Medical Center, 2Intuitive Surgical, 3Washington University, Department of Surgery and Institute for Surgical Education (WISE)
Introduction: Providing objective, scalable feedback to new robotic surgeons is a critical challenge in surgical education. Traditional case observation is subjective and resource intensive. We hypothesized that objective performance indicator (OPI) reports, derived from robotic system data and delivered early in the learning curve, would be perceived by surgeons as valuable and effective educational tools.
Methods: A mixed-methods prospective study of robotic-naïve, practicing surgeons was conducted. Participants received reports detailing four case-level OPIs (clutch and camera metrics) after their 5th, 15th, and 25th robotic cases. Reports included percentile scores benchmarked against peers matched by training and experience. Surgeons scoring below the 50th percentile received training recommendations (e.g., simulation, video review). Perceptions were evaluated with qualitative surveys.
Results: 110 surgeons were enrolled; 107 completed the study, and 66 responded to surveys. OPI reports and recommendations were well received. Most surgeons agreed (agree/strongly agree) that reports provided useful information about technical skills (84%), identified areas for improvement (83%), and would positively impact their practice (80%). 88% reported that it was important to receive OPI reports in training pathways, with the desired frequency of annually (6%), quarterly (67%), monthly (20%), twice monthly (5%), and daily (2%). 84% reported wanting to receive training recommendations, with most agreeing that recommendations were valuable (82%) and improved technical skills (80%), efficiency (78%), and confidence (75%). Desired frequency for recommendation delivery was quarterly (62%), monthly (22%), twice monthly (11%), weekly (2%) and daily (2%). 45% of surgeons receiving recommendations completed them, citing time (53%) and simulation access (25%) as barriers. No significant differences in OPI improvement were found between participants who completed recommendations and those who did not. Qualitative themes revealed a desire for objective data but a need to translate metrics into actionable practice.
Conclusions: This study demonstrated that OPI-based feedback and training recommendations were highly valued and effective educational tools for practicing surgeons on the early robotic surgery learning curve. A primary driver of this educational benefit, which improved confidence, is likely the delivery of objective, benchmarked feedback, rather than the completion of remedial tasks. This "learning loop" model represents a scalable paradigm for continuing surgical education.
ROBOTIC SIMULATION AS A MARKER OF RESIDENT PROGRESSION IN GENERAL SURGERY
Vikram Krishna, MD, Drew Bolster, MD, Raffaele Rocco, MD, Philicia Moonsamy, MD, Harmik J Soukiasian, MD, Farin Amersi, MD, Andrew R Brownlee, MD; Cedars-Sinai Medical Center
Introduction:
Competency-based assessment in general surgery training is lacking. Robotic simulation provides an opportunity to use objective metrics to assess proficiency. Our study evaluated robotic simulator performance as a marker of resident progression.
Methods:
All PGY-1 to PGY-3 general surgery residents at a single academic institution were enrolled into a curriculum-based robotics simulation program. Robotic skills data were prospectively collected between August-October 2025. Residents’ total number of attempts, mean composite score, proportion of average scores >75 per skill (“competent performance”), and proportion of average scores >90 (“high performance”) were recorded. The primary outcome was mean composite score progression by PGY year. The secondary outcomes included competent performance and high performance by PGY year.
Results:
A total of 11 residents were included. Across all levels, 806 simulation attempts were recorded. Mean composite scores significantly increased with each PGY-level (PGY-1: 61.6 vs PGY-2: 71.9 vs PGY-3: 82.1, p=0.048). The percentage of attempts with competent scores (>75) and high-performance scores (>90) also significantly increased by each PGY-year (competent: 9.4% vs 15.1% vs 20.1%, p<0.001; high-performance: 10.8% vs 17.9% vs 23.1%, p<0.001). PGY-1 and PGY-2 residents also spent the most amount of time on the robotic console, compared to PGY-3 residents (9 vs 9 vs 5 mins; p=0.002).
Conclusions:
Robotic simulation data provides an objective method to assess resident performance and engagement. Our results shows progressive improvement across PGY levels. Integration of simulator analytics into residency curricula may enhance competency-based training and promote data-driven feedback in surgical education.

A SIMULATION-BASED LAPAROSCOPIC VENTRAL HERNIA CURRICULUM IMPROVES SURGICAL RESIDENTS’ TECHNICAL PERFORMANCE
Sangrag Ganguli, MD1, Kristine Kuchta, MS2, Colin Johnson, MD2, Syed A Mehdi, MBBS2, Aram Rojas, MD2, Alessia Vallorani, MD2, Arjun Thapa Chhetri, BVSc2, Melissa E Hogg, MD2, Stephen Haggerty, MD2; 1University of Chicago Medical Center, 2Endeavor Health
Introduction
Abdominal wall hernias remain a common challenge for general surgeons. Laparoscopic ventral hernia repair offers lower wound complication rates with similar recurrence compared to open repair, making surgery resident proficiency essential. This study demonstrates the efficacy of a simulation-based laparoscopic ventral hernia repair curriculum at an academic surgical residency program.
Methods
General surgery residents completed a simulation-based module for laparoscopic ventral hernia repair. Junior residents (PGY2-3) participated early in training and repeated the module in their senior years (PGY4-5). Participants provided demographic, prior exposure, and sleep/fatigue data. Residents then completed a pre-test survey assessing confidence in key operative steps. They performed a simulated laparoscopic ventral hernia repair, self-scored their performance, and were concurrently evaluated by an experienced proctor. After a mentored feedback session, residents repeated the simulation with both self- and proctor-assessment.
Results
55 general surgery residents participated in the study – 2 (3.6%) PGY-2, 24 (43.6%) PGY-3, 24 (43.6%) PGY-4, and 5 (9.1%) PGY-5 residents. Post-test scores were significantly higher for both self-evaluation (30.0 vs. 25.1; p<0.0001) and proctor evaluation (31.1 vs. 22.7; p<0.0001), as was resident confidence (43.1 vs. 36.4; p<0.0001). Residents scored themselves higher than evaluators on pre-test (25.1 vs. 22.7; p<0.001), but this difference disappeared on post-test (30.0 vs. 31.1; p=0.09). Evaluators reported greater comfort with residents performing the repair independently than residents’ self-evaluation (4.1 vs. 3.6; p<0.05). Performance on this module did not differ by PGY level, perceived difficulty, self-reported comfort with the procedure, prior hernia repair exposure, or video game experience. Residents with more sleep (> 7 hours) performed better on mesh positioning (4.8 vs. 4.4; p<0.05). Fatigued residents showed greater improvement overall (9.3 vs. 7.3, p=0.04). Qualitative feedback highlighted the operative practice as a strength with the adhesiolysis simulation as the main weakness of the module.
Conclusion
A simulation-based laparoscopic ventral hernia curriculum may be beneficial in improving technical and anatomy-based tasks. Factors such as year in training, prior exposure, fatigue, and perception of difficulty were not associated with performance on the module. Subjective analysis showed the module was especially helpful practicing key steps of a laparoscopic ventral hernia repair.
CHARACTERIZATION OF SURGICAL SKILL IMPROVEMENT USING GESTURES IN THE ADVANCED TRAINING IN LAPAROSCOPIC SKILLS (ATLAS) NEEDLE HANDLING TASK
Sofia Garces Palacios, MD1, Sharanya Vunnava, BS1, Shreya Vunnava, BS1, Madhuri Nagaraj, MD2, Kaustubh Gopal1, Daniel J Scott, MD1, Ganesh Sankaranarayanan, PhD1; 1University of Texas Southwestern Medical Center (SSO), 2University of Colorado Anschutz School of Medicine
Introduction
The Advanced Training in Laparoscopic Suturing Skills (ATLAS) is a structured curriculum designed to enhance laparoscopic suturing skills beyond fundamental levels. The needle handling task (Task 1) requires maneuvering a needle through six variable angled holes on a circular model. This study evaluates the skill improvement in a proficiency-based training of the ATLAS needle handing task using surgical gestures.
Methods
A retrospective video review was conducted using data from an IRB approved proficiency-based study. Fifteen first-year medical students were randomized into a training group (n = 10) and control group (n = 5). All participants completed pre- and post-tests. The training group proceeded through Fundamentals of Laparoscopic Surgery (FLS) to proficiency, followed by ATLAS training; the control group received no additional training. Trained independent raters scored all videos using eight predefined needle handling gestures (needle reposition, control, orientation, grasping, withdrawal, motion, force, and trajectory), rated on a 3-point scale (low, average, excellent). Using Messick’s unitary framework, the internal structure validity is assessed using the intraclass correlation coefficient (ICC) for inter-rater agreement. A two-way mixed ANOVA with Bonferroni post-hoc tests analyzed between-group performance. A generalized additive mixed model (GAMM) was used to analyze the learning curve.
Results
High inter rater agreement was achieved between the graders (ICC = 0.98, p < 0.001). Mixed ANOVA showed significant main effects for group (p < 0.001) and time (p = 0.026), with no significant interaction (p = 0.236). Post hoc tests indicated the training group significantly outperformed the control group at both pre- (p = 0.002) and post-test (p < 0.001). Within-group comparisons revealed significant improvement over time in the training group (p = 0.007) but not in the control group (p = 0.535). GAMM revealed a significant non-linear improvement across training trials (p < 0.001), explaining 56.2% of the variance (R2 = 0.5). The scores increased sharply in early trials and plateaued at trial 13 with an overall gain of 30 points in gesture score.
Conclusion
ATLAS training significantly improves needle handling proficiency at the gesture level. Gesture-based assessment offers a sensitive method for tracking surgical skill acquisition across training sessions.

