ePoster
(P001) THEMATIC ANALYSIS OF INTRAOPERATIVE TEACHER-LEARNER COMMUNICATION: HOW RELIABLE IS AI?
Michael J Ferzoco, Lauren Lewis, MEng; Washington University School of Medicine in St. Louis
Introduction
Evaluating intraoperative teaching and resident autonomy remains a significant challenge in surgical education. Dialogue between attendings and trainees is a rich data source, but analysis is constrained by labor-intensive qualitative coding. Therefore, we evaluated the reliability of a GPT-based Artificial intelligence (AI) assistant to perform line-by-line qualitative coding of intraoperative dialogue as an automated, scalable assessment tool.
Methods
Five intraoperative audio recordings from distinct attending-trainee dyads were transcribed using automated speech recognition. A 35 to 65 minute de-identified segment from each recording, selected for dense interaction, was split into sentences. Using a 36-category codebook, the AI assistant coded each transcript twice. Reliability between AI runs was calculated using percent agreement and Cohen’s kappa (κ). Per-code precision was assessed with the Jaccard index.
Results
Across 2,245 transcript lines, between-run reliability demonstrated 79% agreement and κ of 0.75 (p<0.05). Per-code agreement varied by code complexity. Straightforward sentences were coded more consistently than those requiring implicit reasoning or contextual interpretation (Table 1).
Conclusions
The AI assistant achieved substantial agreement for intraoperative linguistic coding, supporting the reliability of an automated dialogue analysis as a scalable assessment tool. Lower reliability for abstract sentences highlights a need to refine code definitions and transcript accuracy to better capture nuanced teaching. This work represents a key step towards developing automated tools for real-time assessment of intraoperative teaching and providing structured feedback to trainees to advance autonomy.
| Code | Definition | Jaccard Index |
|---|---|---|
| Instrument Request | Attending or trainee asks for instrument | 0.82* |
| Acknowledgement/Affirmation | Confirmation or receipt of information | 0.81* |
| Off Target Talking | Conversations not related to the operation | 0.77* |
| Instruction | Directive guidance on next steps or actions | 0.76* |
| Explanation/Education | Explanations and teaching points | 0.57 |
| Feedback | Performance-oriented comments that reinforce or correct actions | 0.57 |
| Shared Mental Modeling | Explicit alignment on goals or next steps | 0.49 |
| Case Complexity/Anatomy | Characterizing difficulty/anatomical variation | 0.32* |
*p<0.05
(P002) FROM OR DIALOGUE TO ACTIONABLE FEEDBACK: A SCALABLE METHOD FOR ANALYZING INTRAOPERATIVE TEACHING
Blake T Beneville, MD, Katharine E Caldwell, MD, MSCI, Jenna Bennett, BS, Mohamed Jama, BS, Cory Fox, BS, Mike Ferzoco, BS, Lauren Lewis, BS, Jonathan Tong, BS, Michael M Awad, MD, PhD, MHPE; Washington University in St. Louis
Background: Feedback is central to surgical skill acquisition, yet trainees consistently report receiving less feedback than they need. Much of the most specific feedback occurs as real-time dialogue in the operating room and is quickly lost. Traditional qualitative review of these conversations is too time-intensive to scale. We sought to develop a practical method to capture, analyze, and return operative teaching feedback to trainees and programs.
Methods: Attending–trainee dialogue from 25 general surgery operations using open and minimally invasive techniques was audio-recorded, transcribed, manually corrected, and de-identified. Transcripts were reformatted to one sentence per line and coded using a 36-category codebook focused on feedback, coaching behaviors, questions, and off-target talk. A small set of pilot transcripts were manually coded to calibrate an AI assistant (GPT-based) within a human-in-the-loop workflow. The assistant then coded the remaining transcripts in reviewable batches, and investigators confirmed or corrected codes. Throughput and error rates were compared with manual coding alone.
Results: The AI-assisted workflow required ~10 minutes per 100 transcript lines versus >60 minutes for manual coding, operating at <20% of the manual time and saving an estimated 239 hours on the full dataset (28,799 lines). Overall error rate was 2.0% (≈98% accuracy), with calibrated batch accuracy ≥95%. Most errors involved boundary cases (for example, distinguishing interjections from off-target talk, or labeling brief words like “good/okay” as feedback versus simple agreement) and occasional misattribution of teaching directed to a third party. These were readily corrected during review.
Conclusions/Implications: Capturing and analyzing OR dialogue with an AI-supported, human-verified process makes it feasible to return structured, case-specific feedback to trainees at scale. Programs can generate rapid post-case summaries (for example, number and type of feedback statements, coaching strategies used, and missed opportunities), support faculty development with objective teaching profiles, and monitor learning climate over time. This approach turns ephemeral intraoperative teaching into actionable feedback for learners while preserving rigor and faculty time.
(P003) PUBLICLY AVAILABLE CHATGPT SIGNIFICANTLY INCREASED ARTIFICIAL INTELLIGENCE AUTHORSHIP IN PERSONAL STATEMENTS USED IN GENERAL SURGERY RESIDENCY APPLICATIONS
Sana Khan, Dana Cooley, Miguel Tobon, Eliza Beal, David Bouwman, David Edelman; Wayne State University School of Medicine
Background
The use of artificial intelligence authorship (AIA) remains controversial throughout the educational spectrum and could be considered a disruptive tool in the generation of the personal statement (PS). We hypothesized the use of AIA increased significantly in candidates PSs in general surgery residency applications after ChatGPT became widely available.
Methods
All applications to a large, urban based, academic affiliated general surgery residency program during three (2022-2025) application cycles were included. Since Chat GPT became available to the public on November 30, 2022, the 2022-2023 application cycle was considered the PRE-ChatGPT group (PRE) and the combined 2023-2024 and 2024-2025 were considered the POST-ChatGPT group (POST). Appropriate IRB and organizational permissions were obtained for this study. PSs were extracted, blinded, and analyzed using commercially available artificial intelligence detector software (Originality.ai). Additional items analyzed included type of medical school (United States vs. international), USMLE Step 2CK scores and VISA status. Appropriate statistical analyses were applied, p-value < 0.05 was considered significant.
Results
A total of 4596 PSs were included in this study of which 1822 (40%) were PRE and 2774 (60%) were POST. Overall, AIA was used in 2256 (49%) PSs. AIA usage was significantly higher in the POST group (1853/2774 = 67%) compared to the PRE group (403/1822 = 22%), p<0.01. Within the POST group, AIA was used significantly more in candidates attending international medical schools and non-US citizens (p<0.05). No correlation in AIA usage was seen based on Step 2CK scores (p>0.05). When analyzing the two application cycles within the POST group, AIA usage remained stable among US applicants and increased with international applicants.
Conclusions
AIA increased significantly after ChatGPT became publicly available and is currently commonly used. Candidates applying from international medical schools and non-US citizens use AIA more. AIA usage compromises the value of PSs during the consideration of a candidate for a general surgery residency position. A continued national conversation is warranted to develop policy around artificial intelligence authorship including the integration of detection algorithms and redefining evaluation metrics regarding PSs.
(P004) YOU DON’T SAY? AI FEEDBACK IMPROVES RESIDENT COMMUNICATION QUALITY
William D Rieger, MD1, Renee W Green, MD1, Marissa N Thibodeaux1, Anne R Jeckovich, MD1, Rogith Deevakar, MBBS, PhD2, Peggy H Hsieh, PhD, MEd1, Toufeeq Syed, PhD2, Allison R Ownby, PhD, MEd1, Sasha D Adams, MD1, Lillian S Kao, MD, MS, MBA, FACS1, Krislynn M Mueck, MD, MS, FACS1; 1McGovern Medical School at UTHealth Houston, 2McWilliams School of Biomedical Informatics at UTHealth Houston
Introduction: Effective physician-patient communication is crucial for optimal outcomes and is an entrustable professional activity for residents. However, few tools exist to both measure and improve that communication. We aimed to develop an artificial intelligence (AI)-based communication feedback tool and to test its ability to improve the complexity and quality of transcribed interactions between residents and standardized patients (SPs). We hypothesized that application of AI-based feedback will improve the complexity and quality of resident communications with SPs.
Methods: We performed a cross-sectional study of surgery residents at an urban academic program. Residents completed a simulated, beta-tested surgical case with SPs. Communications were recorded, transcribed, and analyzed for complexity using the Flesch Kincaid Grade Level (FKGL) readability instrument and quality using the Ensuring Quality Information for Patients (EQIP) assessment, with lower readability grade level and higher quality scores preferred. Two-thirds of the transcripts were used to refine an enterprise-grade Copilot (Microsoft) Generative Pre-trained Transformer (GPT)-4 AI agent to analyze resident-SP communication and then provide feedback to improve communication complexity and quality. AI-based feedback was applied to the remaining third of transcripts and these adjusted transcripts were re-analyzed for complexity via FKGL and quality via EQIP. Descriptive and univariate statistics were performed.
Results: Of 23 initial resident-SP transcripts, 16 were used for AI training and 7 were used for testing. Among the original testing transcripts, the mean readability grade level using FKGL was 5.2 (SD 0.8), while the mean EQIP score was 73.8% (SD 2.2). After feedback application and adjustment, the transcript mean readability grade level was 5.0 (SD 0.4) and mean EQIP score was 80.6% (SD 1.9%). The mean difference in readability after adjustment was -0.25 (CI -1.4-0.56, p=0.49) grade levels, and the mean difference in quality was 6.8% (CI 5.2-8.4, p<0.01).
Conclusion: In this single-program study, an AI-based feedback tool was developed and showed promise at increasing the quality of resident communication transcripts. While speech complexity was not significantly decreased, additional AI model adjustments may improve this feedback. Further evaluation is required to assess AI feedback feasibility and effectiveness with residents in targeted communication education.
(P005) MOCK ORALS REIMAGINE: CHATGPT’S ROLE IN BOARD EXAM PREPARATION
Jessica L Masch, MD, John R Austin, MD, Michelle Lippincott, MD, Allison E Berndtson, MD, Jarrett E Santorelli, MD; University of California San Diego
Background:
AI is rapidly transforming medical education, yet its role in surgical training remains uncertain. The ABS Certifying Exam demands real-time clinical reasoning, traditionally refined through faculty-led mock orals or expensive review courses. ChatGPT-4o, with its dynamic response capabilities, could serve as a scalable alternative. This study evaluates its effectiveness in simulating oral board scenarios compared to traditional techniques.
Methods:
Eight senior general surgery residents (PGY 4–5) completed two mock oral sessions: one with trauma faculty and one using ChatGPT-4o. ChatGPT was pre-programmed to present cases, await resident responses at decision points, and adapt scenarios accordingly without providing immediate feedback. Feedback was given only at the end of each session. Residents then rated authenticity, utility, and likelihood of future use
Results:
ChatGPT reliably emulated the ABS exam format when appropriately prompted. There was no significant difference in residents' ratings of case similarity (p=0.13). However, residents felt using ChatGPT was less valuable for preparation (0.03) and reported being less likely to use ChatGPT for future preparation (p<0.01), citing concerns about authenticity (p=0.01), beneficial feedback (p=0.03), and information accuracy compared to faculty-led sessions (p<0.01).
Conclusion:
ChatGPT can generate structured oral board scenarios, highlighting AI’s potential in surgical education. Nonetheless, skepticism regarding authenticity and accuracy limits its adoption. Refining AI-driven mock orals—potentially by integrating verified ABS study materials—could enhance trust and usability, making AI a scalable, cost-effective adjunct in board preparation.
(P006) EARLY EXPERIENCES AND PERCEIVED BARRIERS TO IMPLEMENTATION OF ENTRUSTABLE PROFESSIONAL ACTIVITIES (EPAS) IN GENERAL SURGERY RESIDENCY PROGRAMS
Tasha Posid, MA, PhD1, Osama Elsawy, DO2, Jenny Guido, MD3, Lan Vu, MD4, Leslie Haislip5, Lisa Cunningham, MD1, Theresa N Wang, MD6, Emily Huang, MD1, Ellen Hagopian, MD, MHPE, MEd, MESA, FACS, FSSO7, Minna M Wieck, MD8, Justine Broecker, MD9, Kshama Jaiswal, MD10; 1The Ohio State University Wexner Medical Center, 2Saint Joseph's University Medical Center, 3Sanford Health, 4UCSF Benioff Children’s Hospital, 5Eastern Carolina University, 6University of Washington, 7University of Toledo College of Medicine and Life Sciences, 8University of California-Davis Health, 9University of California San Francisco, 10University of Utah
Introduction: Entrustable Professional Activities (EPAs) are a core component of competency-based education (CBE), offering a structured, workplace-based assessment of observable trainee behaviors. Despite recent adoption by the American Board of Surgery, and initial validity evidence across surgical specialties, implementation challenges remain underexplored. This study examines early experiences and perceived barriers to EPA use among general surgery residents and faculty.
Methods: A preliminary survey was distributed via REDCap to members of two ASE committee working groups and their respective institutions. Respondents included general surgery residents (n=30) and faculty (n=7). The survey assessed frequency and modality of EPA use, perceived ease of use, feedback practices, and perceived barriers to implementation. Descriptive and comparative analyses were performed using summary statistics, independent-sample t-tests or chi-square tests to compare residents and faculty, and single-sample tests to evaluate survey responses against predetermined benchmarks for satisfaction and ease of use.
Results: All respondents (100%) accessed their EPA platforms via mobile app, primarily for operative case assessments (Mean of EPAs completed: Residents = 27.6, Faculty = 31.4, p>0.1). Both groups reported the platform as “very easy” to use (p<0.05 vs. neutral). Residents reviewed completed EPAs with attendings less than half the time, and only 69% reported using EPA feedback for self-directed improvement. The most frequently cited barriers overall (>50% collectively) were difficulty developing the habit to complete EPAs, high cognitive load, and burnout (Figure 1). Residents more strongly perceived low faculty buy-in (p<0.05) and limited faculty response rates (p<0.05) as barriers. Faculty, conversely, identified the lack of linkage between EPAs and milestones (p<0.05) as a key limitation. Both groups were equally uncertain whether EPAs promote resident autonomy (p>0.1), though most residents (77%, p<0.05) viewed them as a valuable feedback tool. Qualitative comments highlighted that faculty reminders and structured expectations improved engagement.
Conclusion: Early implementation of EPAs in general surgery appears feasible and educationally valuable but is limited by workflow challenges and variable perception of faculty engagement given low buy-in. While preliminary findings are based on a small sample, we anticipate obtaining additional responses through an anticipated forthcoming ASE-wide survey.

(P007) ALIGNMENT BETWEEN OVERALL AND COMPONENT ENTRUSTMENT SCORES: INSIGHTS FROM GENERAL SURGERY EPA ASSESSMENTS
Kara L Faktor, MD, MSc1, Jessica R Santos-Parker, MD, MS, PhD1, Keli S Santos-Parker, MD, MS, PhD1, Patricia O'Sullivan, EdD1, Olle ten Cate, PhD2, Lan Vu, MD1; 1University of California San Francisco, 2Utrecht University
Introduction: A four-level entrustment-supervision scale (limited participation, direct supervision, indirect supervision, practice ready) is used to assess Entrustable Professional Activities (EPAs) in general surgery residency. Narrative descriptions of each level of entrustment are comprised of several components. We aimed to understand how scores on discrete components of intra-operative performance contribute to the overall EPA assessment score to gain insight into the aspects of resident intra-operative performance that most strongly influence faculty entrustment decisions across PGY levels.
Methods: Assessments on four common general surgery EPAs (gallbladder disease, appendicitis, inguinal hernia, small bowel obstruction) were selected from intra-operative EPA assessments for a single general surgery residency program in the 2024-2025 academic year. The American Board of Surgery (ABS) EPA narrative descriptions were deconstructed into component domains, including anatomy, operative steps and technical skill. We fit a linear regression model treating overall EPA assessment scores (1-4) as continuous with cluster-robust standard errors by attending. Case difficulty was coded as an ordered trend (straightforward, moderate, complex).
Results: A total of 758 EPA assessments were analyzed, including 348 gallbladder disease, 202 appendicitis, 171 inguinal hernia and 37 small bowel obstruction. Case complexity increased with PGY-level (OR 2.92 - 6.12 PGY2-5 compared to PGY1, p<0.001). Each step in increased complexity was associated with a -0.21 change in expected overall EPA assessment score (SE 0.04, p<0.001). Overall EPA assessment scores increased with PGY-level with significant increases in the PGY-4 (β = 0.47, p<0.001) and PGY-5 (β = 0.72, p<0.001) years. When holding all other component scores constant, technical skill had the strongest correlation with the overall EPA assessment score (β = 0.34, p<0.001), followed by operative steps (β = 0.27, p<0.001). Anatomy demonstrated a minimal and statistically borderline association with the overall EPA assessment score (β = 0.06, p = 0.051). The model explained 67.8% of the variance in overall EPA assessment scores.
Conclusion: Overall entrustment increased with PGY-level and decreased with case complexity. Technical skill and knowledge of operative steps demonstrated the strongest influence on overall faculty entrustment. However, 32.2% of the variance was unexplained, suggesting faculty rely on additional factors when making entrustment decisions.
(P008) IMPLEMENTATION OF ENTRUSTABLE PROFESSIONAL ACTIVITIES IN COMPLEX GENERAL SURGICAL ONCOLOGY FELLOWSHIP
Anneliese N Hierl, MD, LaDonna Kearse, MD, Kristen Jogerst, MD, Paul Graham, MD, MS, Naruhiko Ikoma, MD, MS, Jessica Maxwell, MD, MBA, Christopher Scally, MD, MS, Ching-Wei Tzeng, MD, MS, Brian Bednarski, MD, MEHP, Heather Lillemoe, MD; University of Texas MD Anderson Cancer Center
Background:
Entrustable Professional Activities (EPAs) provide a structured framework for competency-based assessment. While increasingly utilized in residency, their adoption in surgical fellowships remains limited. This study evaluates EPA implementation in a Complex General Surgical Oncology (CGSO) fellowship, assessing perceived barriers and benefits, and measuring engagement during initial EPA adoption.
Methods:
A pre-implementation survey was distributed to all faculty and fellows in a single CGSO fellowship via REDCap. Usage data was then extracted from the Firefly application to assess frequency of EPA completion and entrustment level concordance between faculty and fellows. Data was analyzed using descriptive statistics. Post-implementation surveys and focus groups are planned to better understand EPA utilization and faculty assignment of entrustment levels.
Results:
The pre-implementation survey was completed by 100% (14/14[KE1] [LA2] [LA3] ) of fellows and 61% (41/67) of eligible faculty[AH4] [LA5] . Respondents anticipated the following barriers to EPAs: forgetting to initiate/reciprocate EPAs (64%), change in workflow (63%), time constraints (54%), and perceived burden to others (30%). Potential benefits included structured feedback across phases of care (75%), formalized feedback (73%), clearer learning objectives (57%), and improved communication (50%). Prior to implementation, most faculty moderately agreed they provided adequate feedback to trainees across the phases of care, while fellows reported moderate comfort requesting feedback.
In the initial two months following EPA implementation, 260 EPAs were completed: 74 pre/nonoperative, 113 intraoperative, and 73 postoperative. Twenty-nine of 63 eligible faculty and all 14[AH1] [LA2] eligible clinical fellows participated. At least one assessment was completed for each of the 12 core CGSO EPA types: the majority initiated by faculty (Figure). Among assessments with both faculty and fellow entrustment ratings (n=53), 25 (47%) were not concordant, with all but one differing by just one level of entrustment. Fifteen (60%) had faculty ratings with lower entrustment than rated by the fellow.
Conclusions:
Preliminary findings highlight strong perceived value of CGSO EPAs and robust initial uptake in our program. However, practical implementation barriers remain. Ongoing analysis will evaluate engagement trends, concordance between fellow and faculty evaluations, and perceived barriers to implementation.
Figure 1: Entrustable Professional Activity Heatmap: Core EPA Type by Entrustment Level Rating

(P009) MORE THAN A METRIC: SURGEONS IN THE EARLY ROBOTIC LEARNING CURVE REPORT VALUE AND IMPROVED CONFIDENCE FROM FEEDBACK AND TRAINING RECOMMENDATIONS BASED ON OBJECTIVE PERFORMANCE INDICATORS (OPIS)
Gretchen P Jackson, MD, PhD, FACS, FACMI, FAMIA1, Jeffrey Voien2, Karlis Draulis2, Andrew Yee, PhD2, Michael M Awad, MD, PhD, MHPE, FACS3; 1Intuitive Surgical / Vanderbilt University Medical Center, 2Intuitive Surgical, 3Washington University, Department of Surgery and Institute for Surgical Education (WISE)
Introduction: Providing objective, scalable feedback to new robotic surgeons is a critical challenge in surgical education. Traditional case observation is subjective and resource intensive. We hypothesized that objective performance indicator (OPI) reports, derived from robotic system data and delivered early in the learning curve, would be perceived by surgeons as valuable and effective educational tools.
Methods: A mixed-methods prospective study of robotic-naïve, practicing surgeons was conducted. Participants received reports detailing four case-level OPIs (clutch and camera metrics) after their 5th, 15th, and 25th robotic cases. Reports included percentile scores benchmarked against peers matched by training and experience. Surgeons scoring below the 50th percentile received training recommendations (e.g., simulation, video review). Perceptions were evaluated with qualitative surveys.
Results: 110 surgeons were enrolled; 107 completed the study, and 66 responded to surveys. OPI reports and recommendations were well received. Most surgeons agreed (agree/strongly agree) that reports provided useful information about technical skills (84%), identified areas for improvement (83%), and would positively impact their practice (80%). 88% reported that it was important to receive OPI reports in training pathways, with the desired frequency of annually (6%), quarterly (67%), monthly (20%), twice monthly (5%), and daily (2%). 84% reported wanting to receive training recommendations, with most agreeing that recommendations were valuable (82%) and improved technical skills (80%), efficiency (78%), and confidence (75%). Desired frequency for recommendation delivery was quarterly (62%), monthly (22%), twice monthly (11%), weekly (2%) and daily (2%). 45% of surgeons receiving recommendations completed them, citing time (53%) and simulation access (25%) as barriers. No significant differences in OPI improvement were found between participants who completed recommendations and those who did not. Qualitative themes revealed a desire for objective data but a need to translate metrics into actionable practice.
Conclusions: This study demonstrated that OPI-based feedback and training recommendations were highly valued and effective educational tools for practicing surgeons on the early robotic surgery learning curve. A primary driver of this educational benefit, which improved confidence, is likely the delivery of objective, benchmarked feedback, rather than the completion of remedial tasks. This "learning loop" model represents a scalable paradigm for continuing surgical education.
(P010) WE CAN DO BETTER THAN SUMMING OR AVERAGING OSATS: A NOVEL COMPOSITE SKILL METRIC TO CORRECT FOR UNEQUAL ITEM WEIGHTS, RATER EFFECTS, AND OPERATIVE DIFFICULTY.
Ryan Chou1, Alexandra J Berges2, Kofi O Boahene2, Jessica H Maxwell3, John R Wanamaker4, Patrick J Byrne5, Ira D Papel2, Theda C Kontis2, Matthew S Holden6, Gregory D Hager1, Sonya Malekzadeh4, Lisa E Ishii2, S. Swaroop Vedula1, Masaru Ishii2; 1Johns Hopkins University, 2Johns Hopkins University School of Medicine, 3University of Pittsburgh Medical Center, 4MedStar Georgetown University Hospital, 5Cleveland Clinic, 6Carleton University
Introduction
Surgeons’ operating room skill is often evaluated using the Objective Structured Assessment of Technical Skills (OSATS) scale, with seven items each rated 1-5. Conventionally, OSATS scores are calculated as the sum or average of the item scores, implicitly assuming equal item weights. Though average OSATS is correlated with short-term postoperative outcomes and surgeons’ training or experience, the assumption of equal item weighting is unsubstantiated, and it does not correct for rater effects and operative difficulty. As a result, it is less useful for monitoring skill learning, entrustment, privileging, and credentialing. Our objective is to address these limitations with a novel composite metric of surgeons’ skill from OSATS.
Methods
We used data from a multisite prospective cohort study of residents performing nasal septoplasty. After each procedure, the supervising attending assessed the resident’s skill using a modified OSATS scale, three questions on operative difficulty, and three intraoperative errors (“elevating the flap in the wrong plane”, “flap tear(s)”, and “leaving residual deflection”). We fit structural equation models to estimate a latent skill measure from OSATS item scores and latent operative difficulty from the three questions. We simultaneously estimated random effects for rater effects. This allowed adjusting the latent skill score for both operative difficulty and rater effects. We fit multivariate probit regressions to predict the probability of any intraoperative error from the adjusted latent skill score and average OSATS. We used scatterplots to describe the probability of intraoperative error.
Results
We analyzed data from 41 trainees assessed by seven faculty raters in 188 procedures. The latent skill score had probit regression coefficients of -0.7364, -0.1691, -0.3301 for the three errors, respectively, i.e., as skill improved, the probability of error decreased. The coefficients for OSATS average were -0.0842, 0.2398, 0.1972, i.e., the probability of error may increase as skill improved. The latent skill score better discriminated probability of error than OSATS average (Figure 1).
Conclusions
A latent, weighted skill score from OSATS items, adjusted for rater effects and operative difficulty, predicts intraoperative error. Averaging OSATS is insufficient to model the probability of intraoperative error.

(P011) ROBOTIC SIMULATION AS A MARKER OF RESIDENT PROGRESSION IN GENERAL SURGERY
Vikram Krishna, MD, Drew Bolster, MD, Raffaele Rocco, MD, Philicia Moonsamy, MD, Harmik J Soukiasian, MD, Farin Amersi, MD, Andrew R Brownlee, MD; Cedars-Sinai Medical Center
Introduction:
Competency-based assessment in general surgery training is lacking. Robotic simulation provides an opportunity to use objective metrics to assess proficiency. Our study evaluated robotic simulator performance as a marker of resident progression.
Methods:
All PGY-1 to PGY-3 general surgery residents at a single academic institution were enrolled into a curriculum-based robotics simulation program. Robotic skills data were prospectively collected between August-October 2025. Residents’ total number of attempts, mean composite score, proportion of average scores >75 per skill (“competent performance”), and proportion of average scores >90 (“high performance”) were recorded. The primary outcome was mean composite score progression by PGY year. The secondary outcomes included competent performance and high performance by PGY year.
Results:
A total of 11 residents were included. Across all levels, 806 simulation attempts were recorded. Mean composite scores significantly increased with each PGY-level (PGY-1: 61.6 vs PGY-2: 71.9 vs PGY-3: 82.1, p=0.048). The percentage of attempts with competent scores (>75) and high-performance scores (>90) also significantly increased by each PGY-year (competent: 9.4% vs 15.1% vs 20.1%, p<0.001; high-performance: 10.8% vs 17.9% vs 23.1%, p<0.001). PGY-1 and PGY-2 residents also spent the most amount of time on the robotic console, compared to PGY-3 residents (9 vs 9 vs 5 mins; p=0.002).
Conclusions:
Robotic simulation data provides an objective method to assess resident performance and engagement. Our results shows progressive improvement across PGY levels. Integration of simulator analytics into residency curricula may enhance competency-based training and promote data-driven feedback in surgical education.

(P012) A SIMULATION-BASED LAPAROSCOPIC VENTRAL HERNIA CURRICULUM IMPROVES SURGICAL RESIDENTS’ TECHNICAL PERFORMANCE
Sangrag Ganguli, MD1, Kristine Kuchta, MS2, Colin Johnson, MD2, Syed A Mehdi, MBBS2, Aram Rojas, MD2, Alessia Vallorani, MD2, Arjun Thapa Chhetri, BVSc2, Melissa E Hogg, MD2, Stephen Haggerty, MD2; 1University of Chicago Medical Center, 2Endeavor Health
Introduction
Abdominal wall hernias remain a common challenge for general surgeons. Laparoscopic ventral hernia repair offers lower wound complication rates with similar recurrence compared to open repair, making surgery resident proficiency essential. This study demonstrates the efficacy of a simulation-based laparoscopic ventral hernia repair curriculum at an academic surgical residency program.
Methods
General surgery residents completed a simulation-based module for laparoscopic ventral hernia repair. Junior residents (PGY2-3) participated early in training and repeated the module in their senior years (PGY4-5). Participants provided demographic, prior exposure, and sleep/fatigue data. Residents then completed a pre-test survey assessing confidence in key operative steps. They performed a simulated laparoscopic ventral hernia repair, self-scored their performance, and were concurrently evaluated by an experienced proctor. After a mentored feedback session, residents repeated the simulation with both self- and proctor-assessment.
Results
55 general surgery residents participated in the study – 2 (3.6%) PGY-2, 24 (43.6%) PGY-3, 24 (43.6%) PGY-4, and 5 (9.1%) PGY-5 residents. Post-test scores were significantly higher for both self-evaluation (30.0 vs. 25.1; p<0.0001) and proctor evaluation (31.1 vs. 22.7; p<0.0001), as was resident confidence (43.1 vs. 36.4; p<0.0001). Residents scored themselves higher than evaluators on pre-test (25.1 vs. 22.7; p<0.001), but this difference disappeared on post-test (30.0 vs. 31.1; p=0.09). Evaluators reported greater comfort with residents performing the repair independently than residents’ self-evaluation (4.1 vs. 3.6; p<0.05). Performance on this module did not differ by PGY level, perceived difficulty, self-reported comfort with the procedure, prior hernia repair exposure, or video game experience. Residents with more sleep (> 7 hours) performed better on mesh positioning (4.8 vs. 4.4; p<0.05). Fatigued residents showed greater improvement overall (9.3 vs. 7.3, p=0.04). Qualitative feedback highlighted the operative practice as a strength with the adhesiolysis simulation as the main weakness of the module.
Conclusion
A simulation-based laparoscopic ventral hernia curriculum may be beneficial in improving technical and anatomy-based tasks. Factors such as year in training, prior exposure, fatigue, and perception of difficulty were not associated with performance on the module. Subjective analysis showed the module was especially helpful practicing key steps of a laparoscopic ventral hernia repair.
(P013) CHARACTERIZATION OF SURGICAL SKILL IMPROVEMENT USING GESTURES IN THE ADVANCED TRAINING IN LAPAROSCOPIC SKILLS (ATLAS) NEEDLE HANDLING TASK
Sofia Garces Palacios, MD1, Sharanya Vunnava, BS1, Shreya Vunnava, BS1, Madhuri Nagaraj, MD2, Kaustubh Gopal1, Daniel J Scott, MD1, Ganesh Sankaranarayanan, PhD1; 1University of Texas Southwestern Medical Center (SSO), 2University of Colorado Anschutz School of Medicine
Introduction
The Advanced Training in Laparoscopic Suturing Skills (ATLAS) is a structured curriculum designed to enhance laparoscopic suturing skills beyond fundamental levels. The needle handling task (Task 1) requires maneuvering a needle through six variable angled holes on a circular model. This study evaluates the skill improvement in a proficiency-based training of the ATLAS needle handing task using surgical gestures.
Methods
A retrospective video review was conducted using data from an IRB approved proficiency-based study. Fifteen first-year medical students were randomized into a training group (n = 10) and control group (n = 5). All participants completed pre- and post-tests. The training group proceeded through Fundamentals of Laparoscopic Surgery (FLS) to proficiency, followed by ATLAS training; the control group received no additional training. Trained independent raters scored all videos using eight predefined needle handling gestures (needle reposition, control, orientation, grasping, withdrawal, motion, force, and trajectory), rated on a 3-point scale (low, average, excellent). Using Messick’s unitary framework, the internal structure validity is assessed using the intraclass correlation coefficient (ICC) for inter-rater agreement. A two-way mixed ANOVA with Bonferroni post-hoc tests analyzed between-group performance. A generalized additive mixed model (GAMM) was used to analyze the learning curve.
Results
High inter rater agreement was achieved between the graders (ICC = 0.98, p < 0.001). Mixed ANOVA showed significant main effects for group (p < 0.001) and time (p = 0.026), with no significant interaction (p = 0.236). Post hoc tests indicated the training group significantly outperformed the control group at both pre- (p = 0.002) and post-test (p < 0.001). Within-group comparisons revealed significant improvement over time in the training group (p = 0.007) but not in the control group (p = 0.535). GAMM revealed a significant non-linear improvement across training trials (p < 0.001), explaining 56.2% of the variance (R2 = 0.5). The scores increased sharply in early trials and plateaued at trial 13 with an overall gain of 30 points in gesture score.
Conclusion
ATLAS training significantly improves needle handling proficiency at the gesture level. Gesture-based assessment offers a sensitive method for tracking surgical skill acquisition across training sessions.

