ASE 2024 Abstracts - The Association for Surgical Education

Quick Shot V - AI/Recruitment

(Q036) DEEP-LEARNING COMPUTER VISION ALGORITHM FOR HAND ROLL ESTIMATION IN SURGICAL SUTURING SKILL ASSESSMENT
Jianxin Gao, Amir Mehdi Shayan, Simar P. Singh, Joe Bible, Ravikiran Singapogu, Richard E. Groff; Clemson University

Purpose: Vascular surgery education includes evaluation of open surgery suturing skill. Sensor-embedded simulators offer an approach to surgical skill evaluation, but attaching sensors on hands or instruments might interfere with performance. Recently, deep-learning algorithms have been introduced to medical education, but the algorithms lack clinical interpretation. The goal of this study is to use deep-learning computer vision algorithms to estimate hand roll angle, previously measured using a hand-mounted inertial measurement unit (IMU), and then use hand roll to calculate metrics for suturing skill evaluation. This yields a deep learning-based assessment process which does not require sensors on hands or instruments.

Methods: The proposed deep-learning computer vision algorithm includes a hand detection algorithm and a hand roll estimation algorithm. In each video frame, the hand detection algorithm crops the dominant hand from the frame and then the hand roll estimation algorithm estimates the roll angle of the hand. The estimated roll angles are used to calculate suturing skill metrics, which were previously developed and validated using IMU measurements. The performance of hand roll metrics, as calculated from IMU measurements and from deep learning vision, was compared.

Results: The two deep-learning algorithms can process video at 55 frames per second on a commodity PC, allowing real-time analysis. The hand detection algorithm achieves mAP@0.95=0.99 on the test dataset, while the hand roll estimation algorithm achieves roll angle estimation errors around 10 degrees over the entire dataset. The roll angle error is predominantly due to bias rather than noise and so has little effect on the calculated metrics. The metrics provide similar statistical performance whether calculated from IMU or vision. Specifically, 4 out of the 5 metrics have significant mean differences between novice and surgeon groups. Moreover, 3 out of the 5 metrics have significantly different means between resident surgeon and attending surgeon groups at the surface condition.

Conclusion: Our analysis indicates that the proposed deep-learning computer vision algorithm can be used for surgical suturing skill assessment. Since the algorithm is free of attaching sensors, it avoids physical interference and sterility issues, which lays a foundation for intra-operative surgical skill assessment.

(Q037) THE RESIDENCY SIGNAL AND THE NOISE
Kathryn Radulovacki, BA¹, Devika A Shenoy, BS¹, Brooke E Schroeder¹, William C Eward, MD, DVM²; ¹Duke University School of Medicine, ²Duke University Hospital

Introduction: Orthopaedic surgery applicants may send 30 preference signals to residency programs. Large academic programs are thought to receive most signals, potentially leaving smaller community programs with fewer and making those signals more impactful. This study examines whether signal effectiveness varies by program type and size.

Methods: We conducted a retrospective cohort study of orthopaedic surgery applicants in 2024 using the AAMC Residency Explorer Tool. The primary outcome was the “interview offer rate,” defined as the number of interviews given to applicants who signaled the program divided by the number of signals sent. Programs were analyzed by type (academic or community) and size (small = <5 residents, medium = 6-8, large = > 9). Data were evaluated via Kruskal-Wallis and Dunn post-hoc tests.

Results: Of 210 programs, 90 were excluded due to incomplete data, leaving 120 (57.1%). Across all institutions, applicants who signaled a program were significantly more likely to receive an interview offer (median 22% [IQR 17%-28%] versus 1% [0%-3%], p<0.001). For applicants who sent a signal, there was no significant difference in interview offer rate by program type, size, or type and size combined. There were also no significant differences across these metrics for applicants who did not signal a program.

Discussion/Conclusion: Interview rates for signals were consistent across program types and sizes, suggesting similar effects wherever signals are sent. All programs were unlikely to interview applicants who did not signal, highlighting the importance of careful signal selection.

(Q038) AUTOMATED FLS PEG TRANSFER ASSESSMENT USING EXPLAINABLE NEURAL ARCHITECTURE MODELS
Frank G Lee, MD, Mohamed S Baloul, MBBS, MD, Calvin Condon, Hang Yu, Johann Joseph, Erik A Clemens, MS, Benjamin L Lange, MS, Mariela Rivera, MD; Mayo Clinic Rochester

Introduction

All surgical trainees must complete the Fundamental of Laparoscopic Surgery (FLS) peg transfer task for certification, however this creates a large scoring burden on instructors. Automated scoring via computer vision (CV) offers a scalable, objective alternative. We developed and compared multiple CV approaches for automated FLS peg transfer scoring for feasibility, performance, and explainability.

Methods

We compared multiple object detection and tracking architectures on institutional demonstration FLS videos. Model architectures included You Only Look Once X (YOLOx) with ByteTrack for temporal tracking and Deep Object Pose Estimation (DOPE) for bounding box detection versus YOLO Neural Architecture Search (NAS) with custom object tracking algorithms. Models quantified FLS metrics: task completion time, successful disk transfers (12 total), disk drops, and instruments out-of-view. Performance was validated on a held-out real-world FLS assessment recording, with visualization of the embedding layer providing explainability.

Results

YOLO NAS achieved superior performance and interpretability compared to more conventional architectures like YOLOx. Transfer detection accuracy was 92% (11/12 correct) versus 41% (5/12) for YOLOx with ByteTrack (Object Tracking). Frame-level accuracy for YOLO NAS was 93% disk detection, 81% grasper detection, 2 drops detected (1 false positive), and task timing within 3 seconds of actual timing. Embedding visualizations uniquely revealed interpretable learning patterns: distinct clustering emerged across different video sources (demonstrations versus assessments), with progressive refinement during training. This confirmed the model learned task-specific features rather than memorizing backgrounds or recording artifacts.

Conclusions

A preliminary YOLO NAS model shows automated assessment of FLS peg transfer is feasible. Our use of embedding-layer visualization provides a novel method for validating the learning process, building trust and demonstrating the model's application for real-world surgical assessment. However, additional annotated training data is needed before external validation and adoption. Despite suboptimal identification of disks and graspers, the model performs well on timing. This model could be a useful educational tool to track beginners’ progress in an automated fashion.

Figure

A: Representative annotated image frame with peg, disks, and graspers

B: Visual representation of the model embedding layer showing the effect of fine-tuning (from left to right) on interpretable clustering learning patterns.

(Q039) WHAT MOVES THE NEEDLE IN SUTURING? DOES AI ASSISTED ASSESSMENT IMPROVE STUDENT PERFORMANCE?
Enzo Castiglioni, Matias Aguilera, Bernardita Becker, Maria Gaete, Julian Varas Cohen; Center for Simulation and Experimental Surgery, Faculty of Medicine, Pontificia Universidad Catolica de Chile, UC-Christus Health Network, Santiago, Chile

Timely, specific feedback is central to learning procedural skills. Video based assessment scales feedback and captures detailed performance indicators; within this context, artificial intelligence (AI) has been proposed to objectify and streamline faculty comments. The key question is whether AI, by itself, improves student performance compared with standard human feedback. Objective: To determine whether an AI tool that suggests feedback to faculty improves student performance versus standard human feedback.

Quasi experimental assignment (intention to treat) in an undergraduate course (October 2025) teaching a single interrupted suture via an asynchronous video-based platform. Two student groups were assigned and assessed by the same faculty; the only difference between arms was access to editable AI generated suggestions in the AI arm. No other instructional elements, grading criteria, or platform features differed. 85 students completed 132 attempts (maximum three). Outcomes included OSATS total, pass/fail by attempt, and critical subscales. For exploratory analyses, we also recorded the amount of early feedback as the total number of feedback items (text, drawings, error flags) delivered on Attempt 1. “Final” was defined as the first passing attempt or, if no pass, the last attempt. Group comparisons used Welch’s t-test for means and Fisher’s exact test with Wilson intervals for proportions. Linear regression modeled OSATS gain (OSATS_final − OSATS_Attempt1) adjusting for baseline OSATS; Kolmogorov–Smirnov assessed dispersion. Two-sided α = 0.05.

AI showed no independent effect on improvement (AI coefficient ≈ +0.05, p ≈ 0.87). Both arms improved to a similar extent (Control +1.70 vs AI +2.00 OSATS points; Δ = +0.30, 95% CI −0.97 to +1.57). Final pass rates were high in both groups. In exploratory analyses, the first-attempt feedback amount was positively associated with subsequent score increase (β = +0.189 per unit; p = 0.007).

Providing faculty with AI suggested feedback did not improve student performance beyond standard human feedback under otherwise identical conditions; both groups improved similarly. The educational lever appears to be early, substantive feedback, not AI per se. Accordingly, AI should be deployed as a complement to scale, structure, and enhance the quantity and quality of feedback, rather than as a substitute for faculty judgment.

(Q040) SIGNAL BOOST: EVALUATING THE EFFECT OF SIGNALING ON GENERAL SURGERY INTERVIEW RATES
Akshat Sanan, Nicholas J Iglesias, Talia R Arcieri, Ana M Reyes, Marina M Tabbara, Megan V Laurendeau, Nikita M Shah, Vanessa W Hui, Laurence R Sands, Chad M Thorson; University of Miami Miller School of Medicine

Introduction: The Electronic Residency Application Service has introduced signaling to help applicants express interest in specific residency programs, but its impact on interview offers remains unclear. We aimed to evaluate trends in interview rates associated with signaling across general surgery residency programs during the 2024–2025 application cycle.

Methodology: We queried Residency Explorer, extracting program variables including affiliation (university-affiliated, community-based university-affiliated or community-based), location, and interview rate. Signal Boost (SB) was defined as the difference between the proportion of interviews offered to signaled and non-signaled applicants. Associations between SB and program characteristics were assessed using non-parametric tests.

Results: 194 programs were included. University-affiliated programs demonstrated a higher median SB compared to community-based university-affiliated general surgery residency programs (p = 0.047). Doximity top 50 exhibited lower SB than programs ranked 51–100 (p = 0.033). Regional variation was also observed, with higher SB in the West North Central region compared to the Middle Atlantic (p = 0.046) and South Atlantic (p = 0.049).

Conclusions: The impact of residency program signaling varies significantly by program type, ranking, and geographic region, highlighting the need for more precise guidance on signaling practices to promote transparency in the residency selection process.

	Signal Interview Rate	Non-Signal Interview Rate	Median Signal Boost	p-value
Affiliation				0.047
University-Affiliated	23%	3%	0.195
Community-Based University-Affiliated	22%	6%	0.170
Region				0.007
Pacific	32%	5%	0.230
Mountain	24%	6.5%	0.175
West North Central	37%	4%	0.320
West South Central	31%	4%	0.275
East North Central	22%	3.5%	0.180
East South Central	28%	5%	0.240
New England	22%	3%	0.200
Middle Atlantic	22%	5%	0.145
South Atlantic	19%	5%	0.155

(Q041) HARNESSING ARTIFICIAL INTELLIGENCE (AI) FOR SURGICAL EDUCATION: COMPARING AI- AND SURGEON-GENERATED FEEDBACK IN CONSENT TRAINING
Catherine Gbekie, BS, Brianna M Peet, BS, Viemma Nwigwe, MD, Grace B Simmons, AB, Daniel P Pacella, BS, Angel Rosario, MD, MPH; Columbia University Vagelos College of Physicians and Surgeons

Background: Effective surgical consent requires empathy, clarity, and cultural sensitivity. Traditional consent training has relied on in-person simulations, dyadic role-playing, and live feedback, methods that are variable and resource-intensive. Large language models now offer trainees controlled, judgement-free environments to practice communication skills with immediate, individualized feedback; however, their reliability and educational value remain insufficiently validated.

Objectives: To compare (1) the fidelity of AI-generated feedback to resident and attending surgeon-generated feedback and (2) the quality of AI-generated versus human-generated feedback.

Methods: Six medical students completed a simulated appendectomy consent involving a culturally sensitive component. Each encounter transcript received AI-generated and human-generated feedback using the same 17-item informed-consent rubric (0-3 scale; maximum score 51). The primary outcome was the difference in mean item-level rubric scores between AI and human evaluators (positive = higher AI), using human ratings as the reference. The secondary analysis evaluated feedback quality using a meta-evaluation model adapted from the modified Completed Clinical Evaluation Report Rating (CCERR) scale (8 items, 1-5 scale; maximum score 40) measuring Educational Utility and Constructiveness, Specificity and Justification, and Comprehensiveness. Data were collected October through December 2025, with additional responses pending.

Results: Preliminary results from 13 of 43 invited evaluators (N=26 evaluations) showed average rubric score discrepancies of 0.57–1.79 points, with largest differences for Explains Alternatives (Δ1.79), Assesses Capacity (Δ1.40), and Explains Benefits (Δ1.25). In secondary analysis, AI-generated feedback (N=6) universally outscored human-generated (N=20) feedback across all CCERR quality domains (Utility, 4.79 vs. 3.00; Specificity, 4.53 vs. 3.00; Comprehensiveness, 4.89 vs. 3.36), resulting in overall higher average scores for AI feedback (37.8 vs. 12.1 out of 40, Δ25.7). Notably, AI feedback was consistently complete, while surgeons frequently omitted qualitative or justification commentary, reflecting real-world time constraints for delivering live, constructive feedback.

Conclusions: AI feedback exhibited superior comprehensiveness, structural quality, and actionability but inflated performance scores and reduced scoring precision. These findings highlight AI’s potential as a scalable supplemental feedback tool, while underscoring the need for continued human oversight to preserve rubric fidelity and evaluative sophistication.

(Q042) DEFINING THE IDEAL SURGICAL TRAINEE: A NATIONAL SURVEY OF GENERAL SURGERY CHAIRS ON PREFERRED APPLICANT VIRTUES
Kevin I Ig-Izevbekhai¹, Thomas C Howell, MD, MSHS¹, Hima B Thota, MD², Felix de Bie, MD, PhD¹, Edwin Savage, MD³, Jacob A Greenberg, MD, EdM¹, Theodore N Pappas, MD¹, Ryan M Antiel, MD, MSME¹; ¹Department of Surgery, Duke University School of Medicine, ²Department of Surgery, Rutgers New Jersey Medical School, ³Department of Surgery, School of Medicine, University of North Carolina at Chapel Hill

Background: General surgery training is highly demanding, requiring both technical and non-technical excellence. While institutional preferences for core competencies have been explored[1-3], there remains no investigation of surgical leaders’ preferred character attributes in residency candidates.

Methods: This study was a discrete choice experiment within a 2025 electronic national survey of 154 U.S. General Surgery Chairs. Respondents were presented multiple combinations of the following virtues—courage, composure, empathy, gratitude, integrity, justice, resilience, wisdom—and were asked which attribute they valued most or least in a residency candidate. Multinomial logistic regression with robust standard errors was conducted to assess the relative preferences of surgery chairs with secondary analysis of covariates and demographic factors.

Results: A total of 47 chairs (30.5% response rate) returned completed surveys. The highest-ranked virtue was resilience (31.9%), followed by empathy (25.5%), courage (19.1%) and wisdom (8.5%). Lower ranked virtues included composure (6.3%), gratitude (4.3%), integrity (2.1%) and justice (2.1%). Chairs who preferred empathy were also more likely to prioritize wisdom or composure (p = 0.005).

Conclusion: Among a list of non-technical virtues, General Surgery Chairs value resilience in their trainees in greater proportion than other virtues, while also prioritizing empathy. This preference may reflect the constraints of broad training and duty hour restrictions, whereby many programs prioritize technical over non-technical attributes. However, an intentional focus on non-technical attributes could increase virtue literacy among surgeons and encourage residency programs to recruit for and to cultivate key virtues among surgical trainees.

1. Maxfield CM, Montano-Campos JF, Chapman T, et al. Factors Influential in the Selection of Radiology Residents in the Post-Step 1 World: A Discrete Choice Experiment. J Am Coll Radiol. Nov 2021;18(11):1572–1580. doi:10.1016/j.jacr.2021.07.005

2. Melendez MM, Xu X, Sexton TR, Shapiro MJ, Mohan EP. The importance of basic science and clinical research as a selection criterion for general surgery residency programs. J Surg Educ. Mar–Apr 2008;65(2):151–4. doi:10.1016/j.jsurg.2007.08.009

3. Strausser SA, Dopke KM, Groff D, Boehmer S, Olympia RP. Importance of residency applicant factors based on specialty and demographics: a national survey of program directors. BMC Medical Education. 2024/03/13 2024;24(1):275. doi:10.1186/s12909-024-05267-8

(Q043) BRINGING THE TRAUMA BAY TO THE BACKROADS: IMPLEMENTING A PORTABLE SIMULATION CURRICULUM FOR RURAL HOSPITALS — THE PILOT YEAR
Zackery Aldaher, DO¹, Gabrielle Moore, MD¹, Kendall Via¹, Cole Harp, DO¹, Adnan Alseidi, MD, FACS², The Georgia Trauma Commission², Bao Ling Adam, PhD¹, Erika Simmerman Mabes, MD, FACS¹; ¹Medical College of Georgia, ²The Georgia Trauma Commission

Introduction
Rural hospitals face significant gaps in trauma care due to limited access to specialists, procedural training, and standardized team-based education. In Georgia, over 20% of trauma patients come from rural areas where high-fidelity trauma training is unavailable. A statewide needs assessment identified deficiencies in procedural skills, interdisciplinary communication, and trauma system activation, guiding the design of a pilot curriculum presented at a multidisciplinary rural trauma conference. The curriculum was adapted to become a statewide, portable, interactive trauma simulation outreach course tailored for rural providers and piloted at participating hospitals.

Methods
In partnership with the Georgia Trauma Commission, we identified rural hospitals willing to participate in the pilot year of curriculum implementation. Providers at participating hospitals attended sessions combining interactive lectures and procedural skills stations (airway management, hemorrhage control, chest trauma). Participants completed pre- and post-course knowledge tests and Likert-scale confidence surveys with paired t-tests performed for analysis. Qualitative feedback was also collected which was analyzed utilizing inductive coding.

Results
To date, 4 rural hospitals participated in the 2025 pilot year with 83 participants (9% MD/DO/APP, 68% RN, 16% EMS, 7% RT). Participants demonstrated significant improvement in procedural confidence across all skills (Table 1) and trauma stabilization knowledge. 86% of participants were strongly satisfied with the course, and over 90% strongly agreed it should continue. Qualitative themes emphasized the value of hands-on learning, efficient teaching, and clinical relevance. Suggested improvements included having more instructors and additional trauma scenarios.

Conclusion
The pilot year of this portal rural trauma simulation curriculum is demonstrating strong feasibility, high engagement and measurable education benefits across various rural settings. This program fills an important gap in trauma readiness and offers a scalable model for delivering tailored, multidisciplinary trauma education in low-resource environments. Ongoing growth includes regional partnerships, the development of a simulation toolkit, and train-the-trainer programs to ensure sustainability and wide dissemination.

Table 1: Participant (n=83) pre/post survey data indicating confidence in procedural skills following interactive trauma curriculum. All skills showed a significant increase in confidence (p<0.05) post training with largest effects in FAST Exam, Chest Tube and Needle Decompression.

(Q044) ASSESSING THE QUALITY OF SURGICAL SUBINTERNSHIPS: A MULTI-SITE ANALYSIS OF SUBINTERNSHIP DOMAIN-LEVEL PERFORMANCE AND VISITING VS. HOME STUDENT EXPERIENCES
Jonah D Thomas, MD, MS¹, Claire Ferguson, MD², Dandan Chen, PhD¹, Peter Stehr, BS³, Emily Witt, MD³, Jonathan Greer, MD³, Motaz Qadan, MD, PhD³, Sophia McKinley, MD, EdM³; ¹Massachusetts General Hospital, ²SUNY Downstate, ³Harvard Medical School

Introduction: High quality surgical subinternships have the potential to provide learning experiences that prepare medical students for residency and allow students to audition for surgical training at a particular institution. Despite this importance, limited data exists on the quality of surgical subinternship experiences offered to senior medical students, and whether subinternships meet consensus recommendations. This study aims to quantify the educational quality of surgery subinternships across seven key domains and to compare the subinternship experience between home and visiting students.

Methods: From April through September 2025, medical students completing general surgery or general surgery-adjacent subinternships (e.g. vascular, cardiothoracic, transplant surgery) at three academic hospitals completed a 72-item, seven-domain survey on a 5-point Likert scale. Domains included rotation structure, rounding & patient care, operating room conduct, technical skills, knowledge base, clinic, and professionalism. Dunn’s post-hoc test with Benjamini–Hochberg correction was used to compare domain rating means. Differences between home and visiting students were compared using independent-samples t-tests.

Results: Fifty-five students (26 visiting, 29 home) completed the survey (response rate = 57%). The median overall satisfaction score was 9.5/10 [IQR 8.25-10]. Overall, mean domain scores (±SD) were: Professionalism = 4.72 ± 0.43, Rotation Structure = 4.50 ± 0.52, Operating Room = 4.31 ± 0.57, Knowledge = 4.19 ± 0.59, Technical = 3.93 ± 0.82, Clinic = 3.85 ± 0.99, and Rounding = 3.70 ± 1.00. Post-hoc comparisons revealed Professionalism scored significantly higher than nearly all other domains (p < 0.01), except Rotation Structure (Image 1). No significant differences were observed between home and visiting students in overall satisfaction or any domain mean (p > 0.05). Item comparison revealed that visiting students less frequently completed their subinternship early enough to receive a letter of recommendation (p = 0.047).

Conclusions: Subinternship quality was rated highly overall, with Professionalism and Rotation Structure most positively evaluated, while Clinic and Rounding & Patient Care experiences were identified as areas for potential improvement. Equivalent ratings between home and visiting students suggest consistent educational quality regardless of student home institution. These findings provide actionable insight for enhancing subinternship design and are the first step toward benchmarking national standards for surgical subinternship quality.