Plenary Session
Background: In anticipation of the American Board of Surgery (ABS) rollout of Entrustable Professional Activities (EPA), our program piloted an intraoperative EPA assessment tool in 2022. This analysis aims to evaluate the assessment tool by exploring associations between entrustment, surgical skill and knowledge, and resident and faculty characteristics.
Methods: The tool has an overall entrustment score and four sub-scores, developed from the ABS narratives for the pilot EPAs: anatomy, steps of operation, recognition of potential errors, and surgical technique. Each assessment captured name, date, PGY level, rotation, procedure, and case difficulty (straightforward, moderate, or complex). Faculty years in practice (range: 0-37 years), resident gender, and resident underrepresented in medicine (URiM) status were collected for analysis. URiM includes African American or Hispanic/LatinX, consistent with our institutional definition. Given strong intercorrelations (r=0.45-0.69) and high reliability (α=0.84), the four sub-scores were summed to a composite score. A multivariable linear regression model assessed for associations between overall score and composite sub-score, PGY level, case difficulty, faculty years in practice, resident gender, and resident URiM status.
Results: Between June 2022-2023, 46 faculty from 10 elective and emergent services completed 398 assessments for 44 residents (PGY1-PGY5). The linear regression model explained 55% of the variance. Overall entrustment score was most strongly associated with the composite sub-score, followed by case difficulty and PGY level. Faculty years in practice, and resident gender/URiM status were not significantly associated with entrustment (Table 1).
Conclusion: Our results suggest that faculty strongly consider resident intraoperative knowledge and technical skills when imparting autonomy. However, the gap in explained variance calls for further exploration of other contributors to entrustment, such as non-technical skills, interpersonal variables, or other unmeasured sources of bias.

Introduction
While previous studies demonstrated that generative artificial intelligence (AI) can pass medical licensing exams as well as oral and written boards as the examinee, the accuracy and safety of the technology as an examiner in complex, interactive assessments remains unknown. AI-powered chatbots could serve as educational tools simulating human examiners in oral case-based formats. We present initial validity evidence for the use of an AI-powered chatbot as an oral boards examiner.
Methods
We developed a chatbot based on GPT-4 that simulated oral board scenarios. We provided the chatbot with an introductory prompt explaining the flow of oral board scenarios, appropriate topics, and assessment frameworks. General surgery residents were recruited from six institutions through the Collaboration of Surgical Education Fellows who then completed oral board scenario simulations using the chatbot. An experienced general surgery mock oral board examiner evaluated chatbot transcripts using a rubric that assessed accuracy and safety through five domains: (1) inappropriate content, (2) missing content, (3) likelihood of harm, (4) extent of harm, and (5) hallucinations.
Results
Twenty oral board scenario simulations were completed. Commonly tested topics included small bowel obstruction (30%), diverticulitis (20%), and breast disease (15%). Twelve (60%) transcripts had no inappropriate content while three (15%) and five (25%) had inappropriate content of low and high clinical significance, respectively. Eleven (55%) transcripts had no missing content while five (25%) and four (20%) had missing content of low and high clinical significance, respectively. Sixteen (80%) had low likelihood of harm while three (15%) and one (5%) had medium and high likelihood of harm, respectively. Fifteen (75%) demonstrated no harm while four (20%) and one (5%) demonstrated moderate and severe harm, respectively. Only one (5%) transcript exhibited a hallucination with high clinical significance. Five (25%) transcripts exhibited the highest score possible on our rubric (Figure 1).
Conclusion
Our AI-powered chatbot has the potential to be a valuable educational tool for surgery residents preparing for oral boards. Further refinements through additional prompt-engineering and large language model fine-tuning are needed to improve the accuracy and safety of our AI-powered chatbot to best mitigate the risk of harm.

Background: Inguinal hernia repairs are common procedures performed in general surgery. The robotic platform is widely adopted in the procedure, and this necessitates a proficiency-based curriculum for novice surgeon training and safe implementation. Most curricula tend to rely on generalized feedback using global performance metrics. Task-specific metrics (TSM) have been shown to provide trainees with actionable feedback. In this study, we aim to evaluate the benefits to implementing a proficiency based robotic hernia curriculum with task-specific feedback on trainee technical skills and perceptions of improvement.
Methods: Surgery residents from UT Southwestern Medical Center and NorthShore who completed the intuitive SimNow© platform were included. Residents performed an inanimate hernia drill four times. These were graded according to the Objective Structured Assessment of Technical Skills (OSATS) and TSM. Hernia drills were completed by attending surgeons to establish benchmarks. Videos were recorded and accompanied with feedback online for review. Entry and exit surveys were done.
Results: In total, 22 residents started the curriculum and 16 completed it. Three attendings completed the hernia drill. For residents, median total OSATS and TSM on the 1st attempt were 22 (19.5-26) and 12 (9-17), respectively. Also, their median time to completion was 27 (22-32.5) minutes. Attending total OSATS, TSM and time were 30 (29-32), 23(23-25) and 14(10-15) minutes, respectively. These were significantly better than trainees (P<0.05). Comparing the 4th to the 1st attempts for residents, their total OSATS and total TSM improve significantly (31.4 vs 22.4, p<0.05) and (22.7 vs 12.3, P<0.05), respectively. This was similar for time (20.7 vs 26.8 mins, P<0.05). Residents were able to achieve attending performance by their 4th attempt on OSATS and TSM (P<0.05), but not time (P>0.05). A thematic analysis was performed and the three most noted feedback items were peritoneal incision (62.5%), suturing (50%) and camera/instrument handling (43.75%) on the 1st attempt. Most noted items in the 4th attempt were speed (18.75%), suturing (18.75%) and camera/instrument (12.5%) handling.
Conclusion: The robotic hernia curriculum improves resident performance, with residents achieving attending-level performance by their fourth attempt. More procedure specific curricula are needed to better prepare trainees for the operating room.

Introduction: Due to pandemic related restrictions, medical schools transitioned to virtual clinical rotations in 2020. Virtual learning is now an integral part of medical education, but there remain questions as to whether this adequately prepares students for the rigors of surgical residency. We hypothesized that students exposed to virtual learning during medical school demonstrate inferior performance during residency as compared to their predecessors.
Methods: Data were collected from 12 general surgery programs. Residents who began training in academic years (AY) 2018-2022 were included and followed for two years. Residents who started in AY2018-2020 (control group) prior to introduction of virtual rotations were compared to residents who started in AY2021-2022 (intervention group) after virtual clinical rotations were implemented. Primary outcome was the ‘overall’ milestone-score; secondary outcomes included USMLE scores, ABSITE% correct, remediation, and attrition. A linear mixed-effects model was used for the primary outcome and ABSITE% correct. Chi-squared and Fishers-exact tests were used for remediation and attrition, respectively. Only years with complete sets of milestone scores were included.
Results: 334 residents were included: 210 in the control group and 124 in the intervention group. There was no difference in USMLE scores between control vs intervention groups: USMLE Step1; 239 vs 240, p=0.16, USMLE Step2; 249 vs 252, p=0.13. There was a 24-point increase in overall milestones between PGY1 and PGY2 (95%CI 23-26, p<0.05). When adjusted for PGY, virtual learning resulted in an average decrease of 2.3 points over the sum of all milestones when compared to the control group (95%CI -0.4 to -4.2, P<0.05). A 2% decrease in ABSITE% correct in the intervention group (95 CI 1-5%, p=0.07) was noted. 36 residents underwent remediation: 16 (8%) in the control group and 20 (16%) in the intervention group (p=0.03). Twelve residents underwent attrition: 6 (3%) in the control group and 6 (5%) in the intervention group (p=0.4).
Conclusion: Virtual learning during medical school is associated with worse performance as a junior surgical resident as indicated by lower milestone scores and increased remediation. In-person and hands-on learning experiences during medical school should be prioritized to adequately prepare medical students for surgical residency.
Purpose
Students’ perceptions of workplace culture affect the learning environment in the Surgery clerkship; however, the timepoint at which students develop these perceptions is unclear. We aimed to identify inflections in students’ preconceptions of the culture of surgery, which are relevant to the timing of interventions targeting the surgical learning environment.
Methods
Students at multiple levels received surveys between July 2021-September 2023 soliciting words associated with, “Culture of Surgery.” We analyzed entries using a “bag of words” method, with each word representing a unique token, and determined the most prevalent words. In sentiment analysis, 2 raters independently assigned a positive, neutral, or negative valence to each word, and valence agreement was assessed. We compared proportions of valences and rater agreement among student levels with Chi-square contingency tests and determined inflections in sentiment.
Results
Participants included 50 undergraduates, 111 first-year medical students, and 216 clerks beginning Surgery rotations. “Intense” was the most common word reported at all levels. Sentiment analysis results are shown in the table:
| Undergraduate, Non-pre-med | Undergraduate, Pre-med | MS1 Orientation | MS2 Surgery Clerkship | MS3 Surgery Clerkship | P-value | |
|---|---|---|---|---|---|---|
| Participants | 31 | 19 | 111 | 77 | 139 | |
| Unique Words | 76 | 62 | 141 | 159 | 197 | |
| Total Words | 156 | 95 | 263 | 302 | 443 | |
| Valence (%) | ||||||
| Positive | 71.6 | 58.9 | 13.3 | 16.3 | 21.1 | <0.001 |
| Neutral | 21.3 | 21.1 | 29.7 | 27.7 | 13.4 | <0.001 |
| Negative | 7.1 | 20.0 | 57.0 | 56.0 | 50.2 | <0.001 |
| Agreement | 73.7 | 62.3 | 75.2 | 78.6 | 74.1 | 0.183 |
Conclusions
Prior to medical school, students have predominantly positive preconceptions of surgical culture, while negative preconceptions predominate early in medical school and persist into clerkships. Future studies will focus on factors that account for these differences.
