Podium IB - Assessment
(S022) BUILDING A BETTER SUB-INTERNSHIP EVALUATION: A MULTI-STAKEHOLDER APPROACH
Anthony O Morada, MD, Jessica L Becker, MD, Michael J Furey, DO, Christian Hailey Summa, DO, Kristen R Richards, MD, Jacob Bodde, MD, Rebecca Michelle Jordan, DO, Joseph P Bannon, MD, Alexandra Falvo, MD; Geisinger Northeast General Surgery Residency
Background: Sub-internships serve as essential components in the residency selection process; however, many programs lack validated assessment frameworks that adhere to principles of educational science. Traditional evaluative methods heavily rely on unstructured, single-source assessments with uncertain reliability, thereby limiting both trainee development and programmatic decision-making. This study seeks to advance the field of surgical education assessment by implementing and validating a comprehensive evaluation system grounded in established literature concerning workplace-based assessments and psychometric principles.
Methods: Building upon validated surgical education assessment instruments, including the Objective Structured Assessment of Technical Skills (OSATS), Mini-Clinical Evaluation Exercise (Mini-CEX), and 360-degree feedback, we developed a structured, competency-based evaluation system employing behaviorally-anchored 5-point rating scales targeting ACGME competencies. Our innovative multi-stakeholder approach systematically gathered perspectives from attending surgeons, residents, and clinical staff. Implementation spanned five months, enabling prospective validation of reliability, feasibility, and discriminant validity. Variance decomposition identified optimal evaluator thresholds while analyzing score patterns across evaluator types and competency domains.
Results: Twenty-four evaluations across 8 students demonstrated strong psychometric properties and practical feasibility. High completion rates (100%) with brief time investment (mean 4.2 minutes) validated stakeholder acceptability. Multi-source assessment included attending physicians (17%), residents (79%), and clinical staff (4%). Inter-rater reliability achieved excellent to good thresholds: overall intern-readiness ICC=0.93 (95% CI: 0.79-0.995), technical skills ICC=0.92 (0.65-1.0), initiative ICC=0.79 (0.48-0.98), communication ICC=0.75 (0.43-0.98), clinical knowledge ICC=0.66 (0.30-0.97), with professionalism ICC=0.47 showing adequate consistency for multi-rater assessment. Discriminant validity confirmed through appropriate identification of performance concerns (17% flagged). Evidence-based analysis revealed 4-5 evaluations per student optimized reliability while maintaining feasibility.
Conclusions: This study demonstrates that systematic, evidence-based sub-internship evaluation informed by assessment science principles achieves both psychometric rigor and practical implementation success. By validating optimal evaluator thresholds and multi-stakeholder approaches, this work advances surgical education assessment science. The scalable framework enhances training environments through structured feedback while producing objective data supporting residency selection decisions identifying future skilled, compassionate surgeons.
(S023) EVALUATING NARRATIVE FEEDBACK OF ENTRUSTABLE PROFESSIONAL ACTIVITIES IN GENERAL SURGERY: A MIXED-METHODS APPROACH
Gabrielle M Moore1, Dalton Hegeholz, MD1, Ting Sun, PhD1, M. Libby Weaver, MD, MHPE1, Erin Ward, MD1, Erika Simmerman Mabes, DO2, Kshama Jaiswal, MD1; 1University of Utah, 2Wellstar MCG Health
Background
Entrustable Professional Activities (EPAs) are competency-based assessments designed to provide formative feedback for core activities of general surgery. This work aims to evaluate the characteristics of narrative EPA feedback provided to general surgery (GS) residents and investigate associated factors.
Methods
This is a single institution, retrospective study including assessments of GS trainees by faculty from 2023-2025. Qualitative analysis of feedback characteristics was performed utilizing deductive coding by two independent researchers from previously established frameworks (specific, coaching/formative, appreciative, or evaluative). Mixed-effects logistic regression examined the association between feedback characteristics with phase of care, case complexity, entrustment, and faculty and resident demographics.
Results
A total of 673 EPAs were submitted with narrative feedback by faculty. Of those, 535(80%) were specific, 154(23%) coaching, 198(29%) formative, 501(74%) appreciative, 215(32%) evaluative, and 58(9%) containing trainee personal characteristics (i.e., confident). Pre-operative feedback was more likely than intra-operative to include coaching (OR = 1.90, p = 0.034), formative (OR = 1.73, p = 0.049) or evaluative (OR = 3.31, p < 0.001) language (Figure 1). Complex cases were more likely to include formative feedback (OR = 2.23, p = 0.009) while less likely to be specific (OR = 0.39, p = 0.013) or contain trainee personal characteristics (OR = 0.45, p = 0.046). When compared to practice-ready trainees, non-practice-ready residents were less likely to receive specific (OR = 0.27, p = 0.001), coaching (OR = 0.07, p < 0.001), formative (OR = 0.11, p < 0.001), or appreciative feedback (OR = 0.50, p = 0.050) while more likely to receive evaluative feedback (OR = 2.91, p = 0.002). Junior faculty were less likely than senior faculty to provide appreciative feedback (OR = 0.36, p = 0.024). There were no significant differences in feedback characteristics by faculty or trainee demographics.
Conclusion
There is no evidence that EPA narrative assessments are biased by demographic characteristics. Practice-ready residents and those participating in more complex cases receive greater coaching and formative feedback. It may be beneficial to focus future faculty development on increasing constructive feedback given to non-practice-ready residents.
Figure 1: Factors associated with EPA feedback characteristics.
(S024) A BAYESIAN FRAMEWORK FOR HOLISTIC ASSESSMENT OF SURGICAL RESIDENT TECHNICAL SKILL
Jessica R Santos-Parker, MD, PhD, Keli S Santos-Parker, MD, PhD, Shareef Syed, MD, Adnan Alseidi, MD, Hueylan Chern, MD, Patricia S O’Sullivan, PhD; Department of Surgery, University of California San Francisco
Background: Technical skills assessments often include measures of speed and proficiency. However, task-specific measures limit holistic performance interpretation due to traditional analytic methods constraints. We analyzed surgical intern skills assessments using a joint Bayesian model of speed and rated proficiency to estimate latent composite performance and quantify ranking reliability across multiple tasks. This framework identifies holistic speed-proficiency patterns, supporting targeted feedback.
Methods: Forty-two surgical interns were timed on nine standardized technical tasks (iterations of open knot tying, suturing, superficial and deep), rated by 18 faculty on a 5-point ordinal proficiency rubric during a foundational skills assessment at a large academic institution. Analyses were conducted in R (v4.4.1) using brms and lme4. A hierarchical Bayesian mixed model jointly estimated log-time and proficiency rating across residents, tasks, and faculty raters. Resident-specific speed and proficiency were estimated as random intercepts and summed to define a composite score. Composite score reliability was quantified with Bayesian generalizability analysis.
Results: Resident composite score rankings were stable, with 85% [95% CrI 81–89%] agreement across posterior draws. Bayesian generalizability analysis yielded moderate reliability (G = 0.65 [95% CrI 0.42–0.78]) for nine tasks. Latent speed and proficiency were positively correlated (ρ = 0.49, [95% CrI 0.17–0.74]), with speed accounting for 44% (95% CrI 17–68%) of variation in proficiency, indicating substantial proficiency differences among similarly fast residents. The holistic skill plane (Figure 1) displays composite scores by speed–proficiency of each resident, interpreted as resident performance archetypes. An intern in the bottom left quadrant, right of the trend line is overall low skill, relatively proficient but slow vs. top right quadrant, left of the line is overall high skill, relatively fast but less proficient.
Conclusions: Joint Bayesian modeling enables holistic interpretation of resident technical performance by integrating speed and proficiency into a single latent composite score with quantified reliability. This approach identifies resident archetypes with deeper patterns in overall trends of efficiency-proficiency and implications for coaching advice on how to practice: emphasizing speed or technique. Educators can move beyond single task metrics with holistic interpretation to provide personalized overall feedback and identify ineffective practice habits early.

(S025) ANALYSIS OF SURGICAL RESIDENTS IN THE AAMC RESIDENT READINESS SURVEY
Chase C Marso, MD, Roy Phitayakorn, MD, MHPE, Sophia McKinley, MD, MEd, Dandan Chen, PhD; Massachusetts General Hospital
Background: The AAMC initiated the Resident Readiness Survey (RRS) program in 2019 to support and assess the transition of trainees from medical school to intern year. Of all specialties, general surgery programs have among the highest rates of indicating that incoming residents do not meet overall performance expectations. The aim of this study was to analyze RRS responses to understand which specific clinical and non-clinical skills are rated as deficient among general surgical interns and residents from other surgical subspecialties.
Methods: De-identified RRS quantitative data from 2019-2024 was obtained from the AAMC. Resident overall performance and performance in specific clinical and non-clinical domains were analyzed between general surgery and other surgical subspecialties (neurosurgery, orthopedic surgery, integrated plastic surgery, integrated thoracic surgery, and integrated vascular surgery). Descriptive statistics and Chi square tests were performed to identify specific patterns in assessments of residents failing to meet expectations.
Results: From 2019-2024, 303 general surgery residency programs completed 4500 assessments; 441 surgical specialty programs completed 3107 assessments. In that time, significantly more general surgery interns (224/4495, 4.9%) were rated by their program directors as not meeting overall performance expectations compared to surgical subspecialty interns (53/3100, 1.7%) (p<0.001) (Figure). Among specific skills, general surgery interns were significantly more likely to receive ratings of “Did not meet expectations” compared with interns in surgical subspecialties, particularly for organization and timeliness (5.95% vs 2.27%, p<0.001), procedures/consent (3.27% vs 1.18%, p<0.001), and prioritizing a differential diagnosis (3.22% vs 0.93%, p<0.001), as shown in the Figure.
Conclusions: The overall proportion of general surgery interns not meeting expectations as rated by their program directors is low, but the percentage is higher than for interns at surgical subspecialty programs. Additional qualitative analysis of program directors’ comments in the RRS may provide further insight into intern deficiencies. Understanding specific deficiencies may create opportunities for targeted education interventions at medical schools and residency programs during the UME to GME transition.

(S026) CHARACTERIZING HOW SURGICAL RESIDENTS EVALUATE FACULTY: A DOCUMENT ANALYSIS OF FACULTY TEACHING EVALUATION TOOLS
Yichuan Yan, MSED, Nathan G Behrens, MD, Dimitrios Stefanidis, MD, PhD; Indiana University School of Medicine
Purpose
Feedback on faculty teaching is crucial to improve their performance. Surgical training programs obtain such feedback from their residents and provide it to their faculty. The content and quality of faculty teaching evaluations being used are, however, unknown as no standardized tool exists. The aim of this study was to identify the domains and variation of faculty teaching evaluation tools used in surgical residency programs.
Methods
Using convenience sampling, blank faculty teaching evaluation forms completed by residents were collected from various U.S. surgery residency programs. A document analysis was performed on individual question items to identify common themes and evaluation domains. Descriptive statistics and tabulations were used to summarize the data.
Results
Fourteen faculty teaching evaluation tools from thirteen surgical residency programs were collected and analyzed. Content analysis of all question items revealed four major evaluation domains including teaching effectiveness (55.1% of codes), professionalism (25.9%), clinical performance (12.1%), and overall performance (6.9%) from the faculty teaching evaluation tools. Figure 1 demonstrated the variability across the 14 faculty teaching evaluation tools based on the distribution of the question items in four evaluation domains and three question types. The mean number of questions per evaluation was 14.7 (range 4-23), composed of rating questions (9.6), free text comments (4.4), and checklist questions (0.6).
Conclusions
Substantial variability in faculty teaching evaluations used across a variety of surgical training programs was identified. Besides teaching effectiveness, faculty professionalism and, less frequently, clinical performance were also evaluated. Standardization of faculty teaching evaluation tools can ensure assessment validity and actionable faculty feedback and enable comparisons across institutions.

(S027) STANDARD SETTING IN THE ERA OF WORKPLACE-BASED ASSESSMENT: A PILOT EXPLORATION
Alyssa A Pradarelli1, Kayla M Marcotte, MD, PhD1, Brian C George, MD, MAEd1, Tyler J Loftus, MD2, James R Korndorffer Jr, MD, MHPE3, David T Hughes, MD1, Gifty Kwakye, MD, MPH1, Sophia K McKinley, MD, EdM4, Erin M White, MD, MBS, MHS5, Jordan D Bohnen, MD, MBA6, Andrew E Krumm, PhD1; 1University of Michigan, 2University of Florida Health, 3The University of Texas at Austin Dell Medical School, 4Massachusetts General Hospital, 5University of Alabama, 6Beth Israel Deaconess Medical Center
Background: Surgical training has seen a significant increase in the use of workplace-based assessments (WBAs), especially to assess entrustable professional activities (EPAs). Performance standards are necessary to realize the full potential of WBAs for competency-based education, and are currently lacking. This study explored 3 approaches to establishing WBA performance standards using operative performance WBA data.
Methods: An expert panel of surgical educators, representing diverse specialities, programs, and educational roles, participated in a two-hour virtual session. Panelists were trained to understand the Society for Improving Medical Professional Learning (SIMPL) OR WBA tool and SIMPL Operative Performance (OP) score, which summarizes cumulative operative performance ratings using methods adapted from computer adaptive testing and Bayesian inference networks. The panel applied 3 established standard setting methods - Angoff, Bookmark, and Construct Mapping - to define a general surgery graduation standard (i.e. ready for independent practice) for the OP score. The output from each method was assessed for congruency using a one-way repeated-measures analysis of variance (ANOVA), with standard setting method as the within-subjects factor. The session transcript was analyzed inductively using interpretive description methodology to identify themes for future iterations.
Results: Ten of 13 invited surgeons participated. The mean SIMPL OP cut score for graduation did not differ significantly across the 3 standard setting methods (F(2,18) = 0.63, p = 0.54). A graduation standard of 582 on the SIMPL OP scale was proposed, anchored in the Construct Mapping method due to panelists’ perception of its efficiency, utility in combining criterion- and norm-referencing, and ability to translate the standard across procedures. Figure 1 illustrates how the 582 cut score translates to predicted probabilities of a trainee achieving a “practice-ready” rating for common procedures. Other key themes included: 1) defining a “minimally competent graduating trainee”; 2) balancing disease management vs procedural competence; and 3) challenge in addressing variability in case mix and complexity.
Conclusions: This study demonstrated the feasibility and reliability of adapting existing standard setting methods to WBA data, with qualitative support for Construct Mapping. These findings can inform future competency standard setting efforts using single- and multi-source WBA data, facilitating full implementation of CBME in surgery.
(S028) THE RELATIONSHIP BETWEEN ENTRUSTMENT AND NARRATIVE EPA FEEDBACK
Gabrielle M Moore, MD1, Dalton Hegeholz, MD1, Ting Sun, PhD1, M. Libby Weaver, MD, MHPE1, Erin Ward, MD1, Erika Simmerman Mabes, DO2, Kshama Jaiswal1; 1University of Utah, 2Wellstar MCG Health
Background
Entrustable Professional Activities (EPAs) provide formative feedback for core activities of general surgery and can create a performance portfolio for resident assessment to inform practice-readiness. Preliminary work in vascular surgery has demonstrated that high-performing trainees are more likely to receive low-quality or no feedback as compared to lower performing trainees. This has yet to be studied in general surgery (GS). Here, we examine the relationship between GS trainees entrustment levels and (whether they receive) narrative EPA feedback, as well as other assessment factors.
Methods
This is a single institution, retrospective study including all EPA assessments of GS trainees by faculty from 2023-2025. Mixed-effects ordinal regression was performed to examine the association between entrustment and receiving narrative feedback, year of training, case complexity, assessment phase of care, and faculty and resident demographics.
Results
A total of 806 assessments were analyzed with 673 EPAs (83%) containing narrative feedback and 133 EPAs (17%) without any feedback. Trainees with higher entrustment were less likely to receive narrative feedback than trainees with lower entrustment (OR = 0.48, p = 0.002). The likelihood of trainees receiving higher entrustment levels increased significantly annually, reaching the highest level of entrustment likelihood in the senior stages of training. Trainees who participated in straightforward and moderate level cases were more likely to receive higher entrustment than those participating in complex cases (OR = 1.88, p = 0.007, OR = 1.62, p = 0.031, respectively). Pre-operative phase assessments were more likely to result in higher entrustment than intraoperative phases (OR = 6.25, p < 0.001) (Figure 1). There was no difference in entrustment based on trainee or faculty demographics after controlling for post-graduate year and case complexity.
Conclusion
Higher entrusted residents receive less narrative feedback than their lower entrusted peers limiting their ability to reach their fullest potential. Efforts should be made to ensure all residents receive actionable feedback along the continuum of their training.
Figure 1: Entrustment level based on receipt of feedback (A) and case complexity/phase of care (B). *p < 0.05, **p < 0.01, ***p < 0.001
(S029) FROM BEST TO NEXT: IMPROVING TESTING OF CLINICAL JUDGEMENT AND KNOWLEDGE ON A NATIONWIDE EXAM
Maryam Wagner, PhD, Paola Fata, Carlos Gomez-Garibello; McGill University
Background
Clinical judgement, decision-making, clinical reasoning, and problem-solving are several phrases used to describe health professionals’ application of knowledge and skills to clinical problems for making treatment decisions. This judgement is an interpretative practice; accordingly, there are differences in opinions among health professionals’ decisions. This complexity is addressed on assessments of clinical knowledge through the use of the term ‘best management’. We conducted a study to better understand how test-takers interpret and apply ‘best management’ on a national exam of surgical knowledge.
Summary of Work
We analyzed the multiple choice questions used on three administrations of an annual mandatory national exam for general surgery residents. We identified all the items using ‘best management’ in the stems. Using this classification, we calculated the frequency of its use, and investigated test-takers’ performance on these items based on item difficulty and discrimination index. Additionally, we examined test-takers’ perception of ‘best management’ through a post-exam survey (N=432) using two questions probing their interpretation of the term and its application, and a third, open-ended question, probing their opinions and challenges with the use of ‘best management’ on the exam.
Summary of Results
Item analyses revealed ‘best management’ is used in approximately 50% of items and are more difficult that those without it (F(1,415) = 6.97, p = .009). However, items using the phrase do not discriminate learners’ performance (F(1,415) = .726, ns). The post-exam survey revealed two popular interpretations of the meaning of ‘best management: “next most appropriate investigation or intervention” and “the definitive overall clinical decision”. Coding of the narrative comments revealed several emergent themes related to best management including: multiple interpretations contribute to ambiguity; necessity for use of timeframe and context; interpretation dependent on response options.
Conclusions
Best management contributes to increasing the difficulty of exams testing clinical knowledge, but does not discriminate performance. Contributing to the difficulty is test-takers’ differing interpretations of the phrase and subsequent application. Test-takers highlighted the ambiguousness of best management. We used this information to make revisions substituting the term with less ambiguous terms such as ‘next best step’.
(S030) PGY-1 DECISION-MAKING SKILLS: EVIDENCE FROM SEVEN YEARS OF FORMATIVE ASSESSMENT RESULTS
Kathleen R Liscum, MD, FACS1, Yoon Soo Park, PhD2, Patrice Gabler Blair, DrPH, MPH3, Kevin Wasielewski, MPH, MBA3, Enjae Jung, MD, FACS4, Edgardo Salcedo, MD, FACS5, Ajit K Sachdeva, MD, FACS, FRCSC, FSACME, MAMSE3; 1Retired, 2University of Illinois Chicago, 3American College of Surgeons, 4Oregon Health and Sciences University, 5University of California, Davis
Background/Objectives
A formative online assessment is being used nationally to measure the clinical decision-making skills of PGY-1 residents during the first week of residency. The assessment employs a key-features approach focused on clinical topics and decision points prone to error among entering residents. Results provide percent-correct scores, identify potentially harmful actions, and offer peer comparisons at individual and program levels. Program directors and residents use these results to create individualized learning plans and modify curricular plans, as needed.
Methods
Results from seven annual administrations (2018–2024) were analyzed, including participant demographics, resident and program-level scores, and outcomes for 140 decision points spanning 20 clinical topics. Data from the 2025 administration will be added when finalized. Descriptive statistics and regression analyses were used to examine longitudinal trends and identify areas of consistently high or low performance.
Results
National data from 5,092 residents across 626 cohorts showed stable overall performance at 65% correct (SD = 8%) over the seven-year period. Topics with consistently lower mean scores included change in respiratory rate (51%), abdominal pain (57%), and irregular heartbeat (58%), whereas uniformly higher scores were observed for somnolence (73%), hypotension (73%), and chest pain (73%). Mean scores for change in respiratory rate, traumatic extremity ischemia, oliguria, and fever etiologies declined significantly in recent years (3%–8% decrease).
See Figure 1 for score distributions by clinical topic.
Conclusion
Findings identify clinical topics where decision-making skills may require targeted reinforcement in medical school or early residency. Comparisons with national benchmarks provide valuable feedback for developing individualized learning plans and refining residency curricula to enhance readiness for safe patient care.
Figure 1. Score Distribution by Clinical Topic: Box Plots (n = 5,092)

Note: National score distribution by clinical topic areas. Box plots represent the ranges, interquartile range (25th and 75th) and the median (50th percentile).
