Podium IB - Assessment
BUILDING A BETTER SUB-INTERNSHIP EVALUATION: A MULTI-STAKEHOLDER APPROACH
Anthony O Morada, MD, Jessica L Becker, MD, Michael J Furey, DO, Christian Hailey Summa, DO, Kristen R Richards, MD, Jacob Bodde, MD, Rebecca Michelle Jordan, DO, Joseph P Bannon, MD, Alexandra Falvo, MD; Geisinger Northeast General Surgery Residency
Background: Sub-internships serve as essential components in the residency selection process; however, many programs lack validated assessment frameworks that adhere to principles of educational science. Traditional evaluative methods heavily rely on unstructured, single-source assessments with uncertain reliability, thereby limiting both trainee development and programmatic decision-making. This study seeks to advance the field of surgical education assessment by implementing and validating a comprehensive evaluation system grounded in established literature concerning workplace-based assessments and psychometric principles.
Methods: Building upon validated surgical education assessment instruments, including the Objective Structured Assessment of Technical Skills (OSATS), Mini-Clinical Evaluation Exercise (Mini-CEX), and 360-degree feedback, we developed a structured, competency-based evaluation system employing behaviorally-anchored 5-point rating scales targeting ACGME competencies. Our innovative multi-stakeholder approach systematically gathered perspectives from attending surgeons, residents, and clinical staff. Implementation spanned five months, enabling prospective validation of reliability, feasibility, and discriminant validity. Variance decomposition identified optimal evaluator thresholds while analyzing score patterns across evaluator types and competency domains.
Results: Twenty-four evaluations across 8 students demonstrated strong psychometric properties and practical feasibility. High completion rates (100%) with brief time investment (mean 4.2 minutes) validated stakeholder acceptability. Multi-source assessment included attending physicians (17%), residents (79%), and clinical staff (4%). Inter-rater reliability achieved excellent to good thresholds: overall intern-readiness ICC=0.93 (95% CI: 0.79-0.995), technical skills ICC=0.92 (0.65-1.0), initiative ICC=0.79 (0.48-0.98), communication ICC=0.75 (0.43-0.98), clinical knowledge ICC=0.66 (0.30-0.97), with professionalism ICC=0.47 showing adequate consistency for multi-rater assessment. Discriminant validity confirmed through appropriate identification of performance concerns (17% flagged). Evidence-based analysis revealed 4-5 evaluations per student optimized reliability while maintaining feasibility.
Conclusions: This study demonstrates that systematic, evidence-based sub-internship evaluation informed by assessment science principles achieves both psychometric rigor and practical implementation success. By validating optimal evaluator thresholds and multi-stakeholder approaches, this work advances surgical education assessment science. The scalable framework enhances training environments through structured feedback while producing objective data supporting residency selection decisions identifying future skilled, compassionate surgeons.
A BAYESIAN FRAMEWORK FOR HOLISTIC ASSESSMENT OF SURGICAL RESIDENT TECHNICAL SKILL
Jessica R Santos-Parker, MD, PhD, Keli S Santos-Parker, MD, PhD, Shareef Syed, MD, Adnan Alseidi, MD, Hueylan Chern, MD, Patricia S O’Sullivan, PhD; Department of Surgery, University of California San Francisco
Background: Technical skills assessments often include measures of speed and proficiency. However, task-specific measures limit holistic performance interpretation due to traditional analytic methods constraints. We analyzed surgical intern skills assessments using a joint Bayesian model of speed and rated proficiency to estimate latent composite performance and quantify ranking reliability across multiple tasks. This framework identifies holistic speed-proficiency patterns, supporting targeted feedback.
Methods: Forty-two surgical interns were timed on nine standardized technical tasks (iterations of open knot tying, suturing, superficial and deep), rated by 18 faculty on a 5-point ordinal proficiency rubric during a foundational skills assessment at a large academic institution. Analyses were conducted in R (v4.4.1) using brms and lme4. A hierarchical Bayesian mixed model jointly estimated log-time and proficiency rating across residents, tasks, and faculty raters. Resident-specific speed and proficiency were estimated as random intercepts and summed to define a composite score. Composite score reliability was quantified with Bayesian generalizability analysis.
Results: Resident composite score rankings were stable, with 85% [95% CrI 81–89%] agreement across posterior draws. Bayesian generalizability analysis yielded moderate reliability (G = 0.65 [95% CrI 0.42–0.78]) for nine tasks. Latent speed and proficiency were positively correlated (ρ = 0.49, [95% CrI 0.17–0.74]), with speed accounting for 44% (95% CrI 17–68%) of variation in proficiency, indicating substantial proficiency differences among similarly fast residents. The holistic skill plane (Figure 1) displays composite scores by speed–proficiency of each resident, interpreted as resident performance archetypes. An intern in the bottom left quadrant, right of the trend line is overall low skill, relatively proficient but slow vs. top right quadrant, left of the line is overall high skill, relatively fast but less proficient.
Conclusions: Joint Bayesian modeling enables holistic interpretation of resident technical performance by integrating speed and proficiency into a single latent composite score with quantified reliability. This approach identifies resident archetypes with deeper patterns in overall trends of efficiency-proficiency and implications for coaching advice on how to practice: emphasizing speed or technique. Educators can move beyond single task metrics with holistic interpretation to provide personalized overall feedback and identify ineffective practice habits early.

ANALYSIS OF SURGICAL RESIDENTS IN THE AAMC RESIDENT READINESS SURVEY
Chase C Marso, MD, Roy Phitayakorn, MD, MHPE, Sophia McKinley, MD, MEd, Dandan Chen, PhD; Massachusetts General Hospital
Background: The AAMC initiated the Resident Readiness Survey (RRS) program in 2019 to support and assess the transition of trainees from medical school to intern year. Of all specialties, general surgery programs have among the highest rates of indicating that incoming residents do not meet overall performance expectations. The aim of this study was to analyze RRS responses to understand which specific clinical and non-clinical skills are rated as deficient among general surgical interns and residents from other surgical subspecialties.
Methods: De-identified RRS quantitative data from 2019-2024 was obtained from the AAMC. Resident overall performance and performance in specific clinical and non-clinical domains were analyzed between general surgery and other surgical subspecialties (neurosurgery, orthopedic surgery, integrated plastic surgery, integrated thoracic surgery, and integrated vascular surgery). Descriptive statistics and Chi square tests were performed to identify specific patterns in assessments of residents failing to meet expectations.
Results: From 2019-2024, 303 general surgery residency programs completed 4500 assessments; 441 surgical specialty programs completed 3107 assessments. In that time, significantly more general surgery interns (224/4495, 4.9%) were rated by their program directors as not meeting overall performance expectations compared to surgical subspecialty interns (53/3100, 1.7%) (p<0.001) (Figure). Among specific skills, general surgery interns were significantly more likely to receive ratings of “Did not meet expectations” compared with interns in surgical subspecialties, particularly for organization and timeliness (5.95% vs 2.27%, p<0.001), procedures/consent (3.27% vs 1.18%, p<0.001), and prioritizing a differential diagnosis (3.22% vs 0.93%, p<0.001), as shown in the Figure.
Conclusions: The overall proportion of general surgery interns not meeting expectations as rated by their program directors is low, but the percentage is higher than for interns at surgical subspecialty programs. Additional qualitative analysis of program directors’ comments in the RRS may provide further insight into intern deficiencies. Understanding specific deficiencies may create opportunities for targeted education interventions at medical schools and residency programs during the UME to GME transition.

CHARACTERIZING HOW SURGICAL RESIDENTS EVALUATE FACULTY: A DOCUMENT ANALYSIS OF FACULTY TEACHING EVALUATION TOOLS
Yichuan Yan, MSED, Nathan G Behrens, MD, Dimitrios Stefanidis, MD, PhD; Indiana University School of Medicine
Purpose
Feedback on faculty teaching is crucial to improve their performance. Surgical training programs obtain such feedback from their residents and provide it to their faculty. The content and quality of faculty teaching evaluations being used are, however, unknown as no standardized tool exists. The aim of this study was to identify the domains and variation of faculty teaching evaluation tools used in surgical residency programs.
Methods
Using convenience sampling, blank faculty teaching evaluation forms completed by residents were collected from various U.S. surgery residency programs. A document analysis was performed on individual question items to identify common themes and evaluation domains. Descriptive statistics and tabulations were used to summarize the data.
Results
Fourteen faculty teaching evaluation tools from thirteen surgical residency programs were collected and analyzed. Content analysis of all question items revealed four major evaluation domains including teaching effectiveness (55.1% of codes), professionalism (25.9%), clinical performance (12.1%), and overall performance (6.9%) from the faculty teaching evaluation tools. Figure 1 demonstrated the variability across the 14 faculty teaching evaluation tools based on the distribution of the question items in four evaluation domains and three question types. The mean number of questions per evaluation was 14.7 (range 4-23), composed of rating questions (9.6), free text comments (4.4), and checklist questions (0.6).
Conclusions
Substantial variability in faculty teaching evaluations used across a variety of surgical training programs was identified. Besides teaching effectiveness, faculty professionalism and, less frequently, clinical performance were also evaluated. Standardization of faculty teaching evaluation tools can ensure assessment validity and actionable faculty feedback and enable comparisons across institutions.

STANDARD SETTING IN THE ERA OF WORKPLACE-BASED ASSESSMENT: A PILOT EXPLORATION
Alyssa A Pradarelli1, Kayla M Marcotte, MD, PhD1, Brian C George, MD, MAEd1, Tyler J Loftus, MD2, James R Korndorffer Jr, MD, MHPE3, David T Hughes, MD1, Gifty Kwakye, MD, MPH1, Sophia K McKinley, MD, EdM4, Erin M White, MD, MBS, MHS5, Jordan D Bohnen, MD, MBA6, Andrew E Krumm, PhD1; 1University of Michigan, 2University of Florida Health, 3The University of Texas at Austin Dell Medical School, 4Massachusetts General Hospital, 5University of Alabama, 6Beth Israel Deaconess Medical Center
Background: Surgical training has seen a significant increase in the use of workplace-based assessments (WBAs), especially to assess entrustable professional activities (EPAs). Performance standards are necessary to realize the full potential of WBAs for competency-based education, and are currently lacking. This study explored 3 approaches to establishing WBA performance standards using operative performance WBA data.
Methods: An expert panel of surgical educators, representing diverse specialities, programs, and educational roles, participated in a two-hour virtual session. Panelists were trained to understand the Society for Improving Medical Professional Learning (SIMPL) OR WBA tool and SIMPL Operative Performance (OP) score, which summarizes cumulative operative performance ratings using methods adapted from computer adaptive testing and Bayesian inference networks. The panel applied 3 established standard setting methods - Angoff, Bookmark, and Construct Mapping - to define a general surgery graduation standard (i.e. ready for independent practice) for the OP score. The output from each method was assessed for congruency using a one-way repeated-measures analysis of variance (ANOVA), with standard setting method as the within-subjects factor. The session transcript was analyzed inductively using interpretive description methodology to identify themes for future iterations.
Results: Ten of 13 invited surgeons participated. The mean SIMPL OP cut score for graduation did not differ significantly across the 3 standard setting methods (F(2,18) = 0.63, p = 0.54). A graduation standard of 582 on the SIMPL OP scale was proposed, anchored in the Construct Mapping method due to panelists’ perception of its efficiency, utility in combining criterion- and norm-referencing, and ability to translate the standard across procedures. Figure 1 illustrates how the 582 cut score translates to predicted probabilities of a trainee achieving a “practice-ready” rating for common procedures. Other key themes included: 1) defining a “minimally competent graduating trainee”; 2) balancing disease management vs procedural competence; and 3) challenge in addressing variability in case mix and complexity.
Conclusions: This study demonstrated the feasibility and reliability of adapting existing standard setting methods to WBA data, with qualitative support for Construct Mapping. These findings can inform future competency standard setting efforts using single- and multi-source WBA data, facilitating full implementation of CBME in surgery.
