Podium Session I B - Education Technology
Introduction:
Artificial intelligence (AI)-powered language models, such as ChatGPT-4, are promising healthcare education tools for both patients and professionals. For instance, ChatGPT-4 may soon be used to develop quality, cost-effective alternatives for surgery trainee examination preparation and streamline patient education to combat hospital readmission. However, skepticism surrounding the model's accuracy and reliability remains a significant barrier to its widespread adoption. Although ChatGPT-4 has demonstrated proficiency in answering patient-posed questions and written and oral neurosurgery “board-style” multiple-choice questions, the model’s declarative general surgery knowledge and clinical judgment are unknown. This study subjects ChatGPT-4 to written and oral assessments modeled after the traditional two-part board certification process in general surgery. The results may be used to bridge the gap between AI and practical applications in healthcare and revolutionize how surgical care is learned and delivered.
Methods:
250 multiple-choice questions (MCQs) were randomly selected from the Surgical Council on Resident Education (SCORE) web portal’s question bank and input into ChatGPT-4, which provided an answer choice for each MCQ. Two former board examiners evaluated ChatGPT-4’s clinical decision-making skills via four crafted mock oral case scenarios derived from the Entrustable Professional Activities (EPA) topic and assessment framework: acute abdomen, gallbladder disease, right lower quadrant pain and appendicitis, and benign and malignant breast disease.
Results:
ChatGPT-4 answered 197 out of 250 SCORE MCQs correctly, which approximately correlates to scores that place it in the 94th percentile of first-year general surgery residents and the 90th percentile of fifth-year residents on the American Board of Surgery In-Training Examination (ABSITE). ChatGPT-4 committed critical failures precluding practice-ready entrustability in 3 out of 4 clinical scenarios. The most common reasons for failure were related to the incorrect type and timing of operative interventions suggested.
Conclusion:
The limitations in ChatGPT-4's clinical judgment necessitate a cautious approach to its use in surgical education and care delivery. Future research should focus on developing strategies to enhance the model's contextual understanding and clinical judgment, possibly through the integration of specialized datasets or the implementation of advanced reinforcement learning techniques.
Figure 1: Interact with Language Model (Registration Required)

Despite the easing of travel restrictions and health risks associated with the spread of COVID-19, virtual general surgery interviews have persisted. This is due to their many benefits, primarily time and cost savings to applicants and programs alike. However, limitations exist, and some programs, as well as applicants, have advocated for a return to the in-person setting due to challenges of assessing potential fit and program culture. In the 2022-2023 application cycle, our institution offered applicants the choice to interview virtually or in-person.
This study describes our methods of comparing both in-person and virtual interviewees and helps identify overall trends among applicants when given the interview option.
Applicants who received an interview invitation for a categorical general surgery residency position at our institution were offered the option of an in-person or virtual interview. Applicants were assured that their interview type preference would not impact their ranking. Four in-person interview dates and four virtual interview dates were offered and filled. Applicants were scored by day, thereby only being compared to those interviewing in the same modality. Conglomerate scores were then used to generate an overall rank list. After rank lists were submitted, all US MD applicants to our institution were contacted to complete an anonymous electronic survey assessing their interview modality preference and reasoning.
Of the 131 who interviewed, 72 (55%) elected to interview in-person. 162 of 727 total US general surgery applicants completed the survey (22.3%). In the broader applicant pool, 52% of survey respondents indicated they were given the option to choose their interview medium for at least one program. Of those, 73% of applicants elected to complete at least one interview in-person and 60% planned to visit programs after their interview.
Our program implemented a hybrid interview system that allowed applicants to choose how they interviewed. When afforded the choice, the significant majority of applicants elected to complete at least one interview in-person and to attend in person post-interview events. This preference of applicants should not be ignored. Our study demonstrates that a more individualized interview process, which allows both options to the candidates, can successfully be offered.
INTRODUCTION
In the current context of limited experienced instructors in teaching procedures, artificial intelligence (AI) offers promising tools to overcome this challenge in simulation-based training. During a previous study, we developed an AI-based object detection algorithm designed for image recognition in simulated laparoscopic training assessment. We further integrated this tool into our remote and asynchronous video-platform, offering exercise time measurement and error pattern suggestions based on image recognition. The purpose of this research is to test the algorithm’s accuracy for judging these tasks and compare its performance to that of expert teachers.
METHODS AND PROCEDURES
Three basic laparoscopy training exercises (bean drop (BD), peg transfer (PT) and precision cutting (PC)) were selected as they had already received substantial exposure to the embedded deep learning model yolo v4 with logic implementation algorithm. Sixty new videos of each exercise were collected, with five random frames extracted from each video. Data on the algorithm's image detection and triangulation were collected from these video-frames. Expert instructors reclassified elements within each framework. Exercise completion times proposed by the algorithm were compared to official times assigned by instructors during conventional evaluations. Algorithm accuracy of element detection and mean absolute time error were calculated for each exercise. Confusion matrices were constructed.
RESULTS
The algorithm examined the full dataset of 900 video-frames from the three exercises. The algorithm's precision in element detection was 97.5% for BD, 97.1% for PT, and 75.4% for PC exercises when compared to human assessments. F1-scores of 0.955, 0.976 and 0.814 were achieved for BD, PT and PC exercises, respectively. Using human-measured times as references, the algorithm exhibited mean absolute time errors of 2.2 seconds for BD, 1.8 for PT, and 9.9 for PC exercises (Table 1).
CONCLUSIONS
An AI algorithm integrated into our remote and asynchronous video-platform demonstrates significant accuracy in element detection and time measurement across the assessed exercises. Validation of these basic tools emerges as a promising step towards automated AI-driven assessment.

Introduction
Cognitive load (CL), or the demand incurred on working memory, has been identified as a potential barrier to surgical performance and skill acquisition. However, CL measurement often involves subjective measures or invasive physiological measurements (e.g., electroencephalogram). Eye tracking has been proposed as a novel, non-invasive tool for measuring CL. This study aimed to generate validity evidence for the use of eye-tracking metrics to evaluate CL during the performance of simulated laparoscopic and robotic surgical tasks.
Methods
A crossover randomized study was conducted with all participants completing robotic and laparoscopic tasks of varying complexity (basic task: peg transfer, advanced task: suturing under tension). Objective metrics for cognitive load were captured using a mobile eye-tracker (using pupil diameter, gaze fixation duration, eye movement velocity, and saccadic amplitude, i.e., the distance traveled during eye movements), and a visuospatial secondary task to gauge spare attention capacity (using response time and correct hit rates). Participants' self-reported workload was also captured with the NASA-TLX survey. Differences in measurements were compared between tasks using paired t-tests and Pearson’s correlation was used to compare eye metrics with other measures of cognitive load.
Results
Data was captured in 4 junior surgery residents with limited laparoscopic and robotic experience. Participants demonstrated larger mean pupil diameter for both eyes (MD=0.13±0.08mm p=0.002), whole-fixations (MD=0.11±0.09 fixations, p=0.011), and lower saccadic amplitude (MD=-0.356±3.8 degrees, p=0.048) during the advanced compared with the basic task, indicating heightened cognitive load. Further, the lower saccadic amplitude was positively correlated with a lower hit rate for the secondary visuospatial task (r=0.625, p=0.017) and was negatively correlated with subjective mental demand (r=-0.626, p=0.017) both confirming increased CL. No differences were seen between robotic and laparoscopic tasks.
Conclusion
The results from our pilot study add validity evidence to the use of eye-tracking metrics to measure surgical novice CL. In addition to discriminating CL between basic and advanced surgical tasks, these metrics correlated with perceived mental demand and secondary task metrics suggesting they can be effective measures of CL.
Background:
Conventional methods of preparation for oral board exams predominantly involves a combination of independent self-study and 'mock orals’ practice sessions which mimic the real exam using conversational scenarios. Advancements in artificial intelligence (AI) and large language models (LLM) have introduced unique opportunities in medical education. The most popular and available AI model at this time is OpenAI's Chatbot Generative Pre-trained Transformer (ChatGPT). The aim of this study was to evaluate the potential of ChatGPT in its performance as an examiner in mock oral boards scenarios.
Methods:
We used the ChatGPT 4.0 model and examined PGY4 and PGY5 residents from two general surgery programs. Each participant was given 3 scenarios with 8 minutes per scenario. The scenarios were randomly selected by the LLM. At the completion of the scenarios, participants were given an adapted version of the reduced Students´ Evaluation of Education Quality Questionnaire (r-SEEQ) survey. The transcripts from the mock orals scenarios were then printed and reviewed by two independent reviewers to assess both the examiner and examinee for relevance, quality, conversation flow, and accuracy.
Results:
Six residents completed the pilot project for a total of 18 scenarios. The most common scenarios encountered were appendicitis (16.7%) and diverticulitis (16.7%). Surgical management was indicated in 15 (83.3%) of scenarios. The survey showed excellent reliability (α = 0.93) but showed mixed feedback regarding the effectiveness of the platform. Participants reported high scores regarding the platform’s medical accuracy, usefulness, and individuals’ interest in studying. The lowest scoring items related to the platform giving too much information and presenting scenarios which were too easy.
Discussion:
LLM platforms present a unique opportunity for oral examinations due to the conversational nature of the exams. Our use of ChatGPT allowed for easy access and adoption of the technology among participants. However, the broad use of ChatGPT may limit its ability to present complex scenarios and perform the more nuanced role of oral board examiner. Independent review of the conversations is in process at this time. Improvements to the clinical scenarios and a more specific AI platform would likely refine the user experience without compromising quality.
Introduction:
Prior studies have identified factors, such as proficiency with video games or musical instruments, that predict baseline aptitude on laparoscopic and endoscopic skills, though few have elucidated predictors for novice success on robotic simulation. This study investigated physical and recreational determinants that may correlate with baseline aptitude on the da Vinci Surgical Skills Simulator amongst medical students.
Methods:
Medical students at a large academic institution were consented and included. After instruction on basic operational skills of the da Vinci Surgical Skills Simulator, they completed two attempts of the SeaSpikes Skills Test, which involves placing rings on cones scored on time and accuracy. Performance parameters measured included total scores, score change between attempts, economy of motion, completion time, number of items dropped, and use of excessive force. Participants completed a survey querying baseline demographics, physical factors (e.g. handedness, hand size), and prior experiences including video games usage, knitting, chopsticks usage, musical instrument proficiency, and sports participation. Data were analyzed using Chi-squared and Mann-Whitney U tests, with significance set to p < 0.05.
Results:
80 medical students participated in the study. Demographics such as gender, medical school year, handedness, and hand size did not correlate with surgical simulator success. Frequent video game usage either weekly or daily within the past 6 months was associated with higher total scores compared to infrequent or never players (81 [n=12] vs. 67 [n=68]; p=0.04). Prior virtual reality experience was the most predictive for higher total scores [79 [n=23] vs. 63 [n=57]; p=0.02], amongst other parameters. Active musicians had improved economy of motion (403 [n=19] vs. 350 [n=61]; p=0.041), but not higher scores. Prior childhood experience with video games, chopsticks proficiency, knitting experience, and playing sports did not correlate with success.
Conclusion:
Frequent use of video games within the past 6 months and prior virtual reality experience were factors correlated with improved total scores on the da Vinci Surgical Skills Simulator, while prior childhood video game experiences were not. This suggests a potential role for video games and virtual reality in surgical training.

Background:
Video-based education (VBE) facilitates understanding medical knowledge and technical skills and is an expanding field within surgical education. However, there is a paucity of research comparing the effectiveness of novel video techniques to traditional textbook learning among medical student learners.
Objective:
The primary objective of this study was to assess participants’ understanding of surgical content through multiple choice (MC) and simulated intraoperative questions when using video-based versus text-based education as preparation resources. Additional objectives included evaluating time spent on preparation, perceived preparedness, and the likelihood of recommending the resource to peers.
Methods:
Medical students were enrolled in a randomized controlled trial with crossover. For each participant, a VBE resource was randomly assigned to either laparoscopic cholecystectomy or thyroidectomy surgical procedures, with a textbook resource assigned to the remaining procedure. After a 24-hour study period, participants watched both prerecorded surgical procedures via proctored video conference. MC assessments and simulated intraoperative questions assessed knowledge of surgical anatomy and pathology. Student’s t-tests were utilized to compare differences in study outcomes between arms.
Results:
Forty-seven participants were enrolled, of which 39 completed the study. There was no significant difference in performance on the intraoperative questions or the change from pre- to post-intervention MC assessments between the VBE and text-based groups for either surgical procedure (p>0.05). Participants who received VBE as preparation for the cholecystectomy had significantly higher post-intervention MC scores than those who received text-based resources (78% vs. 65%, p= 0.02). Surveys indicated a preference for VBE over text-based resources, including increased perceived helpfulness and preparedness, reduced anxiety, improved ability to follow surgical steps, and increased likelihood to recommend VBE to peers (all p < .001). VBE was perceived to be significantly more efficient (p < .0001), and participants spent significantly less time on VBE versus text-based preparation (35.3 minutes vs. 44.7 minutes, p = 0.004).
Conclusion:
Medical student performance did not differ significantly based on VBE versus text-based preparation. However, despite similar test results, VBE required less preparation time and was strongly preferred by students. VBE resources should be further developed and explored to advance medical student surgical education.
Introduction:
It is unclear how medical students use different modalities of educational technologies (EdTech) such as websites, software, and other applications to enhance their learning. We sought to assess medical students’ current perceptions and use of various EdTech, and specifically of artificial intelligence (AI) chatbots as potential educational resources, throughout medical school.
Methods:
We administered an online survey to all medical students at a single urban academic institution in Fall 2023. It was advertised via an all-student listserv and posters in student work areas. Students were incentivized by a gift card raffle. Differences in quantitative outcomes between demographic groups were assessed with Chi-squared tests. Two researchers (MK & MP) performed content analysis of free-response data via both inductive and deductive coding schemes. Major themes were identified, and representative quotations selected.
Results:
A total of 143 students participated (20% of all students, 62% female). Question banks, flashcard applications, and video libraries were used across all phases of medical school (Figure), and average personal expenditure on EdTech was $1600 by the end of medical school. Almost all respondents (99%) had either personally used or observed the use of an AI chatbot; 57% used it for academic purposes, and 70% indicated a positive experience. There were no differences in use by gender or ethnicity, but men were more likely to indicate a positive experience (p<0.01). When asked how they might use AI in their surgical clerkships, students’ responses related to preparing for operations, streamlining clinical workflow, and supporting self-directed learning. Students indicated educators could help them by providing guidance on appropriate use of AI, practical skills training including information verification, and material resources. Interestingly, approximately 5% of students strongly opposed the use of AI chatbots in medical school contexts.
Discussion:
Medical students at our institution use diverse EdTech resources throughout medical school at significant personal expense. AI use is very common, and students identified several ways that educators can help them make the best use of these technologies, particularly in their surgery clerkships. Next steps should include the establishment of guidelines on acceptable AI use and development of resources to enhance their utility.

Introduction
Increasing utilization has made exposure to robotic surgery paramount during surgical training. Compared to open or laparoscopic surgery, the configurational relationship between trainer and trainee in robotic surgery reduces face-to-face interaction and ability to directly physically manipulate a trainee’s actions while performing a task. We conducted a mixed-methods study to examine factors facilitating training dynamics (teaching, feedback, and autonomy) and trainer-trainee communication in robotic surgery.
Methods
Robotic procedures were video and audio recorded, capturing both endoscopic view and a broad view of the operating consoles with trainer/trainee interactions (Figure 1). Audio was transcribed, then synchronized video stream analysis undertaken utilizing a thematic analysis approach. Trainee console operative time, representing autonomy, was extracted from the robotic system. Trust was measured via a modified Leader Member Exchange (LMX) questionnaire distributed to trainer/trainee at the end of each case. We analyzed correlation between autonomy and trust utilizing Pearson’s coefficient.
Results
This represents pilot analysis of nine recorded robotic procedures in colorectal, general surgery, urology, and thoracic surgery. Thematic analysis revealed a robust educational environment, with good teaching techniques represented by verbal articulation, preoperative face-to-face huddle, and constant active feedback. Feedback “escalation” was commonly used, progressing from direct verbalization, through use of on-screen guidance markers, to intraoperative “head out” pauses with direct face-to-face discussion, or console takeover. Feedback “escalation” was frequently associated with difficult-to-explain or critical intraoperative sections. Trainees utilized strategies for fostering trust, including constant verbalization of intent and physical indication of maneuvers before undertaking. Quantitative analysis revealed trainee console time was directly correlated with average trainer LMX trust scores (rho=0.65), but not trainee scores (rho=0.34). Average trainer LMX trust score was 4.37 vs. 3.16 (p=0.035) for trainees with >50% vs. <50% console time; trainee LMX was not significantly different between the same groups (p=0.37).
Conclusion
Higher levels of trainer trust are associated with higher resident autonomy, and trainers/trainees can use specific strategies for teaching, feedback, and fostering trust in the robotic operating room. These findings could be utilized to help trainers/trainees communicate better in the robotic operating room, improve trainee autonomy, and inform future robotic surgical educational curricula.

Introduction: Large language models have the potential to be an adjunctive medical education tool. With relative success in other specialties’ standardized medical exams, the performance of artificial intelligence in surgical exams has not been studied. The objective of this study is to assess the accuracy of commonly used chatbots, ChatGPT, Bard, and Claude, in answering National Board of Medical Examiners (NBME) surgery practice questions. Character count, as a measure of question complexity, was also assessed in relation to accuracy.
Methods: 20 practice questions from NBME Surgery Sample Items were assessed. ChatGPT-3.5, ChatGPT-4.0, Bard, and Claude were prompted to answer the multiple choice question and provide justification for their choice. The character count, answer choice, and explanation were recorded for each question. A logistic regression model was used to assess the relationship between the accuracy of the AI models and the character count of the questions.
Results: ChatGPT-4.0 scored the highest compared to the other three models at 100%; due to its perfect score, a character count coefficient was not calculated. ChatGPT-3.5 scored a 90% with a character count coefficient of -0.00107. Claude and Bard each scored 70% with a character count coefficient of 0.00016 and -0.00366, respectively. None of the character count coefficients were statistically significant (Table 1).

Table 1. Accuracy and Character Count Relationship across each chatbot.
Conclusions: ChatGPT-4.0 outperforms its counterparts in responding to standardized surgical exam questions, showcasing its potential as a beneficial supplementary tool for surgery students. ChatGPT-3.5, Claude, and Bard may have room for improvement to concurrently serve this purpose. The apparent lack of correlation between question length and response accuracy may indicate that complexity of questions does not significantly impact the performance of these models.
