BEA 2026 Schedule

This page features the workshop schedule. In the schedule below, clicking on a paper will take you to its dedicated page on Underline where pre-recorded videos are available.

Add to Calendar: Download ICS
Schedule Changes: Please check for any last-minute changes here.

Friday, July 3, 2026

Day 1 of the workshop will be completely virtual. You can attend all oral presentations live on Zoom, while the interactive poster sessions will take place on Gather Town.

Location: Virtual: Day 1 Session on Underline
Time Zone: America/Los_Angeles: PDT (Pacific Daylight Time), UTC-7

Time	Description
06:00 - 07:30	Early Poster Session
	Using k-Shot Prompting with Large k for the Automated Scoring of a German Written Elicited Imitation Test (Malte Sternik, Ronja Laarmann-Quante, Anastasia Drackert)
	From Questions to Assessment Tuples: A Multi-Agent Framework with Bloom-Specialized Agents and Automated Verification (Gee-Lyle Wong, Runcong Zhao, Yulan He, Jiazheng Li)
	Fine-Grained Content Zone Prediction in German Argumentative Essays Using LLMs (Xiaoyu Bai, Manfred Stede)
	Sharing is Caring: Advantages of Sharing a Language Background with Learners as an Annotator of Learner Data in UD (Caroline Grand-Clement, Arianna Masciolini)
	The Effects of Structured LLM-Generated Feedback on Programming Assignment Performance (Tsvetomila Mihaylova, Evanfiya Logacheva, Arto Hellas, Jing Fan, Francisco Castro, Bita Akram, Narges Norouzi, Peter Brusilovsky, Juho Leinonen)
	HFT at BEA 2026 Shared Task 2: Blunt-Edge Models for Hybrid Grading (Ulrike Pado)
07:30 - 08:45	Early Oral Session
07:30 - 07:45	Domain-Adaptive Pre-training for Automated Short Answer Grading in Conceptual Physics: Reliability, Question-Level Analysis, and Error Reduction (Shirin Lade, Alistair Willis, Jonathan Nylk, Oli Howson)
07:45 - 08:00	Using Interaction Log Data to Evaluate and Improve Feedback Accuracy in an Intelligent Language Tutoring System (Mariia Soliar, Leona Colling, Stephen Bodnar, Detmar Meurers)
08:00 - 08:15	Towards Pedagogically Aligned LLM Tutors for Math Mistake Remediation (Kseniia Petukhova, Tien Dat Nguyen, Ekaterina Kochmar)
08:15 - 08:30	What Aggregate Scores Hide: Per-Rule Evaluation of Russian Grammatical Error Correction (Anna Smirnova, Artyom Kopan, Vladislav Makeev, George Chernishev)
08:30 - 08:45	IWM-DKM at BEA 2026 Shared Task 2: Supplementing Supervised Fine-Tuning for Rubric-Based Short Answer Scoring (Kate Belcher, Marius De Kuthy Meurers, Kordula De Kuthy, Detmar Meurers)
09:00 - 10:30	Tutorial Session A Theory of Mind and Application in Educational Context (Effat Farhana, Maha Zainab, Qiaosi Wang, Niloofar Mireshghallah, Ramira van der Meulen, Max van Duijn)
09:00 - 09:10	Introduction and Background (Effat Farhana)
09:10 - 09:20	Online Tutoring Systems and Cognitive Modeling (Effat Farhana)
09:20 - 09:35	LLM Adaptation in AI Tutoring (Maha Zainab)
09:35 - 09:50	ToM Integration and Research Gaps (Effat Farhana)
09:50 - 10:00	Synthesis, Reflection, and Key Themes of the Tutorial (Effat Farhana)
10:00 - 10:30	ToM and User Privacy (Niloofar Mireshghallah)
10:30 - 11:00	Coffee Break
11:00 - 12:30	Tutorial Session B Theory of Mind and Application in Educational Context (Effat Farhana, Maha Zainab, Qiaosi Wang, Niloofar Mireshghallah, Ramira van der Meulen, Max van Duijn)
11:00 - 11:25	Addressing Misconceptions (Ramira van der Meulen)
11:25 - 11:50	Mutual ToM in AI Tutoring (Qiaosi Wang - Effat will play recorded talk of Qiaosi as she has time conflict)
11:50 - 12:30	Q&A and Machine ToM in Source Code Comprehension Demo (Maha Zainab)
12:30 - 14:00	Lunch Break / Birds of a Feather
14:00 - 15:45	Oral Session A
14:00 - 14:15	Instruction-Following LLMs for Grammatical Error Correction: Analyzing Neutral-Anchored Instructional Sensitivity Across Editing Modes (Tolgahan Türker, Gülşen Eryiğit)
14:15 - 14:30	Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory (Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne)
14:30 - 14:45	LLM-Powered but Rule-Grounded: Pedagogically Relevant Grammatical Error Characterization for Learner Model Construction (Soroosh Akef, Amália Mendes, P Rebuschat, Detmar Meurers)
14:45 - 15:00	KEYSCORE — Keystroke-enhanced Automated Essay Scoring (Nils-Jonathan Schaller, Daniel Mora Melanchthon, Thorben Jansen, Olaf Köller, Andrea Horbach)
15:15 - 17:30	Poster Session A
	Inferring Student Engagement via Real-Time Thermal–Visual Voice Activity Detection (Bradley Goodman)
	Letting Tutor Personas Speak Up for LLMs: Learning Steering Vectors from Dialogue via Preference Optimization (Jaewook Lee, Alexander Scarlatos, Simon Woodhead, Andrew Lan)
	A Bigger Catch: Fine-Grained Curriculum Standards Alignment on the MathFish Benchmark (Xinman Liu, Mayank Sharma, Xinyu Shi)
	Through the Sentence Lens: Explainable Essay Scoring through Fine-Grained Predictions (Daniel Mora Melanchthon, Stefan Keller, Andrea Horbach)
	Kelvi: A Morphological Parser to Support Tamil Literacy (Shankhalika Srikanth, Sabrina Yu, Sophia Chan, Madeline Solis de Ovando)
	Multi-step Large Language Model for Fine-Grained Feedback in Stepwise Linear Equation Solutions (Imran Chamieh, Torsten Zesch, Klaus Giebermann)
	Comparative Evaluation of AI-Generated vs. Expert-written Answer Explanations for a Medical Education Self-Assessment (Yiyun Zhou, Francis O’Donnell, Victoria Yaneva)
	PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs (Ravi Kumar, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou)
	Policy-Sensitive Fairness Evaluation in Automated Scoring of Clinical Communication (Saed Rezayi, Le An Ha, Victoria Yaneva, Polina Harik, Janet Mee, Jason Snyder)
	Assessment of L2 speech global dimensions using large audio language models (Elsayed Issa, Mahmoud Ali)
	Children’s English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety (Qian Shen, Fanghua Cao, Min Yao, Shlok Gilda, Bonnie Dorr, Walter Leite)
	Transformer-based readability classifiers are worse than you think: Evidence from cross-domain Arabic readability assessment (SARH ALZU’BI, Robert Reynolds)
	Predicting Item Difficulty and Generating Reading Comprehension Items via an Annotated Repository (Radhika Kapoor, Mayank Sharma, Sang Truong, Nick Haber, Ben Domingue, Maria Ruiz-Primo)
	Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment (Grandee Lee, Yue Wang, Che Yee Lye, Luke Peh)
	Teaching Through Analogies: A Modular Pipeline for Educational Analogy Generation (Mariam Barakat, Ekaterina Kochmar)
	From Dialogue to Learner Modeling: Identifying Candidate Signals of Productive Use in LLM-Based Grammar Practice (Luisa Ribeiro-Flucht, Lanhua Huang, Xiaobin Chen)
	SAAKTH at BEA 2026 Shared Task 1: L1-Aware English Vocabulary Difficulty Prediction with Hybrid Transformer and Psycholinguistic Features (Karthik Mattu, Adit Dhall, Arshad Naguru, Shubh Sehgal, Thejas Gowda, Hakyung Sung)
	Jinnie’s Lab at BEA 2026 Shared Task 1: Precalibration of Vocabulary Item Difficulty with Multilingual Transformers and Multi-Task Learning (Zhe Li, Pauline Aguinalde, Jinnie Shin)
	TOEBM at BEA 2026 Shared Task 1: Improving Lexical Difficulty Prediction with Context-Aligned Contrastive Learning and Ridge Ensembling (wicaksono M., Joanito Lopo, Tsamarah Nugraha, Ahmad Adi, Muhamad Nurfajri)

Saturday, July 4, 2026

Day 2 of the workshop will be hybrid, allowing you to participate either in person or online. All presentations (oral and poster) will be delivered on-site. Remote attendees can follow all oral sessions via the live stream, but please keep in mind that poster sessions are limited to on-site participants in the posters area.

Location: In-person: Harbor D
Virtual: Day 2 Session on Underline
Time Zone: America/Los_Angeles: PDT (Pacific Daylight Time), UTC-7

Time	Description
09:00 - 10:30	Oral Session B
09:00 - 09:15	The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors (Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo)
09:15 - 09:30	Interpretable Difficulty-Aware Knowledge Tracing in Tutor-Student Dialogues (Shuyan Huang, Alexander Scarlatos, Jaewook Lee, Andrew Lan)
09:30 - 09:45	Measuring Optimal Challenge: Trajectory-Based Difficulty Alignment in Open-Ended Language Tutoring (Ziqi Shu, Shuman Wang, Michael Hardy)
09:45 - 10:00	Findings of the BEA 2026 Shared Task on Vocabulary Difficulty Prediction for English Learners (Mariano Felice, Lucy Skidmore)
10:00 - 10:15	Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult? (Adam Nohejl, Xuanxin Wu, Yusuke Ide, Maria Riera Machin, Yi-Ning Chang)
10:15 - 10:30	Report on the BEA 2026 Shared Task on Rubric-based Short Answer Scoring for German (Sebastian Gombert, Zhifan Sun, Fabian Zehner, Jannik Lossjew, Tobias Wyrwich, Berrit Czinczel, David Bednorz, Sascha Bernholt, Knut Neumann, Ute Harms, Aiso Heinze, Hendrik Drachsler)
10:30 - 11:00	Coffee Break
11:00 - 12:30	Oral Session C
11:00 - 11:15	EduMUSE: A Multimodal Educational Dataset with Automatically Extracted Instructional Context (Andreea Dutulescu, Stefan Ruseti, Mihai Dascalu, Danielle McNamara)
11:15 - 11:30	Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most (Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Tithi, Xiaoyi Tian, Tiffany Barnes)
11:30 - 11:45	Towards Just-in-Time Adaptive Feedback: Enhancing Student Learning via Knowledge-Grounded LLM (Younghun Lee, Amir Bralin, Nobel Sanjay Rebello, Dan Goldwasser)
11:45 - 12:00	Evaluating LLM Workflows for Generating Clinical Communication Assessment Items: A Comparative Study with Subject-Matter Experts (Christopher Runyon, Peter Baldwin, Ian Micir, Kevin Frome, Stephanie Mann, Saed Rezayi, Keelan Evanini, Victoria Yaneva)
12:00 - 12:15	Zero Shot Phonics: Evaluating Constraint-Adherent Phonics Story Generation in Large Language Models (Maria Monica Manlises, Ethel Ong)
12:15 - 12:30	Evaluating Adaptive Personalization of Educational Readings with Simulated Learners (Ryan Woo, Anmol Rao, Aryan Keluskar, Yinong Chen)
12:30 - 14:00	Lunch Break / Birds of a Feather
14:00 - 15:30	Poster Session B
	Investigating Context-aware CTC for Pronunciation Assessment: Mitigating Peaky Behavior and Context Independency Assumption (Jiun-Ting Li, Tien-Hong Lo, Bi-Cheng Yan, Shih-Hsuan Chiu, Fu-An Chao, Berlin Chen)
	A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges (Wen Liang, Li Siyan, Zackary Rackauckas, Julia Hirschberg)
	Criterial Features in German: Towards Interpretable NLP in Readability Assessment (Denise Loefflad, Sofia Kathmann, Heiko Holz, Detmar Meurers)
	RABIT: Rationale-Based Distillation Towards Interpretable Automatic Speaking Assessment via a Small Language Model (Bi-Cheng Yan, Hong-Yun Lin, Fu-An Chao, Jiun-Ting Li, Berlin Chen)
	Challenges in Machine Translation of Interactive Multimodal Exercises (Lucie Polakova, Miroslav Hrabal, Věra Kloudová, Michal Novák, Mariia Anisimova, Martin Popel)
	Towards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLMs (Stefano Banno, Kate Knill, Mark Gales)
	Assessing the Quality and Consistency of Automated Knowledge Component Generation using Instructor-generated Questions and LLMs (Jordan Esiason, Priyanka Khare, Wookhee Min, Seung Lee, Gamze Ozogul, Xiaoying Zheng, Yeil Jeong)
	Intent vs. Surface: Recovering Acoustic Realization from Modern ASR for Pronunciation Training (Seongjin Park)
	Opportunities and Challenges of LLMs in Education: An NLP Perspective (Sowmya Vajjala, Bashar Alhafni, Stefano Banno, Kaushal Maurya, Ekaterina Kochmar)
	Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation (Abigail Gurin Schleifer, Moriah Ariely, Beata Beigman Klebanov, Asaf Salman, Giora Alexandron)
	Using LLMs for item creation: Validating the potential of automatically generated sentence repetition test items for language assessment (Sarah Löber, Björn Rudzewitz, Yuan Chu, Mengyuan He, Shiqin Liu, Yushan Ye, Xiaobin Chen)
	FinnGEC: Benchmarking Grammatical Error Correction for Finnish (Anh-Duc Vu, Mikhail Zolotilin, Jue Hou, Anisia Katinskaia, Yiheng Wu, Roman Yangarber)
	From Metrics to Meaning: Rule-Grounded LLM Explanations for Data Literacy in the Case of Youth Football (Tomasz Piłka, Tomasz Kuczyński, Mateusz Czajka)
	Classification of Student Struggle in Mathematics (Hannah Levin, Madhura Padwal, Nchimunya Mwiinga)
	Data-lean fine-tuning of models for evaluating teacher performance in a GenAI-led elicitation simulation (Beata Beigman Klebanov, Andrew Hoang, Jamie Mikeska, Benny Longwill, Sanjna Kashyap, Shreyashi Halder, Aakanksha Bhatia)
	Multi-component student writing profiles for expert-aligned automated evaluation of English learner essays. (Russell Moore, Andrew Caines, Paula Buttery)
	Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation (Haziq Khalid, Salsabeel Shapsough, Imran Zualkernan)
	Rubrics as Semantic Subspaces: A Unified Approach to Rubric-based Constructed Response Scoring across Short Answers and Essays (Sebastian Gombert, Sonja Hahn, Nico Andersen, Leon Camus, Zhifan Sun, Ngoc Nhu Hao Nguyen, Fabian Zehner, Longwei Cong, Alexander Mehler, Hendrik Drachsler)
	PeerMathDial: A Middle School Dialogue Dataset for Student Collaborative Math Problem Solving (Murong Yue, Desmond Mcglone, Emily Slutz, Wenhan Lyu, Yixuan Zhang, Jennifer Suh, Ziyu Yao)
	Evaluating LLM-Generated Formative Feedback for Undergraduate Mathematics Through the Lens of Feedback Theory (Aron Gohr, Marie-Amelie Lawn, Kevin Gao, Inigo Serjeant, Stephen Heslip)
	Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education (Mragisha Jain, Tirth Bhatt, Griffin Pitts, Aum Pandya, Peter Brusilovsky, Narges Norouzi, Arto Hellas, Juho Leinonen, Bita Akram)
	Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction (Takumi Goto, Yusuke Sakai, Taro Watanabe)
	Toward Cross-Domain Automated Feedback: A Comparative Evaluation of Open-Source Models across Diverse Student Assessment Types (Muhammad Haseeb, Min Paing Hmue, Ahmad Imam Amjad, Maaz Amjad, Victor Sheng)
	AIDA at BEA 2026 Shared Task 1: A Two-Stage Framework for L1-Aware Vocabulary Difficulty Prediction with Representation Diversity and Residual Calibration (Seok Hyeon Cho, JunHyeok Choi, Sangeun Ji, Sung Won Han)
	BoostedCats at BEA 2026 Shared Task 1: What Makes a Word Hard to Learn? Modeling L1 Influence on English Vocabulary Difficulty (Jonas Mayer Martins, Zhuojing Huang, Aaricia Herygers, Lisa Beinborn)
	uogal at BEA 2026 Shared Task 1: Ensemble of Multilingual Encoders with NMT Augmentation for L1-Aware Vocabulary Difficulty Prediction (bernardo stearns, John P. McCrae, Thomas Gaillat, Jefkine Kafunah)
	IWM-DKM at BEA 2026 Shared Task 2: Supplementing Supervised Fine-Tuning for Rubric-Based Short Answer Scoring (Kate Belcher, Marius De Kuthy Meurers, Kordula De Kuthy, Detmar Meurers)
	RETUYT-INCO at BEA 2026 Shared Task 2: Meta-prompting in Rubric-based Scoring for German (Ignacio Sastre, Ignacio Remersaro, Facundo Díaz, Nicolás De Horta, Luis Chiruzzo, Aiala Rosá, Santiago Góngora)
	UOL@IDEM at BEA 2026 Shared Task 1: Neural Fusion and Feature-Rich Modeling for L1-Aware Vocabulary Difficulty Prediction (Nouran Khallaf, Serge Sharoff)
15:30 - 16:00	Coffee Break
16:00 - 16:45	Panel *Transitioning from Academia to the EdTech Industry* (Christine Bagarino, Kai North, Keelan Evanini, Mariano Felice)
16:45 - 17:15	Oral Session D
16:45 - 17:00	Incentives Of EdTech: A Systematic Review Of EduNLP Research (Gabrielle Gaudeau, Aoife O’Driscoll, Jasper Degraeuwe, Andrew Caines, Donya Rooein, Zeerak Talat)
17:00 - 17:15	Effects of Varying LLM Access on Essay Writing Behavior (Julia Christenson, Karin de Langis, Shirley Anugrah Hayati, Dongyeop Kang)
17:15 - 17:30	Closing Remarks