BEA 2026 Schedule

Time Zone
America/Los_Angeles: PDT (Pacific Daylight Time), UTC-7

The workshop page on Underline is not live yet. Access to the links below will be available at the start of the workshop.

Friday, July 3, 2026

Time Description
06:00 - 07:30 Early Poster Session
  Using k-Shot Prompting with Large k for the Automated Scoring of a German Written Elicited Imitation Test (Malte Sternik, Ronja Laarmann-Quante, Anastasia Drackert)
  Fine-Grained Content Zone Prediction in German Argumentative Essays Using LLMs (Xiaoyu Bai, Manfred Stede)
  Sharing is Caring: Advantages of Sharing a Language Background with Learners as an Annotator of Learner Data in UD (Caroline Grand-Clement, Arianna Masciolini)
  The Effects of Structured LLM-Generated Feedback on Programming Assignment Performance (Tsvetomila Mihaylova, Evanfiya Logacheva, Arto Hellas, Jing Fan, Francisco Castro, Bita Akram, Narges Norouzi, Peter Brusilovsky, Juho Leinonen)
  HFT at BEA 2026 Shared Task 2: Blunt-Edge Models for Hybrid Grading (Ulrike Pado)
  WSE Research at BEA 2026 Shared Task 2: Multi-Strategy Rubric-Based Short Answer Scoring for German (Jonas Gwozdz, Andreas Both)
07:30 - 08:45 Early Oral Session
07:30 - 07:45 Domain-Adaptive Pre-training for Automated Short Answer Grading in Conceptual Physics: Reliability, Question-Level Analysis, and Error Reduction (Shirin Lade, Alistair Willis, Jonathan Nylk, Oli Howson)
07:45 - 08:00 Using Interaction Log Data to Evaluate and Improve Feedback Accuracy in an Intelligent Language Tutoring System (Mariia Soliar, Leona Colling, Stephen Bodnar, Detmar Meurers)
08:00 - 08:15 Towards Pedagogically Aligned LLM Tutors for Math Mistake Remediation (Kseniia Petukhova, Tien Dat Nguyen, Ekaterina Kochmar)
08:15 - 08:30 What Aggregate Scores Hide: Per-Rule Evaluation of Russian Grammatical Error Correction (Anna Smirnova, Artyom Kopan, Vladislav Makeev, George Chernishev)
08:30 - 08:45 IWM-DKM at BEA 2026 Shared Task 2: Supplementing Supervised Fine-Tuning for Rubric-Based Short Answer Scoring (Kate Belcher, Marius De Kuthy Meurers, Kordula De Kuthy, Detmar Meurers)
09:00 - 10:30 Tutorial Session A
  Theory of Mind and Application in Educational Context (Effat Farhana, Maha Zainab, Qiaosi Wang, Niloofar Mireshghallah, Ramira van der Meulen, Max van Duijn)
10:30 - 11:00 Coffee Break
11:00 - 12:30 Tutorial Session B
  Theory of Mind and Application in Educational Context (Effat Farhana, Maha Zainab, Qiaosi Wang, Niloofar Mireshghallah, Ramira van der Meulen, Max van Duijn)
12:30 - 14:00 Lunch Break / Birds of a Feather
14:00 - 15:45 Oral Session A
14:00 - 14:15 Using Interaction Log Data to Evaluate and Improve Feedback Accuracy in an Intelligent Language Tutoring System (Mariia Soliar, Leona Colling, Stephen Bodnar, Detmar Meurers)
14:15 - 14:30 Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory (Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne)
14:30 - 14:45 LLM-Powered but Rule-Grounded: Pedagogically Relevant Grammatical Error Characterization for Learner Model Construction (Soroosh Akef, Amália Mendes, P Rebuschat, Detmar Meurers)
14:45 - 15:00 KEYSCORE — Keystroke-enhanced Automated Essay Scoring (Nils-Jonathan Schaller, Daniel Mora Melanchthon, Thorben Jansen, Olaf Köller, Andrea Horbach)
15:45 - 16:00 Coffee Break
16:00 - 17:30 Poster Session A
  Inferring Student Engagement via Real-Time Thermal–Visual Voice Activity Detection (Bradley Goodman)
  Letting Tutor Personas Speak Up for LLMs: Learning Steering Vectors from Dialogue via Preference Optimization (Jaewook Lee, Alexander Scarlatos, Simon Woodhead, Andrew Lan)
  A Bigger Catch: Fine-Grained Curriculum Standards Alignment on the MathFish Benchmark (Xinman Liu, Mayank Sharma, Xinyu Shi)
  Through the Sentence Lens: Explainable Essay Scoring through Fine-Grained Predictions (Daniel Mora Melanchthon, Stefan Keller, Andrea Horbach)
  Kelvi: A Morphological Parser to Support Tamil Literacy (Shankhalika Srikanth, Sabrina Yu, Sophia Chan, Madeline Solis de Ovando)
  From Questions to Assessment Tuples: A Multi-Agent Framework with Bloom-Specialized Agents and Automated Verification (Gee-Lyle Wong, Runcong Zhao, Yulan He, Jiazheng Li)
  Multi-step Large Language Model for Fine-Grained Feedback in Stepwise Linear Equation Solutions (Imran Chamieh, Torsten Zesch, Klaus Giebermann)
  Comparative Evaluation of AI-Generated vs. Expert-written Answer Explanations for a Medical Education Self-Assessment (Yiyun Zhou, Francis O’Donnell, Victoria Yaneva)
  Multi-component student writing profiles for expert-aligned automated evaluation of English learner essays. (Russell Moore, Andrew Caines, Paula Buttery)
  Policy-Sensitive Fairness Evaluation in Automated Scoring of Clinical Communication (Saed Rezayi, Le An Ha, Victoria Yaneva, Polina Harik, Janet Mee, Jason Snyder)
  Rubrics as Semantic Subspaces: A Unified Approach to Rubric-based Constructed Response Scoring across Short Answers and Essays (Sebastian Gombert, Sonja Hahn, Nico Andersen, Leon Camus, Zhifan Sun, Ngoc Nhu Hao Nguyen, Fabian Zehner, Longwei Cong, Alexander Mehler, Hendrik Drachsler)
  Assessment of L2 speech global dimensions using large audio language models (Elsayed Issa, Mahmoud Ali)
  Children’s English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety (Qian Shen, Fanghua Cao, Min Yao, Shlok Gilda, Bonnie Dorr, Walter Leite)
  Transformer-based readability classifiers are worse than you think: Evidence from cross-domain Arabic readability assessment (SARH ALZU’BI, Robert Reynolds)
  Predicting Item Difficulty and Generating Reading Comprehension Items via an Annotated Repository (Radhika Kapoor, Mayank Sharma, Sang Truong, Nick Haber, Ben Domingue, Maria Ruiz-Primo)
  Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment (Grandee Lee, Yue Wang, Che Yee Lye, Luke Peh)
  Teaching Through Analogies: A Modular Pipeline for Educational Analogy Generation (Mariam Barakat, Ekaterina Kochmar)
  From Dialogue to Learner Modeling: Identifying Candidate Signals of Productive Use in LLM-Based Grammar Practice (Luisa Ribeiro-Flucht, Lanhua Huang, Xiaobin Chen)
  Toward Cross-Domain Automated Feedback: A Comparative Evaluation of Open-Source Models across Diverse Student Assessment Types (Muhammad Haseeb, Min Paing Hmue, Ahmad Imam Amjad, Maaz Amjad, Victor Sheng)
  SAAKTH at BEA 2026 Shared Task 1: L1-Aware English Vocabulary Difficulty Prediction with Hybrid Transformer and Psycholinguistic Features (Karthik Mattu, Adit Dhall, Arshad Naguru, Shubh Sehgal, Thejas Gowda, Hakyung Sung)
  SDPA at BEA 2026 Shared Task 2: Efficient LLM Fine-Tuning for Rubric-based Short Answer Scoring (Zhexiong Liu, Jing Zhang)
  TOEBM at BEA 2026 Shared Task 1: Improving Lexical Difficulty Prediction with Context-Aligned Contrastive Learning and Ridge Ensembling (wicaksono M., Joanito Lopo, Tsamarah Nugraha, Ahmad Adi, Muhamad Nurfajri)

Saturday, July 4, 2026

Time Description
09:00 - 10:30 Oral Session B
09:00 - 09:15 The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors (Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo)
09:15 - 09:30 Interpretable Difficulty-Aware Knowledge Tracing in Tutor-Student Dialogues (Shuyan Huang, Alexander Scarlatos, Jaewook Lee, Andrew Lan)
09:30 - 09:45 Measuring Optimal Challenge: Trajectory-Based Difficulty Alignment in Open-Ended Language Tutoring (Ziqi Shu, Shuman Wang, Michael Hardy)
09:45 - 10:00 Findings of the BEA 2026 Shared Task on Vocabulary Difficulty Prediction for English Learners (Mariano Felice, Lucy Skidmore)
10:00 - 10:15 Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult? (Adam Nohejl, Xuanxin Wu, Yusuke Ide, Maria Riera Machin, Yi-Ning Chang)
10:15 - 10:30 Report on the BEA 2026 Shared Task on Rubric-based Short Answer Scoring for German (Sebastian Gombert, Zhifan Sun, Fabian Zehner, Jannik Lossjew, Tobias Wyrwich, Berrit Czinczel, David Bednorz, Sascha Bernholt, Knut Neumann, Ute Harms, Aiso Heinze, Hendrik Drachsler)
10:30 - 11:00 Coffee Break
11:00 - 12:30 Oral Session C
11:00 - 11:15 EduMUSE: A Multimodal Educational Dataset with Automatically Extracted Instructional Context (Andreea Dutulescu, Stefan Ruseti, Mihai Dascalu, Danielle McNamara)
11:15 - 11:30 Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most (Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Tithi, Xiaoyi Tian, Tiffany Barnes)
11:30 - 11:45 Towards Just-in-Time Adaptive Feedback: Enhancing Student Learning via Knowledge-Grounded LLM (Younghun Lee, Amir Bralin, Nobel Sanjay Rebello, Dan Goldwasser)
11:45 - 12:00 Evaluating LLM Workflows for Generating Clinical Communication Assessment Items: A Comparative Study with Subject-Matter Experts (Christopher Runyon, Peter Baldwin, Ian Micir, Kevin Frome, Stephanie Mann, Saed Rezayi, Keelan Evanini, Victoria Yaneva)
12:00 - 12:15 Zero Shot Phonics: Evaluating Constraint-Adherent Phonics Story Generation in Large Language Models (Maria Monica Manlises, Ethel Ong)
12:15 - 12:30 Evaluating Adaptive Personalization of Educational Readings with Simulated Learners (Ryan Woo, Anmol Rao, Aryan Keluskar, Yinong Chen)
12:30 - 14:00 Lunch Break / Birds of a Feather
14:00 - 15:30 Poster Session B
  Investigating Context-aware CTC for Pronunciation Assessment: Mitigating Peaky Behavior and Context Independency Assumption (Jiun-Ting Li, Tien-Hong Lo, Bi-Cheng Yan, Shih-Hsuan Chiu, Fu-An Chao, Berlin Chen)
  A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges (Wen Liang, Li Siyan, Zackary Rackauckas, Julia Hirschberg)
  Criterial Features in German: Towards Interpretable NLP in Readability Assessment (Denise Loefflad, Sofia Kathmann, Heiko Holz, Detmar Meurers)
  RABIT: Rationale-Based Distillation Towards Interpretable Automatic Speaking Assessment via a Small Language Model (Bi-Cheng Yan, Hong-Yun Lin, Fu-An Chao, Jiun-Ting Li, Berlin Chen)
  Challenges in Machine Translation of Interactive Multimodal Exercises (Lucie Polakova, Miroslav Hrabal, Věra Kloudová, Michal Novák, Mariia Anisimova, Martin Popel)
  Towards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLMs (Stefano Banno, Kate Knill, Mark Gales)
  Assessing the Quality and Consistency of Automated Knowledge Component Generation using Instructor-generated Questions and LLMs (Jordan Esiason, Priyanka Khare, Wookhee Min, Seung Lee, Gamze Ozogul, Xiaoying Zheng, Yeil Jeong)
  Intent vs. Surface: Recovering Acoustic Realization from Modern ASR for Pronunciation Training (Seongjin Park)
  Opportunities and Challenges of LLMs in Education: An NLP Perspective (Sowmya Vajjala, Bashar Alhafni, Stefano Banno, Kaushal Maurya, Ekaterina Kochmar)
  Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation (Abigail Gurin Schleifer, Moriah Ariely, Beata Beigman Klebanov, Asaf Salman, Giora Alexandron)
  Using LLMs for item creation: Validating the potential of automatically generated sentence repetition test items for language assessment (Sarah Löber, Björn Rudzewitz, Yuan Chu, Mengyuan He, Shiqin Liu, Yushan Ye, Xiaobin Chen)
  FinnGEC: Benchmarking Grammatical Error Correction for Finnish (Anh-Duc Vu, Mikhail Zolotilin, Jue Hou, Anisia Katinskaia, Yiheng Wu, Roman Yangarber)
  From Metrics to Meaning: Rule-Grounded LLM Explanations for Data Literacy in the Case of Youth Football (Tomasz Piłka, Tomasz Kuczyński, Mateusz Czajka)
  Classification of Student Struggle in Mathematics (Hannah Levin, Madhura Padwal, Nchimunya Mwiinga)
  PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs (Ravi Kumar, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou)
  Data-lean fine-tuning of models for evaluating teacher performance in a GenAI-led elicitation simulation (Beata Beigman Klebanov, Andrew Hoang, Jamie Mikeska, Benny Longwill, Sanjna Kashyap, Shreyashi Halder, Aakanksha Bhatia)
  Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation (Haziq Khalid, Salsabeel Shapsough, Imran Zualkernan)
  PeerMathDial: A Middle School Dialogue Dataset for Student Collaborative Math Problem Solving (Murong Yue, Desmond Mcglone, Emily Slutz, Wenhan Lyu, Yixuan Zhang, Jennifer Suh, Ziyu Yao)
  Evaluating LLM-Generated Formative Feedback for Undergraduate Mathematics Through the Lens of Feedback Theory (Aron Gohr, Marie-Amelie Lawn, Kevin Gao, Inigo Serjeant, Stephen Heslip)
  Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education (Mragisha Jain, Tirth Bhatt, Griffin Pitts, Aum Pandya, Peter Brusilovsky, Narges Norouzi, Arto Hellas, Juho Leinonen, Bita Akram)
  Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction (Takumi Goto, Yusuke Sakai, Taro Watanabe)
  AIDA at BEA 2026 Shared Task 1: A Two-Stage Framework for L1-Aware Vocabulary Difficulty Prediction with Representation Diversity and Residual Calibration (Seok Hyeon Cho, JunHyeok Choi, Sangeun Ji, Sung Won Han)
  BoostedCats at BEA 2026 Shared Task 1: What Makes a Word Hard to Learn? Modeling L1 Influence on English Vocabulary Difficulty (Jonas Mayer Martins, Zhuojing Huang, Aaricia Herygers, Lisa Beinborn)
  uogal at BEA 2026 Shared Task 1: Ensemble of Multilingual Encoders with NMT Augmentation for L1-Aware Vocabulary Difficulty Prediction (bernardo stearns, John P. McCrae, Thomas Gaillat, Jefkine Kafunah)
  Jinnie’s Lab at BEA 2026 Shared Task 1: Precalibration of Vocabulary Item Difficulty with Multilingual Transformers and Multi-Task Learning (Zhe Li, Pauline Aguinalde, Jinnie Shin)
  IWM-DKM at BEA 2026 Shared Task 2: Supplementing Supervised Fine-Tuning for Rubric-Based Short Answer Scoring (Kate Belcher, Marius De Kuthy Meurers, Kordula De Kuthy, Detmar Meurers)
  RETUYT-INCO at BEA 2026 Shared Task 2: Meta-prompting in Rubric-based Scoring for German (Ignacio Sastre, Ignacio Remersaro, Facundo Díaz, Nicolás De Horta, Luis Chiruzzo, Aiala Rosá, Santiago Góngora)
  UOL@IDEM at BEA 2026 Shared Task 1: Neural Fusion and Feature-Rich Modeling for L1-Aware Vocabulary Difficulty Prediction (Nouran Khallaf, Serge Sharoff)
15:30 - 16:00 Coffee Break
16:00 - 16:45 Panel
16:45 - 17:15 Oral Session D
16:45 - 17:00 Incentives Of EdTech: A Systematic Review Of EduNLP Research (Gabrielle Gaudeau, Aoife O’Driscoll, Jasper Degraeuwe, Andrew Caines, Donya Rooein, Zeerak Talat)
17:00 - 17:15 Effects of Varying LLM Access on Essay Writing Behavior (Julia Christenson, Karin de Langis, Shirley Anugrah Hayati, Dongyeop Kang)
17:15 - 17:30 Closing Remarks