BEA 2026 Papers

This page features the accepted papers from the workshop, each linked to their publication in the ACL Anthology.

Tutorial Track

Theory of Mind and Application in Educational Context (Effat Farhana, Maha Zainab, Qiaosi Wang, Niloofar Mireshghallah, Ramira van der Meulen, Max van Duijn)

Main Track

The main workshop track received 132 submissions. Following a rigorous peer-review process conducted by the program committee, 63 papers were accepted, corresponding to an overall acceptance rate of 48%.

Inferring Student Engagement via Real-Time Thermal–Visual Voice Activity Detection (Bradley Goodman)
Investigating Context-aware CTC for Pronunciation Assessment: Mitigating Peaky Behavior and Context Independency Assumption (Jiun-Ting Li, Tien-Hong Lo, Bi-Cheng Yan, Shih-Hsuan Chiu, Fu-An Chao, Berlin Chen)
A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges (Wen Liang, Li Siyan, Zackary Rackauckas, Julia Hirschberg)
The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors (Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo)
Criterial Features in German: Towards Interpretable NLP in Readability Assessment (Denise Loefflad, Sofia Kathmann, Heiko Holz, Detmar Meurers)
Letting Tutor Personas Speak Up for LLMs: Learning Steering Vectors from Dialogue via Preference Optimization (Jaewook Lee, Alexander Scarlatos, Simon Woodhead, Andrew Lan)
Towards Just-in-Time Adaptive Feedback: Enhancing Student Learning via Knowledge-Grounded LLM (Younghun Lee, Amir Bralin, Nobel Sanjay Rebello, Dan Goldwasser)
RABIT: Rationale-Based Distillation Towards Interpretable Automatic Speaking Assessment via a Small Language Model (Bi-Cheng Yan, Hong-Yun Lin, Fu-An Chao, Jiun-Ting Li, Berlin Chen)
Towards Pedagogically Aligned LLM Tutors for Math Mistake Remediation (Kseniia Petukhova, Tien Dat Nguyen, Ekaterina Kochmar)
Challenges in Machine Translation of Interactive Multimodal Exercises (Lucie Polakova, Miroslav Hrabal, Věra Kloudová, Michal Novák, Mariia Anisimova, Martin Popel)
Evaluating LLM Workflows for Generating Clinical Communication Assessment Items: A Comparative Study with Subject-Matter Experts (Christopher Runyon, Peter Baldwin, Ian Micir, Kevin Frome, Stephanie Mann, Saed Rezayi, Keelan Evanini, Victoria Yaneva)
Towards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLMs (Stefano Banno, Kate Knill, Mark Gales)
Using Interaction Log Data to Evaluate and Improve Feedback Accuracy in an Intelligent Language Tutoring System (Mariia Soliar, Leona Colling, Stephen Bodnar, Detmar Meurers)
A Bigger Catch: Fine-Grained Curriculum Standards Alignment on the MathFish Benchmark (Xinman Liu, Mayank Sharma, Xinyu Shi)
Through the Sentence Lens: Explainable Essay Scoring through Fine-Grained Predictions (Daniel Mora Melanchthon, Stefan Keller, Andrea Horbach)
Instruction-Following LLMs for Grammatical Error Correction: Analyzing Neutral-Anchored Instructional Sensitivity Across Editing Modes (Tolgahan Türker, Gülşen Eryiğit)
Assessing the Quality and Consistency of Automated Knowledge Component Generation using Instructor-generated Questions and LLMs (Jordan Esiason, Priyanka Khare, Wookhee Min, Seung Lee, Gamze Ozogul, Xiaoying Zheng, Yeil Jeong)
Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory (Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne)
Using k-Shot Prompting with Large k for the Automated Scoring of a German Written Elicited Imitation Test (Malte Sternik, Ronja Laarmann-Quante, Anastasia Drackert)
Kelvi: A Morphological Parser to Support Tamil Literacy (Shankhalika Srikanth, Sabrina Yu, Sophia Chan, Madeline Solis de Ovando)
From Questions to Assessment Tuples: A Multi-Agent Framework with Bloom-Specialized Agents and Automated Verification (Gee-Lyle Wong, Runcong Zhao, Yulan He, Jiazheng Li)
Intent vs. Surface: Recovering Acoustic Realization from Modern ASR for Pronunciation Training (Seongjin Park)
KEYSCORE — Keystroke-enhanced Automated Essay Scoring (Nils-Jonathan Schaller, Daniel Mora Melanchthon, Thorben Jansen, Olaf Köller, Andrea Horbach)
EduMUSE: A Multimodal Educational Dataset with Automatically Extracted Instructional Context (Andreea Dutulescu, Stefan Ruseti, Mihai Dascalu, Danielle McNamara)
Opportunities and Challenges of LLMs in Education: An NLP Perspective (Sowmya Vajjala, Bashar Alhafni, Stefano Banno, Kaushal Maurya, Ekaterina Kochmar)
Fine-Grained Content Zone Prediction in German Argumentative Essays Using LLMs (Xiaoyu Bai, Manfred Stede)
Multi-step Large Language Model for Fine-Grained Feedback in Stepwise Linear Equation Solutions (Imran Chamieh, Torsten Zesch, Klaus Giebermann)
Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation (Abigail Gurin Schleifer, Moriah Ariely, Beata Beigman Klebanov, Asaf Salman, Giora Alexandron)
Using LLMs for item creation: Validating the potential of automatically generated sentence repetition test items for language assessment (Sarah Löber, Björn Rudzewitz, Yuan Chu, Mengyuan He, Shiqin Liu, Yushan Ye, Xiaobin Chen)
Comparative Evaluation of AI-Generated vs. Expert-written Answer Explanations for a Medical Education Self-Assessment (Yiyun Zhou, Francis O’Donnell, Victoria Yaneva)
What Aggregate Scores Hide: Per-Rule Evaluation of Russian Grammatical Error Correction (Anna Smirnova, Artyom Kopan, Vladislav Makeev, George Chernishev)
FinnGEC: Benchmarking Grammatical Error Correction for Finnish (Anh-Duc Vu, Mikhail Zolotilin, Jue Hou, Anisia Katinskaia, Yiheng Wu, Roman Yangarber)
From Metrics to Meaning: Rule-Grounded LLM Explanations for Data Literacy in the Case of Youth Football (Tomasz Piłka, Tomasz Kuczyński, Mateusz Czajka)
Sharing is Caring: Advantages of Sharing a Language Background with Learners as an Annotator of Learner Data in UD (Caroline Grand-Clement, Arianna Masciolini)
Classification of Student Struggle in Mathematics (Hannah Levin, Madhura Padwal, Nchimunya Mwiinga)
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs (Ravi Kumar, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou)
Data-lean fine-tuning of models for evaluating teacher performance in a GenAI-led elicitation simulation (Beata Beigman Klebanov, Andrew Hoang, Jamie Mikeska, Benny Longwill, Sanjna Kashyap, Shreyashi Halder, Aakanksha Bhatia)
Multi-component student writing profiles for expert-aligned automated evaluation of English learner essays. (Russell Moore, Andrew Caines, Paula Buttery)
Policy-Sensitive Fairness Evaluation in Automated Scoring of Clinical Communication (Saed Rezayi, Le An Ha, Victoria Yaneva, Polina Harik, Janet Mee, Jason Snyder)
Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation (Haziq Khalid, Salsabeel Shapsough, Imran Zualkernan)
The Effects of Structured LLM-Generated Feedback on Programming Assignment Performance (Tsvetomila Mihaylova, Evanfiya Logacheva, Arto Hellas, Jing Fan, Francisco Castro, Bita Akram, Narges Norouzi, Peter Brusilovsky, Juho Leinonen)
Interpretable Difficulty-Aware Knowledge Tracing in Tutor-Student Dialogues (Shuyan Huang, Alexander Scarlatos, Jaewook Lee, Andrew Lan)
Rubrics as Semantic Subspaces: A Unified Approach to Rubric-based Constructed Response Scoring across Short Answers and Essays (Sebastian Gombert, Sonja Hahn, Nico Andersen, Leon Camus, Zhifan Sun, Ngoc Nhu Hao Nguyen, Fabian Zehner, Longwei Cong, Alexander Mehler, Hendrik Drachsler)
Domain-Adaptive Pre-training for Automated Short Answer Grading in Conceptual Physics: Reliability, Question-Level Analysis, and Error Reduction (Shirin Lade, Alistair Willis, Jonathan Nylk, Oli Howson)
Measuring Optimal Challenge: Trajectory-Based Difficulty Alignment in Open-Ended Language Tutoring (Ziqi Shu, Shuman Wang, Michael Hardy)
PeerMathDial: A Middle School Dialogue Dataset for Student Collaborative Math Problem Solving (Murong Yue, Desmond Mcglone, Emily Slutz, Wenhan Lyu, Yixuan Zhang, Jennifer Suh, Ziyu Yao)
Effects of Varying LLM Access on Essay Writing Behavior (Julia Christenson, Karin de Langis, Shirley Anugrah Hayati, Dongyeop Kang)
Assessment of L2 speech global dimensions using large audio language models (Elsayed Issa, Mahmoud Ali)
Incentives Of EdTech: A Systematic Review Of EduNLP Research (Gabrielle Gaudeau, Aoife O’Driscoll, Jasper Degraeuwe, Andrew Caines, Donya Rooein, Zeerak Talat)
Children’s English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety (Qian Shen, Fanghua Cao, Min Yao, Shlok Gilda, Bonnie Dorr, Walter Leite)
Transformer-based readability classifiers are worse than you think: Evidence from cross-domain Arabic readability assessment (SARH ALZU’BI, Robert Reynolds)
Predicting Item Difficulty and Generating Reading Comprehension Items via an Annotated Repository (Radhika Kapoor, Mayank Sharma, Sang Truong, Nick Haber, Ben Domingue, Maria Ruiz-Primo)
Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment (Grandee Lee, Yue Wang, Che Yee Lye, Luke Peh)
Evaluating LLM-Generated Formative Feedback for Undergraduate Mathematics Through the Lens of Feedback Theory (Aron Gohr, Marie-Amelie Lawn, Kevin Gao, Inigo Serjeant, Stephen Heslip)
Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most (Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Tithi, Xiaoyi Tian, Tiffany Barnes)
Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education (Mragisha Jain, Tirth Bhatt, Griffin Pitts, Aum Pandya, Peter Brusilovsky, Narges Norouzi, Arto Hellas, Juho Leinonen, Bita Akram)
LLM-Powered but Rule-Grounded: Pedagogically Relevant Grammatical Error Characterization for Learner Model Construction (Soroosh Akef, Amália Mendes, P Rebuschat, Detmar Meurers)
Teaching Through Analogies: A Modular Pipeline for Educational Analogy Generation (Mariam Barakat, Ekaterina Kochmar)
Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction (Takumi Goto, Yusuke Sakai, Taro Watanabe)
Zero Shot Phonics: Evaluating Constraint-Adherent Phonics Story Generation in Large Language Models (Maria Monica Manlises, Ethel Ong)
From Dialogue to Learner Modeling: Identifying Candidate Signals of Productive Use in LLM-Based Grammar Practice (Luisa Ribeiro-Flucht, Lanhua Huang, Xiaobin Chen)
Evaluating Adaptive Personalization of Educational Readings with Simulated Learners (Ryan Woo, Anmol Rao, Aryan Keluskar, Yinong Chen)
Toward Cross-Domain Automated Feedback: A Comparative Evaluation of Open-Source Models across Diverse Student Assessment Types (Muhammad Haseeb, Min Paing Hmue, Ahmad Imam Amjad, Maaz Amjad, Victor Sheng)

Shared Task Track

We accepted 28 shared task papers, including two shared task overview papers and 26 system description papers:

Findings of the BEA 2026 Shared Task on Vocabulary Difficulty Prediction for English Learners (Mariano Felice, Lucy Skidmore)
- SATLab at BEA 2026 Shared Task 1: Predicting the Difficulty of English Words for Three L1 Learners Using Primarily Psycholinguistic Features (Yves Bestgen)
- UGA Threshold at BEA 2026 Shared Task 1: Predicting Vocabulary Acquisition Difficulty with Hand-Crafted SLA-Based Features (Emma Dalbo)
- TeamXBC at BEA 2026 Shared Task 1: How AI (and I) won the shared task: Vibe and agentic coding solutions for practical machine learning problems (Xiaobin Chen)
- SAAKTH at BEA 2026 Shared Task 1: L1-Aware English Vocabulary Difficulty Prediction with Hybrid Transformer and Psycholinguistic Features (Karthik Mattu, Adit Dhall, Arshad Naguru, Shubh Sehgal, Thejas Gowda, Hakyung Sung)
- SurreyCTS at BEA 2026 Shared Task 1: Semantic Funnelling and Entropy-based Multilingual Lexical Difficulty Prediction (Georgina Willoughby, Jordan Painter, Diptesh Kanojia, Emily Wells, Constantin Orasan)
- EduNLP at BEA 2026 Shared Task 1: Multi-Model Ensemble with Feature-Augmented Transformers for Vocabulary Difficulty Prediction (Avinash Kumar Sharma)
- AIDA at BEA 2026 Shared Task 1: A Two-Stage Framework for L1-Aware Vocabulary Difficulty Prediction with Representation Diversity and Residual Calibration (Seok Hyeon Cho, JunHyeok Choi, Sangeun Ji, Sung Won Han)
- Failure at BEA 2026 Shared Task 1: One Pipeline, Three L1s: A Unified Language-Agnostic System for Vocabulary Difficulty Prediction (Abid Hossain, Kamruzzaman Khan Alve)
- BoostedCats at BEA 2026 Shared Task 1: What Makes a Word Hard to Learn? Modeling L1 Influence on English Vocabulary Difficulty (Jonas Mayer Martins, Zhuojing Huang, Aaricia Herygers, Lisa Beinborn)
- uogal at BEA 2026 Shared Task 1: Ensemble of Multilingual Encoders with NMT Augmentation for L1-Aware Vocabulary Difficulty Prediction (bernardo stearns, John P. McCrae, Thomas Gaillat, Jefkine Kafunah)
- Jinnie’s Lab at BEA 2026 Shared Task 1: Precalibration of Vocabulary Item Difficulty with Multilingual Transformers and Multi-Task Learning (Zhe Li, Pauline Aguinalde, Jinnie Shin)
- Glite at BEA 2026 Shared Task 1: Holistic Difficulty Models Dominate, Feature Engineering Closes the Gap in L1-Aware Vocabulary Difficulty Prediction (Vassili Philippov, Dmitrii Andreev, Pavel Katunin, Anton Nikolaev)
- NLP-Explorers at BEA 2026 Shared Task 1: DeBERTa–CatBoost Weighted Ensemble Approach for L1-Specific Vocabulary Difficulty Prediction (Tayyab Latif, Asifa Bibi, Sabur Butt, Grigori Sidorov, Alexander Gelbukh)
- RETUYT-INCO at BEA 2026 Shared Task 1: Feature-Enriched mDeBERTa for Word Difficulty Prediction (Santiago Robaina, Aiala Rosá, Luis Chiruzzo)
- Token Titans at BEA 2026 Shared Task 1: Multilingual Lexical Complexity Prediction via Fine-Tuned XLM-RoBERTa with Ensemble Decoding (Anubhab Parashar, Sandeep Mathias)
- TOEBM at BEA 2026 Shared Task 1: Improving Lexical Difficulty Prediction with Context-Aligned Contrastive Learning and Ridge Ensembling (wicaksono M., Joanito Lopo, Tsamarah Nugraha, Ahmad Adi, Muhamad Nurfajri)
- Data Asgardians at BEA 2026 Shared Task 1: A Hybrid Transformer–Feature Ensemble for L1-Aware English Vocabulary Difficulty Prediction (Adrian Pineda, Sabur Butt, Héctor Ceballos Cancino)
- UOL@IDEM at BEA 2026 Shared Task 1: Neural Fusion and Feature-Rich Modeling for L1-Aware Vocabulary Difficulty Prediction (Nouran Khallaf, Serge Sharoff)
- Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult? (Adam Nohejl, Xuanxin Wu, Yusuke Ide, Maria Riera Machin, Yi-Ning Chang)
Report on the BEA 2026 Shared Task on Rubric-based Short Answer Scoring for German (Sebastian Gombert, Zhifan Sun, Fabian Zehner, Jannik Lossjew, Tobias Wyrwich, Berrit Czinczel, David Bednorz, Sascha Bernholt, Knut Neumann, Ute Harms, Aiso Heinze, Hendrik Drachsler)
- HFT at BEA 2026 Shared Task 2: Blunt-Edge Models for Hybrid Grading (Ulrike Pado)
- ASLAN at BEA 2026 Shared Task 2: Voting Across Scoring Paradigms (Marie Bexte, Yuning Ding, Josef Ruppenhofer, Nils-Jonathan Schaller, Daniel Mora Melanchthon, Torsten Zesch, Andrea Horbach)
- WSE Research at BEA 2026 Shared Task 2: Multi-Strategy Rubric-Based Short Answer Scoring for German (Jonas Gwozdz, Andreas Both)
- AMATI at BEA 2026 Shared Task 2: Automatic Short Answer Grading with Inductive Logic Programming and a Large Language Model (Alistair Willis, Aisling Third)
- IWM-DKM at BEA 2026 Shared Task 2: Supplementing Supervised Fine-Tuning for Rubric-Based Short Answer Scoring (Kate Belcher, Marius De Kuthy Meurers, Kordula De Kuthy, Detmar Meurers)
- RETUYT-INCO at BEA 2026 Shared Task 2: Meta-prompting in Rubric-based Scoring for German (Ignacio Sastre, Ignacio Remersaro, Facundo Díaz, Nicolás De Horta, Luis Chiruzzo, Aiala Rosá, Santiago Góngora)
- SDPA at BEA 2026 Shared Task 2: Efficient LLM Fine-Tuning for Rubric-based Short Answer Scoring (Zhexiong Liu, Jing Zhang)