BEA 2025 Papers

This page features the accepted papers from the workshop, each linked to their publication in the ACL Anthology.

Tutorial Track

Large Language Models for Education: Understanding the Needs of Stakeholders, Current Capabilities and the Path Forward (Sankalan Pal Chowdhury, Nico Daheim, Ekaterina Kochmar, Jakub Macina, Donya Rooein, Mrinmaya Sachan, Shashank Sonkar)

Main Track

The main workshop track received 169 submissions. Following a rigorous peer-review process conducted by the program committee, 75 papers were accepted, corresponding to an overall acceptance rate of 44%.

Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic features (Hakyung Sung, Karla Csuros, Min-Chang Sung)
MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks (Marius Dumitran, Mihnea Buca, Theodor Moroianu)
Unsupervised Automatic Short Answer Grading and Essay Scoring: A Weakly Supervised Explainable Approach (Felipe Urrutia, Cristian Buc, Roberto Araya, Valentin Barriere)
A Survey on Automated Distractor Evaluation in Multiple-Choice Tasks (Luca Benedetto, Shiva Taslimipoor, Paula Buttery)
Alignment Drift in CEFR-prompted LLMs for Interactive Spanish Tutoring (Mina Almasi, Ross Kristensen-McLachlan)
Leveraging Generative AI for Enhancing Automated Assessment in Programming Education Contests (Stefan Dascalescu, Marius Dumitran, Mihai Alexandru Vasiluta)
Can LLMs Effectively Simulate Human Learners? Teachers’ Insights from Tutoring LLM Students (Daria Martynova, Jakub Macina, Nico Daheim, Nilay Yalcin, Xiaoyu Zhang, Mrinmaya Sachan)
Adapting LLMs for Minimal-edit Grammatical Error Correction (Ryszard Staruch, Filip Gralinski, Daniel Dzienisiewicz)
COGENT: A Curriculum-oriented Framework for Generating Grade-appropriate Educational Content (Zhengyuan Liu, Stella Xin Yin, Dion Hoe-Lian Goh, Nancy Chen)
Is Lunch Free Yet? Overcoming the Cold-Start Problem in Supervised Content Scoring using Zero-Shot LLM-Generated Training Data (Marie Bexte, Torsten Zesch)
Transformer Architectures for Vocabulary Test Item Difficulty Prediction (Lucy Skidmore, Mariano Felice, Karen Dunn)
Automatic concept extraction for learning domain modeling: A weakly supervised approach using contextualized word embeddings (Kordula De Kuthy, Leander Girrbach, Detmar Meurers)
Towards a Real-time Swedish Speech Analyzer for Language Learning Games: A Hybrid AI Approach to Language Assessment (Tianyi Geng, David Alfter)
Multilingual Grammatical Error Annotation: Combining Language-Agnostic Framework with Language-Specific Flexibility (Mengyang Qiu, Tran Minh Nguyen, Zihao Huang, Zelong Li, Yang Gu, Qingyu Gao, SILIANG LIU, Jungyeul Park)
LLM-based post-editing as reference-free GEC evaluation (Robert Östling, Murathan Kurfali, Andrew Caines)
Increasing the Generalizability of Similarity-Based Essay Scoring Through Cross-Prompt Training (Marie Bexte, Yuning Ding, Andrea Horbach)
Automated Scoring of a German Written Elicited Imitation Test (Mihail Chifligarov, Jammila Laâguidi, Max Schellenberg, Alexander Dill, Anna Timukova, Anastasia Drackert, Ronja Laarmann-Quante)
LLMs Protégés: Tutoring LLMs with Knowledge Gaps Improves Student Learning Outcome (Andrei Kucharavy, Cyril Vallez, Dimitri Percia David)
LEVOS: Leveraging Vocabulary Overlap with Sanskrit to Generate Technical Lexicons in Indian Languages (Karthika N J, Krishnakant Bhatt, Ganesh Ramakrishnan, Preethi Jyothi)
Do LLMs Give Psychometrically Plausible Responses in Educational Assessments? (Andreas Säuberli, Diego Frassinelli, Barbara Plank)
Challenges for AI in Multimodal STEM Assessments: a Human-AI Comparison (Aymeric de Chillaz, Anna Sotnikova, Patrick Jermann, Antoine Bosselut)
LookAlike: Consistent Distractor Generation in Math MCQs (Nisarg Parikh, Alexander Scarlatos, Nigel Fernandez, Simon Woodhead, Andrew Lan)
You Shall Know a Word’s Difficulty by the Family It Keeps: Word Family Features in Personalised Word Difficulty Classifiers for L2 Spanish (Jasper Degraeuwe)
The Need for Truly Graded Lexical Complexity Prediction (David Alfter)
Towards Automatic Formal Feedback on Scientific Documents (Louise Bloch, Johannes Rückert, Christoph Friedrich)
Don’t Score too Early! Evaluating Argument Mining Models on Incomplete Essays (Nils-Jonathan Schaller, Yuning Ding, Thorben Jansen, Andrea Horbach)
Educators’ Perceptions of Large Language Models as Tutors: Comparing Human and AI Tutors in a Blind Text-only Setting (Sankalan Pal Chowdhury, Terry Jingchen Zhang, Donya Rooein, Dirk Hovy, Tanja Käser, Mrinmaya Sachan)
Transformer-Based Real-Word Spelling Error Feedback with Configurable Confusion Sets (Torsten Zesch, Dominic Gardner, Marie Bexte)
Automated L2 Proficiency Scoring: Weak Supervision, Large Language Models, and Statistical Guarantees (Aitor Arronte Alvarez, Naiyi Xie Fincham)
Automatic Generation of Inference Making Questions for Reading Comprehension Assessments (Wanjing (Anya) Ma, Michael Flor, Zuowei Wang)
Investigating Methods for Mapping Learning Objectives to Bloom’s Revised Taxonomy in Course Descriptions for Higher Education (Zahra Kolagar, Frank Zalkow, Alessandra Zarcone)
LangEye: Toward ‘Anytime’ Learner-Driven Vocabulary Learning From Real-World Objects (Mariana Shimabukuro, Deval Panchal, Christopher Collins)
Costs and Benefits of AI-Enabled Topic Modeling in P-20 Research: The Case of School Improvement Plans (Syeda Sabrina Akter, Seth Hunter, David Woo, Antonios Anastasopoulos)
Advances in Auto-Grading with Large Language Models: A Cross-Disciplinary Survey (Tania Amanda Nkoyo Frederick Eneye, Chukwuebuka Fortunate Ijezue, Ahmad Imam Amjad, Maaz Amjad, Sabur Butt, Gerardo Castañeda-Garza)
Unsupervised Sentence Readability Estimation Based on Parallel Corpora for Text Simplification (Rina Miyata, Toru Urakawa, Hideaki Tamori, Tomoyuki Kajiwara)
From End-Users to Co-Designers: Lessons from Teachers (Martina Galletti, Valeria Cesaroni)
LLMs in alliance with Edit-based models: advancing In-Context Learning for Grammatical Error Correction by Specific Example Selection (Alexey Sorokin, Regina Nasyrova)
Explaining Holistic Essay Scores in Comparative Judgment Assessments by Predicting Scores on Rubrics (Michiel De Vrindt, Renske Bouwer, Wim Van Den Noortgate, Marije Lesterhuis, Anaïs Tack)
Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection (Chatrine Qwaider, Bashar Alhafni, Kirill Chirkunov, Nizar Habash, Ted Briscoe)
Direct Repair Optimization: Training Small Language Models For Educational Program Repair Improves Feedback (Charles Koutcheme, Nicola Dainese, Arto Hellas)
Analyzing Interview Questions via Bloom’s Taxonomy to Enhance the Design Thinking Process (Fatemeh Kazemi Vanhari, Christopher Anand, Charles Welch)
Estimation of Text Difficulty in the Context of Language Learning (Anisia Katinskaia, Anh-Duc Vu, Jue Hou, Ulla Vanhatalo, Yiheng Wu, Roman Yangarber)
Are Large Language Models for Education Reliable Across Languages? (Vansh Gupta, Sankalan Pal Chowdhury, Vilém Zouhar, Donya Rooein, Mrinmaya Sachan)
Exploiting the English Vocabulary Profile for L2 word-level vocabulary assessment with LLMs (Stefano Banno, Kate Knill, Mark Gales)
Advancing Question Generation with Joint Narrative and Difficulty Control (Bernardo Leite, Henrique Lopes Cardoso)
Down the Cascades of Omethi: Hierarchical Automatic Scoring in Large-Scale Assessments (Fabian Zehner, Hyo Jeong Shin, Emily Kerzabi, Andrea Horbach, Sebastian Gombert, Frank Goldhammer, Torsten Zesch, Nico Andersen)
Lessons Learned in Assessing Student Reflections with LLMs (Mohamed Elaraby, Diane Litman)
Using NLI to Identify Potential Collocation Transfer in L2 English (Haiyin Yang, Zoey Liu, Stefanie Wulff)
Name of Thrones: How Do LLMs Rank Student Names in Status Hierarchies Based on Race and Gender? (Annabella Sakunkoo, Jonathan Sakunkoo)
Exploring LLM-Based Assessment of Italian Middle School Writing: A Pilot Study (Adriana Mirabella, Dominique Brunato)
Exploring task formulation strategies to evaluate the coherence of classroom discussions with GPT-4o (Yuya Asano, Beata Beigman Klebanov, Jamie Mikeska)
A Bayesian Approach to Inferring Prerequisite Structures and Topic Difficulty in Language Learning (Anh-Duc Vu, Jue Hou, Anisia Katinskaia, Ching-Fan Sheu, Roman Yangarber)
Improving In-context Learning Example Retrieval for Classroom Discussion Assessment with Re-ranking and Label Ratio Regulation (Nhat Tran, Diane Litman, Benjamin Pierce, Richard Correnti, Lindsay Clare Matsumura)
Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues (Fareya Ikram, Alexander Scarlatos, Andrew Lan)
Assessing Critical Thinking Components in Romanian Secondary School Textbooks: A Data Mining Approach to the ROTEX Corpus (Madalina Chitez, Liviu Dinu, Marius Micluta-Campeanu, Ana-Maria Bucur, Roxana Rogobete)
Improving AI assistants embedded in short e-learning courses with limited textual content (Jacek Marciniak, Marek Kubis, Michał Gulczyński, Adam Szpilkowski, Adam Wieczarek, Marcin Szczepański)
Beyond Linear Digital Reading: An LLM-Powered Concept Mapping Approach for Reducing Cognitive Load (Junzhi Han, Jinho D. Choi)
GermDetect: Verb Placement Error Detection Datasets for Learners of Germanic Languages (Noah-Manuel Michael, Andrea Horbach)
Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems (Sahar Yarmohammadtoosky, Yiyun Zhou, Victoria Yaneva, Peter Baldwin, Saed Rezayi, Brian Clauser, Polina Harik)
EyeLLM: Using Lookback Fixations to Enhance Human-LLM Alignment for Text Completion (Astha Singh, Mark Torrance, Evgeny Chukharev)
Span Labeling with Large Language Models: Shell vs. Meat (Phoebe Mulcaire, Nitin Madnani)
Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation (Kseniia Petukhova, Ekaterina Kochmar)
Comparing Behavioral Patterns of LLM and Human Tutors: A Population-level Analysis with the CIMA Dataset (Aayush Kucheria, Nitin Sawhney, Arto Hellas)
Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic (Zhenjiang Mao, Artem Bisliouk, Rohith Nama, Ivan Ruchkin)
Automated Scoring of Communication Skills in Physician-Patient Interaction: Balancing Performance and Scalability (Saed Rezayi, Le An Ha, Yiyun Zhou, Andrew Houriet, Angelo D’Addario, Peter Baldwin, Polina Harik, Ann King, Victoria Yaneva)
Decoding Actionability: A Computational Analysis of Teacher Observation Feedback (Mayank Sharma, Jason Zhang)
EduCSW: Building a Mandarin-English Code-Switched Generation Pipeline for Computer Science Learning (Ruishi Chen, Yiling Zhao)
STAIR-AIG: Optimizing the Automated Item Generation Process through Human-AI Collaboration for Critical Thinking Assessment (Euigyum Kim, Seewoo Li, Salah Khalil, Hyo Jeong Shin)
UPSC2M: Benchmarking Adaptive Learning from Two Million MCQ Attempts (Kevin Shi, Karttikeya Mangalam)
Can GPTZero’s AI Vocabulary Distinguish Between LLM-Generated and Student-Written Essays? (Veronica Schmalz, Anaïs Tack)
Paragraph-level Error Correction and Explanation Generation: Case Study for Estonian (Martin Vainikko, Taavi Kamarik, Karina Kert, Krista Liin, Silvia Maine, Kais Allkivi, Annekatrin Kaivapalu, Mark Fishel)
End-to-End Automated Item Generation and Scoring for Adaptive English Writing Assessment with Large Language Models (Kamel Nebhi, Amrita Panesar, Hans Bantilan)
A Framework for Proficiency-Aligned Grammar Practice in LLM-Based Dialogue Systems (Luisa Ribeiro-Flucht, Xiaobin Chen, Detmar Meurers)
Can LLMs Reliably Simulate Real Students’ Abilities in Mathematics and Reading Comprehension? (KV Aditya Srivatsa, Kaushal Maurya, Ekaterina Kochmar)
LLM-Assisted, Iterative Curriculum Writing: A Human-Centered AI Approach in Finnish Higher Education (Leo Huovinen, Mika Hämäläinen)

Shared Task Track

We accepted 27 shared task papers, including one shared task overview paper and 26 system description papers:

Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors (Ekaterina Kochmar, Kaushal Maurya, Kseniia Petukhova, KV Aditya Srivatsa, Anaïs Tack, Justin Vasselli)
Jinan Smart Education at BEA 2025 Shared Task: Dual Encoder Architecture for Tutor Identification via Semantic Understanding of Pedagogical Conversations (Lei Chen)
Wonderland_EDU@HKU at BEA 2025 Shared Task: Fine-tuning Large Language Models to Evaluate the Pedagogical Ability of AI-powered Tutors (Deliang Wang, Chao Yang, Gaowei Chen)
bea-jh at BEA 2025 Shared Task: Evaluating AI-powered Tutors through Pedagogically-Informed Reasoning (Jihyeon Roh, Jinhyun Bang)
CU at BEA 2025 Shared Task: A BERT-Based Cross-Attention Approach for Evaluating Pedagogical Responses in Dialogue (Zhihao Lyu)
BJTU at BEA 2025 Shared Task: Task-Aware Prompt Tuning and Data Augmentation for Evaluating AI Math Tutors (Yuming Fan, Chuangchuang Tan, Wenyu Song)
SYSUpporter Team at BEA 2025 Shared Task: Class Compensation and Assignment Optimization for LLM-generated Tutor Identification (Longfeng Chen, Zeyu Huang, Zheng Xiao, Yawen Zeng, Jin Xu)
BLCU-ICALL at BEA 2025 Shared Task: Multi-Strategy Evaluation of AI Tutors (Jiyuan An, Xiang Fu, Bo Liu, Xuquan Zong, Cunliang Kong, Shuliang Liu, Shuo Wang, Zhenghao Liu, Liner Yang, Hanghang Fan, Erhong Yang)
Phaedrus at BEA 2025 Shared Task: Assessment of Mathematical Tutoring Dialogues through Tutor Identity Classification and Actionability Evaluation (Rajneesh Tiwari, pranshu rastogi)
Emergent Wisdom at BEA 2025 Shared Task: From Lexical Understanding to Reflective Reasoning for Pedagogical Ability Assessment (Raunak Jain, Srinivasan Rengarajan)
Averroes at BEA 2025 Shared Task: Verifying Mistake Identification in Tutor, Student Dialogue (Mazen Yasser, Mariam Saeed, Hossam Elkordi, Ayman Khalafallah)
SmolLab_SEU at BEA 2025 Shared Task: A Transformer-Based Framework for Multi-Track Pedagogical Evaluation of AI-Powered Tutors (Md. Abdur Rahman, MD AL AMIN, Sabik Aftahee, Muhammad Junayed, Md Ashiqur Rahman)
RETUYT-INCO at BEA 2025 Shared Task: How Far Can Lightweight Models Go in AI-powered Tutor Evaluation? (Santiago Góngora, Ignacio Sastre, Santiago Robaina, Ignacio Remersaro, Luis Chiruzzo, Aiala Rosá)
K-NLPers at BEA 2025 Shared Task: Evaluating the Quality of AI Tutor Responses with GPT-4.1 (Geon Park, Jiwoo Song, Gihyeon Choi, Juoh Sun, Harksoo Kim)
Henry at BEA 2025 Shared Task: Improving AI Tutor’s Guidance Evaluation Through Context-Aware Distillation (Henry Pit)
TBA at BEA 2025 Shared Task: Transfer-Learning from DARE-TIES Merged Models for the Pedagogical Ability Assessment of LLM-Powered Math Tutors (Sebastian Gombert, Fabian Zehner, Hendrik Drachsler)
LexiLogic at BEA 2025 Shared Task: Fine-tuning Transformer Language Models for the Pedagogical Skill Evaluation of LLM-based tutors (Souvik Bhattacharyya, Billodal Roy, Niranjan M, Pranav Gupta)
IALab UC at BEA 2025 Shared Task: LLM-Powered Expert Pedagogical Feature Extraction (Sofía Correa Busquets, Valentina Córdova Véliz, Jorge Baier)
MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors (Baraa Hikal, Mohmaed Basem, Islam Oshallah, Ali Hamdi)
TutorMind at BEA 2025 Shared Task: Leveraging Fine-Tuned LLMs and Data Augmentation for Mistake Identification (FATIMA DEKMAK, Christian Khairallah, Wissam Antoun)
Two Outliers at BEA 2025 Shared Task: Tutor Identity Classification using DiReC, a Two-Stage Disentangled Contrastive Representation (Eduardus Tjitrahardja, Ikhlasul Hanif)
Archaeology at BEA 2025 Shared Task: Are Simple Baselines Good Enough? (Ana Roșu, Iani Ispas, Sergiu Nisioi)
NLIP at BEA 2025 Shared Task: Evaluation of Pedagogical Ability of AI Tutors (Trishita Saha, Shrenik Ganguli, Maunendra Sankar Desarkar)
NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors (Numaan Naeem, Sarfraz Ahmad, Momina Ahsan, Hasan Iqbal)
DLSU at BEA 2025 Shared Task: Towards Establishing Baseline Models for Pedagogical Response Evaluation Tasks (Maria Monica Manlises, Mark Edward Gonzales, Lanz Lim)
BD at BEA 2025 Shared Task: MPNet Ensembles for Pedagogical Mistake Identification and Localization in AI Tutor Responses (Shadman Rohan, Ishita Sur Apan, Muhtasim Shochcho, Md Fahim, Mohammad Rahman, AKM Mahbubur Rahman, Amin Ali)
Thapar Titan/s : Fine-Tuning Pretrained Language Models with Contextual Augmentation for Mistake Identification in Tutor–Student Dialogues (Harsh Dadwal, Sparsh Rastogi, Jatin Bedi)