20th Workshop on Innovative Use of NLP for Building Educational Applications: Schedule

Time Zone: Europe/Vienna: CEST (Central European Summer Time), UTC+2
Location: In-person: Room 1.85–86
Virtual: Underline.io [Day 1] [Day 2]
Add to Calendar: Stay on schedule—download the full workshop program here: Download ICS
Schedule Changes: Please check for any last-minute changes here.

Thursday, July 31, 2025

Below is the schedule for the first workshop day. The tutorial slides are available online and can be accessed by clicking the tutorial link. Each paper in the oral and poster sessions is linked to its page on Underline; you can find the pre-recorded video there. Most oral presentations will be in person but can be followed through livestream. Poster presentations may be in person (in the posters area) or online (on Gather Town). The ID next to a poster represents the poster board ID. To join the workshop online or by livestream, please connect to the Workshop Day 1 on Underline.io.

Time	Description
09:00 - 10:30	Tutorial Session A LLMs for Education: Understanding the Needs of Stakeholders, Current Capabilities and the Path Forward Chair: Victoria Yaneva
10:30 - 11:00	Coffee Break
11:00 - 12:30	Tutorial Session B LLMs for Education: Understanding the Needs of Stakeholders, Current Capabilities and the Path Forward Chair: Bashar Alhafni
12:30 - 14:00	Lunch Break Birds of a Feather: Writing Assistants Organizer: Bashar Alhafni
14:00 - 15:30	Oral Session A Chair: Anaïs Tack
14:00 - 14:15	A Bayesian Approach to Inferring Prerequisite Structures and Topic Difficulty in Language Learning (Anh-Duc Vu, Jue Hou, Anisia Katinskaia, Ching-Fan Sheu, Roman Yangarber)
14:15 - 14:30	Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection (Chatrine Qwaider, Bashar Alhafni, Kirill Chirkunov, Nizar Habash, Ted Briscoe)
14:30 - 14:45	Alignment Drift in CEFR-prompted LLMs for Interactive Spanish Tutoring (Mina Almasi, Ross Kristensen-McLachlan)
14:45 - 15:00	You Shall Know a Word’s Difficulty by the Family It Keeps: Word Family Features in Personalised Word Difficulty Classifiers for L2 Spanish (Jasper Degraeuwe)
15:00 - 15:15	Assessing Critical Thinking Components in Romanian Secondary School Textbooks: A Data Mining Approach to the ROTEX Corpus (Madalina Chitez, Liviu Dinu, Marius Micluta-Campeanu, Ana-Maria Bucur, Roxana Rogobete)
15:15 - 15:30	Unsupervised Automatic Short Answer Grading and Essay Scoring: A Weakly Supervised Explainable Approach (Felipe Urrutia, Cristian Buc, Roberto Araya, Valentin Barriere)
15:30 - 16:00	Coffee Break
16:00 - 17:30	Poster Session A Hall X5 (in person, boards #1-11, #36-46) Gather Town (online)
Hall #1	A Survey on Automated Distractor Evaluation in Multiple-Choice Tasks (Luca Benedetto, Shiva Taslimipoor, Paula Buttery)
Hall #2	Increasing the Generalizability of Similarity-Based Essay Scoring Through Cross-Prompt Training (Marie Bexte, Yuning Ding, Andrea Horbach)
Hall #3	Automatic concept extraction for learning domain modeling: A weakly supervised approach using contextualized word embeddings (Kordula De Kuthy, Leander Girrbach, Detmar Meurers)
Hall #4	Automated Scoring of a German Written Elicited Imitation Test (Mihail Chifligarov, Jammila Laâguidi, Max Schellenberg, Alexander Dill, Anna Timukova, Anastasia Drackert, Ronja Laarmann-Quante)
Hall #5	Challenges for AI in Multimodal STEM Assessments: a Human-AI Comparison (Aymeric de Chillaz, Anna Sotnikova, Patrick Jermann, Antoine Bosselut)
Hall #6	Don’t Score too Early! Evaluating Argument Mining Models on Incomplete Essays (Nils-Jonathan Schaller, Yuning Ding, Thorben Jansen, Andrea Horbach)
Hall #7	LangEye: Toward ‘Anytime’ Learner-Driven Vocabulary Learning From Real-World Objects (Mariana Shimabukuro, Deval Panchal, Christopher Collins)
Hall #8	Explaining Holistic Essay Scores in Comparative Judgment Assessments by Predicting Scores on Rubrics (Michiel De Vrindt, Renske Bouwer, Wim Van Den Noortgate, Marije Lesterhuis, Anaïs Tack)
Hall #9	Name of Thrones: How Do LLMs Rank Student Names in Status Hierarchies Based on Race and Gender? (Annabella Sakunkoo, Jonathan Sakunkoo)
Hall #10	Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems (Sahar Yarmohammadtoosky, Yiyun Zhou, Victoria Yaneva, Peter Baldwin, Saed Rezayi, Brian Clauser, Polina Harik)
Hall #11	Paragraph-level Error Correction and Explanation Generation: Case Study for Estonian (Martin Vainikko, Taavi Kamarik, Karina Kert, Krista Liin, Silvia Maine, Kais Allkivi, Annekatrin Kaivapalu, Mark Fishel)
Hall #36	Can LLMs Reliably Simulate Real Students’ Abilities in Mathematics and Reading Comprehension? (KV Aditya Srivatsa, Kaushal Maurya, Ekaterina Kochmar)
Hall #37	Transformer Architectures for Vocabulary Test Item Difficulty Prediction (Lucy Skidmore, Mariano Felice, Karen Dunn)
Hall #38	Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic features (Hakyung Sung, Karla Csuros, Min-Chang Sung)
Hall #39	Comparing Behavioral Patterns of LLM and Human Tutors: A Population-level Analysis with the CIMA Dataset (Aayush Kucheria, Nitin Sawhney, Arto Hellas)
Hall #40	MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks (Marius Dumitran, Mihnea Buca, Theodor Moroianu)
Hall #41	Advancing Question Generation with Joint Narrative and Difficulty Control (Bernardo Leite, Henrique Lopes Cardoso)
Hall #42	Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation (Kseniia Petukhova, Ekaterina Kochmar)
Hall #43-#46	Available Poster Slots ACL 2025 Papers on Educational Applications
Gather #150	EduCSW: Building a Mandarin-English Code-Switched Generation Pipeline for Computer Science Learning (Ruishi Chen, Yiling Zhao)
	Investigating Methods for Mapping Learning Objectives to Bloom’s Revised Taxonomy in Course Descriptions for Higher Education (Zahra Kolagar, Frank Zalkow, Alessandra Zarcone)
	Improving In-context Learning Example Retrieval for Classroom Discussion Assessment with Re-ranking and Label Ratio Regulation (Nhat Tran, Diane Litman, Benjamin Pierce, Richard Correnti, Lindsay Clare Matsumura)
	Using NLI to Identify Potential Collocation Transfer in L2 English (Haiyin Yang, Zoey Liu, Stefanie Wulff)
	UPSC2M: Benchmarking Adaptive Learning from Two Million MCQ Attempts (Kevin Shi, Karttikeya Mangalam)
	Multilingual Grammatical Error Annotation: Combining Language-Agnostic Framework with Language-Specific Flexibility (Mengyang Qiu, Tran Minh Nguyen, Zihao Huang, Zelong Li, Yang Gu, Qingyu Gao, SILIANG LIU, Jungyeul Park)
Gather #70	Automatic Generation of Inference Making Questions for Reading Comprehension Assessments (Wanjing (Anya) Ma, Michael Flor, Zuowei Wang)
Gather #107	Lessons Learned in Assessing Student Reflections with LLMs (Mohamed Elaraby, Diane Litman)
Gather #69	Automated L2 Proficiency Scoring: Weak Supervision, Large Language Models, and Statistical Guarantees (Aitor Arronte Alvarez, Naiyi Xie Fincham)
	Advances in Auto-Grading with Large Language Models: A Cross-Disciplinary Survey (Tania Amanda Nkoyo Frederick Eneye, Chukwuebuka Fortunate Ijezue, Ahmad Imam Amjad, Maaz Amjad, Sabur Butt, Gerardo Castañeda-Garza)
Gather #123	Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues (Fareya Ikram, Alexander Scarlatos, Andrew Lan)
	Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic (Zhenjiang Mao, Artem Bisliouk, Rohith Nama, Ivan Ruchkin)
17:30 - 18:00	Meetup and Group Picture Conference Center Entrance
18:00 - 21:00	Workshop Dinner Due to the impressive turnout and interest this year, dinner will be organized on a first come, first served basis, with everyone meeting at the entrance to head to the restaurant as a group. Please be aware that dinner reimbursement is available only for student participants. If the group size exceeds the restaurant’s capacity, alternative plans include riding the metro to the lively Prater—an open theme park with local food and famous for its iconic (James Bond) ferris wheel—or enjoying a bite at the nearby food market.

Friday, August 1, 2025

Below is the schedule for the second workshop day. Each paper in the oral and poster sessions is linked to its page on Underline; you can find the pre-recorded video there. To join the workshop by livestream, please connect to the Workshop Day 2 on Underline.io.

Time	Description
09:00 - 09:45	Keynote Talk by Kostiantyn Omelianchuk (Grammarly) How LLMs Are Reshaping GEC: Training, Evaluation, and Task Framing Chair: Zheng Yuan
09:45 - 10:30	Oral Session B Chair: Andrea Horbach
09:45 - 10:00	LLMs in alliance with Edit-based models: advancing In-Context Learning for Grammatical Error Correction by Specific Example Selection (Alexey Sorokin, Regina Nasyrova)
10:00 - 10:15	Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors (Ekaterina Kochmar, Kaushal Maurya, Kseniia Petukhova, KV Aditya Srivatsa, Anaïs Tack, Justin Vasselli)
10:15 - 10:30	MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors (Baraa Hikal, Mohmaed Basem, Islam Oshallah, Ali Hamdi)
10:30 - 11:00	Coffee Break
11:00 - 12:30	Poster Session B Hall X5 (in person, boards #12-35, #66-69) Gather Town (online)
Hall #12	Leveraging Generative AI for Enhancing Automated Assessment in Programming Education Contests (Stefan Dascalescu, Marius Dumitran, Mihai Alexandru Vasiluta)
Hall #13	Is Lunch Free Yet? Overcoming the Cold-Start Problem in Supervised Content Scoring using Zero-Shot LLM-Generated Training Data (Marie Bexte, Torsten Zesch)
Hall #14	Towards a Real-time Swedish Speech Analyzer for Language Learning Games: A Hybrid AI Approach to Language Assessment (Tianyi Geng, David Alfter)
Hall #15	The Need for Truly Graded Lexical Complexity Prediction (David Alfter)
Hall #16	Educators’ Perceptions of Large Language Models as Tutors: Comparing Human and AI Tutors in a Blind Text-only Setting (Sankalan Pal Chowdhury, Terry Jingchen Zhang, Donya Rooein, Dirk Hovy, Tanja Käser, Mrinmaya Sachan)
Hall #17	Costs and Benefits of AI-Enabled Topic Modeling in P-20 Research: The Case of School Improvement Plans (Syeda Sabrina Akter, Seth Hunter, David Woo, Antonios Anastasopoulos)
Hall #18	Are Large Language Models for Education Reliable Across Languages? (Vansh Gupta, Sankalan Pal Chowdhury, Vilém Zouhar, Donya Rooein, Mrinmaya Sachan)
Hall #19	Span Labeling with Large Language Models: Shell vs. Meat (Phoebe Mulcaire, Nitin Madnani)
Hall #20	STAIR-AIG: Optimizing the Automated Item Generation Process through Human-AI Collaboration for Critical Thinking Assessment (Euigyum Kim, Seewoo Li, Salah Khalil, Hyo Jeong Shin)
Hall #21	bea-jh at BEA 2025 Shared Task: Evaluating AI-powered Tutors through Pedagogically-Informed Reasoning (Jihyeon Roh, Jinhyun Bang)
Hall #22	K-NLPers at BEA 2025 Shared Task: Evaluating the Quality of AI Tutor Responses with GPT-4.1 (Geon Park, Jiwoo Song, Gihyeon Choi, Juoh Sun, Harksoo Kim)
Hall #23	IALab UC at BEA 2025 Shared Task: LLM-Powered Expert Pedagogical Feature Extraction (Sofía Correa Busquets, Valentina Córdova Véliz, Jorge Baier)
Hall #24	RETUYT-INCO at BEA 2025 Shared Task: How Far Can Lightweight Models Go in AI-powered Tutor Evaluation? (Santiago Góngora, Ignacio Sastre, Santiago Robaina, Ignacio Remersaro, Luis Chiruzzo, Aiala Rosá)
Hall #25	COGENT: A Curriculum-oriented Framework for Generating Grade-appropriate Educational Content (Zhengyuan Liu, Stella Xin Yin, Dion Hoe-Lian Goh, Nancy Chen)
Hall #26	BLCU-ICALL at BEA 2025 Shared Task: Multi-Strategy Evaluation of AI Tutors (Jiyuan An, Xiang Fu, Bo Liu, Xuquan Zong, Cunliang Kong, Shuliang Liu, Shuo Wang, Zhenghao Liu, Liner Yang, Hanghang Fan, Erhong Yang)
Hall #28	Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection (Chatrine Qwaider, Bashar Alhafni, Kirill Chirkunov, Nizar Habash, Ted Briscoe)
Hall #29-#35; #66-#69	Available Poster Slots ACL 2025 Papers on Educational Applications
	Exploring LLM-Based Assessment of Italian Middle School Writing: A Pilot Study (Adriana Mirabella, Dominique Brunato)
	LEVOS: Leveraging Vocabulary Overlap with Sanskrit to Generate Technical Lexicons in Indian Languages (Karthika N J, Krishnakant Bhatt, Ganesh Ramakrishnan, Preethi Jyothi)
	End-to-End Automated Item Generation and Scoring for Adaptive English Writing Assessment with Large Language Models (Kamel Nebhi, Amrita Panesar, Hans Bantilan)
	Analyzing Interview Questions via Bloom’s Taxonomy to Enhance the Design Thinking Process (Fatemeh Kazemi Vanhari, Christopher Anand, Charles Welch)
	Beyond Linear Digital Reading: An LLM-Powered Concept Mapping Approach for Reducing Cognitive Load (Junzhi Han, Jinho D. Choi)
	Jinan Smart Education at BEA 2025 Shared Task: Dual Encoder Architecture for Tutor Identification via Semantic Understanding of Pedagogical Conversations (Lei Chen)
	CU at BEA 2025 Shared Task: A BERT-Based Cross-Attention Approach for Evaluating Pedagogical Responses in Dialogue (Zhihao Lyu)
	Henry at BEA 2025 Shared Task: Improving AI Tutor’s Guidance Evaluation Through Context-Aware Distillation (Henry Pit)
	TutorMind at BEA 2025 Shared Task: Leveraging Fine-Tuned LLMs and Data Augmentation for Mistake Identification (FATIMA DEKMAK, Christian Khairallah, Wissam Antoun)
	BD at BEA 2025 Shared Task: MPNet Ensembles for Pedagogical Mistake Identification and Localization in AI Tutor Responses (Shadman Rohan, Ishita Sur Apan, Muhtasim Shochcho, Md Fahim, Mohammad Rahman, AKM Mahbubur Rahman, Amin Ali)
	LLM-Assisted, Iterative Curriculum Writing: A Human-Centered AI Approach in Finnish Higher Education (Leo Huovinen, Mika Hämäläinen)
12:30 - 14:00	Lunch Break Birds of a Feather: LLM Evaluation in Educational Applications Organizer: Ekaterina Kochmar
14:00 - 15:30	Poster Session C Hall X5 (in person, boards #12-35, #66-69) Gather Town (online)
Hall #12	Can LLMs Effectively Simulate Human Learners? Teachers’ Insights from Tutoring LLM Students (Daria Martynova, Jakub Macina, Nico Daheim, Nilay Yalcin, Xiaoyu Zhang, Mrinmaya Sachan)
Hall #13	Adapting LLMs for Minimal-edit Grammatical Error Correction (Ryszard Staruch, Filip Gralinski, Daniel Dzienisiewicz)
Hall #14	Do LLMs Give Psychometrically Plausible Responses in Educational Assessments? (Andreas Säuberli, Diego Frassinelli, Barbara Plank)
Hall #15	Towards Automatic Formal Feedback on Scientific Documents (Louise Bloch, Johannes Rückert, Christoph Friedrich)
Hall #16	Transformer-Based Real-Word Spelling Error Feedback with Configurable Confusion Sets (Torsten Zesch, Dominic Gardner, Marie Bexte)
Hall #17	Unsupervised Sentence Readability Estimation Based on Parallel Corpora for Text Simplification (Rina Miyata, Toru Urakawa, Hideaki Tamori, Tomoyuki Kajiwara)
Hall #18	Exploiting the English Vocabulary Profile for L2 word-level vocabulary assessment with LLMs (Stefano Banno, Kate Knill, Mark Gales)
Hall #19	Improving AI assistants embedded in short e-learning courses with limited textual content (Jacek Marciniak, Marek Kubis, Michał Gulczyński, Adam Szpilkowski, Adam Wieczarek, Marcin Szczepański)
Hall #20	GermDetect: Verb Placement Error Detection Datasets for Learners of Germanic Languages (Noah-Manuel Michael, Andrea Horbach)
Hall #21	Automated Scoring of Communication Skills in Physician-Patient Interaction: Balancing Performance and Scalability (Saed Rezayi, Le An Ha, Yiyun Zhou, Andrew Houriet, Angelo D’Addario, Peter Baldwin, Polina Harik, Ann King, Victoria Yaneva)
Hall #22	Can GPTZero’s AI Vocabulary Distinguish Between LLM-Generated and Student-Written Essays? (Veronica Schmalz, Anaïs Tack)
Hall #23	A Framework for Proficiency-Aligned Grammar Practice in LLM-Based Dialogue Systems (Luisa Ribeiro-Flucht, Xiaobin Chen, Detmar Meurers)
Hall #24	Archaeology at BEA 2025 Shared Task: Are Simple Baselines Good Enough? (Ana Roșu, Jany-Gabriel Ispas, Sergiu Nisioi)
Hall #25	NLIP at BEA 2025 Shared Task: Evaluation of Pedagogical Ability of AI Tutors (Trishita Saha, Shrenik Ganguli, Maunendra Sankar Desarkar)
Hall #26	Estimation of Text Difficulty in the Context of Language Learning (Anisia Katinskaia, Anh-Duc Vu, Jue Hou, Ulla Vanhatalo, Yiheng Wu, Roman Yangarber)
Hall #27	Exploring task formulation strategies to evaluate the coherence of classroom discussions with GPT-4o (Yuya Asano, Beata Beigman Klebanov, Jamie Mikeska)
Hall #28	Decoding Actionability: A Computational Analysis of Teacher Observation Feedback (Mayank Sharma, Jason Zhang)
Hall #29	Thapar Titan/s : Fine-Tuning Pretrained Language Models with Contextual Augmentation for Mistake Identification in Tutor–Student Dialogues (Harsh Dadwal, Sparsh Rastogi, Jatin Bedi)
Hall #30	TBA at BEA 2025 Shared Task: Transfer-Learning from DARE-TIES Merged Models for the Pedagogical Ability Assessment of LLM-Powered Math Tutors (Sebastian Gombert, Fabian Zehner, Hendrik Drachsler)
Hall #35	Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors (Ekaterina Kochmar, Kaushal Maurya, Kseniia Petukhova, KV Aditya Srivatsa, Anaïs Tack, Justin Vasselli)
Hall #31-#34; #66-#69	Available Poster Slots ACL 2025 Papers on Educational Applications
	LLM-based post-editing as reference-free GEC evaluation (Robert Östling, Murathan Kurfali, Andrew Caines)
	EyeLLM: Using Lookback Fixations to Enhance Human-LLM Alignment for Text Completion (Astha Singh, Mark Torrance, Evgeny Chukharev)
	Wonderland_EDU@HKU at BEA 2025 Shared Task: Fine-tuning Large Language Models to Evaluate the Pedagogical Ability of AI-powered Tutors (Deliang Wang, Chao Yang, Gaowei Chen)
	BJTU at BEA 2025 Shared Task: Task-Aware Prompt Tuning and Data Augmentation for Evaluating AI Math Tutors (Yuming Fan, Chuangchuang Tan, Wenyu Song)
	SmolLab_SEU at BEA 2025 Shared Task: A Transformer-Based Framework for Multi-Track Pedagogical Evaluation of AI-Powered Tutors (Md. Abdur Rahman, MD AL AMIN, Sabik Aftahee, Muhammad Junayed, Md Ashiqur Rahman)
	LexiLogic at BEA 2025 Shared Task: Fine-tuning Transformer Language Models for the Pedagogical Skill Evaluation of LLM-based tutors (Souvik Bhattacharyya, Billodal Roy, Niranjan M, Pranav Gupta)
	DLSU at BEA 2025 Shared Task: Towards Establishing Baseline Models for Pedagogical Response Evaluation Tasks (Maria Monica Manlises, Mark Edward Gonzales, Lanz Lim)
	LookAlike: Consistent Distractor Generation in Math MCQs (Nisarg Parikh, Alexander Scarlatos, Nigel Fernandez, Simon Woodhead, Andrew Lan)
Gather #77	From End-Users to Co-Designers: Lessons from Teachers (Martina Galletti, Valeria Cesaroni)
15:30 - 16:00	Coffee Break
16:00 - 17:15	Oral Session C Chair: Jill Burstein
16:00 - 16:15	Down the Cascades of Omethi: Hierarchical Automatic Scoring in Large-Scale Assessments (Fabian Zehner, Hyo Jeong Shin, Emily Kerzabi, Andrea Horbach, Sebastian Gombert, Frank Goldhammer, Torsten Zesch, Nico Andersen)
16:15 - 16:30	Direct Repair Optimization: Training Small Language Models For Educational Program Repair Improves Feedback (Charles Koutcheme, Nicola Dainese, Arto Hellas)
16:30 - 16:45	Advancing Question Generation with Joint Narrative and Difficulty Control (Bernardo Leite, Henrique Lopes Cardoso)
16:45 - 17:00	Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation (Kseniia Petukhova, Ekaterina Kochmar)
17:00 - 17:15	LLMs Protégés: Tutoring LLMs with Knowledge Gaps Improves Student Learning Outcome (Andrei Kucharavy, Cyril Vallez, Dimitri Percia David)
17:15 - 17:30	Closing Remarks and Best-Paper Awards