BEA 2025 Shared Task
Pedagogical Ability Assessment of AI-powered Tutors
Motivation
Conversational agents offer promising opportunities for education as they can fulfill various roles (e.g., intelligent tutors and service-oriented assistants) and pursue different objectives (e.g., improving student skills and increasing instructional efficiency) (Wollny et al., 2021), among which serving as an AI tutor is one of the most prevalent tasks (Tack et al., 2023). Recent advances in the development of Large Language Models (LLMs) provide our field with promising ways of building AI-based conversational tutors, which can generate human-sounding dialogues on the fly. The key questions posed in previous research (Tack and Piech, 2022; Tack et al., 2023), however, remain: How can we test whether state-of-the-art generative models are good AI teachers, capable of replying to a student in an educational dialogue?
Evaluating dialogue systems in general presents a significant challenge. While human evaluation is still considered the most reliable method for assessing dialogue quality, its high cost and lack of reproducibility have led to the adaptation of both reference-based and reference-free automatic metrics, originally used in machine translation and summary evaluation, for dialogue evaluation (Lin, 2004; Popovic, 2017; Post, 2018; Gao et al., 2020; Liu et al., 2023). When it comes to Intelligent Tutoring Systems (ITSs), which also function as dialogue systems with the specific role of acting as tutors, these general metrics are insufficient. In the educational context, we need to assess complex pedagogical aspects and abilities of such systems, ensuring that they provide students with sufficient, helpful, and factually correct guidance and do not simply reveal answers when the student makes a mistake, among other aspects. Therefore, developing automatic metrics to evaluate these nuanced aspects is essential for creating effective and helpful tutoring systems.
Due to the lack of a standardized evaluation taxonomy, previous work has used different criteria for evaluation. For example, Tack and Piech (2022) and Tack et al. (2023) evaluated models’ responses in terms of whether they speak like a teacher, understand a student, and help a student, while in Macina et al. (2023), responses of models playing roles of tutors were evaluated by human annotators using coherence, correctness, and equitable tutoring. At the same time, Wang et al. (2024) assess usefulness, care, and human-likeness, and Daheim et al. (2024) use targetedness, correctness, and actionability of a tutor response as quality evaluation criteria. Such lack of standardization makes it difficult to compare different systems, and, therefore, defining evaluation criteria and developing automatic metrics for them is a crucial task for advancing the field, which we aim to address in this task.
Task Goals & Description
Following the successful BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues (Tack et al., 2023), we revisit the question of quality assessment of the tutor responses generated with the AI models (specifically, LLMs) in the context of educational dialogues. We believe that (1) the topic is timely and important, and the shared task will attract BEA community attention; (2) LLMs have significantly advanced in the past couple of years, making it important to revisit this topic after the competition run in 2023; and (3) there is a need to establish a pedagogically motivated benchmark for this task. In contrast to the BEA 2023 shared task, our focus is not on the generation of educational dialogues using state-of-the-art LLMs, but rather on comprehensive evaluation of AI-tutor responses using a set of pedagogically motivated metrics.
In this shared task, we will focus on educational dialogues between a student and a tutor in the mathematical domain grounded in student mistakes or confusion, where the AI tutor aims to remediate such mistakes or confusions. Dialogues in the datasets provided in this shared task include:
- The context consisting of several prior turns from both the tutor and the student. These are extracted from two popular datasets of educational dialogues in the mathematical domain – MathDial (Macina et al., 2023) and Bridge (Wang et al., 2024);
- The last utterance from the student containing a mistake; and
- A set of possible responses to the last student’s utterance from a range of LLM-based tutors and, where available, human tutors, aimed at mistake remediation.
The LLM-based tutor responses are generated by the organizers of the shared task using a set of state-of-the-art LLMs of various sizes and capabilities, including: GPT-4 (Achiam et al., 2023), Gemini (Reid et al., 2024), Sonnet (Anthropic), Mistral (Jiang et al., 2023), Llama-3.1-8B and Llama-3.1-405B (Dubey et al., 2024), and Phi-3 (Abdin et al., 2024).
The identities of the tutors will be included in the development set provided to the task participants, but not in the test set. In addition to the responses themselves, the development set contains annotation of their quality along the following pedagogically motivated dimensions (Maurya et al., 2025):
- Mistake identification: Since all dialogues in the dataset contain a mistake made by the student, a good quality response from the tutor should include the relevant mistake identification. This corresponds to student understanding in the schema of Tack and Piech (2022) and correctness in the schemata of Macina et al. (2023) and Daheim et al. (2024).
- Mistake location: A good tutor response should not only notify the student of the committed error, but also point to its location in the answer and outline what the error is to help student remediate it in their next response. This corresponds to targetedness in Daheim et al. (2024).
- Providing guidance: A good tutor response should provide the student with relevant and helpful guidance, such as a hint, an explanation, a supporting question, and the like. This aspect corresponds to helping a student in Tack and Piech (2022) and usefulness in Wang et al. (2024).
- Actionability: Finally, once the guidance is provided to a student, it should be clear from a good tutor response what the student should do next; in other words, the tutor response should not be vague, unclear or a conversation stopper. This aspect in our schema corresponds to actionability in Daheim et al. (2024).
For more details on the annotation and data collection, please refer to Maurya et al. (2025).
Data
Both development and test sets are distributed as JSON files (Maurya et al., 2025).
Development set
The development set consists of 300 dialogues from MathDial (Macina et al., 2023) and Bridge (Wang et al., 2024) datasets, including the context of several prior turns from both the tutor and the student, the last utterance from the student containing a mistake, and a set of responses to the last student’s utterance from 7 LLM-based tutors and human tutors (a single tutor in MathDial, and an expert and novice tutors in Bridge) annotated for their pedagogical quality. In total, the development set consists of over 2,480 tutor responses, presented and annotated as follows:
{
"conversation_id": "29640377",
"conversation history": "Tutor: Here is our first question.\nTutor: What is the value of 10/5?\nStudent: 50",
"tutor_responses": {
"Llama318B": {
"response": "I think there might be a small mistake there, let's take a look at the problem together – when you divide 10 by 5, we actually get 2, not 50.",
"annotation": {
"Mistake_Identification": "Yes",
"Mistake_Location": "Yes",
"Providing_Guidance": "To some extent",
"Actionability": "No"
}
},
"Sonnet": {
"response": "I appreciate your effort, but let's think about this carefully: if we divide 10 into 5 equal groups, how many would be in each group?",
"annotation": {
"Mistake_Identification": "Yes",
"Mistake_Location": "Yes",
"Providing_Guidance": "Yes",
"Actionability": "Yes"
}
},
"Gemini": {
"response": "Remember, dividing by 5 is like sharing 10 things equally among 5 groups. How many would be in each group? Let's try that again.",
"annotation": {
"Mistake_Identification": "Yes",
"Mistake_Location": "Yes",
"Providing_Guidance": "Yes",
"Actionability": "Yes"
}
},
"..."
}
}
The following fields are included in JSON:
conversation_id
: a unique identifier for the instanceconversation history
: the context of several prior turns from the tutor and the student extracted from the original datasetstutor_responses
: the set of human tutor responses extracted from the original datasets, as well as responses generated by 7 LLMs-as-tutors, each with a unique identifierresponse
: the response from a particular tutorannotation
: the set of annotations, including:Mistake_Identification
marking whether the tutor recognized a mistake in a student’s responseMistake_Location
marking whether the tutor’s response accurately points to a genuine mistake and its locationProviding_Guidance
marking whether the tutor offers correct and relevant guidance, such as an explanation, elaboration, hint, examples, and so onActionability
marking if it is clear from the tutor’s feedback what the student should do next
Test set
The test set consists of 200 dialogues from MathDial and Bridge, also including the context of several prior turns from both the tutor and the student, the last utterance from the student containing a mistake, and a set of unannotated responses to the last student’s utterance from the same set of tutors as in the development set. The key difference is that the test set does not include tutor identities or annotations for the pedagogical quality for their responses.
As a result, the test set is organized as follows:
{
"conversation_id": "616653340",
"conversation history": "Tutor: 39 is a prime number or composite number?\nStudent: prime",
"tutor_responses": {
"Tutor_1": {
"response": "That's a good try, but remember, prime numbers have only two factors: 1 and itself.",
},
"Tutor_2": {
"response": "Hmm, what makes you think prime?",
},
"Tutor_3": {
"response": "That's correct, 39 is actually a composite number because it has factors other than 1 and itself, such as 3 and 13.",
},
"Tutor_4": {
"response": "That was an incorrect answer.",
},
"Tutor_5": {
"response": "Let's check that again. Can you find two numbers that multiply to give 39?",
},
"..."
}
}
Shared Task
This shared task will include five tracks. Participating teams are welcome to take part in any number of tracks.
Tracks
- Track 1 - Mistake Identification:
Teams are invited to develop systems to detect whether tutors’ responses recognize mistakes in students’ responses. The following categories are included:
Yes
: the mistake is clearly identified/ recognized in the tutor’s responseTo some extent
: the tutor’s response suggests that there may be a mistake, but it sounds as if the tutor is not certainNo
: the tutor does not recognize the mistake (e.g., they proceed to simply provide the answer to the asked question)
- Track 2 - Mistake Location:
Teams are invited to develop systems to assess whether tutors’ responses accurately point to a genuine mistake and its location in the students’ responses. The following categories are included:
Yes
: the tutor clearly points to the exact location of a genuine mistake in the student’s solutionTo some extent
: the response demonstrates some awareness of the exact mistake, but is vague, unclear, or easy to misunderstandNo
: the response does not provide any details related to the mistake
- Track 3 - Pedagogical Guidance:
Teams are invited to develop systems to evaluate whether tutors’ responses offer correct and relevant guidance, such as an explanation, elaboration, hint, examples, and so on. The following categories are included:
Yes
: the tutor provides guidance that is correct and relevant to the student’s mistakeTo some extent
: guidance is provided but it is fully or partially incorrect, incomplete, or somewhat misleadingNo
: the tutor’s response does not include any guidance, or the guidance provided is irrelevant to the question or factually incorrect
- Track 4 - Actionability:
Teams are invited to develop systems to assess whether tutors’ feedback is actionable, i.e., it makes it clear what the student should do next. The following categories are included:
Yes
: the response provides clear suggestions on what the student should do nextTo some extent
: the response indicates that something needs to be done, but it is not clear what exactly that isNo
: the response does not suggest any action on the part of the student (e.g., it simply reveals the final answer)
- Track 5 - Guess the tutor identity: Teams are invited to develop systems to identify which tutors the anonymized responses in the test set originated from. This track will address 9 classes: expert and novice tutors, and 7 LLMs included in the tutor set.
Evaluation
Tracks 1-4 will use accuracy and macro F1 as the main metrics. These will be used in two settings:
- Exact evaluation: predictions submitted by the teams will be evaluated for the exact prediction of the three classes (“Yes”, “To some extent”, and “No”)
- Lenient evaluation: since for these dimensions tutor responses annotated as “Yes” and “To some extent” share a certain amount of qualitative value, we will consider “Yes” and “To some extent” as a single class, and evaluate predictions under the 2-class setting (“Yes + To some extent” vs. “No”)
Track 5 will use accuracy of the tutor identity prediction as its main metric.
Participation
- Access to the development set will be provided upon registration. To register for the shared task please fill in the form: https://forms.gle/fKJcdvL2kCrPcu8X6
- All updated about the shared task will be shared with the email addresses, indicated in the registration form
- Test phase will be run via CodaLab – all registered participants will be provided with instructions closer to the date
- All teams officially participating in the test phase on CodaLab will be invited to publish their system papers in the BEA 2025 proceedings and present their work at the BEA 2025 workshop
Submission
Submissions will be run via CodaLab, with the number of submissions from each team capped at 5 per track. More information on the submission requirements will be provided closer to the date.
Important Dates
All deadlines are 11:59pm UTC-12 (anywhere on Earth).
- March 12, 2025: Development data release
- April 9, 2025: Test data release
- April 23, 2025: System submissions from teams due
- April 30, 2025: Evaluation of the results by the organizers
- May 21, 2025: System papers due
- May 28, 2025: Paper reviwewes returned
- June 9, 2025: Final camera-ready submissions
- July 31 and August 1, 2025: BEA 2025 workshop at ACL
FAQ
Questions about this shared task should be sent to bea.sharedtask.2025@gmail.com. We will share the answers to the frequent questions on this page.
Organizers
- Ekaterina Kochmar (MBZUAI)
- Kaushal Kumar Maurya (MBZUAI)
- Kseniia Petukhova (MBZUAI)
- KV Aditya Srivatsa (MBZUAI)
- Justin Vasselli (Nara Institute of Science and Technology)
- Anaïs Tack (KU Leuven)
Contact: bea.sharedtask.2025@gmail.com
Dataset Reference
- Kaushal Kumar Maurya, KV Srivatsa, Kseniia Petukhova, and Ekaterina Kochmar. Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors. In Proceedings of NAACL 2025 (main).
References
- Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219
- Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774
- Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku.
- Nico Daheim, Jakub Macina, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2024. Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors. arXiv preprint arXiv:2407.09136
- Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The LLaMA 3 herd of models. arXiv preprint arXiv:2407.21783
- Xiang Gao, Yizhe Zhang, Michel Galley, Chris Brockett, and Bill Dolan. 2020. Dialogue response ranking training with large-scale human feedback data. arXiv preprint arXiv:2009.06978
- Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81
- Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using GPT-4 with better human alignment. arXiv preprint arXiv:2303.16634
- Jakub Macina, Nico Daheim, Sankalan Pal Chowdhury, Tanmay Sinha, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2023. MathDial: A dialogue tutoring dataset with rich pedagogical properties grounded in math reasoning problems. arXiv preprint arXiv:2305.14536.
- Maja Popovic. 2017. chrF++: words helping character n-grams. In Proceedings of the second conference on machine translation, pages 612–618
- Kaushal Kumar Maurya, KV Srivatsa, Kseniia Petukhova, and Ekaterina Kochmar. Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors. In Proceedings of NAACL 2025 (main).
- Matt Post. 2018. A call for clarity in reporting BLEU scores. arXiv preprint arXiv:1804.08771
- Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530
- Anaïs Tack and Chris Piech. 2022. The AI teacher test: Measuring the pedagogical ability of blender and GPT-3 in educational dialogues. arXiv preprint arXiv:2205.07540
- Anaïs Tack, Ekaterina Kochmar, Zheng Yuan, Serge Bibauw, and Chris Piech. 2023. The BEA 2023 shared task on generating AI teacher responses in educational dialogues. arXiv preprint arXiv:2306.06941.
- Rose Wang, Qingyang Zhang, Carly Robinson, Susanna Loeb, and Dorottya Demszky. 2024. Bridging the novice-expert gap via models of decision-making: A case study on remediating math mistakes. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2174–2199
- Wollny, Sebastian & Schneider, Jan & Di Mitri, Daniele & Weidlich, Joshua & Rittberger, Marc & Drachsler, Hendrik. 2021. Are We There Yet? - A Systematic Literature Review on Chatbots in Education. Frontiers in Artificial Intelligence 4. 654924. (doi:10.3389/frai.2021.654924)