BEA 2025 Shared Task
Pedagogical Ability Assessment of AI-powered Tutors
Motivation
Conversational agents offer promising opportunities for education as they can fulfill various roles (e.g., intelligent tutors and service-oriented assistants) and pursue different objectives (e.g., improving student skills and increasing instructional efficiency) (Wollny et al., 2021), among which serving as an AI tutor is one of the most prevalent tasks (Tack et al., 2023). Recent advances in the development of Large Language Models (LLMs) provide our field with promising ways of building AI-based conversational tutors, which can generate human-sounding dialogues on the fly. The key questions posed in previous research (Tack and Piech, 2022; Tack et al., 2023), however, remain: How can we test whether state-of-the-art generative models are good AI teachers, capable of replying to a student in an educational dialogue?
Evaluating dialogue systems in general presents a significant challenge. While human evaluation is still considered the most reliable method for assessing dialogue quality, its high cost and lack of reproducibility have led to the adaptation of both reference-based and reference-free automatic metrics, originally used in machine translation and summary evaluation, for dialogue evaluation (Lin, 2004; Popovic, 2017; Post, 2018; Gao et al., 2020; Liu et al., 2023). When it comes to Intelligent Tutoring Systems (ITSs), which also function as dialogue systems with the specific role of acting as tutors, these general metrics are insufficient. In the educational context, we need to assess complex pedagogical aspects and abilities of such systems, ensuring that they provide students with sufficient, helpful, and factually correct guidance and do not simply reveal answers when the student makes a mistake, among other aspects. Therefore, developing automatic metrics to evaluate these nuanced aspects is essential for creating effective and helpful tutoring systems.
Due to the lack of a standardized evaluation taxonomy, previous work has used different criteria for evaluation. For example, Tack and Piech (2022) and Tack et al. (2023) evaluated models’ responses in terms of whether they speak like a teacher, understand a student, and help a student, while in Macina et al. (2023), responses of models playing roles of tutors were evaluated by human annotators using coherence, correctness, and equitable tutoring. At the same time, Wang et al. (2024) assess usefulness, care, and human-likeness, and Daheim et al. (2024) use targetedness, correctness, and actionability of a tutor response as quality evaluation criteria. Such lack of standardization makes it difficult to compare different systems, and, therefore, defining evaluation criteria and developing automatic metrics for them is a crucial task for advancing the field, which we aim to address in this task.
Task Goals & Description
Following the successful BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues (Tack et al., 2023), we revisit the question of quality assessment of the tutor responses generated with the AI models (specifically, LLMs) in the context of educational dialogues. We believe that (1) the topic is timely and important, and the shared task will attract BEA community attention; (2) LLMs have significantly advanced in the past couple of years, making it important to revisit this topic after the competition run in 2023; and (3) there is a need to establish a pedagogically motivated benchmark for this task. In contrast to the BEA 2023 shared task, our focus is not on the generation of educational dialogues using state-of-the-art LLMs, but rather on comprehensive evaluation of AI-tutor responses using a set of pedagogically motivated metrics.
In this shared task, we will focus on educational dialogues between a student and a tutor in the mathematical domain grounded in student mistakes or confusion, where the AI tutor aims to remediate such mistakes or confusions. Dialogues in the datasets provided in this shared task include:
- The context consisting of several prior turns from both the tutor and the student and containing a mistake in the student’s solution. These are extracted from two popular datasets of educational dialogues in the mathematical domain – MathDial (Macina et al., 2023) and Bridge (Wang et al., 2024);
- The last utterance from the student; and
- A set of possible responses following the last student’s utterance from a range of LLM-based tutors and, where available, human tutors, aimed at mistake remediation.
The LLM-based tutor responses are generated by the organizers of the shared task using a set of state-of-the-art LLMs of various sizes and capabilities, including: GPT-4 (Achiam et al., 2023), Gemini (Reid et al., 2024), Sonnet (Anthropic), Mistral (Jiang et al., 2023), Llama-3.1-8B and Llama-3.1-405B (Dubey et al., 2024), and Phi-3 (Abdin et al., 2024).
The identities of the tutors will be included in the development set provided to the task participants, but not in the test set. In addition to the responses themselves, the development set contains annotation of their quality along the following pedagogically motivated dimensions (Maurya et al., 2025):
- Mistake identification: Since all dialogues in the dataset contain a mistake made by the student, a good quality response from the tutor should include the relevant mistake identification. This corresponds to student understanding in the schema of Tack and Piech (2022) and correctness in the schemata of Macina et al. (2023) and Daheim et al. (2024).
- Mistake location: A good tutor response should not only notify the student of the committed error, but also point to its location in the answer and outline what the error is to help student remediate it in their next response. This corresponds to targetedness in Daheim et al. (2024).
- Providing guidance: A good tutor response should provide the student with relevant and helpful guidance, such as a hint, an explanation, a supporting question, and the like. This aspect corresponds to helping a student in Tack and Piech (2022) and usefulness in Wang et al. (2024).
- Actionability: Finally, once the guidance is provided to a student, it should be clear from a good tutor response what the student should do next; in other words, the tutor response should not be vague, unclear or a conversation stopper. This aspect in our schema corresponds to actionability in Daheim et al. (2024).
For more details on the annotation and data collection, please refer to Maurya et al. (2025).
Data
Both development and test sets are distributed as JSON files (Maurya et al., 2025).
Development set
The development set consists of 300 dialogues from MathDial (Macina et al., 2023) and Bridge (Wang et al., 2024) datasets, including the context of several prior turns from both the tutor and the student, containing a mistake or expessing a confusion on the part of the student, the last utterance from the student, and a set of responses to the student from 7 LLM-based tutors and human tutors (a single tutor in MathDial, and an expert and novice tutors in Bridge) annotated for their pedagogical quality. In total, the development set consists of over 2,480 tutor responses, presented and annotated as follows:
{
"conversation_id": "29640377",
"conversation history": "Tutor: Here is our first question.\nTutor: What is the value of 10/5?\nStudent: 50",
"tutor_responses": {
"Llama318B": {
"response": "I think there might be a small mistake there, let's take a look at the problem together – when you divide 10 by 5, we actually get 2, not 50.",
"annotation": {
"Mistake_Identification": "Yes",
"Mistake_Location": "Yes",
"Providing_Guidance": "To some extent",
"Actionability": "No"
}
},
"Sonnet": {
"response": "I appreciate your effort, but let's think about this carefully: if we divide 10 into 5 equal groups, how many would be in each group?",
"annotation": {
"Mistake_Identification": "Yes",
"Mistake_Location": "Yes",
"Providing_Guidance": "Yes",
"Actionability": "Yes"
}
},
"Gemini": {
"response": "Remember, dividing by 5 is like sharing 10 things equally among 5 groups. How many would be in each group? Let's try that again.",
"annotation": {
"Mistake_Identification": "Yes",
"Mistake_Location": "Yes",
"Providing_Guidance": "Yes",
"Actionability": "Yes"
}
},
"..."
}
}
The following fields are included in JSON:
conversation_id
: a unique identifier for the instanceconversation history
: the context of several prior turns from the tutor and the student extracted from the original datasetstutor_responses
: the set of human tutor responses extracted from the original datasets, as well as responses generated by 7 LLMs-as-tutors, each with a unique identifierresponse
: the response from a particular tutorannotation
: the set of annotations, including:Mistake_Identification
marking whether the tutor recognized a mistake in a student’s responseMistake_Location
marking whether the tutor’s response accurately points to a genuine mistake and its locationProviding_Guidance
marking whether the tutor offers correct and relevant guidance, such as an explanation, elaboration, hint, examples, and so onActionability
marking if it is clear from the tutor’s feedback what the student should do next
Test set
The test set consists of 200 dialogues from MathDial and Bridge, also including the context of several prior turns from both the tutor and the student, containing a mistake or expessing a confusion on the part of the student, the last utterance from the student, and a set of unannotated responses to the last student’s utterance from the same set of tutors as in the development set. The key difference is that the test set does not include tutor identities or annotations for the pedagogical quality for their responses.
As a result, the test set is organized as follows:
{
"conversation_id": "616653340",
"conversation history": "Tutor: 39 is a prime number or composite number?\nStudent: prime",
"tutor_responses": {
"Tutor_1": {
"response": "That's a good try, but remember, prime numbers have only two factors: 1 and itself.",
},
"Tutor_2": {
"response": "Hmm, what makes you think prime?",
},
"Tutor_3": {
"response": "That's correct, 39 is actually a composite number because it has factors other than 1 and itself, such as 3 and 13.",
},
"Tutor_4": {
"response": "That was an incorrect answer.",
},
"Tutor_5": {
"response": "Let's check that again. Can you find two numbers that multiply to give 39?",
},
"..."
}
}
Shared Task
This shared task will include five tracks. Participating teams are welcome to take part in any number of tracks.
Tracks
- Track 1 - Mistake Identification:
Teams are invited to develop systems to detect whether tutors’ responses recognize mistakes in students’ responses. The following categories are included:
Yes
: the mistake is clearly identified/ recognized in the tutor’s responseTo some extent
: the tutor’s response suggests that there may be a mistake, but it sounds as if the tutor is not certainNo
: the tutor does not recognize the mistake (e.g., they proceed to simply provide the answer to the asked question)
- Track 2 - Mistake Location:
Teams are invited to develop systems to assess whether tutors’ responses accurately point to a genuine mistake and its location in the students’ responses. The following categories are included:
Yes
: the tutor clearly points to the exact location of a genuine mistake in the student’s solutionTo some extent
: the response demonstrates some awareness of the exact mistake, but is vague, unclear, or easy to misunderstandNo
: the response does not provide any details related to the mistake
- Track 3 - Providing guidance:
Teams are invited to develop systems to evaluate whether tutors’ responses offer correct and relevant guidance, such as an explanation, elaboration, hint, examples, and so on. The following categories are included:
Yes
: the tutor provides guidance that is correct and relevant to the student’s mistakeTo some extent
: guidance is provided but it is fully or partially incorrect, incomplete, or somewhat misleadingNo
: the tutor’s response does not include any guidance, or the guidance provided is irrelevant to the question or factually incorrect
- Track 4 - Actionability:
Teams are invited to develop systems to assess whether tutors’ feedback is actionable, i.e., it makes it clear what the student should do next. The following categories are included:
Yes
: the response provides clear suggestions on what the student should do nextTo some extent
: the response indicates that something needs to be done, but it is not clear what exactly that isNo
: the response does not suggest any action on the part of the student (e.g., it simply reveals the final answer)
- Track 5 - Guess the tutor identity: Teams are invited to develop systems to identify which tutors the anonymized responses in the test set originated from. This track will address 9 classes: expert and novice tutors, and 7 LLMs included in the tutor set.
Evaluation
Tracks 1-4 will use accuracy and macro F1 as the main metrics. These will be used in two settings:
- Exact evaluation: predictions submitted by the teams will be evaluated for the exact prediction of the three classes (“Yes”, “To some extent”, and “No”)
- Lenient evaluation: since for these dimensions tutor responses annotated as “Yes” and “To some extent” share a certain amount of qualitative value, we will consider “Yes” and “To some extent” as a single class, and evaluate predictions under the 2-class setting (“Yes + To some extent” vs. “No”)
Track 5 will use accuracy of the tutor identity prediction as its main metric.
Participation
- To register for the shared task and subscribe to updates, please fill in the form: https://forms.gle/fKJcdvL2kCrPcu8X6
- All updates about the shared task will be shared with the email addresses, indicated in the registration form
- Development and test data can be found in the Shared Task Repository
- All teams officially participating in the test phase on CodaLab will be invited to publish their system papers in the BEA 2025 proceedings and present their work at the BEA 2025 workshop
Submission
Submissions will be run via CodaBench, with the number of submissions from each team capped at 5 per track.
The five tracks can be accessed at the following links:
- Track 1 - Mistake Identification: https://www.codabench.org/competitions/7195/
- Track 2 - Mistake Location: https://www.codabench.org/competitions/7200/
- Track 3 - Providing guidance: https://www.codabench.org/competitions/7202/
- Track 4 - Actionability: https://www.codabench.org/competitions/7203/
- Track 5 - Guess the tutor identity: https://www.codabench.org/competitions/7206/
Results
Multiple teams registered and participated in various tracks of the shared task. Below are the official final leaderboards, while the comprehensive results for all submissions per teams can be found on the respective shared task track pages on CodaBench.
Track 1: Mistake Identification
Teams ranks are based on the results according to the main shared task metric – exact macro F1 score (Ex. F1). Secondary metrics for the winning submissions – exact accuracy (Ex. Acc), lenient macro F1 score (Len. F1), and lenient accuracy (Len. Acc) – are also included.
Rank | Team | Submission | Ex. F1 | Ex. Acc | Len. F1 | Len. Acc |
---|---|---|---|---|---|---|
1 | BJTU | 269726 | 0.7181 | 0.8623 | 0.8957 | 0.9457 |
2 | TutorMind | 269923 | 0.7163 | 0.8759 | 0.9108 | 0.9528 |
3 | Averroes | 273668 | 0.7155 | 0.8675 | 0.8997 | 0.9425 |
4 | MSA | 272382 | 0.7154 | 0.8759 | 0.9152 | 0.9535 |
5 | BD | 276025 | 0.7110 | 0.8772 | 0.8966 | 0.9412 |
6 | Gooby-Snoob Guysz | 265229 | 0.7105 | 0.8481 | 0.8901 | 0.9373 |
7 | Wonderland_EDU@HKU | 267613 | 0.6983 | 0.8675 | 0.9109 | 0.9496 |
8 | Archaeology | 275559 | 0.6976 | 0.8675 | 0.8959 | 0.9405 |
9 | test | 273097 | 0.6948 | 0.8400 | 0.8947 | 0.9451 |
10 | Someone | 276087 | 0.6926 | 0.8520 | 0.8964 | 0.9438 |
11 | TBA | 270039 | 0.6858 | 0.8740 | 0.9060 | 0.9476 |
12 | BLCU-ICALL | 276231 | 0.6822 | 0.8578 | 0.8909 | 0.9418 |
13 | bea-jh | 276260 | 0.6802 | 0.8708 | 0.9069 | 0.9457 |
14 | JiNan_Smart Education | 274191 | 0.6790 | 0.8688 | 0.9052 | 0.9470 |
15 | jeez | 273727 | 0.6735 | 0.8623 | 0.8957 | 0.9418 |
16 | MT-NLP | 275793 | 0.6677 | 0.8636 | 0.8885 | 0.9354 |
17 | K-NLPers | 275401 | 0.6669 | 0.8113 | 0.8671 | 0.9270 |
18 | Thapar Titan/s | 276130 | 0.6647 | 0.8520 | 0.8840 | 0.9328 |
19 | Squirrel Ai Learning | 275298 | 0.6646 | 0.8539 | 0.8748 | 0.9315 |
20 | SmolLab_SEU | 268996 | 0.6617 | 0.8397 | 0.8782 | 0.9315 |
21 | bnl | 272027 | 0.6578 | 0.8494 | 0.8806 | 0.9302 |
22 | LexiLogic | 276232 | 0.6549 | 0.8487 | 0.8806 | 0.9302 |
23 | Retuyt-InCo | 275694 | 0.6535 | 0.8449 | 0.8395 | 0.9192 |
24 | 777 | 273165 | 0.6534 | 0.8526 | 0.8731 | 0.9283 |
25 | CU | 274027 | 0.6514 | 0.8701 | 0.8957 | 0.9425 |
26 | NLP Group 7 | 273738 | 0.6499 | 0.8462 | 0.8605 | 0.9276 |
27 | NLIP | 276186 | 0.6438 | 0.8546 | 0.8723 | 0.9257 |
28 | ALA | 276228 | 0.6361 | 0.8423 | 0.8493 | 0.9140 |
29 | mucai | 272784 | 0.6285 | 0.8067 | 0.8354 | 0.8985 |
30 | AGS | 272032 | 0.6251 | 0.8390 | 0.8640 | 0.9211 |
31 | Tutorify | 270221 | 0.6247 | 0.8261 | 0.8502 | 0.9173 |
32 | Promptly Educated | 269660 | 0.6196 | 0.7104 | 0.8479 | 0.9224 |
33 | wyn | 273790 | 0.6184 | 0.8384 | 0.8434 | 0.9095 |
34 | Emergent Wisdom | 274950 | 0.6100 | 0.8546 | 0.8799 | 0.9321 |
35 | Georgia Tech EDU | 271339 | 0.6049 | 0.8171 | 0.8386 | 0.9102 |
36 | SG | 276040 | 0.5896 | 0.7919 | 0.8258 | 0.8875 |
37 | NeuralNexus | 271085 | 0.5840 | 0.8268 | 0.8142 | 0.8972 |
38 | presidency | 273920 | 0.5807 | 0.7570 | 0.8070 | 0.8804 |
39 | NLP_UNH | 273124 | 0.5708 | 0.8358 | 0.8358 | 0.9089 |
40 | letstea | 273885 | 0.5376 | 0.6593 | 0.8109 | 0.8681 |
41 | Patriots | 275702 | 0.5345 | 0.8028 | 0.7923 | 0.8921 |
42 | AUST_NLP | 275376 | 0.4819 | 0.7085 | 0.6929 | 0.7576 |
43 | WhyIamHere | 273409 | 0.4562 | 0.7931 | 0.7126 | 0.8824 |
44 | RAGthoven | 267114 | 0.2949 | 0.4350 | 0.4349 | 0.5365 |
For leaderboards according to the secondary metrics, please refer to the tables on exact accuracy, lenient F1, and lenient accuracy.
Track 2: Mistake Location
Teams ranks are based on the results according to the main shared task metric – exact macro F1 score (Ex. F1). Secondary metrics for the winning submissions – exact accuracy (Ex. Acc), lenient macro F1 score (Len. F1), and lenient accuracy (Len. Acc) – are also included.
Rank | Team | Submission | Ex. F1 | Ex. Acc | Len. F1 | Len. Acc |
---|---|---|---|---|---|---|
1 | BLCU-ICALL | 271997 | 0.5983 | 0.7679 | 0.8386 | 0.8630 |
2 | BJTU | 272600 | 0.5940 | 0.7330 | 0.7848 | 0.8261 |
3 | K-NLPers | 275803 | 0.5880 | 0.7641 | 0.8404 | 0.8610 |
4 | MSA | 275733 | 0.5743 | 0.6975 | 0.7848 | 0.8209 |
5 | SG | 276042 | 0.5692 | 0.7602 | 0.8118 | 0.8416 |
6 | bea-jh | 272508 | 0.5658 | 0.6723 | 0.7792 | 0.8197 |
7 | bd | 275513 | 0.5543 | 0.7143 | 0.7699 | 0.8054 |
8 | TBA | 270051 | 0.5490 | 0.7091 | 0.7702 | 0.8035 |
9 | Wonderland_EDU@HKU | 267804 | 0.5450 | 0.7104 | 0.7649 | 0.8003 |
10 | Averroes | 274370 | 0.5366 | 0.6348 | 0.7587 | 0.7822 |
11 | Whyamher | 273410 | 0.5325 | 0.6910 | 0.7370 | 0.7802 |
12 | NLIP | 270231 | 0.5319 | 0.6878 | 0.7495 | 0.7951 |
Archaeology | 276257 | 0.5319 | 0.6568 | 0.7558 | 0.8009 | |
13 | JiNan_Smart Education | 275308 | 0.5274 | 0.6968 | 0.7502 | 0.7809 |
14 | Squirrel Ai Learning | 275300 | 0.5272 | 0.6904 | 0.7306 | 0.7692 |
15 | Thapar Titans | 276208 | 0.5215 | 0.6943 | 0.7374 | 0.7796 |
16 | jeez | 273748 | 0.5187 | 0.6833 | 0.7416 | 0.7854 |
17 | CU | 273951 | 0.5148 | 0.6807 | 0.7358 | 0.7789 |
18 | 777 | 275190 | 0.5114 | 0.6710 | 0.7195 | 0.7486 |
19 | Someone | 276094 | 0.5009 | 0.7208 | 0.7590 | 0.8074 |
20 | Retuyt-InCo | 272995 | 0.4959 | 0.5863 | 0.7200 | 0.7608 |
21 | NLP Group 7 | 273889 | 0.4936 | 0.6348 | 0.6944 | 0.7524 |
22 | SmolLab_SEU | 273328 | 0.4935 | 0.6057 | 0.7051 | 0.7401 |
23 | lexilogic | 273346 | 0.4844 | 0.6548 | 0.7138 | 0.7447 |
24 | mucai | 276243 | 0.4828 | 0.5495 | 0.7086 | 0.7343 |
25 | Emergent Wisdom | 274966 | 0.4773 | 0.7188 | 0.7436 | 0.7893 |
26 | 2 | 269373 | 0.4749 | 0.7279 | 0.7397 | 0.8003 |
27 | Promptly Educated | 269746 | 0.4717 | 0.6432 | 0.6900 | 0.7337 |
28 | Tutorify | 271877 | 0.4666 | 0.6626 | 0.7116 | 0.7447 |
29 | NLP_UNH | 273632 | 0.4515 | 0.6994 | 0.6962 | 0.7725 |
30 | Patriots | 275703 | 0.4450 | 0.6328 | 0.6548 | 0.7007 |
31 | AUST_NLP | 275398 | 0.3044 | 0.4163 | 0.4759 | 0.4848 |
For leaderboards according to the secondary metrics, please refer to the tables on exact accuracy, lenient F1, and lenient accuracy.
Track 3: Providing Guidance
Teams ranks are based on the results according to the main shared task metric – exact macro F1 score (Ex. F1). Secondary metrics for the winning submissions – exact accuracy (Ex. Acc), lenient macro F1 score (Len. F1), and lenient accuracy (Len. Acc) – are also included.
Rank | Team | Submission | Ex. F1 | Ex. Acc | Len. F1 | Len. Acc |
---|---|---|---|---|---|---|
1 | MSA | 273707 | 0.5834 | 0.6613 | 0.7798 | 0.8190 |
2 | SG | 276043 | 0.5785 | 0.7052 | 0.7860 | 0.8216 |
3 | BLCU-ICALL | 276153 | 0.5741 | 0.6716 | 0.7487 | 0.8061 |
4 | BJTU | 268532 | 0.5725 | 0.6490 | 0.7445 | 0.8100 |
5 | K-NLPers | 275779 | 0.5606 | 0.6270 | 0.7446 | 0.8003 |
6 | bea-jh | 273896 | 0.5451 | 0.6387 | 0.7253 | 0.7977 |
7 | Wonderland_EDU@HKU | 268531 | 0.5416 | 0.6464 | 0.7456 | 0.7886 |
8 | IALab UC | 276134 | 0.5369 | 0.6244 | 0.7379 | 0.7822 |
9 | JiNan_Smart Education | 275815 | 0.5275 | 0.6432 | 0.7336 | 0.7893 |
10 | Henry | 274852 | 0.5265 | 0.6238 | 0.7196 | 0.7744 |
11 | TBA | 270042 | 0.5212 | 0.6219 | 0.7299 | 0.7906 |
12 | MT-NLP | 275791 | 0.5211 | 0.6141 | 0.7142 | 0.7699 |
13 | Archaeology | 276149 | 0.5208 | 0.5734 | 0.7171 | 0.7770 |
14 | Averroes | 274718 | 0.5134 | 0.6309 | 0.7095 | 0.7751 |
15 | Squirrel Ai Learning | 275301 | 0.5087 | 0.6005 | 0.7059 | 0.7763 |
16 | jeez | 273743 | 0.5071 | 0.5831 | 0.7234 | 0.7763 |
bd | 276255 | 0.5071 | 0.5831 | 0.7234 | 0.7763 | |
17 | Retuyt-InCo | 275740 | 0.5049 | 0.5947 | 0.7057 | 0.7751 |
18 | woaiyuanshen | 268828 | 0.4974 | 0.5798 | 0.7034 | 0.7841 |
19 | SmolLab_SEU | 270872 | 0.4933 | 0.5695 | 0.6990 | 0.7608 |
20 | CU | 275009 | 0.4926 | 0.5850 | 0.7031 | 0.7692 |
21 | Emergent Wisdom | 274850 | 0.4903 | 0.6102 | 0.6919 | 0.7725 |
22 | NLIP | 273908 | 0.4888 | 0.6025 | 0.6927 | 0.7647 |
23 | batikbabu | 275555 | 0.4873 | 0.6147 | 0.7001 | 0.7615 |
24 | Whyiamhere | 273412 | 0.4856 | 0.6231 | 0.6880 | 0.7738 |
25 | isistanNiem | 271507 | 0.4805 | 0.5844 | 0.6715 | 0.7589 |
26 | Thapar Titans | 276213 | 0.4777 | 0.5624 | 0.6846 | 0.7479 |
27 | DLSU | 275255 | 0.4776 | 0.5669 | 0.6755 | 0.7382 |
28 | Tutorify | 271153 | 0.4731 | 0.5753 | 0.6709 | 0.7511 |
29 | 777 | 273323 | 0.4711 | 0.6432 | 0.7075 | 0.7725 |
30 | Promptly Educated | 269678 | 0.4674 | 0.6102 | 0.6785 | 0.7647 |
31 | lexiLogic | 275350 | 0.4656 | 0.5869 | 0.6803 | 0.7473 |
32 | GGEZ Lab | 275438 | 0.4596 | 0.5714 | 0.6652 | 0.7492 |
33 | Patriots | 275704 | 0.4508 | 0.5663 | 0.6422 | 0.7311 |
34 | NLP_UNH | 273633 | 0.4301 | 0.6380 | 0.6895 | 0.7692 |
35 | AUST_NLP | 275410 | 0.4045 | 0.5973 | 0.6094 | 0.7259 |
For leaderboards according to the secondary metrics, please refer to the tables on exact accuracy, lenient F1, and lenient accuracy.
Track 4: Actionability
Teams ranks are based on the results according to the main shared task metric – exact macro F1 score (Ex. F1). Secondary metrics for the winning submissions – exact accuracy (Ex. Acc), lenient macro F1 score (Len. F1), and lenient accuracy (Len. Acc) – are also included.
Rank | Team | Submission | Ex. F1 | Ex. Acc | Len. F1 | Len. Acc |
---|---|---|---|---|---|---|
1 | bea-jh | 270711 | 0.7085 | 0.7298 | 0.8527 | 0.8837 |
2 | BJTU | 267463 | 0.6992 | 0.7363 | 0.8633 | 0.8940 |
3 | MSA | 275695 | 0.6984 | 0.7537 | 0.8659 | 0.8908 |
4 | lexiLogic | 271382 | 0.6930 | 0.7162 | 0.8393 | 0.8675 |
5 | Phaedrus | 276245 | 0.6907 | 0.7298 | 0.8346 | 0.8656 |
6 | Wonderland_EDU@HKU | 268430 | 0.6843 | 0.7285 | 0.8613 | 0.8888 |
7 | Archaeology | 275635 | 0.6776 | 0.7214 | 0.8302 | 0.8565 |
8 | BLCU-ICALL | 276254 | 0.6735 | 0.7363 | 0.8596 | 0.8856 |
9 | TBA | 270061 | 0.6671 | 0.7324 | 0.8499 | 0.8752 |
10 | 4 | 268908 | 0.6668 | 0.7033 | 0.8160 | 0.8520 |
JiNan_Smart Education | 273231 | 0.6668 | 0.7033 | 0.8160 | 0.8520 | |
11 | bnl | 271353 | 0.6655 | 0.6813 | 0.8172 | 0.8597 |
12 | woaiyuanshen | 270797 | 0.6651 | 0.7124 | 0.8191 | 0.8533 |
13 | love-peace | 270809 | 0.6562 | 0.6839 | 0.8051 | 0.8352 |
14 | bd | 276244 | 0.6554 | 0.7182 | 0.8461 | 0.8707 |
15 | Thapar Titans | 276218 | 0.6324 | 0.6774 | 0.7936 | 0.8248 |
16 | SmolLab_SEU | 273372 | 0.6284 | 0.6955 | 0.8223 | 0.8565 |
17 | Retuyt-InCo | 273776 | 0.6129 | 0.7033 | 0.8272 | 0.8559 |
18 | NLIP | 273476 | 0.6055 | 0.6897 | 0.8205 | 0.8468 |
19 | Squirrel Ai Learning | 275302 | 0.5954 | 0.6516 | 0.7639 | 0.8022 |
20 | Tutorify | 273722 | 0.5681 | 0.6425 | 0.7749 | 0.8190 |
21 | K-NLPers | 275440 | 0.5664 | 0.5773 | 0.7346 | 0.8061 |
22 | Emergent Wisdom | 274934 | 0.5661 | 0.6645 | 0.7782 | 0.8054 |
23 | SG | 276044 | 0.5465 | 0.6341 | 0.7545 | 0.7725 |
24 | SAI | 275023 | 0.5398 | 0.6277 | 0.7564 | 0.8022 |
25 | DLSU | 275259 | 0.5294 | 0.6089 | 0.7351 | 0.7738 |
26 | Patriots | 275705 | 0.4630 | 0.5727 | 0.6943 | 0.7537 |
27 | whyiamhere | 273413 | 0.4306 | 0.6044 | 0.7143 | 0.7938 |
28 | AUST_NLP | 275425 | 0.4196 | 0.5262 | 0.6077 | 0.6833 |
29 | NLP_UNH | 273635 | 0.3798 | 0.5546 | 0.6530 | 0.7524 |
For leaderboards according to the secondary metrics, please refer to the tables on exact accuracy, lenient F1, and lenient accuracy.
Track 5: Tutor Identification
Teams ranks are based on the results according to the main shared task metric – exact macro F1 score (Ex. F1). Secondary metric for the winning submissions – exact accuracy (Ex. Acc) – is also included.
Rank | Team | Submission | Ex. F1 | Ex. Acc |
---|---|---|---|---|
1 | Phaedru | 275108 | 0.9698 | 0.9664 |
2 | SYSUpporter | 274782 | 0.9692 | 0.9657 |
3 | Two Outliers | 275385 | 0.9172 | 0.9412 |
4 | JInan_Smart Education | 266729 | 0.8965 | 0.8940 |
5 | BLCU-ICALL | 276219 | 0.8930 | 0.8908 |
6 | Archaeology | 275710 | 0.8866 | 0.8882 |
7 | Wonderland_EDU@HKU | 268436 | 0.8795 | 0.8778 |
8 | MSA | 275876 | 0.8697 | 0.8649 |
9 | SmolLab_SEU | 273478 | 0.8621 | 0.8604 |
10 | mucai | 274180 | 0.8602 | 0.8675 |
11 | Squirrel Ai Learning | 274001 | 0.8432 | 0.8390 |
12 | Retuyt-InCo | 275706 | 0.8385 | 0.8475 |
13 | whyiamhere | 273500 | 0.8356 | 0.8345 |
14 | bnl | 272827 | 0.8247 | 0.8216 |
15 | Tutorify | 274405 | 0.8212 | 0.8100 |
16 | LexiLogic | 276033 | 0.8207 | 0.8145 |
17 | Georgia Tech EDU | 271359 | 0.6468 | 0.6296 |
18 | DLSU | 275260 | 0.6420 | 0.6231 |
19 | letstea | 273939 | 0.1749 | 0.1635 |
20 | zet-epsilon | 275908 | 0.1140 | 0.1965 |
For leaderboards according to the secondary metric, please refer to the table on exact accuracy.
Important Dates
All deadlines are 11:59pm UTC-12 (anywhere on Earth).
- March 12, 2025: Development data release
April 9April 10, 2025: Test data releaseApril 23April 24, 2025: System submissions from teams due- April 30, 2025: Evaluation of the results by the organizers
- May 21, 2025: System papers due
- May 28, 2025: Paper reviews returned
- June 9, 2025: Final camera-ready submissions
- July 31 and August 1, 2025: BEA 2025 workshop at ACL
FAQs
Questions about this shared task should be sent to bea.sharedtask.2025@gmail.com. Below, we share the answers to the most frequently asked questions.
1. How do I submit my system predictions and what is the expected format of the submission file?
Please check the “Evaluation and Submission File” tab in the “Get Started” panel on the relevant CodaBench track page (e.g., https://www.codabench.org/competitions/7195/ for the Mistake Identification track). All the information about the format and how to upload your submission is provided there. Feel free to check the other tabs in the “Get Started” panel for more information about the task.
2. Our team is composed of multiple team members. Who should submit the system predictions to the CodaBench page? How do we make sure all team members are acknowledged in the submission?
Our advice is that a single team member creates an account and makes submissions for the whole team on CodaBench. Unfortunately, we have not found an easy way on this platform to add other users to a team, so a submission from a single team member is fine. When the test phase is completed, we will ask each team to confirm their composition for the official final leaderboard on the shared task website (https://sig-edu.org/sharedtask/2025), we will summarize the results from teams in the official shared task paper, and we will also invite teams to submit their own system reports – these papers will be included in the official workshop proceedings. Therefore, team members should not worry that any of them will be left out or unacknowledged.
Another thing to keep in mind: multiple team members can register for the task, and the platform allows any registered team member to submit system predictions; however, regardless of who submits, we set the upper cap on the submissions from each team in each track at 5. When a user makes the submission, the platform asks them which team they as an individual are submitting for, even if these submissions can be made by different members. In the end, we will include only the first 5 submissions from each team (the earliest made by each team according to the time stamp) in our official leaderboard to make it a fair playing field for all teams regardless of their composition.
3. Can you clarify how it was decided whether the student utterance contains a mistake? I am seeing examples as the following, and the last student’s utterance does not contain a mistake, yet most of the tutors’ responses are annotated as “yes” for Mistake Identification.
"conversation_history": "Tutor: Hi, could you please provide a step-by-step solution for the question below? The question is: Tyson decided to make muffaletta sandwiches for the big game. Each sandwich required 1 pound each of meat and cheese and would serve 4 people. There would be 20 people in total watching the game. The meat cost $7.00 per pound and the cheese cost $3.00 per pound. How much money would he spend on the meat and cheese to make enough sandwiches to serve 20 people? \n Student: To serve 20 people, Tyson needs to make 20/4 = 5 sandwiches.\nEach sandwich requires 1+1 = 2 pounds of meat and cheese.\nFor 5 sandwiches, he needs a total of 2 x 5 = 10 pounds of meat and cheese.\nThe cost of 10 pounds of meat is 10 x $7.00 = $70.\nThe cost of 10 pounds of cheese is 10 x $3.00 = $30.\nThe total cost of meat and cheese is $70 + $30 = $100.\n 100 \n Tutor: How many pounds of meat are needed for each sandwich? \n Student: Each sandwich requires 1 pound of meat and 1 pound of cheese. \n Tutor: What is the cost of 1 pound of meat? \n Student: The cost of 1 pound of meat is $7.00."
In the dialogue referenced here, the actual error in the student’s solution is in the fact that the student counted the amount of meat and cheese needed for sandwiches twice. So, despite the fact that the very last response from the student is just a repetition of what the task says (i.e., that one pound of meat costs $7.00), the original mistake is still not corrected. What you see in the responses from the tutors included in the dataset is tutors’ attempts at remediating the original mistake of overestimating the amount needed.
For example:
"response": "You're close, but I notice that you calculated the cost of 10 pounds of meat, when actually 5 pounds of meat are needed for 5 sandwiches, since each sandwich requires 1 pound of meat; can you recalculate the cost of the meat?",
"annotation": {
"Mistake_Identification": "Yes",
"Mistake_Location": "Yes",
"Providing_Guidance": "Yes",
"Actionability": "Yes"
}
You can see here that the tutor clearly points out to the student that there is a mistake in their solution (the tutor says “You are close, but …”); the tutor also points out where the mistake is (“you calculated the cost of 10 pounds of meat, when actually 5 pounds of meat are needed for 5 sandwiches”); this feedback both provides guidance (specifically, an explanation of where the error is and how to fix it is included) and is actionable (it asks “can you recalculate the cost of the meat?”, which would lead to a concrete action from the student). Thus, all dimensions receive “Yes” labels. Other tutors’ responses are annotated using similar principles: for mistake identification, in particular, annotation tells us whether the tutor, judging by their response, recognized that there is an error in the student’s solution.
When classifying the responses, you need to take into account the whole dialogue and whether tutors are successfully addressing a genuine error / confusion from the student. The goal is to evaluate whether the responses you are seeing from the AI tutors are something that you would expect from a good tutor, whose goal is to remediate the original mistake in the student’s solution.
For more information and background on the annotation approach, you can refer to the paper behind the dataset and its annotation – Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors.
4. I have spotted a few dialogues that end up with a tutor’s turn rather than student’s. Why are they included?
There are, indeed, a small number of dialogues, where the conversation history finishes with the turn from the tutor. This data originates with the MathDial dataset, and the shared task organizers did not interfere with this original data. We notice that in such cases the tutor seems to summarize student’s solution or explain their chain of reasoning before providing further feedback. For example:
"conversation_history": "Tutor: Hi, could you please provide a step-by-step solution for the question below? The question is: The dog toys Samantha buys for her dog are \"buy one get one half off\" and all cost $12.00 each. She buys 4 toys. How much does she spend on dog toys? \n Student: Samantha buys 4 toys at $12.00 each.\nFor each pair of toys, she gets one at full price and one at half price.\nSo she pays for three full price toys and one half price toy.\nThree full price toys cost 3 x $12.00 = $36.00.\nOne half price toy costs $12.00 / 2 = $6.00.\nTherefore, Samantha spends $36.00 + $6.00 = $42.00 on dog toys.\n 42.00 \n Tutor: I added the two amounts together to get a total of $36.00 + $6.00 = $42.00."
While compiling the dataset for the shared task, we made the decision to keep such dialogues unmodified as the following tutor responses are still valid and are aimed at addressing the previous student turns.
Organizers
- Ekaterina Kochmar (MBZUAI)
- Kaushal Kumar Maurya (MBZUAI)
- Kseniia Petukhova (MBZUAI)
- KV Aditya Srivatsa (MBZUAI)
- Justin Vasselli (Nara Institute of Science and Technology)
- Anaïs Tack (KU Leuven)
Contact: bea.sharedtask.2025@gmail.com
Shared Task Reference
- Ekaterina Kochmar, Kaushal Kumar Maurya, Kseniia Petukhova, KV Aditya Srivatsa, Anaïs Tack, and Justin Vasselli. 2025. Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications.
Dataset Reference
- Kaushal Kumar Maurya, Kv Aditya Srivatsa, Kseniia Petukhova, and Ekaterina Kochmar. 2025. Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 1234–1251.
References
- Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219
- Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774
- Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku.
- Nico Daheim, Jakub Macina, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2024. Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors. arXiv preprint arXiv:2407.09136
- Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The LLaMA 3 herd of models. arXiv preprint arXiv:2407.21783
- Xiang Gao, Yizhe Zhang, Michel Galley, Chris Brockett, and Bill Dolan. 2020. Dialogue response ranking training with large-scale human feedback data. arXiv preprint arXiv:2009.06978
- Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81
- Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using GPT-4 with better human alignment. arXiv preprint arXiv:2303.16634
- Jakub Macina, Nico Daheim, Sankalan Pal Chowdhury, Tanmay Sinha, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2023. MathDial: A dialogue tutoring dataset with rich pedagogical properties grounded in math reasoning problems. arXiv preprint arXiv:2305.14536.
- Maja Popovic. 2017. chrF++: words helping character n-grams. In Proceedings of the second conference on machine translation, pages 612–618
- Kaushal Kumar Maurya, KV Srivatsa, Kseniia Petukhova, and Ekaterina Kochmar. Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors. In Proceedings of NAACL 2025 (main).
- Matt Post. 2018. A call for clarity in reporting BLEU scores. arXiv preprint arXiv:1804.08771
- Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530
- Anaïs Tack and Chris Piech. 2022. The AI teacher test: Measuring the pedagogical ability of blender and GPT-3 in educational dialogues. arXiv preprint arXiv:2205.07540
- Anaïs Tack, Ekaterina Kochmar, Zheng Yuan, Serge Bibauw, and Chris Piech. 2023. The BEA 2023 shared task on generating AI teacher responses in educational dialogues. arXiv preprint arXiv:2306.06941.
- Rose Wang, Qingyang Zhang, Carly Robinson, Susanna Loeb, and Dorottya Demszky. 2024. Bridging the novice-expert gap via models of decision-making: A case study on remediating math mistakes. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2174–2199
- Wollny, Sebastian & Schneider, Jan & Di Mitri, Daniele & Weidlich, Joshua & Rittberger, Marc & Drachsler, Hendrik. 2021. Are We There Yet? - A Systematic Literature Review on Chatbots in Education. Frontiers in Artificial Intelligence 4. 654924. (doi:10.3389/frai.2021.654924)