BEA 2025 Shared Task

Pedagogical Ability Assessment of AI-powered Tutors

Motivation

Conversational agents offer promising opportunities for education as they can fulfill various roles (e.g., intelligent tutors and service-oriented assistants) and pursue different objectives (e.g., improving student skills and increasing instructional efficiency) (Wollny et al., 2021), among which serving as an AI tutor is one of the most prevalent tasks (Tack et al., 2023). Recent advances in the development of Large Language Models (LLMs) provide our field with promising ways of building AI-based conversational tutors, which can generate human-sounding dialogues on the fly. The key questions posed in previous research (Tack and Piech, 2022; Tack et al., 2023), however, remain: How can we test whether state-of-the-art generative models are good AI teachers, capable of replying to a student in an educational dialogue?

Evaluating dialogue systems in general presents a significant challenge. While human evaluation is still considered the most reliable method for assessing dialogue quality, its high cost and lack of reproducibility have led to the adaptation of both reference-based and reference-free automatic metrics, originally used in machine translation and summary evaluation, for dialogue evaluation (Lin, 2004; Popovic, 2017; Post, 2018; Gao et al., 2020; Liu et al., 2023). When it comes to Intelligent Tutoring Systems (ITSs), which also function as dialogue systems with the specific role of acting as tutors, these general metrics are insufficient. In the educational context, we need to assess complex pedagogical aspects and abilities of such systems, ensuring that they provide students with sufficient, helpful, and factually correct guidance and do not simply reveal answers when the student makes a mistake, among other aspects. Therefore, developing automatic metrics to evaluate these nuanced aspects is essential for creating effective and helpful tutoring systems.

Due to the lack of a standardized evaluation taxonomy, previous work has used different criteria for evaluation. For example, Tack and Piech (2022) and Tack et al. (2023) evaluated models’ responses in terms of whether they speak like a teacher, understand a student, and help a student, while in Macina et al. (2023), responses of models playing roles of tutors were evaluated by human annotators using coherence, correctness, and equitable tutoring. At the same time, Wang et al. (2024) assess usefulness, care, and human-likeness, and Daheim et al. (2024) use targetedness, correctness, and actionability of a tutor response as quality evaluation criteria. Such lack of standardization makes it difficult to compare different systems, and, therefore, defining evaluation criteria and developing automatic metrics for them is a crucial task for advancing the field, which we aim to address in this task.

Task Goals & Description

Following the successful BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues (Tack et al., 2023), we revisit the question of quality assessment of the tutor responses generated with the AI models (specifically, LLMs) in the context of educational dialogues. We believe that (1) the topic is timely and important, and the shared task will attract BEA community attention; (2) LLMs have significantly advanced in the past couple of years, making it important to revisit this topic after the competition run in 2023; and (3) there is a need to establish a pedagogically motivated benchmark for this task. In contrast to the BEA 2023 shared task, our focus is not on the generation of educational dialogues using state-of-the-art LLMs, but rather on comprehensive evaluation of AI-tutor responses using a set of pedagogically motivated metrics.

In this shared task, we will focus on educational dialogues between a student and a tutor in the mathematical domain grounded in student mistakes or confusion, where the AI tutor aims to remediate such mistakes or confusions. Dialogues in the datasets provided in this shared task include:

  • The context consisting of several prior turns from both the tutor and the student and containing a mistake in the student’s solution. These are extracted from two popular datasets of educational dialogues in the mathematical domain – MathDial (Macina et al., 2023) and Bridge (Wang et al., 2024);
  • The last utterance from the student; and
  • A set of possible responses following the last student’s utterance from a range of LLM-based tutors and, where available, human tutors, aimed at mistake remediation.

The LLM-based tutor responses are generated by the organizers of the shared task using a set of state-of-the-art LLMs of various sizes and capabilities, including: GPT-4 (Achiam et al., 2023), Gemini (Reid et al., 2024), Sonnet (Anthropic), Mistral (Jiang et al., 2023), Llama-3.1-8B and Llama-3.1-405B (Dubey et al., 2024), and Phi-3 (Abdin et al., 2024).

The identities of the tutors will be included in the development set provided to the task participants, but not in the test set. In addition to the responses themselves, the development set contains annotation of their quality along the following pedagogically motivated dimensions (Maurya et al., 2025):

  • Mistake identification: Since all dialogues in the dataset contain a mistake made by the student, a good quality response from the tutor should include the relevant mistake identification. This corresponds to student understanding in the schema of Tack and Piech (2022) and correctness in the schemata of Macina et al. (2023) and Daheim et al. (2024).
  • Mistake location: A good tutor response should not only notify the student of the committed error, but also point to its location in the answer and outline what the error is to help student remediate it in their next response. This corresponds to targetedness in Daheim et al. (2024).
  • Providing guidance: A good tutor response should provide the student with relevant and helpful guidance, such as a hint, an explanation, a supporting question, and the like. This aspect corresponds to helping a student in Tack and Piech (2022) and usefulness in Wang et al. (2024).
  • Actionability: Finally, once the guidance is provided to a student, it should be clear from a good tutor response what the student should do next; in other words, the tutor response should not be vague, unclear or a conversation stopper. This aspect in our schema corresponds to actionability in Daheim et al. (2024).

For more details on the annotation and data collection, please refer to Maurya et al. (2025).

Data

Both development and test sets are distributed as JSON files (Maurya et al., 2025).

Development set

The development set consists of 300 dialogues from MathDial (Macina et al., 2023) and Bridge (Wang et al., 2024) datasets, including the context of several prior turns from both the tutor and the student, containing a mistake or expessing a confusion on the part of the student, the last utterance from the student, and a set of responses to the student from 7 LLM-based tutors and human tutors (a single tutor in MathDial, and an expert and novice tutors in Bridge) annotated for their pedagogical quality. In total, the development set consists of over 2,480 tutor responses, presented and annotated as follows:

{
    "conversation_id": "29640377",
    "conversation history": "Tutor: Here is our first question.\nTutor: What is the value of 10/5?\nStudent: 50",
    "tutor_responses": {
      "Llama318B": {
        "response": "I think there might be a small mistake there, let's take a look at the problem together – when you divide 10 by 5, we actually get 2, not 50.",
        "annotation": {
          "Mistake_Identification": "Yes",
          "Mistake_Location": "Yes",
          "Providing_Guidance": "To some extent",
          "Actionability": "No"
        }
      },
      "Sonnet": {
        "response": "I appreciate your effort, but let's think about this carefully: if we divide 10 into 5 equal groups, how many would be in each group?",
        "annotation": {
          "Mistake_Identification": "Yes",
          "Mistake_Location": "Yes",
          "Providing_Guidance": "Yes",
          "Actionability": "Yes"
        }
      },
      "Gemini": {
        "response": "Remember, dividing by 5 is like sharing 10 things equally among 5 groups. How many would be in each group? Let's try that again.",
        "annotation": {
          "Mistake_Identification": "Yes",
          "Mistake_Location": "Yes",
          "Providing_Guidance": "Yes",
          "Actionability": "Yes"
        }
      },
      "..."
    }
}

The following fields are included in JSON:

  • conversation_id: a unique identifier for the instance
  • conversation history: the context of several prior turns from the tutor and the student extracted from the original datasets
  • tutor_responses: the set of human tutor responses extracted from the original datasets, as well as responses generated by 7 LLMs-as-tutors, each with a unique identifier
  • response: the response from a particular tutor
  • annotation: the set of annotations, including:
    • Mistake_Identification marking whether the tutor recognized a mistake in a student’s response
    • Mistake_Location marking whether the tutor’s response accurately points to a genuine mistake and its location
    • Providing_Guidance marking whether the tutor offers correct and relevant guidance, such as an explanation, elaboration, hint, examples, and so on
    • Actionability marking if it is clear from the tutor’s feedback what the student should do next

Test set

The test set consists of 200 dialogues from MathDial and Bridge, also including the context of several prior turns from both the tutor and the student, containing a mistake or expessing a confusion on the part of the student, the last utterance from the student, and a set of unannotated responses to the last student’s utterance from the same set of tutors as in the development set. The key difference is that the test set does not include tutor identities or annotations for the pedagogical quality for their responses.

As a result, the test set is organized as follows:

{
    "conversation_id": "616653340",
    "conversation history": "Tutor: 39 is a prime number or composite number?\nStudent: prime",
    "tutor_responses": {
      "Tutor_1": {
        "response": "That's a good try, but remember, prime numbers have only two factors: 1 and itself.",
      },
      "Tutor_2": {
        "response": "Hmm, what makes you think prime?",
      },
      "Tutor_3": {
        "response": "That's correct, 39 is actually a composite number because it has factors other than 1 and itself, such as 3 and 13.",
      },
      "Tutor_4": {
        "response": "That was an incorrect answer.",
      },
      "Tutor_5": {
        "response": "Let's check that again. Can you find two numbers that multiply to give 39?",
      },
      "..."
    }
}

Shared Task

This shared task will include five tracks. Participating teams are welcome to take part in any number of tracks.

Tracks

  • Track 1 - Mistake Identification: Teams are invited to develop systems to detect whether tutors’ responses recognize mistakes in students’ responses. The following categories are included:
    • Yes: the mistake is clearly identified/ recognized in the tutor’s response
    • To some extent: the tutor’s response suggests that there may be a mistake, but it sounds as if the tutor is not certain
    • No: the tutor does not recognize the mistake (e.g., they proceed to simply provide the answer to the asked question)
  • Track 2 - Mistake Location: Teams are invited to develop systems to assess whether tutors’ responses accurately point to a genuine mistake and its location in the students’ responses. The following categories are included:
    • Yes: the tutor clearly points to the exact location of a genuine mistake in the student’s solution
    • To some extent: the response demonstrates some awareness of the exact mistake, but is vague, unclear, or easy to misunderstand
    • No: the response does not provide any details related to the mistake
  • Track 3 - Providing guidance: Teams are invited to develop systems to evaluate whether tutors’ responses offer correct and relevant guidance, such as an explanation, elaboration, hint, examples, and so on. The following categories are included:
    • Yes: the tutor provides guidance that is correct and relevant to the student’s mistake
    • To some extent: guidance is provided but it is fully or partially incorrect, incomplete, or somewhat misleading
    • No: the tutor’s response does not include any guidance, or the guidance provided is irrelevant to the question or factually incorrect
  • Track 4 - Actionability: Teams are invited to develop systems to assess whether tutors’ feedback is actionable, i.e., it makes it clear what the student should do next. The following categories are included:
    • Yes: the response provides clear suggestions on what the student should do next
    • To some extent: the response indicates that something needs to be done, but it is not clear what exactly that is
    • No: the response does not suggest any action on the part of the student (e.g., it simply reveals the final answer)
  • Track 5 - Guess the tutor identity: Teams are invited to develop systems to identify which tutors the anonymized responses in the test set originated from. This track will address 9 classes: expert and novice tutors, and 7 LLMs included in the tutor set.

Evaluation

Tracks 1-4 will use accuracy and macro F1 as the main metrics. These will be used in two settings:

  • Exact evaluation: predictions submitted by the teams will be evaluated for the exact prediction of the three classes (“Yes”, “To some extent”, and “No”)
  • Lenient evaluation: since for these dimensions tutor responses annotated as “Yes” and “To some extent” share a certain amount of qualitative value, we will consider “Yes” and “To some extent” as a single class, and evaluate predictions under the 2-class setting (“Yes + To some extent” vs. “No”)

Track 5 will use accuracy of the tutor identity prediction as its main metric.

Participation

  • To register for the shared task and subscribe to updates, please fill in the form: https://forms.gle/fKJcdvL2kCrPcu8X6
  • All updates about the shared task will be shared with the email addresses, indicated in the registration form
  • Development and test data can be found in the Shared Task Repository
  • All teams officially participating in the test phase on CodaLab will be invited to publish their system papers in the BEA 2025 proceedings and present their work at the BEA 2025 workshop

Submission

Submissions will be run via CodaBench, with the number of submissions from each team capped at 5 per track.

The five tracks can be accessed at the following links:

Results

Multiple teams registered and participated in various tracks of the shared task. Below are the official final leaderboards, while the comprehensive results for all submissions per teams can be found on the respective shared task track pages on CodaBench.

Track 1: Mistake Identification

Teams ranks are based on the results according to the main shared task metric – exact macro F1 score (Ex. F1). Secondary metrics for the winning submissions – exact accuracy (Ex. Acc), lenient macro F1 score (Len. F1), and lenient accuracy (Len. Acc) – are also included.

Rank Team Submission Ex. F1 Ex. Acc Len. F1 Len. Acc
1 BJTU 269726 0.7181 0.8623 0.8957 0.9457
2 TutorMind 269923 0.7163 0.8759 0.9108 0.9528
3 Averroes 273668 0.7155 0.8675 0.8997 0.9425
4 MSA 272382 0.7154 0.8759 0.9152 0.9535
5 BD 276025 0.7110 0.8772 0.8966 0.9412
6 Gooby-Snoob Guysz 265229 0.7105 0.8481 0.8901 0.9373
7 Wonderland_EDU@HKU 267613 0.6983 0.8675 0.9109 0.9496
8 Archaeology 275559 0.6976 0.8675 0.8959 0.9405
9 test 273097 0.6948 0.8400 0.8947 0.9451
10 Someone 276087 0.6926 0.8520 0.8964 0.9438
11 TBA 270039 0.6858 0.8740 0.9060 0.9476
12 BLCU-ICALL 276231 0.6822 0.8578 0.8909 0.9418
13 bea-jh 276260 0.6802 0.8708 0.9069 0.9457
14 JiNan_Smart Education 274191 0.6790 0.8688 0.9052 0.9470
15 jeez 273727 0.6735 0.8623 0.8957 0.9418
16 MT-NLP 275793 0.6677 0.8636 0.8885 0.9354
17 K-NLPers 275401 0.6669 0.8113 0.8671 0.9270
18 Thapar Titan/s 276130 0.6647 0.8520 0.8840 0.9328
19 Squirrel Ai Learning 275298 0.6646 0.8539 0.8748 0.9315
20 SmolLab_SEU 268996 0.6617 0.8397 0.8782 0.9315
21 bnl 272027 0.6578 0.8494 0.8806 0.9302
22 LexiLogic 276232 0.6549 0.8487 0.8806 0.9302
23 Retuyt-InCo 275694 0.6535 0.8449 0.8395 0.9192
24 777 273165 0.6534 0.8526 0.8731 0.9283
25 CU 274027 0.6514 0.8701 0.8957 0.9425
26 NLP Group 7 273738 0.6499 0.8462 0.8605 0.9276
27 NLIP 276186 0.6438 0.8546 0.8723 0.9257
28 ALA 276228 0.6361 0.8423 0.8493 0.9140
29 mucai 272784 0.6285 0.8067 0.8354 0.8985
30 AGS 272032 0.6251 0.8390 0.8640 0.9211
31 Tutorify 270221 0.6247 0.8261 0.8502 0.9173
32 Promptly Educated 269660 0.6196 0.7104 0.8479 0.9224
33 wyn 273790 0.6184 0.8384 0.8434 0.9095
34 Emergent Wisdom 274950 0.6100 0.8546 0.8799 0.9321
35 Georgia Tech EDU 271339 0.6049 0.8171 0.8386 0.9102
36 SG 276040 0.5896 0.7919 0.8258 0.8875
37 NeuralNexus 271085 0.5840 0.8268 0.8142 0.8972
38 presidency 273920 0.5807 0.7570 0.8070 0.8804
39 NLP_UNH 273124 0.5708 0.8358 0.8358 0.9089
40 letstea 273885 0.5376 0.6593 0.8109 0.8681
41 Patriots 275702 0.5345 0.8028 0.7923 0.8921
42 AUST_NLP 275376 0.4819 0.7085 0.6929 0.7576
43 WhyIamHere 273409 0.4562 0.7931 0.7126 0.8824
44 RAGthoven 267114 0.2949 0.4350 0.4349 0.5365

For leaderboards according to the secondary metrics, please refer to the tables on exact accuracy, lenient F1, and lenient accuracy.

Track 2: Mistake Location

Teams ranks are based on the results according to the main shared task metric – exact macro F1 score (Ex. F1). Secondary metrics for the winning submissions – exact accuracy (Ex. Acc), lenient macro F1 score (Len. F1), and lenient accuracy (Len. Acc) – are also included.

Rank Team Submission Ex. F1 Ex. Acc Len. F1 Len. Acc
1 BLCU-ICALL 271997 0.5983 0.7679 0.8386 0.8630
2 BJTU 272600 0.5940 0.7330 0.7848 0.8261
3 K-NLPers 275803 0.5880 0.7641 0.8404 0.8610
4 MSA 275733 0.5743 0.6975 0.7848 0.8209
5 SG 276042 0.5692 0.7602 0.8118 0.8416
6 bea-jh 272508 0.5658 0.6723 0.7792 0.8197
7 bd 275513 0.5543 0.7143 0.7699 0.8054
8 TBA 270051 0.5490 0.7091 0.7702 0.8035
9 Wonderland_EDU@HKU 267804 0.5450 0.7104 0.7649 0.8003
10 Averroes 274370 0.5366 0.6348 0.7587 0.7822
11 Whyamher 273410 0.5325 0.6910 0.7370 0.7802
12 NLIP 270231 0.5319 0.6878 0.7495 0.7951
  Archaeology 276257 0.5319 0.6568 0.7558 0.8009
13 JiNan_Smart Education 275308 0.5274 0.6968 0.7502 0.7809
14 Squirrel Ai Learning 275300 0.5272 0.6904 0.7306 0.7692
15 Thapar Titans 276208 0.5215 0.6943 0.7374 0.7796
16 jeez 273748 0.5187 0.6833 0.7416 0.7854
17 CU 273951 0.5148 0.6807 0.7358 0.7789
18 777 275190 0.5114 0.6710 0.7195 0.7486
19 Someone 276094 0.5009 0.7208 0.7590 0.8074
20 Retuyt-InCo 272995 0.4959 0.5863 0.7200 0.7608
21 NLP Group 7 273889 0.4936 0.6348 0.6944 0.7524
22 SmolLab_SEU 273328 0.4935 0.6057 0.7051 0.7401
23 lexilogic 273346 0.4844 0.6548 0.7138 0.7447
24 mucai 276243 0.4828 0.5495 0.7086 0.7343
25 Emergent Wisdom 274966 0.4773 0.7188 0.7436 0.7893
26 2 269373 0.4749 0.7279 0.7397 0.8003
27 Promptly Educated 269746 0.4717 0.6432 0.6900 0.7337
28 Tutorify 271877 0.4666 0.6626 0.7116 0.7447
29 NLP_UNH 273632 0.4515 0.6994 0.6962 0.7725
30 Patriots 275703 0.4450 0.6328 0.6548 0.7007
31 AUST_NLP 275398 0.3044 0.4163 0.4759 0.4848

For leaderboards according to the secondary metrics, please refer to the tables on exact accuracy, lenient F1, and lenient accuracy.

Track 3: Providing Guidance

Teams ranks are based on the results according to the main shared task metric – exact macro F1 score (Ex. F1). Secondary metrics for the winning submissions – exact accuracy (Ex. Acc), lenient macro F1 score (Len. F1), and lenient accuracy (Len. Acc) – are also included.

Rank Team Submission Ex. F1 Ex. Acc Len. F1 Len. Acc
1 MSA 273707 0.5834 0.6613 0.7798 0.8190
2 SG 276043 0.5785 0.7052 0.7860 0.8216
3 BLCU-ICALL 276153 0.5741 0.6716 0.7487 0.8061
4 BJTU 268532 0.5725 0.6490 0.7445 0.8100
5 K-NLPers 275779 0.5606 0.6270 0.7446 0.8003
6 bea-jh 273896 0.5451 0.6387 0.7253 0.7977
7 Wonderland_EDU@HKU 268531 0.5416 0.6464 0.7456 0.7886
8 IALab UC 276134 0.5369 0.6244 0.7379 0.7822
9 JiNan_Smart Education 275815 0.5275 0.6432 0.7336 0.7893
10 Henry 274852 0.5265 0.6238 0.7196 0.7744
11 TBA 270042 0.5212 0.6219 0.7299 0.7906
12 MT-NLP 275791 0.5211 0.6141 0.7142 0.7699
13 Archaeology 276149 0.5208 0.5734 0.7171 0.7770
14 Averroes 274718 0.5134 0.6309 0.7095 0.7751
15 Squirrel Ai Learning 275301 0.5087 0.6005 0.7059 0.7763
16 jeez 273743 0.5071 0.5831 0.7234 0.7763
  bd 276255 0.5071 0.5831 0.7234 0.7763
17 Retuyt-InCo 275740 0.5049 0.5947 0.7057 0.7751
18 woaiyuanshen 268828 0.4974 0.5798 0.7034 0.7841
19 SmolLab_SEU 270872 0.4933 0.5695 0.6990 0.7608
20 CU 275009 0.4926 0.5850 0.7031 0.7692
21 Emergent Wisdom 274850 0.4903 0.6102 0.6919 0.7725
22 NLIP 273908 0.4888 0.6025 0.6927 0.7647
23 batikbabu 275555 0.4873 0.6147 0.7001 0.7615
24 Whyiamhere 273412 0.4856 0.6231 0.6880 0.7738
25 isistanNiem 271507 0.4805 0.5844 0.6715 0.7589
26 Thapar Titans 276213 0.4777 0.5624 0.6846 0.7479
27 DLSU 275255 0.4776 0.5669 0.6755 0.7382
28 Tutorify 271153 0.4731 0.5753 0.6709 0.7511
29 777 273323 0.4711 0.6432 0.7075 0.7725
30 Promptly Educated 269678 0.4674 0.6102 0.6785 0.7647
31 lexiLogic 275350 0.4656 0.5869 0.6803 0.7473
32 GGEZ Lab 275438 0.4596 0.5714 0.6652 0.7492
33 Patriots 275704 0.4508 0.5663 0.6422 0.7311
34 NLP_UNH 273633 0.4301 0.6380 0.6895 0.7692
35 AUST_NLP 275410 0.4045 0.5973 0.6094 0.7259

For leaderboards according to the secondary metrics, please refer to the tables on exact accuracy, lenient F1, and lenient accuracy.

Track 4: Actionability

Teams ranks are based on the results according to the main shared task metric – exact macro F1 score (Ex. F1). Secondary metrics for the winning submissions – exact accuracy (Ex. Acc), lenient macro F1 score (Len. F1), and lenient accuracy (Len. Acc) – are also included.

Rank Team Submission Ex. F1 Ex. Acc Len. F1 Len. Acc
1 bea-jh 270711 0.7085 0.7298 0.8527 0.8837
2 BJTU 267463 0.6992 0.7363 0.8633 0.8940
3 MSA 275695 0.6984 0.7537 0.8659 0.8908
4 lexiLogic 271382 0.6930 0.7162 0.8393 0.8675
5 Phaedrus 276245 0.6907 0.7298 0.8346 0.8656
6 Wonderland_EDU@HKU 268430 0.6843 0.7285 0.8613 0.8888
7 Archaeology 275635 0.6776 0.7214 0.8302 0.8565
8 BLCU-ICALL 276254 0.6735 0.7363 0.8596 0.8856
9 TBA 270061 0.6671 0.7324 0.8499 0.8752
10 4 268908 0.6668 0.7033 0.8160 0.8520
  JiNan_Smart Education 273231 0.6668 0.7033 0.8160 0.8520
11 bnl 271353 0.6655 0.6813 0.8172 0.8597
12 woaiyuanshen 270797 0.6651 0.7124 0.8191 0.8533
13 love-peace 270809 0.6562 0.6839 0.8051 0.8352
14 bd 276244 0.6554 0.7182 0.8461 0.8707
15 Thapar Titans 276218 0.6324 0.6774 0.7936 0.8248
16 SmolLab_SEU 273372 0.6284 0.6955 0.8223 0.8565
17 Retuyt-InCo 273776 0.6129 0.7033 0.8272 0.8559
18 NLIP 273476 0.6055 0.6897 0.8205 0.8468
19 Squirrel Ai Learning 275302 0.5954 0.6516 0.7639 0.8022
20 Tutorify 273722 0.5681 0.6425 0.7749 0.8190
21 K-NLPers 275440 0.5664 0.5773 0.7346 0.8061
22 Emergent Wisdom 274934 0.5661 0.6645 0.7782 0.8054
23 SG 276044 0.5465 0.6341 0.7545 0.7725
24 SAI 275023 0.5398 0.6277 0.7564 0.8022
25 DLSU 275259 0.5294 0.6089 0.7351 0.7738
26 Patriots 275705 0.4630 0.5727 0.6943 0.7537
27 whyiamhere 273413 0.4306 0.6044 0.7143 0.7938
28 AUST_NLP 275425 0.4196 0.5262 0.6077 0.6833
29 NLP_UNH 273635 0.3798 0.5546 0.6530 0.7524

For leaderboards according to the secondary metrics, please refer to the tables on exact accuracy, lenient F1, and lenient accuracy.

Track 5: Tutor Identification

Teams ranks are based on the results according to the main shared task metric – exact macro F1 score (Ex. F1). Secondary metric for the winning submissions – exact accuracy (Ex. Acc) – is also included.

Rank Team Submission Ex. F1 Ex. Acc
1 Phaedru 275108 0.9698 0.9664
2 SYSUpporter 274782 0.9692 0.9657
3 Two Outliers 275385 0.9172 0.9412
4 JInan_Smart Education 266729 0.8965 0.8940
5 BLCU-ICALL 276219 0.8930 0.8908
6 Archaeology 275710 0.8866 0.8882
7 Wonderland_EDU@HKU 268436 0.8795 0.8778
8 MSA 275876 0.8697 0.8649
9 SmolLab_SEU 273478 0.8621 0.8604
10 mucai 274180 0.8602 0.8675
11 Squirrel Ai Learning 274001 0.8432 0.8390
12 Retuyt-InCo 275706 0.8385 0.8475
13 whyiamhere 273500 0.8356 0.8345
14 bnl 272827 0.8247 0.8216
15 Tutorify 274405 0.8212 0.8100
16 LexiLogic 276033 0.8207 0.8145
17 Georgia Tech EDU 271359 0.6468 0.6296
18 DLSU 275260 0.6420 0.6231
19 letstea 273939 0.1749 0.1635
20 zet-epsilon 275908 0.1140 0.1965

For leaderboards according to the secondary metric, please refer to the table on exact accuracy.

Important Dates

All deadlines are 11:59pm UTC-12 (anywhere on Earth).

  • March 12, 2025: Development data release
  • April 9 April 10, 2025: Test data release
  • April 23 April 24, 2025: System submissions from teams due
  • April 30, 2025: Evaluation of the results by the organizers
  • May 21, 2025: System papers due
  • May 28, 2025: Paper reviews returned
  • June 9, 2025: Final camera-ready submissions
  • July 31 and August 1, 2025: BEA 2025 workshop at ACL

FAQs

Questions about this shared task should be sent to bea.sharedtask.2025@gmail.com. Below, we share the answers to the most frequently asked questions.

1. How do I submit my system predictions and what is the expected format of the submission file?

Please check the “Evaluation and Submission File” tab in the “Get Started” panel on the relevant CodaBench track page (e.g., https://www.codabench.org/competitions/7195/ for the Mistake Identification track). All the information about the format and how to upload your submission is provided there. Feel free to check the other tabs in the “Get Started” panel for more information about the task.

2. Our team is composed of multiple team members. Who should submit the system predictions to the CodaBench page? How do we make sure all team members are acknowledged in the submission?

Our advice is that a single team member creates an account and makes submissions for the whole team on CodaBench. Unfortunately, we have not found an easy way on this platform to add other users to a team, so a submission from a single team member is fine. When the test phase is completed, we will ask each team to confirm their composition for the official final leaderboard on the shared task website (https://sig-edu.org/sharedtask/2025), we will summarize the results from teams in the official shared task paper, and we will also invite teams to submit their own system reports – these papers will be included in the official workshop proceedings. Therefore, team members should not worry that any of them will be left out or unacknowledged.

Another thing to keep in mind: multiple team members can register for the task, and the platform allows any registered team member to submit system predictions; however, regardless of who submits, we set the upper cap on the submissions from each team in each track at 5. When a user makes the submission, the platform asks them which team they as an individual are submitting for, even if these submissions can be made by different members. In the end, we will include only the first 5 submissions from each team (the earliest made by each team according to the time stamp) in our official leaderboard to make it a fair playing field for all teams regardless of their composition.

3. Can you clarify how it was decided whether the student utterance contains a mistake? I am seeing examples as the following, and the last student’s utterance does not contain a mistake, yet most of the tutors’ responses are annotated as “yes” for Mistake Identification.

"conversation_history": "Tutor: Hi, could you please provide a step-by-step solution for the question below? The question is: Tyson decided to make muffaletta sandwiches for the big game.  Each sandwich required 1 pound each of meat and cheese and would serve 4 people.  There would be 20 people in total watching the game.  The meat cost $7.00 per pound and the cheese cost $3.00 per pound.  How much money would he spend on the meat and cheese to make enough sandwiches to serve 20 people? \n Student: To serve 20 people, Tyson needs to make 20/4 = 5 sandwiches.\nEach sandwich requires 1+1 = 2 pounds of meat and cheese.\nFor 5 sandwiches, he needs a total of 2 x 5 = 10 pounds of meat and cheese.\nThe cost of 10 pounds of meat is 10 x $7.00 = $70.\nThe cost of 10 pounds of cheese is 10 x $3.00 = $30.\nThe total cost of meat and cheese is $70 + $30 = $100.\n 100 \n Tutor: How many pounds of meat are needed for each sandwich? \n Student: Each sandwich requires 1 pound of meat and 1 pound of cheese. \n Tutor: What is the cost of 1 pound of meat? \n Student: The cost of 1 pound of meat is $7.00."

In the dialogue referenced here, the actual error in the student’s solution is in the fact that the student counted the amount of meat and cheese needed for sandwiches twice. So, despite the fact that the very last response from the student is just a repetition of what the task says (i.e., that one pound of meat costs $7.00), the original mistake is still not corrected. What you see in the responses from the tutors included in the dataset is tutors’ attempts at remediating the original mistake of overestimating the amount needed.

For example:

"response": "You're close, but I notice that you calculated the cost of 10 pounds of meat, when actually 5 pounds of meat are needed for 5 sandwiches, since each sandwich requires 1 pound of meat; can you recalculate the cost of the meat?",
"annotation": {
  "Mistake_Identification": "Yes",
  "Mistake_Location": "Yes",
  "Providing_Guidance": "Yes",
  "Actionability": "Yes"
}

You can see here that the tutor clearly points out to the student that there is a mistake in their solution (the tutor says “You are close, but …”); the tutor also points out where the mistake is (“you calculated the cost of 10 pounds of meat, when actually 5 pounds of meat are needed for 5 sandwiches”); this feedback both provides guidance (specifically, an explanation of where the error is and how to fix it is included) and is actionable (it asks “can you recalculate the cost of the meat?”, which would lead to a concrete action from the student). Thus, all dimensions receive “Yes” labels. Other tutors’ responses are annotated using similar principles: for mistake identification, in particular, annotation tells us whether the tutor, judging by their response, recognized that there is an error in the student’s solution.

When classifying the responses, you need to take into account the whole dialogue and whether tutors are successfully addressing a genuine error / confusion from the student. The goal is to evaluate whether the responses you are seeing from the AI tutors are something that you would expect from a good tutor, whose goal is to remediate the original mistake in the student’s solution.

For more information and background on the annotation approach, you can refer to the paper behind the dataset and its annotation – Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors.

4. I have spotted a few dialogues that end up with a tutor’s turn rather than student’s. Why are they included?

There are, indeed, a small number of dialogues, where the conversation history finishes with the turn from the tutor. This data originates with the MathDial dataset, and the shared task organizers did not interfere with this original data. We notice that in such cases the tutor seems to summarize student’s solution or explain their chain of reasoning before providing further feedback. For example:

"conversation_history": "Tutor: Hi, could you please provide a step-by-step solution for the question below? The question is: The dog toys Samantha buys for her dog are \"buy one get one half off\" and all cost $12.00 each. She buys 4 toys.  How much does she spend on dog toys? \n Student: Samantha buys 4 toys at $12.00 each.\nFor each pair of toys, she gets one at full price and one at half price.\nSo she pays for three full price toys and one half price toy.\nThree full price toys cost 3 x $12.00 = $36.00.\nOne half price toy costs $12.00 / 2 = $6.00.\nTherefore, Samantha spends $36.00 + $6.00 = $42.00 on dog toys.\n 42.00 \n Tutor: I added the two amounts together to get a total of $36.00 + $6.00 = $42.00."

While compiling the dataset for the shared task, we made the decision to keep such dialogues unmodified as the following tutor responses are still valid and are aimed at addressing the previous student turns.

Organizers

  • Ekaterina Kochmar (MBZUAI)
  • Kaushal Kumar Maurya (MBZUAI)
  • Kseniia Petukhova (MBZUAI)
  • KV Aditya Srivatsa (MBZUAI)
  • Justin Vasselli (Nara Institute of Science and Technology)
  • Anaïs Tack (KU Leuven)

Contact: bea.sharedtask.2025@gmail.com

Shared Task Reference

  • Ekaterina Kochmar, Kaushal Kumar Maurya, Kseniia Petukhova, KV Aditya Srivatsa, Anaïs Tack, and Justin Vasselli. 2025. Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications.

Dataset Reference

  • Kaushal Kumar Maurya, Kv Aditya Srivatsa, Kseniia Petukhova, and Ekaterina Kochmar. 2025. Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 1234–1251.

References

  • Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219
  • Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774
  • Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku.
  • Nico Daheim, Jakub Macina, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2024. Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors. arXiv preprint arXiv:2407.09136
  • Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The LLaMA 3 herd of models. arXiv preprint arXiv:2407.21783
  • Xiang Gao, Yizhe Zhang, Michel Galley, Chris Brockett, and Bill Dolan. 2020. Dialogue response ranking training with large-scale human feedback data. arXiv preprint arXiv:2009.06978
  • Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825
  • Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81
  • Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using GPT-4 with better human alignment. arXiv preprint arXiv:2303.16634
  • Jakub Macina, Nico Daheim, Sankalan Pal Chowdhury, Tanmay Sinha, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2023. MathDial: A dialogue tutoring dataset with rich pedagogical properties grounded in math reasoning problems. arXiv preprint arXiv:2305.14536.
  • Maja Popovic. 2017. chrF++: words helping character n-grams. In Proceedings of the second conference on machine translation, pages 612–618
  • Kaushal Kumar Maurya, KV Srivatsa, Kseniia Petukhova, and Ekaterina Kochmar. Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors. In Proceedings of NAACL 2025 (main).
  • Matt Post. 2018. A call for clarity in reporting BLEU scores. arXiv preprint arXiv:1804.08771
  • Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530
  • Anaïs Tack and Chris Piech. 2022. The AI teacher test: Measuring the pedagogical ability of blender and GPT-3 in educational dialogues. arXiv preprint arXiv:2205.07540
  • Anaïs Tack, Ekaterina Kochmar, Zheng Yuan, Serge Bibauw, and Chris Piech. 2023. The BEA 2023 shared task on generating AI teacher responses in educational dialogues. arXiv preprint arXiv:2306.06941.
  • Rose Wang, Qingyang Zhang, Carly Robinson, Susanna Loeb, and Dorottya Demszky. 2024. Bridging the novice-expert gap via models of decision-making: A case study on remediating math mistakes. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2174–2199
  • Wollny, Sebastian & Schneider, Jan & Di Mitri, Daniele & Weidlich, Joshua & Rittberger, Marc & Drachsler, Hendrik. 2021. Are We There Yet? - A Systematic Literature Review on Chatbots in Education. Frontiers in Artificial Intelligence 4. 654924. (doi:10.3389/frai.2021.654924)