BEA 2025 Shared Task

Pedagogical Ability Assessment of AI-powered Tutors

Motivation

Conversational agents offer promising opportunities for education as they can fulfill various roles (e.g., intelligent tutors and service-oriented assistants) and pursue different objectives (e.g., improving student skills and increasing instructional efficiency) (Wollny et al., 2021), among which serving as an AI tutor is one of the most prevalent tasks (Tack et al., 2023). Recent advances in the development of Large Language Models (LLMs) provide our field with promising ways of building AI-based conversational tutors, which can generate human-sounding dialogues on the fly. The key questions posed in previous research (Tack and Piech, 2022; Tack et al., 2023), however, remain: How can we test whether state-of-the-art generative models are good AI teachers, capable of replying to a student in an educational dialogue?

Evaluating dialogue systems in general presents a significant challenge. While human evaluation is still considered the most reliable method for assessing dialogue quality, its high cost and lack of reproducibility have led to the adaptation of both reference-based and reference-free automatic metrics, originally used in machine translation and summary evaluation, for dialogue evaluation (Lin, 2004; Popovic, 2017; Post, 2018; Gao et al., 2020; Liu et al., 2023). When it comes to Intelligent Tutoring Systems (ITSs), which also function as dialogue systems with the specific role of acting as tutors, these general metrics are insufficient. In the educational context, we need to assess complex pedagogical aspects and abilities of such systems, ensuring that they provide students with sufficient, helpful, and factually correct guidance and do not simply reveal answers when the student makes a mistake, among other aspects. Therefore, developing automatic metrics to evaluate these nuanced aspects is essential for creating effective and helpful tutoring systems.

Due to the lack of a standardized evaluation taxonomy, previous work has used different criteria for evaluation. For example, Tack and Piech (2022) and Tack et al. (2023) evaluated models’ responses in terms of whether they speak like a teacher, understand a student, and help a student, while in Macina et al. (2023), responses of models playing roles of tutors were evaluated by human annotators using coherence, correctness, and equitable tutoring. At the same time, Wang et al. (2024) assess usefulness, care, and human-likeness, and Daheim et al. (2024) use targetedness, correctness, and actionability of a tutor response as quality evaluation criteria. Such lack of standardization makes it difficult to compare different systems, and, therefore, defining evaluation criteria and developing automatic metrics for them is a crucial task for advancing the field, which we aim to address in this task.

Task Goals & Description

Following the successful BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues (Tack et al., 2023), we revisit the question of quality assessment of the tutor responses generated with the AI models (specifically, LLMs) in the context of educational dialogues. We believe that (1) the topic is timely and important, and the shared task will attract BEA community attention; (2) LLMs have significantly advanced in the past couple of years, making it important to revisit this topic after the competition run in 2023; and (3) there is a need to establish a pedagogically motivated benchmark for this task. In contrast to the BEA 2023 shared task, our focus is not on the generation of educational dialogues using state-of-the-art LLMs, but rather on comprehensive evaluation of AI-tutor responses using a set of pedagogically motivated metrics.

In this shared task, we will focus on educational dialogues between a student and a tutor in the mathematical domain grounded in student mistakes or confusion, where the AI tutor aims to remediate such mistakes or confusions. Dialogues in the datasets provided in this shared task include:

The context consisting of several prior turns from both the tutor and the student and containing a mistake in the student’s solution. These are extracted from two popular datasets of educational dialogues in the mathematical domain – MathDial (Macina et al., 2023) and Bridge (Wang et al., 2024);
The last utterance from the student; and
A set of possible responses following the last student’s utterance from a range of LLM-based tutors and, where available, human tutors, aimed at mistake remediation.

The LLM-based tutor responses are generated by the organizers of the shared task using a set of state-of-the-art LLMs of various sizes and capabilities, including: GPT-4 (Achiam et al., 2023), Gemini (Reid et al., 2024), Sonnet (Anthropic), Mistral (Jiang et al., 2023), Llama-3.1-8B and Llama-3.1-405B (Dubey et al., 2024), and Phi-3 (Abdin et al., 2024).

The identities of the tutors will be included in the development set provided to the task participants, but not in the test set. In addition to the responses themselves, the development set contains annotation of their quality along the following pedagogically motivated dimensions (Maurya et al., 2025):

Mistake identification: Since all dialogues in the dataset contain a mistake made by the student, a good quality response from the tutor should include the relevant mistake identification. This corresponds to student understanding in the schema of Tack and Piech (2022) and correctness in the schemata of Macina et al. (2023) and Daheim et al. (2024).
Mistake location: A good tutor response should not only notify the student of the committed error, but also point to its location in the answer and outline what the error is to help student remediate it in their next response. This corresponds to targetedness in Daheim et al. (2024).
Providing guidance: A good tutor response should provide the student with relevant and helpful guidance, such as a hint, an explanation, a supporting question, and the like. This aspect corresponds to helping a student in Tack and Piech (2022) and usefulness in Wang et al. (2024).
Actionability: Finally, once the guidance is provided to a student, it should be clear from a good tutor response what the student should do next; in other words, the tutor response should not be vague, unclear or a conversation stopper. This aspect in our schema corresponds to actionability in Daheim et al. (2024).

For more details on the annotation and data collection, please refer to Maurya et al. (2025).

Data

Both development and test sets are distributed as JSON files (Maurya et al., 2025).

Development set

The development set consists of 300 dialogues from MathDial (Macina et al., 2023) and Bridge (Wang et al., 2024) datasets, including the context of several prior turns from both the tutor and the student, containing a mistake or expessing a confusion on the part of the student, the last utterance from the student, and a set of responses to the student from 7 LLM-based tutors and human tutors (a single tutor in MathDial, and an expert and novice tutors in Bridge) annotated for their pedagogical quality. In total, the development set consists of over 2,480 tutor responses, presented and annotated as follows:

{
    "conversation_id": "29640377",
    "conversation history": "Tutor: Here is our first question.\nTutor: What is the value of 10/5?\nStudent: 50",
    "tutor_responses": {
      "Llama318B": {
        "response": "I think there might be a small mistake there, let's take a look at the problem together – when you divide 10 by 5, we actually get 2, not 50.",
        "annotation": {
          "Mistake_Identification": "Yes",
          "Mistake_Location": "Yes",
          "Providing_Guidance": "To some extent",
          "Actionability": "No"
        }
      },
      "Sonnet": {
        "response": "I appreciate your effort, but let's think about this carefully: if we divide 10 into 5 equal groups, how many would be in each group?",
        "annotation": {
          "Mistake_Identification": "Yes",
          "Mistake_Location": "Yes",
          "Providing_Guidance": "Yes",
          "Actionability": "Yes"
        }
      },
      "Gemini": {
        "response": "Remember, dividing by 5 is like sharing 10 things equally among 5 groups. How many would be in each group? Let's try that again.",
        "annotation": {
          "Mistake_Identification": "Yes",
          "Mistake_Location": "Yes",
          "Providing_Guidance": "Yes",
          "Actionability": "Yes"
        }
      },
      "..."
    }
}

The following fields are included in JSON:

conversation_id: a unique identifier for the instance
conversation history: the context of several prior turns from the tutor and the student extracted from the original datasets
tutor_responses: the set of human tutor responses extracted from the original datasets, as well as responses generated by 7 LLMs-as-tutors, each with a unique identifier
response: the response from a particular tutor
annotation: the set of annotations, including:
- Mistake_Identification marking whether the tutor recognized a mistake in a student’s response
- Mistake_Location marking whether the tutor’s response accurately points to a genuine mistake and its location
- Providing_Guidance marking whether the tutor offers correct and relevant guidance, such as an explanation, elaboration, hint, examples, and so on
- Actionability marking if it is clear from the tutor’s feedback what the student should do next

Test set

The test set consists of 200 dialogues from MathDial and Bridge, also including the context of several prior turns from both the tutor and the student, containing a mistake or expessing a confusion on the part of the student, the last utterance from the student, and a set of unannotated responses to the last student’s utterance from the same set of tutors as in the development set. The key difference is that the test set does not include tutor identities or annotations for the pedagogical quality for their responses.

As a result, the test set is organized as follows:

{
    "conversation_id": "616653340",
    "conversation history": "Tutor: 39 is a prime number or composite number?\nStudent: prime",
    "tutor_responses": {
      "Tutor_1": {
        "response": "That's a good try, but remember, prime numbers have only two factors: 1 and itself.",
      },
      "Tutor_2": {
        "response": "Hmm, what makes you think prime?",
      },
      "Tutor_3": {
        "response": "That's correct, 39 is actually a composite number because it has factors other than 1 and itself, such as 3 and 13.",
      },
      "Tutor_4": {
        "response": "That was an incorrect answer.",
      },
      "Tutor_5": {
        "response": "Let's check that again. Can you find two numbers that multiply to give 39?",
      },
      "..."
    }
}

Shared Task

This shared task will include five tracks. Participating teams are welcome to take part in any number of tracks.

Tracks

Track 1 - Mistake Identification: Teams are invited to develop systems to detect whether tutors’ responses recognize mistakes in students’ responses. The following categories are included:
- Yes: the mistake is clearly identified/ recognized in the tutor’s response
- To some extent: the tutor’s response suggests that there may be a mistake, but it sounds as if the tutor is not certain
- No: the tutor does not recognize the mistake (e.g., they proceed to simply provide the answer to the asked question)
Track 2 - Mistake Location: Teams are invited to develop systems to assess whether tutors’ responses accurately point to a genuine mistake and its location in the students’ responses. The following categories are included:
- Yes: the tutor clearly points to the exact location of a genuine mistake in the student’s solution
- To some extent: the response demonstrates some awareness of the exact mistake, but is vague, unclear, or easy to misunderstand
- No: the response does not provide any details related to the mistake
Track 3 - Providing guidance: Teams are invited to develop systems to evaluate whether tutors’ responses offer correct and relevant guidance, such as an explanation, elaboration, hint, examples, and so on. The following categories are included:
- Yes: the tutor provides guidance that is correct and relevant to the student’s mistake
- To some extent: guidance is provided but it is fully or partially incorrect, incomplete, or somewhat misleading
- No: the tutor’s response does not include any guidance, or the guidance provided is irrelevant to the question or factually incorrect
Track 4 - Actionability: Teams are invited to develop systems to assess whether tutors’ feedback is actionable, i.e., it makes it clear what the student should do next. The following categories are included:
- Yes: the response provides clear suggestions on what the student should do next
- To some extent: the response indicates that something needs to be done, but it is not clear what exactly that is
- No: the response does not suggest any action on the part of the student (e.g., it simply reveals the final answer)
Track 5 - Guess the tutor identity: Teams are invited to develop systems to identify which tutors the anonymized responses in the test set originated from. This track will address 9 classes: expert and novice tutors, and 7 LLMs included in the tutor set.

Evaluation

Tracks 1-4 will use accuracy and macro F1 as the main metrics. These will be used in two settings:

Exact evaluation: predictions submitted by the teams will be evaluated for the exact prediction of the three classes (“Yes”, “To some extent”, and “No”)
Lenient evaluation: since for these dimensions tutor responses annotated as “Yes” and “To some extent” share a certain amount of qualitative value, we will consider “Yes” and “To some extent” as a single class, and evaluate predictions under the 2-class setting (“Yes + To some extent” vs. “No”)

Track 5 will use accuracy of the tutor identity prediction as its main metric.

Participation

To register for the shared task and subscribe to updates, please fill in the form: https://forms.gle/fKJcdvL2kCrPcu8X6
All updates about the shared task will be shared with the email addresses, indicated in the registration form
Development and test data can be found in the Shared Task Repository
All teams officially participating in the test phase on CodaLab will be invited to publish their system papers in the BEA 2025 proceedings and present their work at the BEA 2025 workshop

Submission

Submissions will be run via CodaBench, with the number of submissions from each team capped at 5 per track.

The five tracks can be accessed at the following links:

Track 1 - Mistake Identification: https://www.codabench.org/competitions/7195/
Track 2 - Mistake Location: https://www.codabench.org/competitions/7200/
Track 3 - Providing guidance: https://www.codabench.org/competitions/7202/
Track 4 - Actionability: https://www.codabench.org/competitions/7203/
Track 5 - Guess the tutor identity: https://www.codabench.org/competitions/7206/

Results

Multiple teams registered and participated in various tracks of the shared task. Below are the official final leaderboards, while the comprehensive results for all submissions per teams can be found on the respective shared task track pages on CodaBench.

Track 1: Mistake Identification

Teams ranks are based on the results according to the main shared task metric – exact macro F1 score (Ex. F1). Secondary metrics for the winning submissions – exact accuracy (Ex. Acc), lenient macro F1 score (Len. F1), and lenient accuracy (Len. Acc) – are also included.

Rank	Team	Submission	Ex. F1	Ex. Acc	Len. F1	Len. Acc
1	BJTU	269726	0.7181	0.8623	0.8957	0.9457
2	TutorMind	269923	0.7163	0.8759	0.9108	0.9528
3	Averroes	273668	0.7155	0.8675	0.8997	0.9425
4	MSA	272382	0.7154	0.8759	0.9152	0.9535
5	BD	276025	0.7110	0.8772	0.8966	0.9412
6	Gooby-Snoob Guysz	265229	0.7105	0.8481	0.8901	0.9373
7	Wonderland_EDU@HKU	267613	0.6983	0.8675	0.9109	0.9496
8	Archaeology	275559	0.6976	0.8675	0.8959	0.9405
9	test	273097	0.6948	0.8400	0.8947	0.9451
10	Someone	276087	0.6926	0.8520	0.8964	0.9438
11	TBA	270039	0.6858	0.8740	0.9060	0.9476
12	BLCU-ICALL	276231	0.6822	0.8578	0.8909	0.9418
13	bea-jh	276260	0.6802	0.8708	0.9069	0.9457
14	JiNan_Smart Education	274191	0.6790	0.8688	0.9052	0.9470
15	jeez	273727	0.6735	0.8623	0.8957	0.9418
16	MT-NLP	275793	0.6677	0.8636	0.8885	0.9354
17	K-NLPers	275401	0.6669	0.8113	0.8671	0.9270
18	Thapar Titan/s	276130	0.6647	0.8520	0.8840	0.9328
19	Squirrel Ai Learning	275298	0.6646	0.8539	0.8748	0.9315
20	SmolLab_SEU	268996	0.6617	0.8397	0.8782	0.9315
21	bnl	272027	0.6578	0.8494	0.8806	0.9302
22	LexiLogic	276232	0.6549	0.8487	0.8806	0.9302
23	Retuyt-InCo	275694	0.6535	0.8449	0.8395	0.9192
24	777	273165	0.6534	0.8526	0.8731	0.9283
25	CU	274027	0.6514	0.8701	0.8957	0.9425
26	NLP Group 7	273738	0.6499	0.8462	0.8605	0.9276
27	NLIP	276186	0.6438	0.8546	0.8723	0.9257
28	ALA	276228	0.6361	0.8423	0.8493	0.9140
29	mucai	272784	0.6285	0.8067	0.8354	0.8985
30	AGS	272032	0.6251	0.8390	0.8640	0.9211
31	Tutorify	270221	0.6247	0.8261	0.8502	0.9173
32	Promptly Educated	269660	0.6196	0.7104	0.8479	0.9224
33	wyn	273790	0.6184	0.8384	0.8434	0.9095
34	Emergent Wisdom	274950	0.6100	0.8546	0.8799	0.9321
35	Georgia Tech EDU	271339	0.6049	0.8171	0.8386	0.9102
36	SG	276040	0.5896	0.7919	0.8258	0.8875
37	NeuralNexus	271085	0.5840	0.8268	0.8142	0.8972
38	presidency	273920	0.5807	0.7570	0.8070	0.8804
39	NLP_UNH	273124	0.5708	0.8358	0.8358	0.9089
40	letstea	273885	0.5376	0.6593	0.8109	0.8681
41	Patriots	275702	0.5345	0.8028	0.7923	0.8921
42	AUST_NLP	275376	0.4819	0.7085	0.6929	0.7576
43	WhyIamHere	273409	0.4562	0.7931	0.7126	0.8824
44	RAGthoven	267114	0.2949	0.4350	0.4349	0.5365

For leaderboards according to the secondary metrics, please refer to the tables on exact accuracy, lenient F1, and lenient accuracy.

Track 2: Mistake Location

Rank	Team	Submission	Ex. F1	Ex. Acc	Len. F1	Len. Acc
1	BLCU-ICALL	271997	0.5983	0.7679	0.8386	0.8630
2	BJTU	272600	0.5940	0.7330	0.7848	0.8261
3	K-NLPers	275803	0.5880	0.7641	0.8404	0.8610
4	MSA	275733	0.5743	0.6975	0.7848	0.8209
5	SG	276042	0.5692	0.7602	0.8118	0.8416
6	bea-jh	272508	0.5658	0.6723	0.7792	0.8197
7	bd	275513	0.5543	0.7143	0.7699	0.8054
8	TBA	270051	0.5490	0.7091	0.7702	0.8035
9	Wonderland_EDU@HKU	267804	0.5450	0.7104	0.7649	0.8003
10	Averroes	274370	0.5366	0.6348	0.7587	0.7822
11	Whyamher	273410	0.5325	0.6910	0.7370	0.7802
12	NLIP	270231	0.5319	0.6878	0.7495	0.7951
	Archaeology	276257	0.5319	0.6568	0.7558	0.8009
13	JiNan_Smart Education	275308	0.5274	0.6968	0.7502	0.7809
14	Squirrel Ai Learning	275300	0.5272	0.6904	0.7306	0.7692
15	Thapar Titans	276208	0.5215	0.6943	0.7374	0.7796
16	jeez	273748	0.5187	0.6833	0.7416	0.7854
17	CU	273951	0.5148	0.6807	0.7358	0.7789
18	777	275190	0.5114	0.6710	0.7195	0.7486
19	Someone	276094	0.5009	0.7208	0.7590	0.8074
20	Retuyt-InCo	272995	0.4959	0.5863	0.7200	0.7608
21	NLP Group 7	273889	0.4936	0.6348	0.6944	0.7524
22	SmolLab_SEU	273328	0.4935	0.6057	0.7051	0.7401
23	lexilogic	273346	0.4844	0.6548	0.7138	0.7447
24	mucai	276243	0.4828	0.5495	0.7086	0.7343
25	Emergent Wisdom	274966	0.4773	0.7188	0.7436	0.7893
26	2	269373	0.4749	0.7279	0.7397	0.8003
27	Promptly Educated	269746	0.4717	0.6432	0.6900	0.7337
28	Tutorify	271877	0.4666	0.6626	0.7116	0.7447
29	NLP_UNH	273632	0.4515	0.6994	0.6962	0.7725
30	Patriots	275703	0.4450	0.6328	0.6548	0.7007
31	AUST_NLP	275398	0.3044	0.4163	0.4759	0.4848

For leaderboards according to the secondary metrics, please refer to the tables on exact accuracy, lenient F1, and lenient accuracy.

Track 3: Providing Guidance

Rank	Team	Submission	Ex. F1	Ex. Acc	Len. F1	Len. Acc
1	MSA	273707	0.5834	0.6613	0.7798	0.8190
2	SG	276043	0.5785	0.7052	0.7860	0.8216
3	BLCU-ICALL	276153	0.5741	0.6716	0.7487	0.8061
4	BJTU	268532	0.5725	0.6490	0.7445	0.8100
5	K-NLPers	275779	0.5606	0.6270	0.7446	0.8003
6	bea-jh	273896	0.5451	0.6387	0.7253	0.7977
7	Wonderland_EDU@HKU	268531	0.5416	0.6464	0.7456	0.7886
8	IALab UC	276134	0.5369	0.6244	0.7379	0.7822
9	JiNan_Smart Education	275815	0.5275	0.6432	0.7336	0.7893
10	Henry	274852	0.5265	0.6238	0.7196	0.7744
11	TBA	270042	0.5212	0.6219	0.7299	0.7906
12	MT-NLP	275791	0.5211	0.6141	0.7142	0.7699
13	Archaeology	276149	0.5208	0.5734	0.7171	0.7770
14	Averroes	274718	0.5134	0.6309	0.7095	0.7751
15	Squirrel Ai Learning	275301	0.5087	0.6005	0.7059	0.7763
16	jeez	273743	0.5071	0.5831	0.7234	0.7763
	bd	276255	0.5071	0.5831	0.7234	0.7763
17	Retuyt-InCo	275740	0.5049	0.5947	0.7057	0.7751
18	woaiyuanshen	268828	0.4974	0.5798	0.7034	0.7841
19	SmolLab_SEU	270872	0.4933	0.5695	0.6990	0.7608
20	CU	275009	0.4926	0.5850	0.7031	0.7692
21	Emergent Wisdom	274850	0.4903	0.6102	0.6919	0.7725
22	NLIP	273908	0.4888	0.6025	0.6927	0.7647
23	batikbabu	275555	0.4873	0.6147	0.7001	0.7615
24	Whyiamhere	273412	0.4856	0.6231	0.6880	0.7738
25	isistanNiem	271507	0.4805	0.5844	0.6715	0.7589
26	Thapar Titans	276213	0.4777	0.5624	0.6846	0.7479
27	DLSU	275255	0.4776	0.5669	0.6755	0.7382
28	Tutorify	271153	0.4731	0.5753	0.6709	0.7511
29	777	273323	0.4711	0.6432	0.7075	0.7725
30	Promptly Educated	269678	0.4674	0.6102	0.6785	0.7647
31	lexiLogic	275350	0.4656	0.5869	0.6803	0.7473
32	GGEZ Lab	275438	0.4596	0.5714	0.6652	0.7492
33	Patriots	275704	0.4508	0.5663	0.6422	0.7311
34	NLP_UNH	273633	0.4301	0.6380	0.6895	0.7692
35	AUST_NLP	275410	0.4045	0.5973	0.6094	0.7259

For leaderboards according to the secondary metrics, please refer to the tables on exact accuracy, lenient F1, and lenient accuracy.

Track 4: Actionability

Rank	Team	Submission	Ex. F1	Ex. Acc	Len. F1	Len. Acc
1	bea-jh	270711	0.7085	0.7298	0.8527	0.8837
2	BJTU	267463	0.6992	0.7363	0.8633	0.8940
3	MSA	275695	0.6984	0.7537	0.8659	0.8908
4	lexiLogic	271382	0.6930	0.7162	0.8393	0.8675
5	Phaedrus	276245	0.6907	0.7298	0.8346	0.8656
6	Wonderland_EDU@HKU	268430	0.6843	0.7285	0.8613	0.8888
7	Archaeology	275635	0.6776	0.7214	0.8302	0.8565
8	BLCU-ICALL	276254	0.6735	0.7363	0.8596	0.8856
9	TBA	270061	0.6671	0.7324	0.8499	0.8752
10	4	268908	0.6668	0.7033	0.8160	0.8520
	JiNan_Smart Education	273231	0.6668	0.7033	0.8160	0.8520
11	bnl	271353	0.6655	0.6813	0.8172	0.8597
12	woaiyuanshen	270797	0.6651	0.7124	0.8191	0.8533
13	love-peace	270809	0.6562	0.6839	0.8051	0.8352
14	bd	276244	0.6554	0.7182	0.8461	0.8707
15	Thapar Titans	276218	0.6324	0.6774	0.7936	0.8248
16	SmolLab_SEU	273372	0.6284	0.6955	0.8223	0.8565
17	Retuyt-InCo	273776	0.6129	0.7033	0.8272	0.8559
18	NLIP	273476	0.6055	0.6897	0.8205	0.8468
19	Squirrel Ai Learning	275302	0.5954	0.6516	0.7639	0.8022
20	Tutorify	273722	0.5681	0.6425	0.7749	0.8190
21	K-NLPers	275440	0.5664	0.5773	0.7346	0.8061
22	Emergent Wisdom	274934	0.5661	0.6645	0.7782	0.8054
23	SG	276044	0.5465	0.6341	0.7545	0.7725
24	SAI	275023	0.5398	0.6277	0.7564	0.8022
25	DLSU	275259	0.5294	0.6089	0.7351	0.7738
26	Patriots	275705	0.4630	0.5727	0.6943	0.7537
27	whyiamhere	273413	0.4306	0.6044	0.7143	0.7938
28	AUST_NLP	275425	0.4196	0.5262	0.6077	0.6833
29	NLP_UNH	273635	0.3798	0.5546	0.6530	0.7524

For leaderboards according to the secondary metrics, please refer to the tables on exact accuracy, lenient F1, and lenient accuracy.

Track 5: Tutor Identification

Teams ranks are based on the results according to the main shared task metric – exact macro F1 score (Ex. F1). Secondary metric for the winning submissions – exact accuracy (Ex. Acc) – is also included.

Rank	Team	Submission	Ex. F1	Ex. Acc
1	Phaedru	275108	0.9698	0.9664
2	SYSUpporter	274782	0.9692	0.9657
3	Two Outliers	275385	0.9172	0.9412
4	JInan_Smart Education	266729	0.8965	0.8940
5	BLCU-ICALL	276219	0.8930	0.8908
6	Archaeology	275710	0.8866	0.8882
7	Wonderland_EDU@HKU	268436	0.8795	0.8778
8	MSA	275876	0.8697	0.8649
9	SmolLab_SEU	273478	0.8621	0.8604
10	mucai	274180	0.8602	0.8675
11	Squirrel Ai Learning	274001	0.8432	0.8390
12	Retuyt-InCo	275706	0.8385	0.8475
13	whyiamhere	273500	0.8356	0.8345
14	bnl	272827	0.8247	0.8216
15	Tutorify	274405	0.8212	0.8100
16	LexiLogic	276033	0.8207	0.8145
17	Georgia Tech EDU	271359	0.6468	0.6296
18	DLSU	275260	0.6420	0.6231
19	letstea	273939	0.1749	0.1635
20	zet-epsilon	275908	0.1140	0.1965

For leaderboards according to the secondary metric, please refer to the table on exact accuracy.

Important Dates

All deadlines are 11:59pm UTC-12 (anywhere on Earth).

March 12, 2025: Development data release
~~April 9~~ April 10, 2025: Test data release
~~April 23~~ April 24, 2025: System submissions from teams due
April 30, 2025: Evaluation of the results by the organizers
May 21, 2025: System papers due
May 28, 2025: Paper reviews returned
June 9, 2025: Final camera-ready submissions
July 31 and August 1, 2025: BEA 2025 workshop at ACL

FAQs

Questions about this shared task should be sent to bea.sharedtask.2025@gmail.com. Below, we share the answers to the most frequently asked questions.

1. How do I submit my system predictions and what is the expected format of the submission file?

Please check the “Evaluation and Submission File” tab in the “Get Started” panel on the relevant CodaBench track page (e.g., https://www.codabench.org/competitions/7195/ for the Mistake Identification track). All the information about the format and how to upload your submission is provided there. Feel free to check the other tabs in the “Get Started” panel for more information about the task.

2. Our team is composed of multiple team members. Who should submit the system predictions to the CodaBench page? How do we make sure all team members are acknowledged in the submission?

Our advice is that a single team member creates an account and makes submissions for the whole team on CodaBench. Unfortunately, we have not found an easy way on this platform to add other users to a team, so a submission from a single team member is fine. When the test phase is completed, we will ask each team to confirm their composition for the official final leaderboard on the shared task website (https://sig-edu.org/sharedtask/2025), we will summarize the results from teams in the official shared task paper, and we will also invite teams to submit their own system reports – these papers will be included in the official workshop proceedings. Therefore, team members should not worry that any of them will be left out or unacknowledged.

Another thing to keep in mind: multiple team members can register for the task, and the platform allows any registered team member to submit system predictions; however, regardless of who submits, we set the upper cap on the submissions from each team in each track at 5. When a user makes the submission, the platform asks them which team they as an individual are submitting for, even if these submissions can be made by different members. In the end, we will include only the first 5 submissions from each team (the earliest made by each team according to the time stamp) in our official leaderboard to make it a fair playing field for all teams regardless of their composition.

3. Can you clarify how it was decided whether the student utterance contains a mistake? I am seeing examples as the following, and the last student’s utterance does not contain a mistake, yet most of the tutors’ responses are annotated as “yes” for Mistake Identification.

"conversation_history": "Tutor: Hi, could you please provide a step-by-step solution for the question below? The question is: Tyson decided to make muffaletta sandwiches for the big game.  Each sandwich required 1 pound each of meat and cheese and would serve 4 people.  There would be 20 people in total watching the game.  The meat cost $7.00 per pound and the cheese cost $3.00 per pound.  How much money would he spend on the meat and cheese to make enough sandwiches to serve 20 people? \n Student: To serve 20 people, Tyson needs to make 20/4 = 5 sandwiches.\nEach sandwich requires 1+1 = 2 pounds of meat and cheese.\nFor 5 sandwiches, he needs a total of 2 x 5 = 10 pounds of meat and cheese.\nThe cost of 10 pounds of meat is 10 x $7.00 = $70.\nThe cost of 10 pounds of cheese is 10 x $3.00 = $30.\nThe total cost of meat and cheese is $70 + $30 = $100.\n 100 \n Tutor: How many pounds of meat are needed for each sandwich? \n Student: Each sandwich requires 1 pound of meat and 1 pound of cheese. \n Tutor: What is the cost of 1 pound of meat? \n Student: The cost of 1 pound of meat is $7.00."

In the dialogue referenced here, the actual error in the student’s solution is in the fact that the student counted the amount of meat and cheese needed for sandwiches twice. So, despite the fact that the very last response from the student is just a repetition of what the task says (i.e., that one pound of meat costs $7.00), the original mistake is still not corrected. What you see in the responses from the tutors included in the dataset is tutors’ attempts at remediating the original mistake of overestimating the amount needed.

For example:

"response": "You're close, but I notice that you calculated the cost of 10 pounds of meat, when actually 5 pounds of meat are needed for 5 sandwiches, since each sandwich requires 1 pound of meat; can you recalculate the cost of the meat?",
"annotation": {
  "Mistake_Identification": "Yes",
  "Mistake_Location": "Yes",
  "Providing_Guidance": "Yes",
  "Actionability": "Yes"
}

You can see here that the tutor clearly points out to the student that there is a mistake in their solution (the tutor says “You are close, but …”); the tutor also points out where the mistake is (“you calculated the cost of 10 pounds of meat, when actually 5 pounds of meat are needed for 5 sandwiches”); this feedback both provides guidance (specifically, an explanation of where the error is and how to fix it is included) and is actionable (it asks “can you recalculate the cost of the meat?”, which would lead to a concrete action from the student). Thus, all dimensions receive “Yes” labels. Other tutors’ responses are annotated using similar principles: for mistake identification, in particular, annotation tells us whether the tutor, judging by their response, recognized that there is an error in the student’s solution.

When classifying the responses, you need to take into account the whole dialogue and whether tutors are successfully addressing a genuine error / confusion from the student. The goal is to evaluate whether the responses you are seeing from the AI tutors are something that you would expect from a good tutor, whose goal is to remediate the original mistake in the student’s solution.

For more information and background on the annotation approach, you can refer to the paper behind the dataset and its annotation – Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors.

4. I have spotted a few dialogues that end up with a tutor’s turn rather than student’s. Why are they included?

There are, indeed, a small number of dialogues, where the conversation history finishes with the turn from the tutor. This data originates with the MathDial dataset, and the shared task organizers did not interfere with this original data. We notice that in such cases the tutor seems to summarize student’s solution or explain their chain of reasoning before providing further feedback. For example:

"conversation_history": "Tutor: Hi, could you please provide a step-by-step solution for the question below? The question is: The dog toys Samantha buys for her dog are \"buy one get one half off\" and all cost $12.00 each. She buys 4 toys.  How much does she spend on dog toys? \n Student: Samantha buys 4 toys at $12.00 each.\nFor each pair of toys, she gets one at full price and one at half price.\nSo she pays for three full price toys and one half price toy.\nThree full price toys cost 3 x $12.00 = $36.00.\nOne half price toy costs $12.00 / 2 = $6.00.\nTherefore, Samantha spends $36.00 + $6.00 = $42.00 on dog toys.\n 42.00 \n Tutor: I added the two amounts together to get a total of $36.00 + $6.00 = $42.00."

While compiling the dataset for the shared task, we made the decision to keep such dialogues unmodified as the following tutor responses are still valid and are aimed at addressing the previous student turns.

Organizers

Ekaterina Kochmar (MBZUAI)
Kaushal Kumar Maurya (MBZUAI)
Kseniia Petukhova (MBZUAI)
KV Aditya Srivatsa (MBZUAI)
Justin Vasselli (Nara Institute of Science and Technology)
Anaïs Tack (KU Leuven)

Contact: bea.sharedtask.2025@gmail.com

Shared Task Reference

Ekaterina Kochmar, Kaushal Kumar Maurya, Kseniia Petukhova, KV Aditya Srivatsa, Anaïs Tack, and Justin Vasselli. 2025. Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications.

Dataset Reference

Kaushal Kumar Maurya, Kv Aditya Srivatsa, Kseniia Petukhova, and Ekaterina Kochmar. 2025. Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 1234–1251.

References

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774
Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku.
Nico Daheim, Jakub Macina, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2024. Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors. arXiv preprint arXiv:2407.09136
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The LLaMA 3 herd of models. arXiv preprint arXiv:2407.21783
Xiang Gao, Yizhe Zhang, Michel Galley, Chris Brockett, and Bill Dolan. 2020. Dialogue response ranking training with large-scale human feedback data. arXiv preprint arXiv:2009.06978
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using GPT-4 with better human alignment. arXiv preprint arXiv:2303.16634
Jakub Macina, Nico Daheim, Sankalan Pal Chowdhury, Tanmay Sinha, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2023. MathDial: A dialogue tutoring dataset with rich pedagogical properties grounded in math reasoning problems. arXiv preprint arXiv:2305.14536.
Maja Popovic. 2017. chrF++: words helping character n-grams. In Proceedings of the second conference on machine translation, pages 612–618
Kaushal Kumar Maurya, KV Srivatsa, Kseniia Petukhova, and Ekaterina Kochmar. Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors. In Proceedings of NAACL 2025 (main).
Matt Post. 2018. A call for clarity in reporting BLEU scores. arXiv preprint arXiv:1804.08771
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530
Anaïs Tack and Chris Piech. 2022. The AI teacher test: Measuring the pedagogical ability of blender and GPT-3 in educational dialogues. arXiv preprint arXiv:2205.07540
Anaïs Tack, Ekaterina Kochmar, Zheng Yuan, Serge Bibauw, and Chris Piech. 2023. The BEA 2023 shared task on generating AI teacher responses in educational dialogues. arXiv preprint arXiv:2306.06941.
Rose Wang, Qingyang Zhang, Carly Robinson, Susanna Loeb, and Dorottya Demszky. 2024. Bridging the novice-expert gap via models of decision-making: A case study on remediating math mistakes. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2174–2199
Wollny, Sebastian & Schneider, Jan & Di Mitri, Daniele & Weidlich, Joshua & Rittberger, Marc & Drachsler, Hendrik. 2021. Are We There Yet? - A Systematic Literature Review on Chatbots in Education. Frontiers in Artificial Intelligence 4. 654924. (doi:10.3389/frai.2021.654924)