BEA 2024 Shared Task

Automated Prediction of Item Difficulty and Item Response Time

Overview Paper

For a detailed overview of the shared task, the main solutions proposed by the teams, and their implication for educational assessment, please read the Shared Task overview paper referenced below.

Findings from the First Shared Task on Automated Prediction of Difficulty and Response Time for Multiple-Choice Questions (Yaneva et al., BEA 2024)

Motivation

For standardized exams to be fair and valid, test items must meet certain criteria. One important criterion is that the questions should cover a wide range of difficulty levels to gather information about the abilities of test takers effectively. Additionally, it is essential to allocate an appropriate amount of time for each question: too little time can make the exam speeded, while too much time can make it inefficient. Typically, item difficulty and response time data are collected via a process called pretesting, where new items are embedded in live exams alongside scored items. While robust, this process of collecting item characteristics data is time-consuming and expensive. As noted by Settles et al. (2020), “This labor-intensive process often restricts the number of items that can feasibly be created, which in turn poses a threat to security: Items may be copied and leaked, or simply used too often”.

To address this challenge (also referred to as the “cold-start parameter estimation problem” (McCarthy et al., 2021)), there is growing interest in predicting item characteristics such as difficulty and response time based on the item text. Such estimates can be used to “jump-start” parameter estimation by exposing the item to a smaller sample of test-takers, or improve fairness by reducing the time variance for forms that include pretest items (Baldwin et al., 2020).

Due to difficulties with sharing exam data, efforts to advance the state-of-the-art in item parameter prediction have been fragmented and conducted in individual institutions, with no transparent evaluation on a publicly available dataset. In this Shared Task, we bridge this gap by sharing practice item content and characteristics from a high-stakes medical exam called the United States Medical Licensing Examination® (USMLE®) for the exploration of two topics: predicting item difficulty (Track 1) and item response time (Track 2) based on item text.

Data

The data for the Shared Task consists of 667 previously used and now retired Multiple Choice Questions (MCQs) from USMLE Steps 1, 2 CK, and 3. The USMLE is a series of examinations (called Steps) to support medical licensure decisions in the United States that is developed by the National Board of Medical Examiners (NBME) and Federation of State Medical Boards (FSMB). An example practice item from the dataset is given in Table 1.

Q	A 65-year-old woman comes to the physician for a follow-up examination after blood pressure measurements were 175/105 mm Hg and 185/110 mm Hg 1 and 3 weeks ago, respectively. She has well-controlled type 2 diabetes mellitus. Her blood pressure now is 175/110 mm Hg. Physical examination shows no other abnormalities. Antihypertensive therapy is started, but her blood pressure remains elevated at her next visit 3 weeks later. Laboratory studies show increased plasma renin activity; the erythrocyte sedimentation rate and serum electrolytes are within the reference ranges. Angiography shows a high-grade stenosis of the proximal right renal artery; the left renal artery appears normal. Which of the following is the most likely diagnosis?
(A)	Atherosclerosis
(B)	Congenital renal artery hypoplasia
(C)	Fibromuscular dysplasia
(D)	Takayasu arteritis
(E)	Temporal arteritis

Table 1: An example of a practice item from the USMLE Step 1 Sample Test Questions at usmle.org

The part describing the case is referred to as stem, the correct answer is referred to as key, and the incorrect answer options are known as distractors. All items are MCQs that test medical knowledge and were written by experienced subject matter experts following a set of guidelines, stipulating adherence to a standard structure. These guidelines require avoidance of “window dressing” (extraneous material not needed to answer the item), “red herrings” (information designed to mislead the test-taker), and grammatical cues (e.g., correct answers that are longer or more specific than the other options). The goal of standardizing items in this manner is to produce items that vary in their difficulty and discriminating power due only to differences in the medical content they assess.

The items were administered within a standard nine-hour exam. For this shared task, the item characteristic data was derived from first-time examinees from accredited US and Canadian medical schools.

Each item is tagged with the following item characteristics:

Item difficulty: A measure of item difficulty where higher values indicate more difficult items.
Time intensity: arithmetic mean response time, measured in seconds, across all examinees who attempted a given item in a live exam. This includes all time spent on the item from the moment it is presented on the screen until the examinee moves to the next item, as well as any revisits.

The data is structured as follows:

ItemNum<\t>ItemStem_Text<\t>Answer__A<\t>Answer__B<\t>Answer__C<\t>Answer__D<\t>Answer__E<\t>Answer__F<\t>Answer__G<\t>Answer__H<\t>Answer__I<\t>Answer__J<\t>Answer_Key<\t>Answer_Text<\t>ItemType<\t>EXAM<\t>Difficulty<\t>Response_Time

ItemNum denotes the consecutive number of the item in the dataset (e.g., 1,2,3,4,5, etc).

ItemStem_Text contains the text data for the item stem (the part of the item describing the clinical case).

Answer__A contains the text for response option A

Answer__B contains the text for response option B

Answer__C contains the text for response option C.

(…)

Answer__J contains the text for response option J. For items that have fewer than J response options, the remaining columns are left blank. For example, if an item contains response options A to E, the fields for columns F to J are left blank for that item.

Answer__Key contains the letter of the correct answer for that item.

Answer_Text contains the text of the correct response for the item.

ItemType denotes whether the item contained an image (e.g., an x-ray image, picture of a skin lesion, etc.) or not. The value “Text” denotes text-only items that do not contain images and the value “PIX” denotes items that contain an image. Note that the images are not part of the dataset.

EXAM denotes the Step of the USMLE exam the item belongs to (Step 1, Step 2, or Step 3). For more information on the Steps of the USMLE see https://www.usmle.org/step-exams.

Difficulty contains the item difficulty measure. Higher values indicate more difficult items.

Response_Time contains the mean response time for the item measured in seconds.

Prior work related to modeling item difficulty and time intensity for clinical MCQs from the USMLE includes the following articles: Ha et al. (2019), Baldwin et al. (2020), Yaneva et al. (2020), Xue et al. (2020), Yaneva et al. (2021) (see the References section below).

Participation

We frame the proposed task in two separate tracks as follows:

Track 1: Given the item text and metadata, predict the item difficulty variable.
Track 2: Given the item text and metadata, predict the time intensity variable.

Use of one target variable in the prediction of another is not allowed, since at the time of writing of each item neither the difficulty, nor the time intensity parameters are available.

Training data outside of the specified training set is allowed, provided that it is publicly available.

In both tracks, the evaluation will be based on the Root Mean Squared Error metric (RMSE).

Registration

Registration is now closed. For questions, please email Victoria Yaneva at vyaneva@nbme.org

Data Access

Access to the data for the purposes of the Shared Task is no longer available. If you wish to request access to the data for a different research study, please follow the application process outlined at https://www.nbme.org/bea-2024-shared-task-data. For questions, please contact ORS@nbme.org.

If you publish results using this data, please cite the Shared Task overview paper as follows: Findings from the First Shared Task on Automated Prediction of Difficulty and Response Time for Multiple-Choice Questions (Yaneva et al., BEA 2024)

Submission and Evaluation

The submissions need to be separate .csv files for each track named “Difficulty_Predictions.csv”, and “Response_time_predictions.csv”, respectively. Each submission should contain the item number (Item_Num) and predicted value as in the following example:

Item_num, Prediction
143, 0.9
423, 0.1

Teams can submit up to three attempts for each track, differentiated by adding run1, run2, or run3 to the name of their uploaded .csv file. However, we ask that the participants explain how these attempts/submissions are different within their system report paper, i.e., changes in methodology, parameters, models used, prediction strategy, etc.

There will be two separate leaderboards for Track 1 and Track 2. In both, submissions are ranked according to the Root Mean Squared Error metric from Python’s scikit-learn library.

Submissions from each team are expected to be accompanied by a system report paper. To allow time for writing, the papers are due on March 10. The papers must use the official ACL style templates, which are available here. The accepted papers will be published as part of the official BEA proceedings (ACL Anthology). Both long and short papers are welcome. The system papers will be summarized and discussed in an overview paper.

We kindly request that team members be available to serve as reviewers for system report papers from other teams if needed. If you are unavailable for reviewing between March 11th and March 31st, please inform us promptly.

Results

Out of all 48 teams who registered, 17 teams submitted results. As can be seen from the leaderboard below, predicting item difficulty remains a highly challenging task, with the best results surpassing the DummyRegressor baseline by a minimal margin (e.g., 0.29 compared to 0.31). The task of predicting item response time was easier to address, with many teams outperforming the DummyRegressor baseline (23.927 compared to 31.68).

Track 1: Item Difficulty Prediction

DummyRegressor Baseline Difficulty: 0.311

Rank	Team Name	Target	Run	RMSE
1	EduTec	Difficulty	electra	0.299
2	UPN-ICC	Difficulty	run1	0.303
3	EduTec	Difficulty	roberta	0.304
4	ITEC	Difficulty	RandomForest	0.305
5	BC	Difficulty	ENSEMBLE	0.305
6	Scalar	Difficulty	Predictions	0.305
7	BC	Difficulty	FEAT	0.305
8	BC	Difficulty	ROBERTA	0.306
9	UnibucLLM	Difficulty	run1	0.308
10	EDU	Difficulty	Run3	0.308
11	EDU	Difficulty	Run1	0.308
12	ITEC	Difficulty	Ensemble	0.308
13	UNED	Difficulty	run3	0.308
14	Rishikesh	Difficulty	1	0.31
15	Iran-Canada	Difficulty	run2	0.311
16	Baseline	Difficulty	DummyRegressor	0.311
17	EduTec	Difficulty	deberta	0.312
18	Iran-Canada	Difficulty	run3	0.313
19	SCaLARlab	Difficulty	run1	0.315
20	BRG	Difficulty	PubMedBert	0.318
21	Iran-Canada	Difficulty	run1	0.322
22	SCaLARlab	Difficulty	run2	0.322
23	ml4ed	Difficulty	run1	0.323
24	ml4ed	Difficulty	run3	0.325
25	edtec	Difficulty	run1	0.325
26	edtec	Difficulty	run3	0.326
27	edtec	Difficulty	run2	0.326
28	Pitt	Difficulty	run1	0.326
29	ml4ed	Difficulty	run2	0.328
30	UnibucLLM	Difficulty	run3	0.328
31	EDU	Difficulty	Run2	0.329
32	ED	Difficulty	run1	0.332
33	SCaLARlab	Difficulty	run3	0.336
34	UnibucLLM	Difficulty	run2	0.337
35	UNED	Difficulty	run1	0.337
36	BRG	Difficulty	run1	0.34
37	Daniel	Difficulty	run2	0.348
38	BRG	Difficulty	run2	0.348
39	ED	Difficulty	run2	0.353
40	UNED	Difficulty	run2	0.363
41	Daniel	Difficulty	run1	0.364
42	ED	Difficulty	run3	0.367
43	ITEC	Difficulty	BERT-ClinicalQA	0.393

Track 2: Response Time Prediction

DummyRegressor Baseline Response Time: 31.68

Rank	Team Name	Target	Run	RMSE
1	UNED	Response_Time	run2	23.927
2	ITEC	Response_Time	Lasso	24.116
3	UNED	Response_Time	run1	24.777
4	UNED	Response_Time	run3	25.365
5	EduTec	Response_Time	roberta	25.64
6	EduTec	Response_Time	electra	25.875
7	UnibucLLM	Response_Time	run3	26.073
8	ED	Response_Time	run1	26.57
9	Rishikesh	Response_Time	1	26.651
10	UnibucLLM	Response_Time	run2	26.768
11	UnibucLLM	Response_Time	run1	26.846
12	SCaLARlab	Response_Time	run3	26.945
13	Scalar	Response_Time	predictions	26.982
14	EduTec	Response_Time	deberta	27.302
15	EDU	Response_Time	Run1	27.474
16	SCaLARlab	Response_Time	run2	27.481
17	EDU	Response_Time	Run3	28.191
18	Iran-Canada	Response_Time	run3	28.714
19	SCaLARlab	Response_Time	run1	28.768
20	Iran-Canada	Response_Time	run2	28.88
21	Iran-Canada	Response_Time	run1	29.394
22	Daniel	Response_Time	run1	29.967
23	UPN-ICC	Response_Time	run1	30.981
24	BRG	Response_Time	run2	31.48
25	Baseline	Response_Time	DummyRegressor	31.68
26	EDU	Response_Time	Run2	31.962
27	BRG	Response_Time	run1	31.996
28	ED	Response_Time	run2	33.281
29	BRG	Response_Time	PubMedBert	33.412
30	ED	Response_Time	run3	35.476
31	Daniel	Response_Time	run2	36.421
32	ITEC	Response_Time	BERT-ClinicalQA	53.844
33	ITEC	Response_Time	emrqa	54.719
34	Pitt	Response_Time	run1	70.488

Important Dates

Training data release: January 15

Test data release: February 10

Results due: February 16

Announcement of winners: February 21

Paper submissions due: March 10

Camera-ready papers due: April 22

Organizers

Victoria Yaneva, National Board of Medical Examiners

Peter Baldwin, National Board of Medical Examiners

Kai North, George Mason University

Brian Clauser, National Board of Medical Examiners

Saed Rezayi, National Board of Medical Examiners

Yiyun Zhou, National Board of Medical Examiners

Le An Ha, Ho Chi Minh City University of Foreign Languages - Information Technology (HUFLIT)

Polina Harik, National Board of Medical Examiners

References

Baldwin, P., Yaneva, V., Mee, J., Clauser, B. E. , Ha, L. A. 2020. Using Natural Language Processing to Predict Item Response Times and Improve Test Construction. Journal of Educational Measurement, Wiley. DOI: https://doi.org/10.1111/jedm.12264

McCarthy, A.D., Yancey, K.P., LaFlair, G.T., Egbert, J., Liao, M. and Settles, B., 2021, November. Jump-starting item parameters for adaptive language tests. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 883-899).

Ha, L. A., Yaneva, V., Baldwin, P. and Mee, J. 2019. Predicting the Difficulty of Multiple Choice Questions in a High-stakes Medical Exam. Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications (BEA), held in conjunction with ACL 2019, Florence, Italy, 2 August, 2019.

Settles, B., T. LaFlair, G. and Hagiwara, M., 2020. Machine learning–driven language assessment. Transactions of the Association for computational Linguistics, 8, pp.247-263.

Xue, K., Yaneva, V., Runyon, C. and Baldwin, P., 2020, July. Predicting the Difficulty and Response Time of Multiple Choice Questions Using Transfer Learning. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 193-197).

Yaneva, V., Ha, L. A., Baldwin, P. and Mee, J. 2020. Predicting Item Survival for Multiple Choice Questions in a High-stakes Medical Exam. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 6812-6818).

Yaneva, V., Jurich, D. P., Ha, L. A., and Baldwin, P. (2021) Using Linguistic Features to Predict the Response Process Complexity Associated with Answering Clinical MCQs. Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 223-232)