BEA 2024 Shared Task

Automated Prediction of Item Difficulty and Item Response Time


For standardized exams to be fair and valid, test items must meet certain criteria. One important criterion is that the questions should cover a wide range of difficulty levels to gather information about the abilities of test takers effectively. Additionally, it is essential to allocate an appropriate amount of time for each question: too little time can make the exam speeded, while too much time can make it inefficient. Typically, item difficulty and response time data are collected via a process called pretesting, where new items are embedded in live exams alongside scored items. While robust, this process of collecting item characteristics data is time-consuming and expensive. As noted by Settles et al. (2020), “This labor-intensive process often restricts the number of items that can feasibly be created, which in turn poses a threat to security: Items may be copied and leaked, or simply used too often”.

To address this challenge (also referred to as the “cold-start parameter estimation problem” (McCarthy et al., 2021)), there is growing interest in predicting item characteristics such as difficulty and response time based on the item text. Such estimates can be used to “jump-start” parameter estimation by exposing the item to a smaller sample of test-takers, or improve fairness by reducing the time variance for forms that include pretest items (Baldwin et al., 2020).

Due to difficulties with sharing exam data, efforts to advance the state-of-the-art in item parameter prediction have been fragmented and conducted in individual institutions, with no transparent evaluation on a publicly available dataset. In this Shared Task, we bridge this gap by sharing practice item content and characteristics from a high-stakes medical exam called the United States Medical Licensing Examination® (USMLE®) for the exploration of two topics: predicting item difficulty (Track 1) and item response time (Track 2) based on item text.


The data for the Shared Task consists of 667 previously used and now retired Multiple Choice Questions (MCQs) from USMLE Steps 1, 2 CK, and 3. The USMLE is a series of examinations (called Steps) to support medical licensure decisions in the United States that is developed by the National Board of Medical Examiners (NBME) and Federation of State Medical Boards (FSMB). An example practice item from the dataset is given in Table 1.

Q A 65-year-old woman comes to the physician for a follow-up examination after blood pressure measurements were 175/105 mm Hg and 185/110 mm Hg 1 and 3 weeks ago, respectively. She has well-controlled type 2 diabetes mellitus. Her blood pressure now is 175/110 mm Hg. Physical examination shows no other abnormalities. Antihypertensive therapy is started, but her blood pressure remains elevated at her next visit 3 weeks later. Laboratory studies show increased plasma renin activity; the erythrocyte sedimentation rate and serum electrolytes are within the reference ranges. Angiography shows a high-grade stenosis of the proximal right renal artery; the left renal artery appears normal. Which of the following is the most likely diagnosis?
(A) Atherosclerosis
(B) Congenital renal artery hypoplasia
(C) Fibromuscular dysplasia
(D) Takayasu arteritis
(E) Temporal arteritis

Table 1: An example of a practice item from the USMLE Step 1 Sample Test Questions at

The part describing the case is referred to as stem, the correct answer is referred to as key, and the incorrect answer options are known as distractors. All items are MCQs that test medical knowledge and were written by experienced subject matter experts following a set of guidelines, stipulating adherence to a standard structure. These guidelines require avoidance of “window dressing” (extraneous material not needed to answer the item), “red herrings” (information designed to mislead the test-taker), and grammatical cues (e.g., correct answers that are longer or more specific than the other options). The goal of standardizing items in this manner is to produce items that vary in their difficulty and discriminating power due only to differences in the medical content they assess.

The items were administered within a standard nine-hour exam. For this shared task, the item characteristic data was derived from first-time examinees from accredited US and Canadian medical schools.

Each item is tagged with the following item characteristics:

  • Item difficulty: A measure of item difficulty where higher values indicate more difficult items.

  • Time intensity: arithmetic mean response time, measured in seconds, across all examinees who attempted a given item in a live exam. This includes all time spent on the item from the moment it is presented on the screen until the examinee moves to the next item, as well as any revisits.

The data is structured as follows:


ItemNum denotes the consecutive number of the item in the dataset (e.g., 1,2,3,4,5, etc).

ItemStem_Text contains the text data for the item stem (the part of the item describing the clinical case).

Answer__A contains the text for response option A

Answer__B contains the text for response option B

Answer__C contains the text for response option C.


Answer__J contains the text for response option J. For items that have fewer than J response options, the remaining columns are left blank. For example, if an item contains response options A to E, the fields for columns F to J are left blank for that item.

Answer__Key contains the letter of the correct answer for that item.

Answer_Text contains the text of the correct response for the item.

ItemType denotes whether the item contained an image (e.g., an x-ray image, picture of a skin lesion, etc.) or not. The value “Text” denotes text-only items that do not contain images and the value “PIX” denotes items that contain an image. Note that the images are not part of the dataset.

EXAM denotes the Step of the USMLE exam the item belongs to (Step 1, Step 2, or Step 3). For more information on the Steps of the USMLE see

Difficulty contains the item difficulty measure. Higher values indicate more difficult items.

Response_Time contains the mean response time for the item measured in seconds.

Prior work related to modeling item difficulty and time intensity for clinical MCQs from the USMLE includes the following articles: Ha et al. (2019), Baldwin et al. (2020), Yaneva et al. (2020), Xue et al. (2020), Yaneva et al. (2021) (see the References section below).


We frame the proposed task in two separate tracks as follows:

  • Track 1: Given the item text and metadata, predict the item difficulty variable.

  • Track 2: Given the item text and metadata, predict the time intensity variable.

Use of one target variable in the prediction of another is not allowed, since at the time of writing of each item neither the difficulty, nor the time intensity parameters are available.

Training data outside of the specified training set is allowed, provided that it is publicly available.

In both tracks, the evaluation will be based on the Root Mean Squared Error metric (RMSE).


Registration is now closed. For questions, please email Victoria Yaneva at

Data Access

Access to the data for the purposes of the Shared Task is no longer available. If you wish to request access to the data for a different research study, please follow the application process outlined at NBME’s Data Sharing Portal. For questions, please contact

Submission and Evaluation

The submissions need to be separate .csv files for each track named “Difficulty_Predictions.csv”, and “Response_time_predictions.csv”, respectively. Each submission should contain the item number (Item_Num) and predicted value as in the following example:

Item_num, Prediction
143, 0.9
423, 0.1

Teams can submit up to three attempts for each track, differentiated by adding run1, run2, or run3 to the name of their uploaded .csv file. However, we ask that the participants explain how these attempts/submissions are different within their system report paper, i.e., changes in methodology, parameters, models used, prediction strategy, etc.

There will be two separate leaderboards for Track 1 and Track 2. In both, submissions are ranked according to the Root Mean Squared Error metric from Python’s scikit-learn library.

Submissions from each team are expected to be accompanied by a system report paper. To allow time for writing, the papers are due on March 10. The papers must use the official ACL style templates, which are available here. The accepted papers will be published as part of the official BEA proceedings (ACL Anthology). Both long and short papers are welcome. The system papers will be summarized and discussed in an overview paper.

We kindly request that team members be available to serve as reviewers for system report papers from other teams if needed. If you are unavailable for reviewing between March 11th and March 31st, please inform us promptly.


Out of all 48 teams who registered, 17 teams submitted results. As can be seen from the leaderboard below, predicting item difficulty remains a highly challenging task, with the best results surpassing the DummyRegressor baseline by a minimal margin (e.g., 0.29 compared to 0.31). The task of predicting item response time was easier to address, with many teams outperforming the DummyRegressor baseline (23.927 compared to 31.68).

Track 1: Item Difficulty Prediction

DummyRegressor Baseline Difficulty: 0.311

Rank Team Name Target Run RMSE
1 EduTec Difficulty electra 0.299
2 UPN-ICC Difficulty run1 0.303
3 EduTec Difficulty roberta 0.304
4 ITEC Difficulty RandomForest 0.305
5 BC Difficulty ENSEMBLE 0.305
6 Scalar Difficulty Predictions 0.305
7 BC Difficulty FEAT 0.305
8 BC Difficulty ROBERTA 0.306
9 UnibucLLM Difficulty run1 0.308
10 EDU Difficulty Run3 0.308
11 EDU Difficulty Run1 0.308
12 ITEC Difficulty Ensemble 0.308
13 UNED Difficulty run3 0.308
14 Rishikesh Difficulty 1 0.31
15 Iran-Canada Difficulty run2 0.311
16 Baseline Difficulty DummyRegressor 0.311
17 EduTec Difficulty deberta 0.312
18 Iran-Canada Difficulty run3 0.313
19 SCaLARlab Difficulty run1 0.315
20 BRG Difficulty PubMedBert 0.318
21 Iran-Canada Difficulty run1 0.322
22 SCaLARlab Difficulty run2 0.322
23 ml4ed Difficulty run1 0.323
24 ml4ed Difficulty run3 0.325
25 edtec Difficulty run1 0.325
26 edtec Difficulty run3 0.326
27 edtec Difficulty run2 0.326
28 Pitt Difficulty run1 0.326
29 ml4ed Difficulty run2 0.328
30 UnibucLLM Difficulty run3 0.328
31 EDU Difficulty Run2 0.329
32 ED Difficulty run1 0.332
33 SCaLARlab Difficulty run3 0.336
34 UnibucLLM Difficulty run2 0.337
35 UNED Difficulty run1 0.337
36 BRG Difficulty run1 0.34
37 Daniel Difficulty run2 0.348
38 BRG Difficulty run2 0.348
39 ED Difficulty run2 0.353
40 UNED Difficulty run2 0.363
41 Daniel Difficulty run1 0.364
42 ED Difficulty run3 0.367
43 ITEC Difficulty BERT-ClinicalQA 0.393

Track 2: Response Time Prediction

DummyRegressor Baseline Response Time: 31.68

Rank Team Name Target Run RMSE
1 UNED Response_Time run2 23.927
2 ITEC Response_Time Lasso 24.116
3 UNED Response_Time run1 24.777
4 UNED Response_Time run3 25.365
5 EduTec Response_Time roberta 25.64
6 EduTec Response_Time electra 25.875
7 UnibucLLM Response_Time run3 26.073
8 ED Response_Time run1 26.57
9 Rishikesh Response_Time 1 26.651
10 UnibucLLM Response_Time run2 26.768
11 UnibucLLM Response_Time run1 26.846
12 SCaLARlab Response_Time run3 26.945
13 Scalar Response_Time predictions 26.982
14 EduTec Response_Time deberta 27.302
15 EDU Response_Time Run1 27.474
16 SCaLARlab Response_Time run2 27.481
17 EDU Response_Time Run3 28.191
18 Iran-Canada Response_Time run3 28.714
19 SCaLARlab Response_Time run1 28.768
20 Iran-Canada Response_Time run2 28.88
21 Iran-Canada Response_Time run1 29.394
22 Daniel Response_Time run1 29.967
23 UPN-ICC Response_Time run1 30.981
24 BRG Response_Time run2 31.48
25 Baseline Response_Time DummyRegressor 31.68
26 EDU Response_Time Run2 31.962
27 BRG Response_Time run1 31.996
28 ED Response_Time run2 33.281
29 BRG Response_Time PubMedBert 33.412
30 ED Response_Time run3 35.476
31 Daniel Response_Time run2 36.421
32 ITEC Response_Time BERT-ClinicalQA 53.844
33 ITEC Response_Time emrqa 54.719
34 Pitt Response_Time run1 70.488

Important Dates

Training data release: January 15

Test data release: February 10

Results due: February 16

Announcement of winners: February 21

Paper submissions due: March 10

Camera-ready papers due: April 22


Victoria Yaneva, National Board of Medical Examiners

Peter Baldwin, National Board of Medical Examiners

Kai North, George Mason University

Brian Clauser, National Board of Medical Examiners

Saed Rezayi, National Board of Medical Examiners

Yiyun Zhou, National Board of Medical Examiners

Le An Ha, Ho Chi Minh City University of Foreign Languages - Information Technology (HUFLIT)

Polina Harik, National Board of Medical Examiners


Baldwin, P., Yaneva, V., Mee, J., Clauser, B. E. , Ha, L. A. 2020. Using Natural Language Processing to Predict Item Response Times and Improve Test Construction. Journal of Educational Measurement, Wiley. DOI:

McCarthy, A.D., Yancey, K.P., LaFlair, G.T., Egbert, J., Liao, M. and Settles, B., 2021, November. Jump-starting item parameters for adaptive language tests. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 883-899).

Ha, L. A., Yaneva, V., Baldwin, P. and Mee, J. 2019. Predicting the Difficulty of Multiple Choice Questions in a High-stakes Medical Exam. Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications (BEA), held in conjunction with ACL 2019, Florence, Italy, 2 August, 2019.

Settles, B., T. LaFlair, G. and Hagiwara, M., 2020. Machine learning–driven language assessment. Transactions of the Association for computational Linguistics, 8, pp.247-263.

Xue, K., Yaneva, V., Runyon, C. and Baldwin, P., 2020, July. Predicting the Difficulty and Response Time of Multiple Choice Questions Using Transfer Learning. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 193-197).

Yaneva, V., Ha, L. A., Baldwin, P. and Mee, J. 2020. Predicting Item Survival for Multiple Choice Questions in a High-stakes Medical Exam. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 6812-6818).

Yaneva, V., Jurich, D. P., Ha, L. A., and Baldwin, P. (2021) Using Linguistic Features to Predict the Response Process Complexity Associated with Answering Clinical MCQs. Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 223-232)