BEA 2024 Shared Task
Automated Prediction of Item Difficulty and Item Response Time
Overview Paper
For a detailed overview of the shared task, the main solutions proposed by the teams, and their implication for educational assessment, please read the Shared Task overview paper referenced below.
Findings from the First Shared Task on Automated Prediction of Difficulty and Response Time for Multiple-Choice Questions (Yaneva et al., BEA 2024)
Motivation
For standardized exams to be fair and valid, test items must meet certain criteria. One important criterion is that the questions should cover a wide range of difficulty levels to gather information about the abilities of test takers effectively. Additionally, it is essential to allocate an appropriate amount of time for each question: too little time can make the exam speeded, while too much time can make it inefficient. Typically, item difficulty and response time data are collected via a process called pretesting, where new items are embedded in live exams alongside scored items. While robust, this process of collecting item characteristics data is time-consuming and expensive. As noted by Settles et al. (2020), “This labor-intensive process often restricts the number of items that can feasibly be created, which in turn poses a threat to security: Items may be copied and leaked, or simply used too often”.
To address this challenge (also referred to as the “cold-start parameter estimation problem” (McCarthy et al., 2021)), there is growing interest in predicting item characteristics such as difficulty and response time based on the item text. Such estimates can be used to “jump-start” parameter estimation by exposing the item to a smaller sample of test-takers, or improve fairness by reducing the time variance for forms that include pretest items (Baldwin et al., 2020).
Due to difficulties with sharing exam data, efforts to advance the state-of-the-art in item parameter prediction have been fragmented and conducted in individual institutions, with no transparent evaluation on a publicly available dataset. In this Shared Task, we bridge this gap by sharing practice item content and characteristics from a high-stakes medical exam called the United States Medical Licensing Examination® (USMLE®) for the exploration of two topics: predicting item difficulty (Track 1) and item response time (Track 2) based on item text.
Data
The data for the Shared Task consists of 667 previously used and now retired Multiple Choice Questions (MCQs) from USMLE Steps 1, 2 CK, and 3. The USMLE is a series of examinations (called Steps) to support medical licensure decisions in the United States that is developed by the National Board of Medical Examiners (NBME) and Federation of State Medical Boards (FSMB). An example practice item from the dataset is given in Table 1.
Q | A 65-year-old woman comes to the physician for a follow-up examination after blood pressure measurements were 175/105 mm Hg and 185/110 mm Hg 1 and 3 weeks ago, respectively. She has well-controlled type 2 diabetes mellitus. Her blood pressure now is 175/110 mm Hg. Physical examination shows no other abnormalities. Antihypertensive therapy is started, but her blood pressure remains elevated at her next visit 3 weeks later. Laboratory studies show increased plasma renin activity; the erythrocyte sedimentation rate and serum electrolytes are within the reference ranges. Angiography shows a high-grade stenosis of the proximal right renal artery; the left renal artery appears normal. Which of the following is the most likely diagnosis? |
---|---|
(A) | Atherosclerosis |
(B) | Congenital renal artery hypoplasia |
(C) | Fibromuscular dysplasia |
(D) | Takayasu arteritis |
(E) | Temporal arteritis |
Table 1: An example of a practice item from the USMLE Step 1 Sample Test Questions at usmle.org
The part describing the case is referred to as stem, the correct answer is referred to as key, and the incorrect answer options are known as distractors. All items are MCQs that test medical knowledge and were written by experienced subject matter experts following a set of guidelines, stipulating adherence to a standard structure. These guidelines require avoidance of “window dressing” (extraneous material not needed to answer the item), “red herrings” (information designed to mislead the test-taker), and grammatical cues (e.g., correct answers that are longer or more specific than the other options). The goal of standardizing items in this manner is to produce items that vary in their difficulty and discriminating power due only to differences in the medical content they assess.
The items were administered within a standard nine-hour exam. For this shared task, the item characteristic data was derived from first-time examinees from accredited US and Canadian medical schools.
Each item is tagged with the following item characteristics:
-
Item difficulty: A measure of item difficulty where higher values indicate more difficult items.
-
Time intensity: arithmetic mean response time, measured in seconds, across all examinees who attempted a given item in a live exam. This includes all time spent on the item from the moment it is presented on the screen until the examinee moves to the next item, as well as any revisits.
The data is structured as follows:
ItemNum<\t>ItemStem_Text<\t>Answer__A<\t>Answer__B<\t>Answer__C<\t>Answer__D<\t>Answer__E<\t>Answer__F<\t>Answer__G<\t>Answer__H<\t>Answer__I<\t>Answer__J<\t>Answer_Key<\t>Answer_Text<\t>ItemType<\t>EXAM<\t>Difficulty<\t>Response_Time
ItemNum denotes the consecutive number of the item in the dataset (e.g., 1,2,3,4,5, etc).
ItemStem_Text contains the text data for the item stem (the part of the item describing the clinical case).
Answer__A contains the text for response option A
Answer__B contains the text for response option B
Answer__C contains the text for response option C.
(…)
Answer__J contains the text for response option J. For items that have fewer than J response options, the remaining columns are left blank. For example, if an item contains response options A to E, the fields for columns F to J are left blank for that item.
Answer__Key contains the letter of the correct answer for that item.
Answer_Text contains the text of the correct response for the item.
ItemType denotes whether the item contained an image (e.g., an x-ray image, picture of a skin lesion, etc.) or not. The value “Text” denotes text-only items that do not contain images and the value “PIX” denotes items that contain an image. Note that the images are not part of the dataset.
EXAM denotes the Step of the USMLE exam the item belongs to (Step 1, Step 2, or Step 3). For more information on the Steps of the USMLE see https://www.usmle.org/step-exams.
Difficulty contains the item difficulty measure. Higher values indicate more difficult items.
Response_Time contains the mean response time for the item measured in seconds.
Prior work related to modeling item difficulty and time intensity for clinical MCQs from the USMLE includes the following articles: Ha et al. (2019), Baldwin et al. (2020), Yaneva et al. (2020), Xue et al. (2020), Yaneva et al. (2021) (see the References section below).
Participation
We frame the proposed task in two separate tracks as follows:
-
Track 1: Given the item text and metadata, predict the item difficulty variable.
-
Track 2: Given the item text and metadata, predict the time intensity variable.
Use of one target variable in the prediction of another is not allowed, since at the time of writing of each item neither the difficulty, nor the time intensity parameters are available.
Training data outside of the specified training set is allowed, provided that it is publicly available.
In both tracks, the evaluation will be based on the Root Mean Squared Error metric (RMSE).
Registration
Registration is now closed. For questions, please email Victoria Yaneva at vyaneva@nbme.org
Data Access
Access to the data for the purposes of the Shared Task is no longer available. If you wish to request access to the data for a different research study, please follow the application process outlined at https://www.nbme.org/bea-2024-shared-task-data. For questions, please contact ORS@nbme.org.
If you publish results using this data, please cite the Shared Task overview paper as follows: Findings from the First Shared Task on Automated Prediction of Difficulty and Response Time for Multiple-Choice Questions (Yaneva et al., BEA 2024)
Submission and Evaluation
The submissions need to be separate .csv files for each track named “Difficulty_Predictions.csv”, and “Response_time_predictions.csv”, respectively. Each submission should contain the item number (Item_Num) and predicted value as in the following example:
Item_num, Prediction
143, 0.9
423, 0.1
Teams can submit up to three attempts for each track, differentiated by adding run1, run2, or run3 to the name of their uploaded .csv file. However, we ask that the participants explain how these attempts/submissions are different within their system report paper, i.e., changes in methodology, parameters, models used, prediction strategy, etc.
There will be two separate leaderboards for Track 1 and Track 2. In both, submissions are ranked according to the Root Mean Squared Error metric from Python’s scikit-learn library.
Submissions from each team are expected to be accompanied by a system report paper. To allow time for writing, the papers are due on March 10. The papers must use the official ACL style templates, which are available here. The accepted papers will be published as part of the official BEA proceedings (ACL Anthology). Both long and short papers are welcome. The system papers will be summarized and discussed in an overview paper.
We kindly request that team members be available to serve as reviewers for system report papers from other teams if needed. If you are unavailable for reviewing between March 11th and March 31st, please inform us promptly.
Results
Out of all 48 teams who registered, 17 teams submitted results. As can be seen from the leaderboard below, predicting item difficulty remains a highly challenging task, with the best results surpassing the DummyRegressor baseline by a minimal margin (e.g., 0.29 compared to 0.31). The task of predicting item response time was easier to address, with many teams outperforming the DummyRegressor baseline (23.927 compared to 31.68).
Track 1: Item Difficulty Prediction
DummyRegressor Baseline Difficulty: 0.311
Rank | Team Name | Target | Run | RMSE |
---|---|---|---|---|
1 | EduTec | Difficulty | electra | 0.299 |
2 | UPN-ICC | Difficulty | run1 | 0.303 |
3 | EduTec | Difficulty | roberta | 0.304 |
4 | ITEC | Difficulty | RandomForest | 0.305 |
5 | BC | Difficulty | ENSEMBLE | 0.305 |
6 | Scalar | Difficulty | Predictions | 0.305 |
7 | BC | Difficulty | FEAT | 0.305 |
8 | BC | Difficulty | ROBERTA | 0.306 |
9 | UnibucLLM | Difficulty | run1 | 0.308 |
10 | EDU | Difficulty | Run3 | 0.308 |
11 | EDU | Difficulty | Run1 | 0.308 |
12 | ITEC | Difficulty | Ensemble | 0.308 |
13 | UNED | Difficulty | run3 | 0.308 |
14 | Rishikesh | Difficulty | 1 | 0.31 |
15 | Iran-Canada | Difficulty | run2 | 0.311 |
16 | Baseline | Difficulty | DummyRegressor | 0.311 |
17 | EduTec | Difficulty | deberta | 0.312 |
18 | Iran-Canada | Difficulty | run3 | 0.313 |
19 | SCaLARlab | Difficulty | run1 | 0.315 |
20 | BRG | Difficulty | PubMedBert | 0.318 |
21 | Iran-Canada | Difficulty | run1 | 0.322 |
22 | SCaLARlab | Difficulty | run2 | 0.322 |
23 | ml4ed | Difficulty | run1 | 0.323 |
24 | ml4ed | Difficulty | run3 | 0.325 |
25 | edtec | Difficulty | run1 | 0.325 |
26 | edtec | Difficulty | run3 | 0.326 |
27 | edtec | Difficulty | run2 | 0.326 |
28 | Pitt | Difficulty | run1 | 0.326 |
29 | ml4ed | Difficulty | run2 | 0.328 |
30 | UnibucLLM | Difficulty | run3 | 0.328 |
31 | EDU | Difficulty | Run2 | 0.329 |
32 | ED | Difficulty | run1 | 0.332 |
33 | SCaLARlab | Difficulty | run3 | 0.336 |
34 | UnibucLLM | Difficulty | run2 | 0.337 |
35 | UNED | Difficulty | run1 | 0.337 |
36 | BRG | Difficulty | run1 | 0.34 |
37 | Daniel | Difficulty | run2 | 0.348 |
38 | BRG | Difficulty | run2 | 0.348 |
39 | ED | Difficulty | run2 | 0.353 |
40 | UNED | Difficulty | run2 | 0.363 |
41 | Daniel | Difficulty | run1 | 0.364 |
42 | ED | Difficulty | run3 | 0.367 |
43 | ITEC | Difficulty | BERT-ClinicalQA | 0.393 |
Track 2: Response Time Prediction
DummyRegressor Baseline Response Time: 31.68
Rank | Team Name | Target | Run | RMSE |
---|---|---|---|---|
1 | UNED | Response_Time | run2 | 23.927 |
2 | ITEC | Response_Time | Lasso | 24.116 |
3 | UNED | Response_Time | run1 | 24.777 |
4 | UNED | Response_Time | run3 | 25.365 |
5 | EduTec | Response_Time | roberta | 25.64 |
6 | EduTec | Response_Time | electra | 25.875 |
7 | UnibucLLM | Response_Time | run3 | 26.073 |
8 | ED | Response_Time | run1 | 26.57 |
9 | Rishikesh | Response_Time | 1 | 26.651 |
10 | UnibucLLM | Response_Time | run2 | 26.768 |
11 | UnibucLLM | Response_Time | run1 | 26.846 |
12 | SCaLARlab | Response_Time | run3 | 26.945 |
13 | Scalar | Response_Time | predictions | 26.982 |
14 | EduTec | Response_Time | deberta | 27.302 |
15 | EDU | Response_Time | Run1 | 27.474 |
16 | SCaLARlab | Response_Time | run2 | 27.481 |
17 | EDU | Response_Time | Run3 | 28.191 |
18 | Iran-Canada | Response_Time | run3 | 28.714 |
19 | SCaLARlab | Response_Time | run1 | 28.768 |
20 | Iran-Canada | Response_Time | run2 | 28.88 |
21 | Iran-Canada | Response_Time | run1 | 29.394 |
22 | Daniel | Response_Time | run1 | 29.967 |
23 | UPN-ICC | Response_Time | run1 | 30.981 |
24 | BRG | Response_Time | run2 | 31.48 |
25 | Baseline | Response_Time | DummyRegressor | 31.68 |
26 | EDU | Response_Time | Run2 | 31.962 |
27 | BRG | Response_Time | run1 | 31.996 |
28 | ED | Response_Time | run2 | 33.281 |
29 | BRG | Response_Time | PubMedBert | 33.412 |
30 | ED | Response_Time | run3 | 35.476 |
31 | Daniel | Response_Time | run2 | 36.421 |
32 | ITEC | Response_Time | BERT-ClinicalQA | 53.844 |
33 | ITEC | Response_Time | emrqa | 54.719 |
34 | Pitt | Response_Time | run1 | 70.488 |
Important Dates
Training data release: January 15
Test data release: February 10
Results due: February 16
Announcement of winners: February 21
Paper submissions due: March 10
Camera-ready papers due: April 22
Organizers
Victoria Yaneva, National Board of Medical Examiners
Peter Baldwin, National Board of Medical Examiners
Kai North, George Mason University
Brian Clauser, National Board of Medical Examiners
Saed Rezayi, National Board of Medical Examiners
Yiyun Zhou, National Board of Medical Examiners
Le An Ha, Ho Chi Minh City University of Foreign Languages - Information Technology (HUFLIT)
Polina Harik, National Board of Medical Examiners
References
Baldwin, P., Yaneva, V., Mee, J., Clauser, B. E. , Ha, L. A. 2020. Using Natural Language Processing to Predict Item Response Times and Improve Test Construction. Journal of Educational Measurement, Wiley. DOI: https://doi.org/10.1111/jedm.12264
McCarthy, A.D., Yancey, K.P., LaFlair, G.T., Egbert, J., Liao, M. and Settles, B., 2021, November. Jump-starting item parameters for adaptive language tests. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 883-899).
Ha, L. A., Yaneva, V., Baldwin, P. and Mee, J. 2019. Predicting the Difficulty of Multiple Choice Questions in a High-stakes Medical Exam. Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications (BEA), held in conjunction with ACL 2019, Florence, Italy, 2 August, 2019.
Settles, B., T. LaFlair, G. and Hagiwara, M., 2020. Machine learning–driven language assessment. Transactions of the Association for computational Linguistics, 8, pp.247-263.
Xue, K., Yaneva, V., Runyon, C. and Baldwin, P., 2020, July. Predicting the Difficulty and Response Time of Multiple Choice Questions Using Transfer Learning. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 193-197).
Yaneva, V., Ha, L. A., Baldwin, P. and Mee, J. 2020. Predicting Item Survival for Multiple Choice Questions in a High-stakes Medical Exam. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 6812-6818).
Yaneva, V., Jurich, D. P., Ha, L. A., and Baldwin, P. (2021) Using Linguistic Features to Predict the Response Process Complexity Associated with Answering Clinical MCQs. Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 223-232)