Criterion-referenced adaptive university exams: Effects of different linking designs on ability estimates

Aron Fink, Sebastian Born, Andreas Frey, & Christian Spoden

Session 2B, 12:45 - 14:15, VIA

The increasing digitalization in the educational sector opens up new opportunities not only for the teaching process, but also for the design of written university exams. Digital technologies make it possible to use innovative item formats and have the potential to foster the efficiency of scoring and data handling. Furthermore, and even more important from a scientific point of view, the shift to using digital technology for testing purposes in higher education provides the opportunity to implement state-of the art-methods from Psychometrics and Educational Measurement in the day-to-day practice. In particular, criterion-referenced computerized adaptive testing (CR-CAT) has the potential to make university exams more individualized, more accurate and fairer. From a practical point of view, however, the calibration of the item pool needed for CR-CAT poses a critical challenge since a separate calibration study is often not feasible and/or sample sizes of university exams are too low to allow for a stable estimation of item parameters. Thus, we suggest a new method for continuous item pool calibration during the operational CR-CAT phase. This method enables a step-by-step build-up of the item pool across several time points without a separate calibration study. In order to keep the scale constant across time points, link items are used. Due to the novelty of the method, the impact of the proportion of link items used and their item difficulty distribution on the quality of the person ability estimates (q) is unclear. To shed light into this, a simulation study based on a fully crossed design with the four factors “proportion of link items” (1/6, 1/4, 1/3 of test length), “difficulty distribution of link items” (normal, uniform, bi-modal with very low and very high difficulty only), “test length” (36, 48, 60 items), and “sample size” per time point (50, 100, 300) was carried out. Evaluation criteria for the quality of the q estimates are the bias conditional on q and the standard error of q conditional on q. The study is currently running, but will be completed before the conference. Regarding the results, we expect that a higher proportion of extremely difficult link items will reduce both bias and standard error for persons at the margins of the ability distribution. Longer test lengths and larger sample sizes should lead to less bias and lower standard error for all persons.

Published Sep. 5, 2018 1:40 PM - Last modified Sep. 5, 2018 1:40 PM