Resetting the standard with IRT concurrent calibration and Circle-arc equating in small samples

Monika Vaheoja, Norman. D., Verhelst, & Theo. J. H. M. Eggen

Session 2B, 12:45 - 14:15, VIA

Resetting standard performance statistically in tests with small samples is challenging because the small sample statistics often include bias, caused by sampling error. In practice, therefore, are the standard setting procedures applied that rely on experts’ estimation such as Angoff (1971), and empirical information to statistically reset the standard is neglected. But the standard-setting methods that include experts estimation, are biased too, and often expensive (Cizek & Bunch, 2007).

Livingston and Kim (2009) proposed a circle-arc equating for small samples. This method assumes a curvilinear relationship between reference and a new test to prevent the transformation of the scores beyond the range of possible scores. Different studies have shown promising results in favor of circle-arc method (Dwyer, 2016; LaFlair, Isbell, May, Gutierrez & Jamieson, 2015), but because the circle-arc method is a solution from the classical test theory approach it has its limitations too. Especially in the context where the population ability and test difficulty interact. In the later, Item Response Theory (IRT) outperforms classical test theory, but until now, it is not advised for small samples (Kolen & Brennan, 2014).

IRT is a theory about the responses of participants on a given test or exam. In this theory, the probability of correctly answering an item by a respondent is modeled assuming that a score on an item is dependent on the ability of the respondent and of the item characteristics. One of the IRT models is the One Parameter Logistic Model (OPLM; Verhelst & Glas, 1995). In OPLM are the item difficulty parameters estimated with the conditional maximum likelihood estimation, which means that no assumption of the population ability has to be made and the sample does not have to be representative to the population.

In the present simulation study, we will compare Circle-arc equating and IRT concurrent calibration with OPLM in transferring cut-score from reference test to a new test in three different contexts: at first we will fix the reference sample during the calibration, second, set them free and in the third context we will vary the population ability on a new test with low, and high ability group. Within each context, data is simulated in three varying situations: sample size, test length, and test difficulty. The results demonstrate that even in small samples (50 subjects taking both tests) IRT method outperforms classical test theory approach when tests’ difficulty and population ability interact.  The discussion involves the suggestion for further research such as the influence of the anchor-test and the reliability of the tests in the equating.

Published Sep. 5, 2018 1:41 PM - Last modified Sep. 5, 2018 1:41 PM