Next Article in Journal
The Scaling of Blood Pressure and Volume
Previous Article in Journal
Analogues of the Laplace Transform and Z-Transform with Piecewise Linear Kernels
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Comparison of Linking Methods for Two Groups for the Two-Parameter Logistic Item Response Model in the Presence and Absence of Random Differential Item Functioning

1
IPN–Leibniz Institute for Science and Mathematics Education, Olshausenstraße 62, 24118 Kiel, Germany
2
Centre for International Student Assessment (ZIB), Olshausenstraße 62, 24118 Kiel, Germany
Submission received: 12 July 2021 / Revised: 3 September 2021 / Accepted: 10 September 2021 / Published: 15 September 2021
(This article belongs to the Section Mathematical Sciences)

Abstract

:
This article investigates the comparison of two groups based on the two-parameter logistic item response model. It is assumed that there is random differential item functioning in item difficulties and item discriminations. The group difference is estimated using separate calibration with subsequent linking, as well as concurrent calibration. The following linking methods are compared: mean-mean linking, log-mean-mean linking, invariance alignment, Haberman linking, asymmetric and symmetric Haebara linking, different recalibration linking methods, anchored item parameters, and concurrent calibration. It is analytically shown that log-mean-mean linking and mean-mean linking provide consistent estimates if random DIF effects have zero means. The performance of the linking methods was evaluated through a simulation study. It turned out that (log-)mean-mean and Haberman linking performed best, followed by symmetric Haebara linking and a newly proposed recalibration linking method. Interestingly, linking methods frequently found in applications (i.e., asymmetric Haebara linking, recalibration linking used in a variant in current large-scale assessment studies, anchored item parameters, concurrent calibration) perform worse in the presence of random differential item functioning. In line with the previous literature, differences between linking methods turned out be negligible in the absence of random differential item functioning. The different linking methods were also applied in an empirical example that performed a linking of PISA 2006 to PISA 2009 for Austrian students. This application showed that estimated trends in the means and standard deviations depended on the chosen linking method and the employed item response model.

1. Introduction

The analysis of educational and psychological tests is an important field in the social sciences. The test items (i.e., tasks presented in these tests) are often modeled using item response theory (IRT; [1,2,3]; for applications, see, e.g., [4,5,6,7,8,9,10,11,12,13,14]) models. In this article, the two-parameter logistic (2PL; [15]) IRT model is investigated to compare two groups on test items. For example, groups could be demographic groups, countries, studies, or time points. The group comparisons were carried out using linking methods [16]. A significant obstacle in applying linking methods is that the test items could behave differently in the two groups (i.e., differential item functioning), that is it cannot be expected that the two groups share a common set of statistical parameters for the test items. Such a situation is particularly important in educational large-scale assessment (LSA; [17,18,19]) studies in which several countries are compared. It can be expected that test items function differently because there are curricular differences in those countries.
The paper is structured as follows. In Section 2, the 2PL model with differential item functioning is introduced. In Section 3, several linking methods are discussed. In Section 4, we present the results of a simulation study in which these linking methods are compared. In Section 5, an empirical example using the PISA datasets from 2006 and 2009 for Austria is presented. Finally, the paper closes with a discussion in Section 6.

2. Linking Two Groups with the 2PL Model

2.1. 2PL Model

For dichotomous items i = 1 , , I , the item response function (IRF) of the 2PL model is given by [3,20]:
P ( X i = x | θ ) = P i ( x ; a i , b i ) = Ψ ( 2 x 1 ) a i ( θ b i ) x = 0 , 1 ,
where Ψ ( y ) = exp ( y ) / ( 1 + exp ( y ) ) is the logistic link function, a i is the item discrimination, and b i is the item difficulty. The one-parameter logistic (1PL) IRT model (also referred to as the Rasch model; [21]) fixes all item discriminations a i to one.
The 1PL or the 2PL model is usually estimated under a local independence assumption, that is:
P ( X = x ) = i = 1 I P i ( x i , a i , b i ) f ( θ ) d θ ,
where f is the density of θ and x = ( x 1 , , x I ) . In many applications, θ is assumed to be normally distributed (i.e., θ N ( μ , σ 2 ) ). In (2), the multivariate contingency table of X with corresponding probabilities P ( X = x ) involving 2 I item response patterns was summarized by a unidimensional latent variable θ .

2.2. Linking Design

To assess the distributional differences between two groups, an appropriate linking design must be established to identify group differences. Such a linking design is displayed in Figure 1. For two groups g = 1 , 2 , there is a set I 0 of common items (also referred to as anchor items or link items) that are administered in both groups. There are also group-specific items I g that are uniquely administered in each of the two groups.
This linking design is also referred to as a common items nonequivalent group design [22]. In equating, no common items exist in many applications, but two test forms are administered to equivalent groups [23]. Equating is often performed with the goal of producing an equivalency table for the sum score in a test, while linking determines linking constants in order to identify group differences with respect to the latent variable θ . In this article, we are interested in determining distributional differences between the two groups for the latent variable θ .
It has to be emphasized that differences between the two groups are mainly determined by the set of common items if the linking relies on an IRT model. The employed IRT models can predict the expected performance of students on items that were not administered. The crucial assumption is that item responses on nonadministered unique items can be inferred (or imputed) from common items. This corresponds to an ignorability assumption in the missing data literature and translates into the conditional independence of item responses on unique items conditional on item responses on common items (see [24,25]). If a unidimensional IRT model with a local independence assumption (2) holds, this condition is automatically fulfilled.

2.3. Random Differential Item Functioning

Common items in the set I 0 are administered in two groups g = 1 , 2 (see Figure 1). It is assumed that the 2PL model (see Section 2.1) holds in both groups. The situation in which item parameters do not differ between groups is called measurement invariance [26,27,28]. However, it is likely in many applications that items function differently in the two groups. The existence of group-specific item parameters is labeled differential item functioning (DIF; [29,30,31,32,33]; for applications, see, e.g., [34,35,36,37,38]). In this case, items possess group-specific item discriminations a i g and item difficulties b i g . DIF in item discriminations is referred to as nonuniform DIF (NUDIF), while DIF in item difficulties is referred to as uniform DIF (UDIF) if there is no DIF in item discriminations (see [32,39,40]). Note that the local independence assumption (2) holds within each group even though the item parameters can differ across groups.
For the rest of this article, we assumed random DIF [41,42,43,44,45,46,47,48,49,50]. The main idea is that the difference of item parameters between groups is modeled by a distribution. Hence, a population perspective to a universe of items was adopted. Random DIF for item discriminations a i g ( i = 1 , , I ; g = 1 , 2 ) is defined as:
a i 1 = a i f i and a i 2 = a i + f i ,
where DIF effects f i are considered as a random variable that follows a distribution F f . Common item discriminations a i can be considered either fixed or random. In the following, we considered them as fixed. Item difficulties b i g are also considered as random DIF, and DIF effects e i follow a random distribution F e :
b i 1 = b i e i and b i 2 = b i + e i ,
where b i are common item difficulties.
It is important to emphasize that there is an inherent indefinability of a simultaneous determination of average DIF effects and average group differences [51,52,53,54]. Hence, some structural assumptions must be posed on the distributions of DIF effects F e and F f . One possible choice would be to assume E ( e i ) = E ( f i ) = 0 ; that is, random DIF is is described by unsystematic differences in the item parameters (see Appendix A for further details). A frequent assumption is that DIF effects are normally distributed with zero means:
e i N ( 0 , τ b 2 ) and f i N ( 0 , τ a 2 )
It is impossible to assume the means of DIF effects different from zero because average DIF effects are confounded with average group differences. Note that it holds that a i 2 a i 1 = 2 f i N ( 0 , 2 τ a 2 ) and b i 2 b i 1 = 2 e i N ( 0 , 2 τ b 2 ) .
As an alternative, a sparsity assumption might be posed on DIF effects. In this case, the majority of items have DIF effects of zero, while only a few items have DIF effects different from zero [44,55]. If DIF effects are considered as fixed, this situation is known as partial invariance [56,57,58]. The random DIF distribution is a mixture distribution with two classes: one class with zero effects and the other class containing DIF effects different from zero. The mixture distribution can be simultaneously estimated in an IRT model with two groups ([55]; see also [59]). Alternatively, DIF detection methods can be used to identify items whose DIF effects differ from zero [32,60,61]. These items can be removed from linking in subsequent analysis (see, e.g., [62]). However, some research has shown that simultaneous treatment of the linking and modeling of DIF effects can result in superior statistical performance [54,63,64,65,66].
We would also like to point out that DIF effects in item discriminations in Equation (3) follow an additive model. Alternatively, a multiplicative model for DIF effects can be assumed:
a i 1 = a i / f i and a i 2 = a i f i ,
which corresponds to an additive model in logarithmized item discriminations:
log a i 1 = log a i log f i and log a i 2 = log a i + log f i .
It is hard to decide in empirical applications whether DIF effects for item discriminations should be modeled in the untransformed metric (see Equation (3)) or a logarithmized metric (see Equation (7)).
The simulation study only considered normally distributed DIF effects with zero means and an additive model for DIF effects in item discriminations.

2.3.1. Identified Item Parameters in Separate Calibrations in the Two Groups

In Section 3, we discuss some linking methods that rely on item parameters that were obtained from separate calibrations. That is, the 2PL model was separately fitted for the two groups. For reasons of identifiability, it has to be assumed that in the first group, it holds that μ 1 = 0 and σ 1 = 1 . In a separate estimation for the first group with an infinite sample size, the estimated item discriminations a ^ i 1 are equal to the data-generating parameters a i 1 . The same holds for estimated item difficulties, that is b ^ i 1 = b i 1 .
In the second group, there are group-specific item parameters a i 2 and b i 2 and distribution parameters μ 2 0 and σ 2 1 . In a separate estimation for the second group, it was assumed that the mean was set to zero, and the standard deviation (SD) was set to one. Hence, estimated item parameters also include the distribution parameters. We obtain:
a i 2 ( θ b i 2 ) = a i 2 ( σ 2 θ + μ 2 b i 2 ) = ( a i 2 σ 2 ) ( θ σ 2 1 ( b i 2 μ 2 ) ) ,
where the standardized ability θ is standard normally distributed (i.e., N ( 0 , 1 ) ). From Equation (8), it follows that:
a ^ i 2 = a i 2 σ 2 and b ^ i 2 = σ 2 1 ( b i 2 μ 2 ) .

2.3.2. The Role of Normally Distributed Random DIF in Educational Assessment

It should be noted that the presence of random DIF implies that there is no single item for which the two groups share the same item parameters. Hence, all items are allowed to have different item parameters. By posing a normal distribution on DIF effects, it was assumed that DIF across groups vanishes on average for an infinite number of items. The presence of normally distributed random DIF is strongly different from the situation of partial invariance in which only a few items (or a few item parameters) possess DIF, while the rest of the items (or item parameters) do not have DIF. The two kinds of DIF effects can also simultaneously occur [65]. In our experience, we mainly observed random fluctuations of group-specific item parameters, which would correspond to normally distributed random DIF instead of to the case of partial invariance. However, it seems that ensuring partial invariance is seen as a measurement ideal in educational LSA studies [67,68,69,70]. We have argued against such a perspective elsewhere [54,65,71,72] and think that the random DIF perspective with normally distributed DIF effects is more relevant in real-world applications. Moreover, we think removing items due to DIF from group comparisons is unwise because DIF can be interpreted as construct-relevant [32,52,73,74,75]. In this situation, a group difference that relies on a purified set of items can bias estimated group differences with respect to the means and standard deviations.

3. Linking Methods

In the following, we discuss linking methods that allow estimating the distribution parameters μ 2 = E ( θ ) and σ 2 = SD ( θ ) of the second group. Let γ 0 denote item parameters a i and b i of common items I 0 and γ g item parameters from the group-specific sets of unique items I g for g = 1 , 2 . Furthermore, denote by X g the matrix of observed item responses from group g for items i I 0 I g . For group g = 1 , 2 , the log-likelihood function is defined by:
l ( μ , σ , γ 0 , γ g ; X g ) = p = 1 N g log i I 0 I g P i ( x p g i ; a i , b i ) f ( θ ; μ , σ ) d θ ,
where x p g i is the item response of person p in group g at item i. The IRT model in (10) can be estimated by marginal maximum likelihood (MML) estimation and an expectation–maximization algorithm [76,77,78,79].
For reasons of identification, we define μ 1 = 0 and σ 1 = 1 and identify the distribution parameters μ 2 and σ 2 of the second group. Separate calibrations for the two groups can be carried out and result in group-specific item parameter estimates a ^ i g and b ^ i g (see Section 2.3.1). These item parameters were subsequently transformed utilizing linking methods in order to obtain estimates μ ^ 2 and σ ^ 2 . See [16,22,80,81,82] for an overview of linking methods. In the next subsection, we discuss several linking methods.

3.1. Log-Mean-Mean Linking

In log-mean-mean linking (logMM; [16]), the means of the logarithmized item discriminations and item difficulties in the two groups are set equal for identifying group means and group SDs. The SD σ 2 of the second group is estimated as:
σ ^ 2 = exp 1 | I 0 | i I 0 log a ^ i 2 1 | I 0 | i I 0 log a ^ i 1 ,
where | I 0 | is the number of items in the set I 0 . The estimation in (11) corresponds to the assumption of additive DIF effects in logarithmized item discriminations (see (7)). The mean μ 2 of the second group is estimated by:
μ ^ 2 = σ ^ 2 1 | I 0 | i I 0 b ^ i 2 + 1 | I 0 | i I 0 b ^ i 1 .
It can be shown that logMM estimates are consistent under weak conditions. Here, consistency means that the number of common items (i.e., | I 0 | ) is tending toward infinity. Moreover, it was assumed that group-specific item parameters would have been estimated with an infinite sample size (i.e., N ). Consistency proofs can be easily modified to the case of finite sample sizes (use p as the mathematical symbol for consistency in the folllowing). Then, consistency is meant under a double sampling scheme in which the number of persons and number of items tends to infinity (i.e., N and | I 0 | ) . We now present the consistency result.
Proposition 1. 
Assume that DIF effects e i fulfill E ( e i ) = 0 . For DIF effects f i , one of the following conditions holds:
(i) 
For additive DIF effects f i (Equation (3)), it holds that E ( f i ) = 0 and f i has a symmetric distribution;
(ii) 
For multiplicative DIF effects f i (Equation (7)), it holds that E ( log f i ) = 0 .
Then, logMM estimators μ ^ 2 and σ ^ 2 are consistent for μ 2 and σ 2 , respectively:
μ ^ 2 p μ 2 and σ ^ 2 p σ 2 for | I 0 | .
Proof. 
See Appendix B. □

3.2. Mean-Mean Linking

In mean-mean linking (MM; [16]), the means of untransformed item discriminations are matched. Hence, the estimation in (11) is substituted by:
σ ^ 2 = 1 | I 0 | i I 0 a ^ i 2 1 | I 0 | i I 0 a ^ i 1 .
This estimation corresponds to additive DIF effects in untransformed item discriminations (see (3)). The estimation formula for μ ^ 2 in (12) remains unaltered.
It can also be shown that MM estimates are consistent under weak conditions.
Proposition 2. 
Assume that DIF effects e i fulfill E ( e i ) = 0 . For DIF effects f i , one of the following conditions holds:
(i) 
For additive DIF effects f i (Equation (3)), it holds that E ( f i ) = 0 ;
(ii) 
For multiplicative DIF effects f i (Equation (7)), it holds that E ( log f i ) = 0 and log f i has a symmetric distribution.
Then, MM estimators μ ^ 2 and σ ^ 2 are consistent for μ 2 and σ 2 , respectively:
μ ^ 2 p μ 2 and σ ^ 2 p σ 2 for | I 0 | .
Proof. 
See Appendix C. □
Finally, it is instructive to study the effects on the estimates in MM linking if the true effects do not have zero means. For additive DIF effects f i , we assumed E ( f i ) = δ f and E ( e i ) = δ e . Using similar derivations as in Appendix C, we obtain:
σ ^ 2 p σ 2 A + δ f A δ f = σ 2 1 + δ f / A 1 δ f / A ,
where A = lim | I 0 | 1 | I 0 | i = 1 | I 0 | a i . Setting B = lim | I 0 | 1 | I 0 | i = 1 | I 0 | b i , we obtain:
μ ^ 2 p μ 2 1 + δ f / A 1 δ f / A 2 B δ f / A 1 δ f / A 2 δ e 1 1 δ f / A .
If DIF effects do not have zero means, Equations (16) and (17) show that biased estimates for the group mean and the group standard deviation can be expected.

3.3. Haberman Linking (HAB and HAB-nolog)

Haberman linking (HAB; [71,83,84]) provides a generalization of logMM and MM to multiple groups while simultaneously estimating common item parameters a and b . HAB consists of two steps. In the first step, group-specific SDs are estimated. In the second step, group-specific means are estimated.
The originally proposed HAB linking [83] operates on logarithmized item discriminations (HAB). To estimate σ 2 , the linking function in HAB for the particular case of two groups is:
H 1 , log ( σ 2 , a ) = i I 0 I 1 log a ^ i 1 log a i 2 + i I 0 I 2 log a ^ i 2 log a i log σ 2 2 .
Parameters in HAB linking are estimated as the minimizer of (18):
( σ ^ 2 , a ^ ) = arg min σ 2 , a H 1 , log ( σ 2 , a ) .
For i I g for g = 1 , 2 , we obtain a ^ i = a ^ i g . Furthermore, for i I 0 , the minimization of (18) corresponds to a two-way analysis of variance. We obtain (see Equation (A31) in Appendix D):
σ ^ 2 = exp 1 | I 0 | i I 0 log a ^ i 2 1 | I 0 | i I 0 log a ^ i 1 ,
which shows the equivalence to logMM linking with respect to the estimation of σ 2 .
In the second step, group means are estimated. The linking function is given as:
H 2 ( μ 2 , b ) = i I 0 I 1 b ^ i 1 b i 2 + i I 0 I 2 σ ^ 2 b ^ i 2 b i + μ 2 2 .
The parameters are estimated as:
( μ ^ 2 , b ^ ) = arg min μ 2 , b H 2 ( μ 2 , b ) .
Using the same derivation as for H 1 , log , we obtain:
μ ^ 2 = σ ^ 2 1 | I 0 | i I 0 b ^ i 2 + 1 | I 0 | i I 0 b ^ i 1 .
Notably, Equation (23) coincides with the estimation in logMM linking (see Equation (12)).
As an alternative, Haberman linking can also be conducted based on untransformed item discriminations [71]. This method is labeled as HAB-nolog. It turned out that HAB-nolog outperformed HAB in some situations for multiple groups [71,84]. The linking function of HAB-nolog is given by:
H 1 , nolog ( σ 2 , a ) = i I 0 I 1 a ^ i 1 a i 1 2 + i I 0 I 2 a ^ i 2 a i σ 2 2 .
Parameter estimates in HAB-nolog linking are determined by:
( σ ^ 2 , a ^ ) = arg min σ 2 , a H 1 , nolog ( σ 2 , a ) .
The SD of the second group is given by (see Equation (A36) in Appendix D):
σ ^ 2 = 1 + 1 | I 0 | i I 0 a ^ i 2 1 | I 0 | i I 0 a ^ i 1 .
The linking function for μ 2 in HAB-nolog is the same as in HAB (see Equation (21)).

3.4. Invariance Alignment with p = 2

Asparouhov and Muthén [85,86] proposed the method of invariance alignment (IA) to define a linking that maximizes the extent of invariant item parameters. IA can also be regarded as a linking method that handles noninvariant item parameters [71].
The IA method is based on estimated group-specific item intercepts ν ^ i g = a ^ i g b ^ i g and item discriminations a ^ i g obtained from separate calibrations. IA was originally formulated to detect only a few noninvariant items. It has been pointed out that IA in its original proposal is not an acceptable linking method in the presence of normally distributed DIF effects [87,88,89]. In [71], IA was studied using a general class of so-called L p -type robust linking functions. The original IA formulation used p = 0.5 [85]. For normally distributed DIF effects, p = 2 is an adequate choice [71,89]. Hence, we investigate IA with the loss function using p = 2 (IA2).
It has been shown that IA estimation originally formulated as a joint estimation problem can be reformulated as a two-step estimation method [71]. In the first step, group SDs are computed. In the second step, group means are computed. For two groups, σ 2 is estimated in the first step and μ 2 in the second step. Note that the first group serves as the reference group (i.e., it holds that μ 1 = 0 and σ 1 = 1 ). The formulas in [71] can be simplified to the case of two groups for providing estimates μ ^ 2 and σ ^ 2 :
σ ^ 2 = arg min σ 2 i I 0 a ^ i 1 a ^ i 2 σ 2 2 and
μ ^ 2 = arg min μ 2 i I 0 ν ^ i 1 ν ^ i 2 + μ 2 a ^ i 2 σ ^ 2 2 .
The estimates in (27) and (28) have close expressions (see Equations (A41) and (A45) in Appendix E).

3.5. Haebara Linking Methods (HAE-Asymm, HAE-Symm, HAE-Joint)

In contrast to the MM, logMM, HAB, HAB-nolog, and IA linking methods, Haebara (HAE) linking [90] aligns IRFs instead of directly aligning item parameters. The linking function in asymmetric HAE (HAE-asymm; [90]) linking is given as:
H asymm ( μ 2 , σ 2 ) = i I 0 Ψ a ^ i 1 ( θ b ^ i 1 ) Ψ σ 2 1 a ^ i 2 ( θ σ 2 b ^ i 2 μ 2 ) 2 ω ( θ ) d θ .
Estimated distribution parameters are obtained by minimizing (29):
( μ ^ 2 , σ ^ 2 ) = arg min μ 2 , σ 2 H asymm ( μ 2 , σ 2 ) .
The linking function (29) aligns IRFs of the second group to those in the first group. Hence, it is asymmetric because parameters in the second group are expected to behave similarly to the first group. However, the first group could alternatively be aligned to the second group. To robustify the HAE linking, both directions of alignment are considered in symmetric Haebara linking (HAE-symm; [91,92]), which employs the linking function:
H symm ( μ 2 , σ 2 ) = i I 0 Ψ a ^ i 1 ( θ b ^ i 1 ) Ψ σ 2 1 a ^ i 2 ( θ σ 2 b ^ i 2 μ 2 ) 2 ω ( θ ) d θ + i I 0 Ψ a ^ i 1 σ 2 ( θ σ 2 1 ( b ^ i 1 μ 2 ) ) Ψ a ^ i 2 ( θ b ^ i 2 ) 2 ω ( θ ) d θ .
In a similar vein, the estimated mean and the SD for the second group is given by and defined as:
( μ ^ 2 , σ ^ 2 ) = arg min μ 2 , σ 2 H symm ( μ 2 , σ 2 ) .
A generalization of HAE linking to the general case of multiple groups was proposed in [93]. In this joint Haebara (HAE-joint) linking approach, distribution parameters are simultaneously estimated with common item parameters. The linking function in HAE-joint is defined as [65,93,94]:
H joint ( μ 2 , σ 2 , a , b ) = i I 0 Ψ a ^ i 1 ( θ b ^ i 1 ) Ψ a i ( θ b i ) 2 ω ( θ ) d θ + i I 0 Ψ a ^ i 2 ( θ b ^ i 2 ) Ψ a i ( σ 2 θ b i + μ 2 ) 2 ω ( θ ) d θ ,
where a and b denote common item parameters. Parameter estimates are given by minimizing (33):
( μ ^ 2 , σ ^ 2 , a ^ , b ^ ) = arg min μ 2 , σ 2 , a , b H joint ( μ 2 , σ 2 , a , b ) .
A variant of joint Haebara linking was proposed in [84]. Note that in joint Haebara linking, common IRFs are aligned to group-specific IRFs.

3.6. Recalibration Linking (RC1, RC2, and RC3)

A further linking technique is recalibration (RC) linking. This is based on item parameters obtained from separate calibrations. The core idea is that group differences in distributions can be inferred by recalibrating a study with one group using item parameters from the other group. RC methods are mainly used in LSA studies such as the Programme for International Student Assessment (PISA; [95]), Progress in International Reading Literacy Study (PIRLS; [96]), and Trends in International Mathematics and Science Study (TIMSS; [97,98]).
In PISA, RC linking was employed until PISA 2012 for the 1PL model to handle model misspecifications (i.e., the data-generating IRT model deviates from the 1PL model) in linking that could artificially impact the estimated SDs [99]. RC linking in PIRLS and TIMSS operates on the 3PL model [98,100,101]. Notably, recalibration methods have also been proposed for determining linking errors in PIRLS/TIMSS [101], as well as in PISA ([69], p. 176ff.), but the two approaches turned out to be different.
RC linking is based on estimated item parameters from the first and the second groups. Item parameters are obtained assuming μ 1 = μ 2 = 0 and σ 1 = σ 2 = 1 in separate calibrations. The group-specific parameter estimates are defined as:
( γ ^ 0 ( 1 ) , γ ^ 1 ( 1 ) ) = arg max γ 0 , γ 1 l ( 0 , 1 , ( γ 0 , γ 1 ) ; X 1 ) and
( γ ^ 0 ( 2 ) , γ ^ 2 ( 2 ) ) = arg max γ 0 , γ 2 l ( 0 , 1 , ( γ 0 , γ 2 ) ; X 2 ) .
To obtain the mean and the SD of the second group, the data of the first group (i.e., X 1 ) are recalibrated using item parameters from the second group (i.e., γ ^ 0 ( 2 ) ):
( m 1 , s 1 , g ^ 1 ) = arg max μ , σ , γ 1 l ( μ , σ , ( γ ^ 0 ( 2 ) , γ 1 ) ; X 1 ) .
A recalibrated mean m 1 and a recalibrated SD s 2 are obtained. These two parameters indicate differences between the two groups. Similarly, data of the second group (i.e., X 2 ) can be recalibrated using item parameters from the first group (i.e., γ ^ 0 ( 1 ) ):
( m 2 , s 2 , g ^ 2 ) = arg max μ , σ , γ 2 l ( μ , σ , ( γ ^ 0 ( 1 ) , γ 2 ) ; X 2 ) .
Based on these estimates, the distribution parameters for the second group are defined as:
μ ^ 2 = s m 1 , σ ^ 2 = s
where the scaling factor s can take different forms. Note that μ ^ 2 relies on the recalibrated mean m 1 for the first group and the scaling factor s. The three RC linking methods differ concerning the scaling constant used:
Method RC 1 : s = 1 s 1 , Method RC 2 : s = s 2 , Method RC 3 : s = s 2 s 1
The linking methods RC1 and RC2 are asymmetric, while method RC3 relies on both recalibrated SDs. The scaling factor in RC3 linking is defined as the geometric mean of the scaling factors in RC1 and RC2. The definition is motivated by the fact that the impact on recalibrated SDs is symmetrically treated. The linking method RC1 is currently used in PIRLS and TIMSS [96,98] and was used in PISA until 2012 [99]. To our knowledge, linking methods RC2 and RC3 have not yet been investigated in the literature.

3.7. Anchored Item Parameters

The linking method based on anchored item parameters (ANCH; [16,22,102]) estimates the distribution parameters in the second group by fixing the item parameters of the common items to those of the first group. Assume that item parameter estimates in the first group are computed as:
( γ ^ 0 , γ ^ 1 ) = arg max γ 0 , γ 1 l ( 0 , 1 , ( γ 0 , γ 1 ) ; X 1 ) .
In ANCH linking, μ 2 and σ 2 are estimated by maximizing the log-likelihood function while fixing γ 0 (i.e., γ 0 = γ ^ 0 ):
( μ ^ 2 , σ ^ 2 , γ ^ 2 ) = arg max μ 2 , σ 2 , γ 2 l ( μ 2 , σ 2 , ( γ ^ 0 , γ 2 ) ; X 2 ) .
It should be noted that RC linking can be considered a variant of ANCH linking because, in RC linking, the distribution parameters of the first group are re-estimated using anchored item parameters from the second group. However, the distribution parameters of the second group are indirectly obtained by transforming the re-estimated distribution parameters of the first group (see Section 3.6). In contrast, ANCH provides an estimate of μ 2 and σ 2 directly.

3.8. Concurrent Calibration

Concurrent calibration (CC; [65,70,103]) is based on a multiple-group IRT model and presupposes invariant item groups across groups. The distribution parameters of the second group are determined by maximizing the joint likelihood:
( μ ^ 2 , σ ^ 2 , γ ^ 0 , γ ^ 1 , γ ^ 2 ) = arg max μ 2 , σ 2 , γ 0 , γ 1 , γ 2 l ( 0 , 1 , ( γ 0 , γ 1 ) ; X 1 ) + l ( μ 2 , σ 2 , ( γ 0 , γ 2 ) ; X 2 )
In the presence of random DIF, the log-likelihood function in (43) will typically be misspecified. The estimated mean and SD can typically be biased due to this misspecification. It has frequently been pointed out that separate calibration with subsequent linking can be more robust to the presence of DIF than CC [16,54,102]. CC can only be expected to be more efficient than linking based on separate calibrations in small to moderate sample sizes and in the absence of DIF [65]. Notably, CC is more computationally demanding than linking based on separate calibration [65,104].

4. Simulation Study

4.1. Purpose

The purpose of this simulation study was to investigate the performance of linking methods in the two-group case for the 2PL model under different sample sizes, different numbers of items, and different amounts of uniform and nonuniform DIF. Most simulation studies either assume invariant item parameters (i.e., no DIF) or presuppose partial invariance in which only a few item parameters differ between groups (e.g., [63,105,106,107,108,109,110,111,112]). There is a lack of research in the presence of random DIF, although there is some initial work for continuous items [88,89]. Moreover, although recalibration linking is in operational use in LSA studies, they have not yet been systematically compared to alternative linking methods.
We expected from the previous research and our analytical findings in Propositions 1 and 2 that moment-based linking methods could be competitive with the CC and HAE linking methods [54,65]. We did not have specific hypotheses regarding the performance of recalibration linking.

4.2. Design

We simulated a design with two groups and only common items (i.e., no unique items). Data were simulated according to the 2PL model with different amounts of random DIF. In the simulation, it was assumed that there were only common items and no group-specific unique items. The 2PL model with random DIF was simulated according to Equations (3) and (4). Table A1 in Appendix F shows item parameters a i and b i for 20 items used in the simulation. The first group served as a reference group assuming μ 1 = 0 and σ 1 = 1 , while the distribution parameters for the second group were μ 2 = 0.3 and σ 2 = 1.2 .
In the simulation, four factors were simulated. First, we chose sample sizes N = 500 , 1000, and 5000 (3 factor levels). Second, the item number was either I = 20 or I = 40 (2 factor levels). For 40 items, the item parameters from Table A1 were duplicated. The random DIF SD τ b for item difficulties b i was 0, 0.1, 0.3, or 0.5 (4 factor levels). The random DIF SD τ a for item discriminations a i was 0, 0.15, 0.25 (3 factor levels). In total, there were 3 × 2 × 4 × 3 = 96 conditions employed in the simulation.
In total, 1000 datasets were simulated and analyzed in each condition.

4.3. Analysis Methods

The 2PL model was separately estimated in the two groups. Afterward, 13 linking methods (logMM, HAB-log, MM, HAB-nolog, IA2, HAE-asymm, HAE-symm, HAE-joint, RC1, RC2, RC3, ANCH, CC; see Section 3) were applied.
The parameters of interest were the estimated mean μ ^ 2 and SD σ ^ 2 for the second group. For the two parameters, the bias and root-mean-squared error (RMSE) were computed. To decrease the dependence of the RMSE on the sample size and the number of items, we computed a relative RMSE for which the RMSE of a linking method was divided by the RMSE of the linking method with the best performance. Hence, this relative RMSE had 100 for the best linking method as its lowest value.
To summarize the contribution of each of the manipulated factors in the simulation, we conducted an analysis of variance (ANOVA) and used a variance decomposition to assess the importance.
Moreover, we classified linking methods as whether or not they showed satisfactory performance in a particular condition. We defined satisfactory performance for the bias if the absolute bias of a parameter (i.e., the estimated mean μ ^ 2 or estimated SD σ ^ 2 ) was smaller than 0.01. In LSA studies such as the Programme for International Student Assessment (PISA), standard errors were about 0.02 or 0.03 for standardized ability variables θ . It should be required that the bias only be a fraction of the variability introduced by the sampling error, which motivated the value of 0.01 as a cutoff. An estimator had satisfactory performance concerning the RMSE if the relative RMSE was smaller than 120. Here, an alternative estimator to the best-performing estimation should not lose too much precision. With an RMSE of 120, the relative-mean-squared error (MSE) was 144 (note that 1.2 2 = 1.44 ), which corresponded to a loss in precision of 44 % . Estimators with worse performance might not be considered as satisfactory in such a situation.
In all the analyses, the statistical software R [113] was used. The R package TAM [114] was used to estimate the 2PL model with marginal maximum likelihood as the estimation method. The linking methods were estimated using dedicated R functions or the existing functionality in the R packages sirt [115] and TAM [114]. The ANOVA model was estimated with the R package lme4 [116].

4.4. Results

In Table 1, the variance decomposition of the ANOVA is presented. All terms up to three-way interactions are included. From the size of the residual variance, it can be concluded that the first three orders are sufficient to capture the most important factors in the simulation.
It turned out that sample size (N) and the number of items (I) were only of minor importance in the first three-order terms in ANOVA. However, the linking method, as well as the size of random DIF were important factors. Importantly, the random DIF SD ( τ b ) in item difficulties b i had a large influence on the estimated means, while random DIF SD ( τ a ) in item discriminations a i strongly impacted the estimated SDs.
Due to these observations, we decided to provide aggregated results for three important groups of cells in the simulation. First, we aggregated the results for six conditions with no DIF (NODIF; τ b = 0 and τ a = 0 ). Because there were two estimated parameters (mean and SD), this resulted in aggregation across twelve measures for the bias and RMSE, respectively. Second, we considered all 18 conditions with uniform DIF (UDIF; τ b > 0 and τ a = 0 ). Third, we provide summaries across all 48 conditions with nonuniform DIF (NUDIF; τ a > 0 ).
In Table 2, the performance and the linking methods are summarized across these three groups of conditions by classifying all linking methods as either satisfactory or nonsatisfactory. Table 2 shows the proportion of conditions in which a linking method provided satisfactory results. In the absence of DIF (columns “NODIF”), all methods performed well in most of the conditions. Notably, IA2, the asymmetric recalibration methods RC1 and RC2, as well as ANCH provided biases in some instances. In the case of uniform DIF (columns “UDIF”), HAE-asymm, HAE-joint and CC cannot be recommended in terms of the bias. Moreover, only moment-based methods (logMM, HAB, MM, HAB-nolog) can be recommended in terms of the RMSE. Finally, in the presence of nonuniform DIF (columns “NUDIF”), moment-based methods, as well as HAE-symm and RC3 were satisfactory in terms of the bias. If the RMSE is considered as an additional criterion, utilizing linking methods MM, HAB-nolog, HAE-symm, and RC3 can be suggested.
In Table 3, the bias and RMSE for the mean μ ^ 2 and the SD σ ^ 2 of the second group for N = 1000 students and I = 40 items are shown. It can be seen that all linking methods performed well in the absence of DIF (see columns “NODIF”). In the case of uniform DIF (columns “UDIF”), HAE-asymm, HAE-joint, and CC produced nonsatisfactory results in terms of the bias for the mean or the SD. Interestingly, the RMSE for the SD was much larger for HAE, RC, ANCH, and CC than the moment-based methods logMM, HAB, MM, and logMM. In the case of nonuniform DIF (column “NUDIF”), only moment-based methods (except IA2) and HAE-symm and the newly proposed recalibration linking method RC3 produced satisfactory results. Furthermore, note that the RMSE for the SD was larger for logMM and HAB compared to MM and HAB-nolog. This finding can be explained by the fact that the data-generating model for DIF was an additive model that operated on nontransformed item discriminations and favored MM and HAB-nolog. To conclude, linking methods that are not moment-based can only be recommended as a default linking method when the mean is of interest and there is an absence of random DIF. It depends on the extent of UDIF (i.e., size of τ b ) whether the additional variance introduced by moment-based methods is compensated by smaller biases.

5. Empirical Example: Linking PISA 2006 and PISA 2009 for Austria

5.1. Method

In order to illustrate the consequences of different scaling models (i.e., 1PL and 2PL models), as well as different linking methods (see Section 3), we analyzed the data for Austrian students from the PISA study conducted in 2006 (PISA 2006; [95]) and 2009 (PISA 2009; [117]). In Table 4, the sample sizes (i.e., N) and item numbers (i.e., I) used are presented. There were common items in the two PISA studies (Mathematics: I 0 = 35 ; Reading: I 0 = 27 ; Science: I 0 = 53 ). Moreover, the officially reported means and SDs for Austrian students in PISA 2006 and PISA 2009 are displayed. Note that PISA 2006 and PISA 2009 employed the 1PL model and utilized the MM linking method with a subsequent recalibration linking (method RC1; [117]).
In both PISA studies, we included only those students who received a test booklet with at least one item in a respective domain. For simplicity, all polytomous items (i.e., items with maximum scores larger than one) were dichotomously recoded, with only the highest category being recoded as correct. The 1PL and the 2PL model were used as scaling models, and student weights were taken into account. In total, 13 linking methods (logMM, HAB-log, MM, HAB-nolog, IA2, HAE-asymm, HAE-symm, HAE-joint, RC1, RC2, RC3, ANCH, CC; see Section 3) were applied. In the linking procedure, the distribution parameters of the first study (PISA 2006) were fixed (i.e., μ 1 = 0 , σ 1 = 0 ), and the distribution parameters of the second study (PISA 2009) were estimated. The two ability distributions for the three domains (and the distribution parameters) were linearly transformed such that the mean equaled the officially reported mean and the SD equaled the officially reported SD in PISA 2006 in the respective domain. For example, for Mathematics, it holds that μ ^ 1 = M = 506.8 and σ ^ 1 = SD = 96.8 in PISA 2006 for all linking methods for the 1PL and the 2PL model.

5.2. Results

In Table 5, the trend estimates for Austria from PISA 2006 and PISA 2009 are shown. For the linearly transformed scores, the trend estimate is given as Δ ^ = μ ^ 2 μ ^ 1 . There was a significant and large negative trend for Mathematics and Science, while the trend in Reading turned out to be smaller. Across both models (1PL and 2PL) and linking methods, the average trend in Mathematics was M = 14.4 ( SD = 1.1 , Range = 3.6 ). There were slight differences between the 1PL and the 2PL models for the moment-based linking methods (i.e., logMM, HAB, MM, HAB-nolog). Notably, the variation of estimates across linking methods was larger for the 2PL model ( SD = 1.3 ) than for the 1PL model ( SD = 0.7 ). The trend in Reading was smaller ( M = 5.4 ) and less variable ( SD = 0.8 , Range = 2.4 ) than in Mathematics. Finally, the trend in Science was also strongly negative ( M = 14.5 ). The variability ( SD = 1.3 , Range = 5.2 ) was mainly caused by the variability in linking methods of the 2PL model. In contrast, linking methods for the 1PL model were very similar ( SD = 0.3 , Range = 0.8 ).
In Table 6, the estimated SD σ ^ 2 in PISA 2009 is displayed for the 1PL and the 2PL models as a function of the linking method. Interestingly, the variability across linking methods turned out to be larger than for the mean estimate. Relatively large differences of the 1PL model and the 2PL model were observed for Reading (1PL: M = 100.4 , 2PL: M = 105.2 ) and Science (1PL: M = 104.1 , 2PL: M = 107.4 ). Note that the differences between the 1PL and the 2PL model were particularly pronounced for the IA2 method. The difference between the two scaling models might be explained by different average item discriminations for common items and unique items. Reading was a minor domain in PISA 2006 (no unique items) and a major domain in PISA 2009 in which a large set of unique items were introduced. Science was a major domain in PISA 2006 (many unique items) and a minor domain in PISA 2009 (no unique items). In contrast, Mathematics was a minor domain in PISA 2006 and PISA 2009, and there was only a small number of unique items in PISA 2006 and no unique items in PISA 2009. These findings indicate that an adjustment for the average differences in item discriminations has to be conducted to avoid biased estimates of the means and SDs when a misspecified 1PL model is employed (see [99]).

6. Discussion

In this article, several linking methods for two groups were evaluated for the 2PL model in the case of normally distributed random DIF effects with zero means for item difficulties and item discriminations. Somehow surprisingly and contrary to the recommendations in the literature, moment-based linking methods (mean-mean and log-mean-mean, as well as Haberman linking) performed best in terms of the bias and RMSE. The unbiasedness of moment-based methods in the case of many items was expected due to our consistency results for the estimators presented in Section 3. When the primary criterion is the bias, symmetric Haebara (HAE-symm) linking and the newly proposed recalibration method RC3 can only be recommended among the nonmoment-based methods. In contrast, the commonly used asymmetric Haebara (HAE-asymm) linking and the recalibration linking RC1 used in PIRLS and TIMSS had substantially worse performance. Furthermore, note that concurrent calibration—which incorrectly assumes invariant item parameters across groups—and the anchored item parameters method provided biased estimates and cannot be recommended in operational use. Concurrent calibration can only achieve the promised highest efficiency [67,70] in small-to-moderate-sized samples, the absence of random DIF, and a correctly specified IRT model. If data in educational large-scale assessment studies were indeed to follow a random DIF distribution, the currently used linking methods (concurrent calibration, recalibration linking RC1) could be replaced by better alternatives as proposed in this study (moment-based linking, recalibration linking RC3). Of course, the extent of the bias and loss in the precision of the current methods depends on the variance of the DIF effects and can vary from study to study.
Our study assumed that the utilized scaling model (i.e., the 2PL model) was correctly specified. This assumption might be unrealistic in practice, and data could have been generated with much more complex item response functions [118,119,120,121,122] or multidimensional IRT models [123,124]. The performance of the linking methods for misspecified IRT models [125,126] in the presence of random DIF might be an exciting topic for future research [127,128].
In this article, we only considered random DIF effects with a normal distribution and zero means. For relevance in practical applications, it might be interesting for future research to study the DIF effects under different data-generating models. First, random DIF could be simulation in a partial invariance situation in which only a few items have DIF effects, while the majority of DIF effects do not have DIF effects. Second, the case of normally distributed random DIF and partial invariance could occur in tandem. DIF effects could be simulated from a mixture distribution in which the first class includes normally distributed DIF effects with zero means, while the second class of smaller proportion includes outlying and large DIF effects. Due to the presence of outliers, the average of DIF effects will typically differ from zero. For this kind of DIF effects, robust linking methods must be employed for removing these outlying items from group comparisons [65,71,89,94,107].
In this article, we only studied linking in the two-group case. In the case of multiple groups, different linking methods can be employed [65,84,93,129,130]. However, the findings from two groups are likely to generalize to the multiple group case if the linking were to be performed based on sequences of pairwise linking approaches [131,132,133]. Comparing the performance of pairwise linking approaches and simultaneous linking of multiple groups would be an exciting topic for future research.
Our findings are likely to have even more impact on vertical scaling in which groups constitute time points or grades in the school career [134]. In this situation, differences in the means and SDs between groups are expected to be larger, and the consequences of choosing an inappropriate linking method can be much more pronounced [128,135,136,137,138]. As pointed out by an anonymous reviewer, vertical linking poses more assumptions than cross-sectional linking because younger test-takers could incorporate guessing strategies if they had a chance to respond to harder items designed for older test-takers (see also [25]).
It should be noted that we did not investigate the computation of standard errors for the linking methods. There is a rich literature that derives standard error formulas for linking due to sampling of persons (e.g., [85,104,132,139,140,141]). In addition, variability in estimated group means and SDs due to selecting items has been studied as linking errors in the literature [66,72,99,142,143,144,145]. It might be interesting in future research to investigate standard errors that reflect these sources of uncertainty [72,84,132,146]. Procedures that rely on resampling of persons and items will likely correctly reflect uncertainty due to persons and items in the parameters of interest.

Funding

This research received no external funding.

Data Availability Statement

The PISA 2006 and 2009 datasets are available from https://www.oecd.org/pisa/data/ (accessed on 12 July 2021).

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
1PLone-parameter logistic model
2PLtwo-parameter logistic model
ANCHanchored item parameters
CCconcurrent calibration
DIFdifferential item functioning
HABHaberman linking with logarithmized item discriminations
HAB-nologHaberman linking with untransformed item discriminations
HAEHaebara linking
HAE-asymmasymmetric Haebara linking
HAE-jointHaebara linking with joint item parameters
HAE-symmsymmetric Haebara linking
IA2invariance alignment with power p = 2
IRFitem response function
IRTitem response theory
logMMlog-mean-mean linking
LSAlarge-scale assessment
MMmean-mean linking
MMLmarginal maximum likelihood
MSEmean-squared error
NUDIFnonuniform differential item functioning
PIRLSProgress in International Reading Literacy Study
PISAProgramme for International Student Assessment
RCrecalibration linking
RMSEroot-mean-squared error
SDstandard deviation
TIMSSTrends in International Mathematics and Science Study
UDIFuniform differential item functioning

Appendix A. Nonidentifiability of DIF Effects Distributions

In this Appendix, we show that some constraints on the distributions of DIF effects e i and f i have to be posed in order to disentangle group differences in the ability distribution from average DIF effects. In Appendices Appendix A.1 and Appendix A.2, we discuss nonidentifiability for item difficulties and item discriminations, respectively.

Appendix A.1. DIF Effects for Item Difficulties

First, we show that the mean E ( e i ) must be set to zero for reasons of identification. Assume that there are additive DIF effects for item difficulties (see Equation (4)). Let us assume that DIF effects e i have a mean different from zero (i.e., E ( e i ) = δ e . Then, we can write e i = δ e + e i with E ( e i ) = 0 . The IRF for item i for persons in the first group (i.e., g = 1 ) can be written as:
P ( X i = 1 | θ ) = Ψ a i ( θ b i + e i ) , θ N ( 0 , 1 ) ,
where b i = b i δ e . For persons in the second group, the ability distribution is given by θ N ( μ 2 , σ 2 2 ) . However, the IRF can be equivalently formulated as:
P ( X i = 1 | θ ) = Ψ a i ( θ b i e i ) , θ N ( μ 2 2 δ e , σ 2 2 ) .
Hence, the parameterization involving common item difficulties b i , DIF effects e i with E ( e i ) = δ e 0 , and θ N ( μ 2 , σ 2 2 ) for persons in the second group can be equivalently parameterized with common item difficulties b i , DIF effects e i with E ( e i ) = 0 , and θ N ( μ 2 , σ 2 2 ) for persons in the second group. This demonstrates that group mean differences for abilities cannot be identified if the average DIF effect E ( e i ) is assumed to differ from zero.

Appendix A.2. DIF Effects for Item Discriminations

Now, we show nonidentifiability for the DIF effects distribution of item discriminations. We assumed multiplicative DIF effects (see Equation (6)). It is shown that E ( log f i ) cannot be identified from data if the standard deviation σ 2 is simultaneously estimated. Assume E ( log f i ) = δ f 0 . We can reparametrize the DIF effects as f i = exp ( δ f ) f i with E ( log f i ) = 0 . Then, we obtain the IRF for item i in the first group:
P ( X i = 1 | θ ) = Ψ a i f i ( θ b i + e i ) = Ψ a i f i ( θ b i + e i ) , θ N ( 0 , 1 ) ,
where a i = a i exp ( δ f ) . The IRF for persons in the second group is given as:
P ( X i = 1 | θ ) = Ψ a i f i ( θ b i + e i ) , θ N ( μ 2 , σ 2 exp ( 2 δ f ) ) ,
Hence, the mean of DIF effects f i (or log f i , respectively) must be fixed for identification reasons.

Appendix B. Proof of Proposition 1

Appendix B.1. Consistency of Additive DIF Effects fi with Condition (I)

Define:
T ^ g = 1 | I 0 | i I 0 log a ^ i g for g = 1 , 2 .
From Section 2.3.1 (Equation (9)), we know that a ^ i 1 = a i 1 = a i f i and a ^ i 2 = a i 2 σ 2 = σ 2 ( a i + f i ) . Hence, we obtain:
E ( T ^ 2 ) = 1 | I 0 | i I 0 E ( log ( σ 2 ( a i + f i ) ) ) = log σ 2 + 1 | I 0 | i I 0 E ( log ( a i + f i ) ) and
E ( T ^ 1 ) = 1 | I 0 | i I 0 E ( log ( a i f i ) ) .
Because the distribution of f i is symmetric with E ( f i ) = 0 , we obtain E ( log ( a i + f i ) ) = E ( log ( a i f i ) ) . Therefore, we obtain:
E ( T ^ 2 T ^ 1 ) = log σ 2 .
Trivially, it follows that T ^ 2 T ^ 1 p log σ 2 . Because σ ^ 2 = exp ( T ^ 2 T ^ 1 ) , we obtain by the continuous mapping theorem ([147], p. 7):
σ ^ 2 = exp ( T ^ 2 T ^ 1 ) p exp ( log σ 2 ) = σ 2 .
To derive the consistency of μ ^ 2 , define:
B ^ g = 1 | I 0 | i I 0 b ^ i 2 for g = 1 , 2 .
From Equation (9) in Section 2.3.1, it follows that b ^ i 1 = b i e i and b ^ i 2 = σ 2 1 ( b i + e i μ 2 ) . Thus,
E ( B ^ 2 ) = μ 2 σ 2 + 1 σ 2 1 | I 0 | i I 0 b i and
E ( B ^ 1 ) = 1 | I 0 | i I 0 b i .
We can rewrite:
μ ^ 2 = σ ^ 2 B ^ 2 + B ^ 1 .
Assume the existence of B = lim | I 0 | 1 | I 0 | i I 0 b i . It holds that B ^ 2 p μ 2 / σ 2 + B / σ 2 and B ^ 1 p B . Hence, we obtain from (A13):
μ ^ 2 = σ ^ 2 B ^ 2 + B ^ 1 p σ 2 μ 2 / σ 2 + B / σ 2 + B = μ 2 .

Appendix B.2. Consistency for Multiplicative DIF Effects fi with Condition (II)

From Section 2.3.1 (Equation (9)), we know that a ^ i 1 = a i 1 = a i / f i and a ^ i 2 = a i 2 σ 2 = σ 2 a i f i . Then, we obtain:
T ^ 2 T ^ 1 = log σ 2 + 1 | I 0 | i I 0 log f i .
With E ( log f i ) = 0 , we obtain from (A15):
E ( T ^ 2 T ^ 1 ) = log σ 2 .
This proves σ ^ 2 p σ 2 using the same reasoning as in (A9).
The derivations for B ^ g ( g = 1 , 2 ) are the same as in Appendix B.1. Due to the consistency of σ ^ 2 , we also obtain the consistency of μ ^ 2 as in (A14).

Appendix C. Proof of Proposition 2

Appendix C.1. Consistency for Additive DIF Effects fi with Condition (I)

Define:
U ^ g = 1 | I 0 | i I 0 a ^ i g for g = 1 , 2 .
The expected values for U ^ 1 and U ^ 2 are given in as:
E ( U ^ 2 ) = 1 | I 0 | i I 0 E ( ( σ 2 ( a i + f i ) ) ) = σ 2 1 | I 0 | i I 0 a i and
E ( U ^ 1 ) = 1 | I 0 | i I 0 a i .
Using the notation A = lim | I 0 | 1 | I 0 | i I 0 a i , we obtain U ^ 1 p A and U ^ 2 p σ 2 A . By the continuous mapping theorem ([147], p. 7), we obtain:
σ ^ 2 = U ^ 2 U ^ 1 p σ 2 .
The steps for deriving the consistency of μ ^ 2 are the same as in Appendix B.1.

Appendix C.2. Consistency for Multiplicative DIF Effects fi with Condition (IO)

For multiplicative DIF effects, we have:
U ^ 2 = 1 | I 0 | i I 0 a ^ i σ 2 f i = σ 2 1 | I 0 | i I 0 a ^ i exp ( log f i ) and
U ^ 1 = 1 | I 0 | i I 0 a ^ i / f i = σ 2 1 | I 0 | i I 0 a ^ i exp ( log f i ) .
Because f i has a symmetric distribution, it follows that α E ( exp ( log f i ) ) = E ( exp ( log f i ) ) . Then, we have:
E ( U ^ 2 ) = σ 2 α 1 | I 0 | i I 0 a ^ i and E ( U ^ 1 ) = α 1 | I 0 | i I 0 a ^ i .
Assuming the existence of A = lim | I 0 | 1 | I 0 | i I 0 a ^ i , one obtains U ^ 2 p σ 2 α A and U ^ 1 p α A . By the continuous mapping theorem ([147], p. 7), we obtain:
σ ^ 2 = U ^ 2 U ^ 1 p σ 2 α A α A = σ 2 .
As in Appendix C.1, the derivation for the consistency of μ ^ 2 does not require new calculations.

Appendix D. Estimates in Haberman Linking

For the HAB linking with logarithmized item loadings, the linking function for the standard deviation is given as:
H 1 , log ( σ 2 , a ) = i I 0 I 1 log a ^ i 1 log a i 2 + i I 0 I 2 log a ^ i 2 log a i log σ 2 2 .
Taking the derivative of H 1 , log with respect to log a i for i I 0 and setting it to zero provide (up to multiplication with a constant):
log a ^ i 1 log a i + log a ^ i 2 log a i log σ 2 = 0 .
Similarly, setting the derivative of H 1 , log with respect to log σ 2 to zero provides:
i I 0 log a ^ i 2 log a i log σ 2 = 0
Summing the equations of all | I 0 | items of (A26) and subtracting (A27) provide:
i I 0 log a ^ i 1 log a i = 0 .
With an estimate log σ ^ 2 of log σ 2 , we obtain from (A26):
log a ^ i = 1 2 log a ^ i 1 + log a ^ i 2 log σ ^ 2 .
Plugging (A29) into (A27) provides:
log σ ^ 2 = 1 | I 0 | i I 0 log a ^ i 2 1 | I 0 | i I 0 log a ^ i 1 .
Hence, it holds that:
σ ^ 2 = exp 1 | I 0 | i I 0 log a ^ i 2 1 | I 0 | i I 0 log a ^ i 1 .
For Haberman linking with untransformed item discriminations (HAB-nolog), the linking function for the standard deviation of the second group is given as:
H 1 , nolog ( σ 2 , a ) = i I 0 I 1 a ^ i 1 a i 1 2 + i I 0 I 2 a ^ i 2 a i σ 2 2 .
Taking the derivative of H 1 , nolog with respect to a i for i I 0 and setting it to zero provide:
a ^ i 1 a i 1 + a ^ i 2 a i σ 2 = 0 .
Estimates a ^ i are then given by:
a ^ i = 1 2 a ^ i 1 + a ^ i 2 ( 1 + σ 2 ) .
Setting the derivative of H 1 , nolog with respect to σ 2 to zero provides:
i I 0 a ^ i 2 a i σ 2 = 0 .
Substituting (A34) into (A35) provides:
σ ^ 2 = 1 + 1 | I 0 | i I 0 a ^ i 2 1 | I 0 | i I 0 a ^ i 1 .

Appendix E. Estimates in Invariance Alignment

We now derive closed expressions for the estimates from IA2 (see Section 3.4). To estimate σ 2 , define:
f ( σ 2 ) = i I 0 a ^ i 1 a ^ i 2 σ 2 2 .
Then, it follows that:
f ( σ 2 ) = i I 0 a ^ i 1 2 + a ^ i 2 2 σ 2 2 2 a ^ i 1 a ^ i 2 σ 2 .
The derivative of f in (A38) is:
f σ 2 = i I 0 2 a ^ i 2 2 σ 2 3 + 2 a ^ i 1 a ^ i 2 σ 2 2 .
To find the minimum, setting the derivative to zero provides:
i I 0 a ^ i 2 2 + a ^ i 1 a ^ i 2 σ 2 = 0
and we have:
σ ^ 2 = i I 0 a ^ i 2 2 i I 0 a ^ i 1 a ^ i 2 .
To estimate μ 2 , define:
g ( μ 2 ) = i I 0 ν ^ i 1 ν ^ i 2 + μ 2 a ^ i 2 σ ^ 2 2 .
We obtain:
g ( μ 2 ) = i I 0 ( ν ^ i 1 ν ^ i 2 ) 2 + μ 2 2 a ^ i 2 2 σ ^ 2 2 2 μ 2 a ^ i 2 σ ^ 2 ( ν ^ i 1 ν ^ i 2 ) .
Setting the derivative of g with respect to μ 2 to zero provides:
i I 0 2 μ 2 a ^ i 2 2 σ ^ 2 2 2 a ^ i 2 σ ^ 2 ( ν ^ i 1 ν ^ i 2 ) = 0 .
Then, we obtain using (A41):
μ ^ 2 = i I 0 a ^ i 2 σ ^ 2 ( ν ^ i 1 ν ^ i 2 ) i I 0 a ^ i 2 2 σ ^ 2 2 = σ ^ 2 i I 0 a ^ i 2 ( ν ^ i 1 ν ^ i 2 ) i I 0 a ^ i 2 2 = i I 0 a ^ i 2 ( ν ^ i 1 ν ^ i 2 ) i I 0 a ^ i 1 a ^ i 2 .

Appendix F. Item Parameters Used in the Simulation Study

In Table A1, the item parameters used in the simulation study are shown. Item discriminations a i had a mean of 1.00 ( SD = 0.28 , Min = 0.50 , Max = 1.42 ), and item difficulties b i had an average of 0.00 ( SD = 1.00 , Min = 1.62 , Max = 1.39 )
Table A1. Item parameters used in the simulation study.
Table A1. Item parameters used in the simulation study.
Item a i b i
10.95−0.97
20.88 0.59
30.75 0.75
41.29−0.79
51.28 1.23
61.29−1.10
71.25−0.67
80.97 0.20
90.73 1.26
101.27 0.05
111.42 1.22
120.75−0.01
130.50 0.20
140.81 1.39
151.12 0.61
160.78−1.00
171.30−1.58
180.70−1.62
191.29 1.06
200.74−0.81
Note. a i = item discrimination; b i = item difficulty.

References

  1. Cai, L.; Choi, K.; Hansen, M.; Harrell, L. Item response theory. Annu. Rev. Stat. Appl. 2016, 3, 297–321. [Google Scholar] [CrossRef]
  2. van der Linden, W.J.; Hambleton, R.K. (Eds.) Handbook of Modern Item Response Theory; Springer: New York, NY, USA, 1997. [Google Scholar] [CrossRef]
  3. Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
  4. Battauz, M. Regularized estimation of the four-parameter logistic model. Psych 2020, 2, 269–278. [Google Scholar] [CrossRef]
  5. Bürkner, P.C. Analysing standard progressive matrices (SPM-LS) with Bayesian item response models. J. Intell. 2020, 8, 5. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Chang, H.H.; Wang, C.; Zhang, S. Statistical applications in educational measurement. Annu. Rev. Stat. Appl. 2021, 8, 439–461. [Google Scholar] [CrossRef]
  7. Genge, E. LC and LC-IRT models in the identification of Polish households with similar perception of financial position. Sustainability 2021, 13, 4130. [Google Scholar] [CrossRef]
  8. Jefmański, B.; Sagan, A. Item response theory models for the fuzzy TOPSIS in the analysis of survey data. Symmetry 2021, 13, 223. [Google Scholar] [CrossRef]
  9. Karwowski, M.; Milerski, B. Who supports Polish educational reforms? Exploring actors’ and observers’ attitudes. Educ. Sci. 2021, 11, 120. [Google Scholar] [CrossRef]
  10. Medová, J.; Páleníková, K.; Rybanskỳ, L.; Naštická, Z. Undergraduate students’ solutions of modeling problems in algorithmic graph theory. Mathematics 2019, 7, 572. [Google Scholar] [CrossRef] [Green Version]
  11. Mousavi, A.; Cui, Y. The effect of person misfit on item parameter estimation and classification accuracy: A simulation study. Educ. Sci. 2020, 10, 324. [Google Scholar] [CrossRef]
  12. Palma-Vasquez, C.; Carrasco, D.; Hernando-Rodriguez, J.C. Mental health of teachers who have teleworked due to COVID-19. Eur. J. Investig. Health Psychol. Educ. 2021, 11, 515–528. [Google Scholar] [CrossRef]
  13. Storme, M.; Myszkowski, N.; Baron, S.; Bernard, D. Same test, better scores: Boosting the reliability of short online intelligence recruitment tests with nested logit item response theory models. J. Intell. 2019, 7, 17. [Google Scholar] [CrossRef] [Green Version]
  14. Tsutsumi, E.; Kinoshita, R.; Ueno, M. Deep item response theory as a novel test theory based on deep learning. Electronics 2021, 10, 1020. [Google Scholar] [CrossRef]
  15. Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
  16. Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
  17. Lietz, P.; Cresswell, J.C.; Rust, K.F.; Adams, R.J. (Eds.) Implementation of Large-Scale Education Assessments; Wiley: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
  18. Maehler, D.B.; Rammstedt, B. (Eds.) Large-Scale Cognitive Assessment; Springer: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
  19. Wagemaker, H. International large-scale assessments: From research to policy. In A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Rutkowski, L., von Davier, M., Rutkowski, D., Eds.; Chapman Hall/CRC Press: London, UK, 2014; pp. 11–36. [Google Scholar] [CrossRef]
  20. van der Linden, W.J. Unidimensional Logistic Response Models. In Handbook of Item Response Theory, Volume One: Models; CRC Press: Boca Raton, FL, USA, 2016; pp. 11–30. [Google Scholar] [CrossRef]
  21. Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational Research: Copenhagen, Denmark, 1960. [Google Scholar]
  22. von Davier, A.A.; Carstensen, C.H.; von Davier, M. Linking Competencies in Educational Settings and Measuring Growth; (Research Report No. RR-06-12); Educational Testing Service: Princeton, NJ, USA, 2006. [Google Scholar] [CrossRef]
  23. von Davier, A.A.; Holland, P.W.; Thayer, D.T. The Kernel Method of Test Equating; Springer: New York, NY, USA, 2004. [Google Scholar] [CrossRef] [Green Version]
  24. Bolsinova, M.; Maris, G. Can IRT solve the missing data problem in test equating? Front. Psychol. 2016, 6, 1956. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  25. Liou, M.; Cheng, P.E. Equipercentile equating via data-imputation techniques. Psychometrika 1995, 60, 119–136. [Google Scholar] [CrossRef]
  26. Meredith, W. Measurement invariance, factor analysis and factorial invariance. Psychometrika 1993, 58, 525–543. [Google Scholar] [CrossRef]
  27. Millsap, R.E. Statistical Approaches to Measurement Invariance; Routledge: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
  28. van de Vijver, F.J.R. (Ed.) Invariance Analyses in Large-Scale Studies; OECD: Paris, France, 2019. [Google Scholar] [CrossRef]
  29. Mellenbergh, G.J. Item bias and item response theory. Int. J. Educ. Res. 1989, 13, 127–143. [Google Scholar] [CrossRef]
  30. Millsap, R.E.; Everson, H.T. Methodology review: Statistical approaches for assessing measurement bias. Appl. Psychol. Meas. 1993, 17, 297–334. [Google Scholar] [CrossRef]
  31. Osterlind, S.J.; Everson, H.T. Differential Item Functioning; Sage Publications: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
  32. Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics, Volume 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elesvier: Amsterdam, The Netherlands, 2007; pp. 125–167. [Google Scholar] [CrossRef]
  33. Uyar, S.; Kelecioglu, H.; Dogan, N. Comparing differential item functioning based on manifest groups and latent classes. Educ. Sci. Theory Pract. 2017, 17, 1977–2000. [Google Scholar] [CrossRef] [Green Version]
  34. Lee, S.Y.; Hong, A.J. Psychometric investigation of the cultural intelligence scale using the Rasch measurement model in South Korea. Sustainability 2021, 13, 3139. [Google Scholar] [CrossRef]
  35. Mylona, I.; Aletras, V.; Ziakas, N.; Tsinopoulos, I. Rasch validation of the VF-14 scale of vision-specific functioning in Greek patients. Int. J. Environ. Res. Public Health 2021, 18, 4254. [Google Scholar] [CrossRef]
  36. Pichette, F.; Béland, S.; Leśniewska, J. Detection of gender-biased items in the peabody picture vocabulary test. Languages 2019, 4, 27. [Google Scholar] [CrossRef] [Green Version]
  37. Shibaev, V.; Grigoriev, A.; Valueva, E.; Karlin, A. Differential item functioning on Raven’s SPM+ amongst two convenience samples of Yakuts and Russian. Psych 2020, 2, 44–51. [Google Scholar] [CrossRef] [Green Version]
  38. Silvia, P.J.; Rodriguez, R.M. Time to renovate the humor styles questionnaire? An item response theory analysis of the HSQ. Behav. Sci. 2020, 10, 173. [Google Scholar] [CrossRef]
  39. Hanson, B.A. Uniform DIF and DIF defined by differences in item response functions. J. Educ. Behav. Stat. 1998, 23, 244–253. [Google Scholar] [CrossRef]
  40. Teresi, J.A.; Ramirez, M.; Lai, J.S.; Silver, S. Occurrences and sources of differential item functioning (DIF) in patient-reported outcome measures: Description of DIF methods, and review of measures of depression, quality of life and general health. Psychol. Sci. 2008, 50, 538–612. [Google Scholar]
  41. Buchholz, J.; Hartig, J. Measurement invariance testing in questionnaires: A comparison of three multigroup-CFA and IRT-based approaches. Psych. Test Assess. Model. 2020, 62, 29–53. [Google Scholar]
  42. Chalmers, R.P. Extended mixed-effects item response models with the MH-RM algorithm. J. Educ. Meas. 2015, 52, 200–222. [Google Scholar] [CrossRef]
  43. De Boeck, P.; Wilson, M. Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach; Springer: New York, NY, USA, 2004. [Google Scholar] [CrossRef]
  44. De Boeck, P. Random item IRT models. Psychometrika 2008, 73, 533–559. [Google Scholar] [CrossRef]
  45. de Jong, M.G.; Steenkamp, J.B.E.M.; Fox, J.P. Relaxing measurement invariance in cross-national consumer research using a hierarchical IRT model. J. Consum. Res. 2007, 34, 260–278. [Google Scholar] [CrossRef]
  46. Doran, H.; Bates, D.; Bliese, P.; Dowling, M. Estimating the multilevel Rasch model: With the lme4 package. J. Stat. Softw. 2007, 20, 1–18. [Google Scholar] [CrossRef]
  47. Fox, J.P.; Verhagen, A.J. Random item effects modeling for cross-national survey data. In Cross-Cultural Analysis: Methods and Applications; Davidov, E., Schmidt, P., Billiet, J., Eds.; Routledge: London, UK, 2010; pp. 461–482. [Google Scholar] [CrossRef]
  48. Van den Noortgate, W.; De Boeck, P. Assessing and explaining differential item functioning using logistic mixed models. J. Educ. Behav. Stat. 2005, 30, 443–464. [Google Scholar] [CrossRef]
  49. Muthén, B.; Asparouhov, T. Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychol. Methods 2012, 17, 313–335. [Google Scholar] [CrossRef] [PubMed]
  50. van de Schoot, R.; Kluytmans, A.; Tummers, L.; Lugtig, P.; Hox, J.; Muthén, B. Facing off with scylla and charybdis: A comparison of scalar, partial, and the novel possibility of approximate measurement invariance. Front. Psychol. 2013, 4, 770. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  51. Bechger, T.M.; Maris, G. A statistical test for differential item pair functioning. Psychometrika 2015, 80, 317–340. [Google Scholar] [CrossRef]
  52. Camilli, G. The case against item bias detection techniques based on internal criteria: Do item bias procedures obscure test fairness issues? In Differential Item Functioning: Theory and Practice; Holland, P.W., Wainer, H., Eds.; Erlbaum: Hillsdale, NJ, USA, 1993; pp. 397–417. [Google Scholar]
  53. Doebler, A. Looking at DIF from a new perspective: A structure-based approach acknowledging inherent indefinability. Appl. Psychol. Meas. 2019, 43, 303–321. [Google Scholar] [CrossRef] [PubMed]
  54. Robitzsch, A.; Lüdtke, O. A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psych. Test Assess. Model. 2020, 62, 233–279. [Google Scholar]
  55. Frederickx, S.; Tuerlinckx, F.; De Boeck, P.; Magis, D. RIM: A random item mixture model to detect differential item functioning. J. Educ. Meas. 2010, 47, 432–457. [Google Scholar] [CrossRef]
  56. Byrne, B.M.; Shavelson, R.J.; Muthén, B. Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychol. Bull. 1989, 105, 456–466. [Google Scholar] [CrossRef]
  57. Magis, D.; Tuerlinckx, F.; De Boeck, P. Detection of differential item functioning using the lasso approach. J. Educ. Behav. Stat. 2015, 40, 111–135. [Google Scholar] [CrossRef]
  58. Tutz, G.; Schauberger, G. A penalty approach to differential item functioning in Rasch models. Psychometrika 2015, 80, 21–43. [Google Scholar] [CrossRef] [Green Version]
  59. Soares, T.M.; Gonçalves, F.B.; Gamerman, D. An integrated Bayesian model for DIF analysis. J. Educ. Behav. Stat. 2009, 34, 348–377. [Google Scholar] [CrossRef]
  60. Kopf, J.; Zeileis, A.; Strobl, C. Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. Educ. Psychol. Meas. 2015, 75, 22–56. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  61. Magis, D.; Béland, S.; Tuerlinckx, F.; De Boeck, P. A general framework and an R package for the detection of dichotomous differential item functioning. Behav. Res. Methods 2010, 42, 847–862. [Google Scholar] [CrossRef] [Green Version]
  62. Teresi, J.A.; Ramirez, M.; Jones, R.N.; Choi, S.; Crane, P.K. Modifying measures based on differential item functioning (DIF) impact analyses. J. Aging Health 2012, 24, 1044–1076. [Google Scholar] [CrossRef] [Green Version]
  63. DeMars, C.E. Alignment as an alternative to anchor purification in DIF analyses. Struct. Equ. Model. 2020, 27, 56–72. [Google Scholar] [CrossRef]
  64. Lai, M.H.C.; Liu, Y.; Tse, W.W.Y. Adjusting for partial invariance in latent parameter estimation: Comparing forward specification search and approximate invariance methods. Behav. Res. Methods 2021. [Google Scholar] [CrossRef] [PubMed]
  65. Robitzsch, A.; Lüdtke, O. Mean comparisons of many groups in the presence of DIF: An evaluation of linking and concurrent scaling approaches. J. Educ. Behav. Stat. 2021. [Google Scholar] [CrossRef]
  66. Sachse, K.A.; Roppelt, A.; Haag, N. A comparison of linking methods for estimating national trends in international comparative large-scale assessments in the presence of cross-national DIF. J. Educ. Meas. 2016, 53, 152–171. [Google Scholar] [CrossRef]
  67. Oliveri, M.E.; von Davier, M. Investigation of model fit and score scale comparability in international assessments. Psych. Test Assess. Model. 2011, 53, 315–333. [Google Scholar]
  68. Oliveri, M.E.; von Davier, M. Toward increasing fairness in score scale calibrations employed in international large-scale assessments. Int. J. Test. 2014, 14, 1–21. [Google Scholar] [CrossRef]
  69. OECD. PISA 2015. Technical Report; OECD: Paris, France, 2017. [Google Scholar]
  70. von Davier, M.; Yamamoto, K.; Shin, H.J.; Chen, H.; Khorramdel, L.; Weeks, J.; Davis, S.; Kong, N.; Kandathil, M. Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assess. Educ. 2019, 26, 466–488. [Google Scholar] [CrossRef]
  71. Robitzsch, A. Lp loss functions in invariance alignment and Haberman linking with few or many groups. Stats 2020, 3, 246–283. [Google Scholar] [CrossRef]
  72. Robitzsch, A.; Lüdtke, O. Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assess. Educ. 2019, 26, 444–465. [Google Scholar] [CrossRef]
  73. El Masri, Y.H.; Andrich, D. The trade-off between model fit, invariance, and validity: The case of PISA science assessments. Appl. Meas. Educ. 2020, 33, 174–188. [Google Scholar] [CrossRef]
  74. Shealy, R.; Stout, W. A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika 1993, 58, 159–194. [Google Scholar] [CrossRef]
  75. Zwitser, R.J.; Glaser, S.S.F.; Maris, G. Monitoring countries in a changing world: A new look at DIF in international surveys. Psychometrika 2017, 82, 210–232. [Google Scholar] [CrossRef]
  76. Aitkin, M. Expectation maximization algorithm and extensions. In Handbook of Item Response Theory, Volume 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 217–236. [Google Scholar] [CrossRef]
  77. Bock, R.D.; Aitkin, M. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 1981, 46, 443–459. [Google Scholar] [CrossRef]
  78. von Davier, M.; Sinharay, S. Analytics in international large-scale assessments: Item response theory and population models. In A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Rutkowski, L., von Davier, M., Rutkowski, D., Eds.; Chapman Hall/CRC Press: London, UK, 2014; pp. 155–174. [Google Scholar] [CrossRef]
  79. Robitzsch, A. A note on a computationally efficient implementation of the EM algorithm in item response models. Quant. Comput. Methods Behav. Sci. 2021, 1, e3783. [Google Scholar] [CrossRef]
  80. González, J.; Wiberg, M. Applying Test Equating Methods. Using R; Springer: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
  81. Lee, W.C.; Lee, G. IRT linking and equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 639–673. [Google Scholar] [CrossRef]
  82. Sansivieri, V.; Wiberg, M.; Matteucci, M. A review of test equating methods with a special focus on IRT-based approaches. Statistica 2017, 77, 329–352. [Google Scholar] [CrossRef]
  83. Haberman, S.J. Linking Parameter Estimates Derived from an Item Response Model through Separate Calibrations; (Research Report No. RR-09-40); Educational Testing Service: Princeton, NJ, USA, 2009. [Google Scholar] [CrossRef]
  84. Battauz, M. Multiple equating of separate IRT calibrations. Psychometrika 2017, 82, 610–636. [Google Scholar] [CrossRef] [PubMed]
  85. Asparouhov, T.; Muthén, B. Multiple-group factor analysis alignment. Struct. Equ. Model. 2014, 21, 495–508. [Google Scholar] [CrossRef]
  86. Muthén, B.; Asparouhov, T. IRT studies of many groups: The alignment method. Front. Psychol. 2014, 5, 978. [Google Scholar] [CrossRef] [Green Version]
  87. Muthén, B.; Asparouhov, T. Recent methods for the study of measurement invariance with many groups: Alignment and random effects. Sociol. Methods Res. 2018, 47, 637–664. [Google Scholar] [CrossRef]
  88. Pokropek, A.; Davidov, E.; Schmidt, P. A Monte Carlo simulation study to assess the appropriateness of traditional and newer approaches to test for measurement invariance. Struct. Equ. Model. 2019, 26, 724–744. [Google Scholar] [CrossRef] [Green Version]
  89. Pokropek, A.; Lüdtke, O.; Robitzsch, A. An extension of the invariance alignment method for scale linking. Psych. Test Assess. Model. 2020, 62, 303–334. [Google Scholar]
  90. Haebara, T. Equating logistic ability scales by a weighted least squares method. Jpn. Psychol. Res. 1980, 22, 144–149. [Google Scholar] [CrossRef] [Green Version]
  91. Kim, S.; Kolen, M.J. Effects on scale linking of different definitions of criterion functions for the IRT characteristic curve methods. J. Educ. Behav. Stat. 2007, 32, 371–397. [Google Scholar] [CrossRef]
  92. Weeks, J.P. plink: An R package for linking mixed-format tests using IRT-based methods. J. Stat. Softw. 2010, 35, 1–33. [Google Scholar] [CrossRef]
  93. Arai, S.; Mayekawa, S.i. A comparison of equating methods and linking designs for developing an item pool under item response theory. Behaviormetrika 2011, 38, 1–16. [Google Scholar] [CrossRef]
  94. Robitzsch, A. Robust Haebara linking for many groups: Performance in the case of uniform DIF. Psych 2020, 2, 155–173. [Google Scholar] [CrossRef]
  95. OECD. PISA 2006. Technical Report; OECD: Paris, France, 2009. [Google Scholar]
  96. Foy, P.; Yin, L. Scaling the PIRLS 2016 achievement data. In Methods and Procedures in PIRLS 2016; Martin, M.O., Mullis, I.V., Hooper, M., Eds.; IEA: Newton, MA, USA, 2017. [Google Scholar]
  97. Foy, P.; Yin, L. Scaling the TIMSS 2015 achievement data. In Methods and Procedures in TIMSS 2015; Martin, M.O., Mullis, I.V., Hooper, M., Eds.; IEA: Newton, MA, USA, 2016. [Google Scholar]
  98. Foy, P.; Fishbein, B.; von Davier, M.; Yin, L. Implementing the TIMSS 2019 scaling methodology. In Methods and Procedures: TIMSS 2019 Technical Report; Martin, M.O., von Davier, M., Mullis, I.V., Eds.; IEA: Newton, MA, USA, 2020. [Google Scholar]
  99. Gebhardt, E.; Adams, R.J. The influence of equating methodology on reported trends in PISA. J. Appl. Meas. 2007, 8, 305–322. [Google Scholar]
  100. Fishbein, B.; Martin, M.O.; Mullis, I.V.S.; Foy, P. The TIMSS 2019 item equivalence study: Examining mode effects for computer-based assessment and implications for measuring trends. Large-Scale Assess. Educ. 2018, 6, 11. [Google Scholar] [CrossRef] [Green Version]
  101. Martin, M.O.; Mullis, I.V.S.; Foy, P.; Brossman, B.; Stanco, G.M. Estimating linking error in PIRLS. IERI Monogr. Ser. 2012, 5, 35–47. [Google Scholar]
  102. Kim, S.H.; Cohen, A.S. A comparison of linking and concurrent calibration under item response theory. Appl. Psychol. Meas. 1998, 22, 131–143. [Google Scholar] [CrossRef] [Green Version]
  103. Hanson, B.A.; Béguin, A.A. Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Appl. Psychol. Meas. 2002, 26, 3–24. [Google Scholar] [CrossRef]
  104. Andersson, B. Asymptotic variance of linking coefficient estimators for polytomous IRT models. Appl. Psychol. Meas. 2018, 42, 192–205. [Google Scholar] [CrossRef]
  105. Demirus, K.; Gelbal, S. The study of the effect of anchor items showing or not showing differantial item functioning to test equating using various methods. J. Meas. Eval. Educ. Psychol. 2016, 7, 182–201. [Google Scholar] [CrossRef] [Green Version]
  106. Gübes, N.; Uyar, S. Comparing performance of different equating methods in presence and absence of DIF Items in anchor test. Int. J. Progress. Educ. 2020, 16, 111–122. [Google Scholar] [CrossRef]
  107. He, Y.; Cui, Z. Evaluating robust scale transformation methods with multiple outlying common items under IRT true score equating. Appl. Psychol. Meas. 2020, 44, 296–310. [Google Scholar] [CrossRef]
  108. Inal, H.; Anil, D. Investigation of group invariance in test equating under different simulation conditions. Eurasian J. Educ. Res. 2018, 18, 67–86. [Google Scholar] [CrossRef]
  109. Kabasakal, K.A.; Kelecioğlu, H. Effect of differential item functioning on test equating. Educ. Sci. Theory Pract. 2015, 15, 1229–1246. [Google Scholar] [CrossRef] [Green Version]
  110. Tulek, O.K.; Kose, I.A. Comparison of different forms of a test with or without items that exhibit DIF. Eurasian J. Educ. Res. 2019, 19, 167–182. [Google Scholar] [CrossRef]
  111. Pohl, S.; Schulze, D. Assessing group comparisons or change over time under measurement non-invariance: The cluster approach for nonuniform DIF. Psych. Test Assess. Model. 2020, 62, 281–303. [Google Scholar]
  112. Yurtçu, M.; Güzeller, C.O. Investigation of equating error in tests with differential item functioning. Int. J. Assess. Tool. Educ. 2018, 5, 50–57. [Google Scholar] [CrossRef]
  113. R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2020; Available online: https://www.R-project.org/ (accessed on 20 August 2020).
  114. Robitzsch, A.; Kiefer, T.; Wu, M. TAM: Test Analysis Modules; R Package Version 3.7-6. 2021. Available online: https://CRAN.R-project.org/package=TAM (accessed on 25 June 2021).
  115. Robitzsch, A. Sirt: Supplementary Item Response Theory Models; R Package Version 3.9-4. 2020. Available online: https://CRAN.R-project.org/package=sirt (accessed on 17 February 2020).
  116. Bates, D.; Mächler, M.; Bolker, B.; Walker, S. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 2015, 67, 1–48. [Google Scholar] [CrossRef]
  117. OECD. PISA 2009. Technical Report; OECD: Paris, France, 2012. [Google Scholar]
  118. Falk, C.F.; Cai, L. Semiparametric item response functions in the context of guessing. J. Educ. Meas. 2016, 53, 229–247. [Google Scholar] [CrossRef]
  119. Feuerstahler, L.M. Metric transformations and the filtered monotonic polynomial item response model. Psychometrika 2019, 84, 105–123. [Google Scholar] [CrossRef] [PubMed]
  120. Feuerstahler, L. Flexible item response modeling in R with the flexmet package. Psych 2021, 3, 447–478. [Google Scholar] [CrossRef]
  121. Ramsay, J.O.; Winsberg, S. Maximum marginal likelihood estimation for semiparametric item analysis. Psychometrika 1991, 56, 365–379. [Google Scholar] [CrossRef]
  122. Rossi, N.; Wang, X.; Ramsay, J.O. Nonparametric item response function estimates with the EM algorithm. J. Educ. Behav. Stat. 2002, 27, 291–317. [Google Scholar] [CrossRef] [Green Version]
  123. Anderson, D.; Kahn, J.D.; Tindal, G. Exploring the robustness of a unidimensional item response theory model with empirically multidimensional data. Appl. Meas. Educ. 2017, 30, 163–177. [Google Scholar] [CrossRef]
  124. Martineau, J.A. Distorting value added: The use of longitudinal, vertically scaled student achievement data for growth-based, value-added accountability. J. Educ. Behav. Stat. 2006, 31, 35–62. [Google Scholar] [CrossRef] [Green Version]
  125. Köhler, C.; Hartig, J. Practical significance of item misfit in educational assessments. Appl. Psychol. Meas. 2017, 41, 388–400. [Google Scholar] [CrossRef] [PubMed]
  126. Sinharay, S.; Haberman, S.J. How often is the misfit of item response theory models practically significant? Educ. Meas. 2014, 33, 23–35. [Google Scholar] [CrossRef]
  127. Zhao, Y.; Hambleton, R.K. Practical consequences of item response theory model misfit in the context of test equating with mixed-format test data. Front. Psychol. 2017, 8, 484. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  128. Bolt, D.M.; Deng, S.; Lee, S. IRT model misspecification and measurement of growth in vertical scaling. J. Educ. Meas. 2014, 51, 141–162. [Google Scholar] [CrossRef]
  129. Guo, H.; Liu, J.; Dorans, N.; Feigenbaum, M. Multiple Linking in Equating and Random Scale Drift; (Research Report No. RR-11-46); Educational Testing Service: Princeton, NJ, USA, 2011. [Google Scholar] [CrossRef]
  130. Puhan, G. Detecting and correcting scale drift in test equating: An illustration from a large scale testing program. Appl. Meas. Educ. 2008, 22, 79–103. [Google Scholar] [CrossRef]
  131. Battauz, M. IRT test equating in complex linkage plans. Psychometrika 2013, 78, 464–480. [Google Scholar] [CrossRef] [PubMed]
  132. Battauz, M. Factors affecting the variability of IRT equating coefficients. Stat. Neerl. 2015, 69, 85–101. [Google Scholar] [CrossRef]
  133. Battauz, M. equateIRT: An R package for IRT test equating. J. Stat. Softw. 2015, 68, 1–22. [Google Scholar] [CrossRef] [Green Version]
  134. Briggs, D.C.; Weeks, J.P. The sensitivity of value-added modeling to the creation of a vertical score scale. Educ. Financ. Policy 2009, 4, 384–414. [Google Scholar] [CrossRef]
  135. Bjermo, J.; Miller, F. Efficient estimation of mean ability growth using vertical scaling. Appl. Meas. Educ. 2021. [Google Scholar] [CrossRef]
  136. Fischer, L.; Rohm, T.; Carstensen, C.H.; Gnambs, T. Linking of Rasch-scaled tests: Consequences of limited item pools and model misfit. Front. Psychol. 2021, 12, 633896. [Google Scholar] [CrossRef] [PubMed]
  137. Pohl, S.; Haberkorn, K.; Carstensen, C.H. Measuring competencies across the lifespan-challenges of linking test scores. In Dependent Data in Social Sciences Research; Stemmler, M., von Eye, A., Wiedermann, W., Eds.; Springer: Cham, Switzerland, 2015; pp. 281–308. [Google Scholar] [CrossRef]
  138. Tong, Y.; Kolen, M.J. Comparisons of methodologies and results in vertical scaling for educational achievement tests. Appl. Meas. Educ. 2007, 20, 227–253. [Google Scholar] [CrossRef]
  139. Barrett, M.D.; van der Linden, W.J. Estimating linking functions for response model parameters. J. Educ. Behav. Stat. 2019, 44, 180–209. [Google Scholar] [CrossRef]
  140. Jewsbury, P.A. Error Variance in Common Population Linking Bridge Studies; (Research Report No. RR-19-42); Educational Testing Service: Princeton, NJ, USA, 2019. [Google Scholar] [CrossRef] [Green Version]
  141. Ogasawara, H. Standard errors of item response theory equating/linking by response function methods. Appl. Psychol. Meas. 2001, 25, 53–67. [Google Scholar] [CrossRef]
  142. Haberman, S.J.; Lee, Y.H.; Qian, J. Jackknifing Techniques for Evaluation of Equating Accuracy; (Research Report No. RR-09-02); Educational Testing Service: Princeton, NJ, USA, 2009. [Google Scholar] [CrossRef]
  143. Michaelides, M.P. A review of the effects on IRT item parameter estimates with a focus on misbehaving common items in test equating. Front. Psychol. 2010, 1, 167. [Google Scholar] [CrossRef] [Green Version]
  144. Monseur, C.; Berezner, A. The computation of equating errors in international surveys in education. J. Appl. Meas. 2007, 8, 323–335. [Google Scholar] [PubMed]
  145. Monseur, C.; Sibberns, H.; Hastedt, D. Linking errors in trend estimation for international surveys in education. IERI Monogr. Ser. 2008, 1, 113–122. [Google Scholar]
  146. Xu, X.; von Davier, M. Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study; (Research Report No. RR-10-10); Educational Testing Service: Princeton, NJ, USA, 2010. [Google Scholar] [CrossRef]
  147. Van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 1998; Volume 3. [Google Scholar] [CrossRef]
Figure 1. Linking design for two groups with common items I 0 and group-specific unique items I 1 and I 2 .
Figure 1. Linking design for two groups with common items I 0 and group-specific unique items I 1 and I 2 .
Foundations 01 00009 g001
Table 1. Variance proportions of different factors in the simulation study for the bias and RMSE for the estimated mean μ ^ 2 and estimated SD σ ^ 2 for the second group.
Table 1. Variance proportions of different factors in the simulation study for the bias and RMSE for the estimated mean μ ^ 2 and estimated SD σ ^ 2 for the second group.
Source μ ^ 2 σ ^ 2
BiasRMSEBiasRMSE
N0.31.10.63.9
I0.00.30.00.0
Meth10.2 1 14.9 1 19.1 1 0.0
τ b 13.0 1 0.00.81.8
τ a 4.39.012.3 1 0.0
N × I0.00.00.00.0
N × Meth0.03.70.80.0
N × τ b 0.02.40.00.0
N × τ a 0.00.60.05.0
I × Meth0.40.10.10.0
I × τ b 0.00.10.00.0
I × τ a 0.00.00.00.6
Meth × τ b 58.1 1 13.1 1 17.5 1 14.2 1
Meth × τ a 8.212.1 1 47.7 1 13.2 1
τ a × τ b 0.04.10.017.7 1
N × I × Meth0.00.00.10.0
N × I × τ b 0.00.20.00.0
N × I × τ a 0.00.10.00.4
N × Meth × τ b 0.27.50.04.2
N × Meth × τ a 0.04.00.09.1
N × τ a × τ b 0.110.0 1 0.013.8 1
I × Meth × τ b 0.50.00.00.4
I × Meth × τ a 0.10.00.21.1
I × τ a × τ b 0.10.30.00.7
Meth × τ a × τ b 1.010.1 1 0.18.2
Residual3.76.40.65.7
Note.N = sample size; I = number of items; Meth = linking method; τ a = standard deviation of DIF effects in item discriminations a i ; τ b = standard deviation of DIF effects in item difficulties b i . Percentage values larger than 1.0 are printed in bold.
Table 2. Summary of the satisfactory performance of linking methods for the absolute bias and RMSE across parameters (mean μ ^ 2 and standard deviation σ ^ 2 ) and conditions.
Table 2. Summary of the satisfactory performance of linking methods for the absolute bias and RMSE across parameters (mean μ ^ 2 and standard deviation σ ^ 2 ) and conditions.
BiasRMSE
NODIFUDIFNUDIFNODIFUDIFNUDIF
logMM100 1 979410010045
HAB100 1 979410010044
MM100 1 9495 1 9210072
HAB-nolog100 1 949610010078
IA2 1 75 1 78 1 8100100 1 4
HAE-asymm100 1 4242100 1 6178
HAE-symm100 1 9794100 1 6181
HAE-joint100 1 4260100 1 4261
RC1 1 83 1 7816100 1 6129
RC2 1 83 1 78 1 8100 1 6148
RC3100 1 9496100 1 6179
ANCH 1 83 1 7813100 1 6148
CC100 1 5045100 1 3346
Note. DIF = differential item functioning; NODIF = no DIF; UDIF = uniform DIF; NUDIF = nonuniform DIF; logMM = log-mean-mean linking; HAB = Haberman linking with logarithmized item discriminations; MM = mean-mean linking; HAB-nolog = Haberman linking with untransformed item discriminations; IA2 = invariance alignment with power p = 2 ; HAE-asymm = asymmetric Haebara linking; HAE-symm = symmetric Haebara linking; HAE-joint = Haebara linking with joint item parameters; RC = recalibration linking(see Equation (40)); ANCH = anchored item parameters; CC = concurrent calibration. Values smaller than 70 are printed in bold.
Table 3. Bias and RMSE for mean μ ^ 2 and standard deviation σ ^ 2 for the second group for a sample size N = 1000 and I = 40 items as a function of the type of differential item functioning and linking method.
Table 3. Bias and RMSE for mean μ ^ 2 and standard deviation σ ^ 2 for the second group for a sample size N = 1000 and I = 40 items as a function of the type of differential item functioning and linking method.
BiasRMSE
NODIFUDIFNUDIFNODIFUDIFNUDIF
τ b = 0 τ b = 0.5 τ b = 0.5 τ b = 0 τ b = 0.5 τ b = 0.5
τ a = 0 τ a = 0 τ a = 0.25 τ a = 0 τ a = 0 τ a = 0.25
Mean μ ^ 2
logMM 0.000 0.007 0.008108.2104.4106.1
HAB 0.000 0.007 0.008108.2104.4106.1
MM 0.000 0.007 0.007108.1103.7104.7
HAB-nolog 0.001 0.007 0.007108.5103.5104.5
IA2−0.001 0.001 0.045103.2107.5133.3
HAE-asymm−0.002−0.030−0.032102.3100.0100.0
HAE-symm−0.001 0.002 0.005102.7105.0105.2
HAE-joint−0.002 0.067 0.064100.9136.1132.4
RC1−0.001 0.001 0.028100.2104.8120.5
RC2−0.006−0.004−0.022100.0104.0100.1
RC3−0.003−0.001 0.002100.1103.9109.4
ANCH−0.003−0.004−0.021101.4104.2103.9
CC−0.002 0.095 0.109101.3149.2157.7
Standard Deviation σ ^ 2
logMM 0.000 0.003 0.008110.2112.6128.9
HAB 0.000 0.003 0.008110.2112.6129.4
MM−0.001 0.001 0.005108.5109.4107.7
HAB-nolog 0.001 0.002 0.007100.0100.0100.0
IA2 0.009 0.009 0.147113.2111.6197.9
HAE-asymm−0.002−0.120−0.134107.2378.8185.6
HAE-symm 0.001−0.003 0.003108.3233.7119.9
HAE-joint−0.001 0.020 0.029107.5317.0146.6
RC1 0.006 0.008 0.105109.8243.8174.5
RC2−0.009−0.008−0.097108.5217.2148.3
RC3−0.002 0.000 0.002106.6228.3110.2
ANCH−0.009−0.008−0.097108.5217.2148.3
CC−0.001 0.015 0.029107.4220.4129.0
Note: DIF = differential item functioning; NODIF = no DIF; UDIF = uniform DIF; NUDIF = nonuniform DIF; τ a = standard deviation of DIF effects in item discriminations a i ; τ b = standard deviation of DIF effects in item difficulties b i ; logMM = log-mean-mean linking; HAB = Haberman linking with logarithmized item discriminations; MM = mean-mean linking; HAB-nolog = Haberman linking with untransformed item discriminations; IA2 = invariance alignment with power p = 2 ; HAE-asymm = asymmetric Haebara linking; HAE-symm = symmetric Haebara linking; HAE-joint = Haebara linking with joint item parameters; RC = recalibration linking(see Equation (40)); ANCH = anchored item parameters; CC = concurrent calibration. Absolute biases larger than 0.02 are printed in bold. RMSE values larger than 120 are printed in bold.
Table 4. Sample information and descriptive results for PISA 2006 and PISA 2009 Austria.
Table 4. Sample information and descriptive results for PISA 2006 and PISA 2009 Austria.
DomainNI M SD
P06P09P06P09P06P09P06P09
Mathematics37844575 1 4835506.8495.9 1 96.8 1 96.1
Reading26466585 1 2799491.2470.3107.7100.1
Science4927457710353511.7494.3 1 97.3101.8
Note: M = mean; SD = standard deviation; P06 = PISA 2006; P09 = PISA 2009; N = number of students; I = number of items; M = mean; SD = standard deviation.
Table 5. Trend estimate for Austrian students in average achievement from PISA 2006 to PISA 2009.
Table 5. Trend estimate for Austrian students in average achievement from PISA 2006 to PISA 2009.
MethodMathematicsReadingScience
1PL2PL1PL2PL1PL2PL
logMM−15.5−12.4−5.8−6.3−14.7−16.8
HAB−15.5−12.4−5.8−6.3−14.7−16.8
MM−15.5−12.4−5.8−6.3−14.7−16.7
HAB-nolog−15.5−12.3−6.0−6.3−14.5−16.6
IA2−15.5−15.9−5.8−6.1−14.7−11.6
HAE-asymm−14.4−14.6−4.9−6.4−14.2−15.9
HAE-symm−14.6−15.0−5.0−6.6−14.2−15.7
HAE-joint−13.5−14.1−4.1−5.0−13.9−14.0
RC1−14.3−14.5−4.4−5.1−14.0−13.2
RC2−14.3−14.3−4.3−5.0−14.2−12.9
RC3−14.3−14.4−4.4−5.0−14.1−13.1
ANCH−14.4−15.7−4.5−5.4−14.5−14.1
CC−14.3−14.9−4.3−5.3−14.2−13.6
M−14.8−14.1−5.0−5.8−14.3−14.7
SD 1 0.7 1 1.3 0.7 0.6 1 0.3 1 1.8
Min−15.5−15.9−6.0−6.6−14.7−16.8
Max−13.5−12.3−4.1−5.0−13.9−11.6
Note. logMM = log-mean-mean linking; HAB = Haberman linking with logarithmized item discriminations; MM = mean-mean linking; HAB-nolog = Haberman linking with untransformed item discriminations; IA2 = invariance alignment with power p = 2 ; HAE-asymm = asymmetric Haebara linking; HAE-symm = symmetric Haebara linking; HAE-joint = Haebara linking with joint item parameters; RC = recalibration linking(see Equation (40)); ANCH = anchored item parameters; CC = concurrent calibration.
Table 6. Standard deviation for Austrian students in PISA 2009. for domains Mathematics, Reading and Science for the 1PL and the 2PL model as a function of the linking method.
Table 6. Standard deviation for Austrian students in PISA 2009. for domains Mathematics, Reading and Science for the 1PL and the 2PL model as a function of the linking method.
MethodMathematicsReadingScience
1PL2PL1PL2PL1PL2PL
logMM97.798.3 1 98.6103.2103.2106.8
HAB97.798.3 1 98.6103.2103.2106.8
MM97.798.7 1 98.6103.8103.2106.9
HAB-nolog97.999.3 1 94.6102.0103.9108.1
IA297.799.5 1 98.6104.6103.2109.2
HAE-asymm94.195.0102.6105.4105.0107.5
HAE-symm95.096.2103.1105.9105.3107.8
HAE-joint95.095.7105.1107.5104.7107.4
RC196.096.9103.1107.2103.9108.6
RC296.095.6 1 99.9106.2104.7105.9
RC396.096.3101.5106.7104.3107.2
ANCH96.095.6 1 99.9106.2104.7105.9
CC95.996.7101.3106.4104.1107.5
M96.397.1100.4105.2104.1107.4
SD 1 1.2 1 1.5 1 1 2.7 1 1 1.7 1 1 0.7 1 1 0.9
Min94.195.0 1 94.6102.0103.2105.9
Max97.999.5105.1107.5105.3109.2
Note. logMM = log-mean-mean linking; HAB = Haberman linking with logarithmized item discriminations; MM = mean-mean linking; HAB-nolog = Haberman linking with untransformed item discriminations; IA2 = invariance alignment with power p = 2 ; HAE-asymm = asymmetric Haebara linking; HAE-symm = symmetric Haebara linking; HAE-joint = Haebara linking with joint item parameters; RC = recalibration linking(see Equation (40)); ANCH = anchored item parameters; CC = concurrent calibration.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Robitzsch, A. A Comparison of Linking Methods for Two Groups for the Two-Parameter Logistic Item Response Model in the Presence and Absence of Random Differential Item Functioning. Foundations 2021, 1, 116-144. https://0-doi-org.brum.beds.ac.uk/10.3390/foundations1010009

AMA Style

Robitzsch A. A Comparison of Linking Methods for Two Groups for the Two-Parameter Logistic Item Response Model in the Presence and Absence of Random Differential Item Functioning. Foundations. 2021; 1(1):116-144. https://0-doi-org.brum.beds.ac.uk/10.3390/foundations1010009

Chicago/Turabian Style

Robitzsch, Alexander. 2021. "A Comparison of Linking Methods for Two Groups for the Two-Parameter Logistic Item Response Model in the Presence and Absence of Random Differential Item Functioning" Foundations 1, no. 1: 116-144. https://0-doi-org.brum.beds.ac.uk/10.3390/foundations1010009

Article Metrics

Back to TopTop