Chapter 2 What are the weaknesses of the current approaches to evaluate model fit?

In terms of exact fit, the \({\chi}^2\) test does not perform as well in small samples and may also be overly sensitive to minor misspecifications at large sample sizes^9,10. Additionally, with extremely small samples and smaller loadings, the \({\chi}^2\) test can be underpowered and unable to detect misfit¹¹. Some models (such as bifactor models) inherently have a higher “fit propensity”, meaning that they are more likely to fit the data regardless of the true nature of the data generating model^12–14. Further, even if one does find a non-significant p-value for a \({\chi}^2\) test with a reasonable sample size, this does not guarantee that the true model has been recovered as there are often multiple equivalent models that can fit the data¹⁵. As such, all modeling decisions should have strong theoretical grounding.

Unlike the \({\chi}^2\) test, approximate fit indices do not have a corresponding p-value. Given that these indices essentially function as effect size measures that capture the degree of misfit in the model, the difficulty lies in how to interpret them and which cutoff values to use (if any). Researchers often rely on a set of fixed cutoff values derived from a simulation study conducted by Hu and Bentler¹⁶, which has over 96,000 citations as of 2022. This simulation study produced the well-known cutoff values of SRMR < .08, RMSEA < .06, and CFI > .95. However, interpretations of the results from simulations studies are limited to the conditions sampled in the simulation study. Hu and Bentler manipulated the sample size (from 250 – 5000), the number and type of misspecifications (omitted crossloadings and omitted factor covariances), and the normality of the factors and errors. However, they did not manipulate the number of factors (3), the number of items (15), the magnitude of the factor loadings (.7 - .8), or the model type (single level CFA estimated using maximum likelihood estimation).

Several studies have demonstrated that these fixed cutoff values cannot reliably be extrapolated to other model subspaces (e.g., one-factor models, multi-factor models with stronger or weaker loadings, or models with fewer or greater numbers of items or factors). In other words, if a researcher evaluates the fit of a single-level CFA model that does not have 15 items, 3 factors, and a sample size between 250 – 5000, the cutoff values derived from Hu and Bentler’s study cannot be used to reliably determine if there is substantial misfit in the model¹. Most concerning is the “reliability paradox” which stipulates that lower loadings (e.g., a smaller reliability coefficient) are associated with “better” values of approximate fit indices^11,17–20. In other words, holding all else equal, as factor loadings decrease, the SRMR and the RMSEA will also decrease, mistakenly leading researchers to conclude that less reliable models fit the data better (when compared to a set of fixed cutoff values). If Hu and Bentler had varied the factor loadings in their original simulation study, it is possible that they would have been unable to recommend a set of fixed cutoff variables since fit indices are so sensitive to loading magnitude.

References

1. McNeish, D., & Wolf, M. G. (2021). Dynamic fit index cutoffs for confirmatory factor analysis models. Psychological Methods. https://doi.org/10.1037/met0000425

9. Browne, M. W., & Cudeck, R. (1992). Alternative Ways of Assessing Model Fit. Sociological Methods & Research, 21(2). https://journals.sagepub.com/doi/10.1177/0049124192021002005

10. Hu, L., Bentler, P. M., & Kano, Y. (1992). Can test statistics in covariance structure analysis be trusted? Psychological Bulletin, 112, 351–362. https://doi.org/10.1037/0033-2909.112.2.351

11. McNeish, D., An, J., & Hancock, G. R. (2018). The Thorny Relation Between Measurement Quality and Fit Index Cutoffs in Latent Variable Models. Journal of Personality Assessment, 100(1), 43–52. https://doi.org/10.1080/00223891.2017.1281286

12. Bonifay, W., & Cai, L. (2017). On the Complexity of Item Response Theory Models. Multivariate Behavioral Research, 52(4). https://doi.org/10.1080/00273171.2017.1309262

14. Preacher, K. J. (2006). Quantifying Parsimony in Structural Equation Modeling. Multivariate Behavioral Research, 41(3), 227–259. https://doi.org/10.1207/s15327906mbr4103_1

15. Hayduk, L. (2014). Seeing Perfectly Fitting Factor Models That Are Causally Misspecified: Understanding That Close-Fitting Models Can Be Worse. Educational and Psychological Measurement, 74(6), 905–926. https://doi.org/10.1177/0013164414527449

16. Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6(1), 1–55. https://doi.org/10.1080/10705519909540118

17. Hancock, G. R., & Mueller, R. O. (2011). The Reliability Paradox in Assessing Structural Relations Within Covariance Structure Models. Educational and Psychological Measurement, 71(2), 306–324. https://doi.org/10.1177/0013164410384856

20. Saris, W. E., Satorra, A., & Veld, W. M. van der. (2009). Testing Structural Equation Models or Detection of Misspecifications? Structural Equation Modeling: A Multidisciplinary Journal, 16(4), 561–582. https://doi.org/10.1080/10705510903203433