Chapter 2 What are the weaknesses of the current approaches to evaluate model fit?
In terms of exact fit, the \({\chi}^2\) test does not perform as well in small samples and may also be overly sensitive to minor misspecifications at large sample sizes9,10. Additionally, with extremely small samples and smaller loadings, the \({\chi}^2\) test can be underpowered and unable to detect misfit11. Some models (such as bifactor models) inherently have a higher “fit propensity”, meaning that they are more likely to fit the data regardless of the true nature of the data generating model12–14. Further, even if one does find a non-significant p-value for a \({\chi}^2\) test with a reasonable sample size, this does not guarantee that the true model has been recovered as there are often multiple equivalent models that can fit the data15. As such, all modeling decisions should have strong theoretical grounding.
Unlike the \({\chi}^2\) test, approximate fit indices do not have a corresponding p-value. Given that these indices essentially function as effect size measures that capture the degree of misfit in the model, the difficulty lies in how to interpret them and which cutoff values to use (if any). Researchers often rely on a set of fixed cutoff values derived from a simulation study conducted by Hu and Bentler16, which has over 96,000 citations as of 2022. This simulation study produced the well-known cutoff values of SRMR < .08, RMSEA < .06, and CFI > .95. However, interpretations of the results from simulations studies are limited to the conditions sampled in the simulation study. Hu and Bentler manipulated the sample size (from 250 – 5000), the number and type of misspecifications (omitted crossloadings and omitted factor covariances), and the normality of the factors and errors. However, they did not manipulate the number of factors (3), the number of items (15), the magnitude of the factor loadings (.7 - .8), or the model type (single level CFA estimated using maximum likelihood estimation).
Several studies have demonstrated that these fixed cutoff values cannot reliably be extrapolated to other model subspaces (e.g., one-factor models, multi-factor models with stronger or weaker loadings, or models with fewer or greater numbers of items or factors). In other words, if a researcher evaluates the fit of a single-level CFA model that does not have 15 items, 3 factors, and a sample size between 250 – 5000, the cutoff values derived from Hu and Bentler’s study cannot be used to reliably determine if there is substantial misfit in the model1. Most concerning is the “reliability paradox” which stipulates that lower loadings (e.g., a smaller reliability coefficient) are associated with “better” values of approximate fit indices11,17–20. In other words, holding all else equal, as factor loadings decrease, the SRMR and the RMSEA will also decrease, mistakenly leading researchers to conclude that less reliable models fit the data better (when compared to a set of fixed cutoff values). If Hu and Bentler had varied the factor loadings in their original simulation study, it is possible that they would have been unable to recommend a set of fixed cutoff variables since fit indices are so sensitive to loading magnitude.