through those design decisions, with small runnable checks you can do on your own data first to evaluate dimensions, the number of strata, and the analytic sample
MAIHDA is for questions of the form “how much of the variation in an outcome lies between people’s intersectional social positions, and how much of that is more than the sum of its parts?” It is well suited when:
Strata are the cross-product of the dimensions, so
cell counts fall off fast as you add dimensions.
make_strata() builds the strata and returns a
strata_info table of counts you can inspect before
modelling:
s2 <- make_strata(maihda_health_data, vars = c("Gender", "Race"))
nrow(s2$strata_info) # number of strata
#> [1] 10
summary(s2$strata_info$n) # cell-size distribution
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 75 102 127 300 175 1044Add education and the same sample splits into many more, smaller cells:
s3 <- make_strata(maihda_health_data, vars = c("Gender", "Race", "Education"))
nrow(s3$strata_info)
#> [1] 50
summary(s3$strata_info$n)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 13.25 25.50 60.00 45.50 349.00
sum(s3$strata_info$n < 10) # how many strata have < 10 people
#> [1] 5Each extra dimension multiplies the number of strata and divides the people among them. Small cells are not fatal, (partial pooling shrinkage is exactly what protects MAIHDA against noisy small strata) but they have consequences (next section). A useful rule: choose the fewest dimensions that answer your question, and look at the cell-size distribution before committing.
When cells get very small the maximum-likelihood (lme4)
estimate of the between-stratum variance can collapse to the boundary (
a singular fit) and report a VPC of (near) zero with no uncertainty. The
package records this and surfaces it in a “Fit diagnostics” note rather
than letting it pass silently:
over <- fit_maihda(
BMI ~ 1 + (1 | Gender:Race:Education),
data = maihda_health_data[1:60, ] # deliberately too few people per stratum
)
#> boundary (singular) fit: see help('isSingular')
over
#> MAIHDA Model
#> ============
#>
#> Engine: lme4
#> Family: gaussian
#> Formula: BMI ~ (1 | stratum)
#>
#> Fit diagnostics:
#> Singular fit: at least one variance component is estimated at (or near) zero.
#> The between-stratum variance and any VPC/PCV derived from it may be unreliable.
#> Convergence warnings reported by lme4:
#> - boundary (singular) fit: see help('isSingular')
#>
#>
#> Underlying model:
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: BMI ~ (1 | stratum)
#> Data: data
#> REML criterion at convergence: 386.8857
#> Random effects:
#> Groups Name Std.Dev.
#> stratum (Intercept) 0.000
#> Residual 6.203
#> Number of obs: 60, groups: stratum, 24
#> Fixed Effects:
#> (Intercept)
#> 28.8
#> optimizer (nloptwrap) convergence code: 0 (OK) ; 0 optimizer warnings; 1 lme4 warningsIf you see a singular-fit note, do not read the VPC as a clean zero.
The solution is to collapse dimensions or categories (fewer, larger
cells), or to use engine = "brms", whose weakly-informative
priors regularise the variance off the boundary and return a posterior
interval, the subject of the Bayesian sparse vignette.
make_strata() will auto-bin a numeric dimension into
tertiles (with a message()), but a continuous
covariate belongs in the fixed part of the formula, not the
strata.| Quantity | Answers | Does not answer |
|---|---|---|
| VPC/ICC | share of variance between strata | the amount of between-stratum variation (a share can rise just because the residual fell) |
| PCV | additive share of the between-stratum variance | a causal decomposition; a negative PCV is not proof of hidden inequality |
| Discriminatory accuracy (AUC/MOR) | how well strata predict the individual outcome | how large the group differences are (a high VPC can go with modest AUC) |
lme4 (default) – fast frequentist fits
for adequately-sized cells.brms – Bayesian; preferred when cells
are sparse or dimensions have few levels (regularising priors, posterior
intervals).For extensions beyond the cross-sectional case, see the crossed random effects (dimensions/contexts) and longitudinal vignettes.