Modern machine learning imputation algorithms (like
missForest) excel at minimizing point-wise prediction error
(RMSE). However, this point-wise optimization inherently shrinks the
variance of the imputed values, causing structural variance
collapse. In longitudinal Growth Curve Models (GCM), this
crushes the latent slope variance (\(\sigma^2_S\)), destroying the statistical
power needed to track patient trajectories over time.
The smriti package resolves this by decoupling
prediction from structural geometry. It utilizes a two-stage
architecture: 1. Initialization: Non-parametric
imputation bridges the missingness to establish a dense matrix. 2.
Lagrangian Projection: A C++ gradient descent layer
projects the hallucinated data toward a target covariance manifold while
preserving fidelity to the initial imputed values. The augmented loss
function is
\[L(X) = \frac{1}{2}\|X - X_{\text{imp}}\|_F^2 + \frac{\lambda}{2}\|\operatorname{cov}(X) - \Sigma_{\text{target}}\|_F^2\]
where the first term anchors the solution near the initial imputation and the second (governed by \(\lambda\)) enforces the covariance structure.
Real-world clinical data often contains heavy-tailed skew or
corrupted sensor artifacts. The smriti_impute() function
handles this via the robust routing toggle:
robust = FALSE: Uses pairwise-complete Pearson
covariance, projected to the nearest positive-semidefinite matrix to
correct any non-PSD artefacts from pairwise deletion. Best for
well-behaved, approximately-Normal data.robust = TRUE: Constructs the target from pairwise
Spearman correlations (rank-based, outlier-resistant) and column-wise
MAD scale estimates. The resulting matrix is projected to the nearest
PSD manifold, producing a target that is structurally robust to severe
outliers (e.g., broken EHR sensors).The penalty weight lambda controls the trade-off between
preserving the original imputation values and matching the target
covariance. At lambda = 1.0 (the default) both objectives
are weighted equally. Increasing lambda enforces the
covariance constraint more strictly but allows greater deviation from
the initial imputation. The learning_rate (default
0.001) governs gradient step size; max_iter
(default 2000) bounds the optimisation.
library(smriti)
library(missForest)
# Load clinical data with structural missingness and sensor artifacts
data <- read.csv("clinical_proxy.csv")
# Execute robust refinement to isolate the structural manifold
clean_data <- smriti_impute(
data = data,
time_cols = c("T1", "T2", "T3", "T4"),
robust = TRUE,
lambda = 1.0
)