Introduction to smriti: Structural Variance Preservation

The Imputation Uncertainty Principle

Modern machine learning imputation algorithms (like missForest) excel at minimizing point-wise prediction error (RMSE). However, this point-wise optimization inherently shrinks the variance of the imputed values, causing structural variance collapse. In longitudinal Growth Curve Models (GCM), this crushes the latent slope variance (\(\sigma^2_S\)), destroying the statistical power needed to track patient trajectories over time.

The smriti package resolves this by decoupling prediction from structural geometry. It utilizes a two-stage architecture: 1. Initialization: Non-parametric imputation bridges the missingness to establish a dense matrix. 2. Lagrangian Projection: A C++ gradient descent layer projects the hallucinated data toward a target covariance manifold while preserving fidelity to the initial imputed values. The augmented loss function is

\[L(X) = \frac{1}{2}\|X - X_{\text{imp}}\|_F^2 + \frac{\lambda}{2}\|\operatorname{cov}(X) - \Sigma_{\text{target}}\|_F^2\]

where the first term anchors the solution near the initial imputation and the second (governed by \(\lambda\)) enforces the covariance structure.

The Robustness-Efficiency Tradeoff

Real-world clinical data often contains heavy-tailed skew or corrupted sensor artifacts. The smriti_impute() function handles this via the robust routing toggle:

Fidelity-Constraint Balance

The penalty weight lambda controls the trade-off between preserving the original imputation values and matching the target covariance. At lambda = 1.0 (the default) both objectives are weighted equally. Increasing lambda enforces the covariance constraint more strictly but allows greater deviation from the initial imputation. The learning_rate (default 0.001) governs gradient step size; max_iter (default 2000) bounds the optimisation.

Example: Shielding Against Corrupted EHR Data

library(smriti)
library(missForest)

# Load clinical data with structural missingness and sensor artifacts
data <- read.csv("clinical_proxy.csv")

# Execute robust refinement to isolate the structural manifold
clean_data <- smriti_impute(
  data       = data,
  time_cols  = c("T1", "T2", "T3", "T4"),
  robust     = TRUE,
  lambda     = 1.0
)