Does Mechanistic Interpretability Transfer Across Data Modalities? A Cross-Domain Causal Circuit Analysis of Variational Autoencoders

cs.LG Dip Roy, Rajiv Misra, Sanjay Kumar Singh, Anisha Roy · Mar 22, 2026

What it does

Why it matters

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper investigates whether mechanistic interpretability findings from image-domain VAEs transfer to tabular data using 75 independent training runs across five architectures and four tabular benchmarks. It introduces posterior-calibrated Causal Effect Strength (CES) and Feature-Group Disentanglement (FGD) to compare circuit structures across modalities, finding that tabular VAEs exhibit ~50% lower modularity and that β-VAEs suffer catastrophic capacity collapse on heterogeneous tabular data (260× CES reduction) compared to images.

Critical review

Verdict

Bottom line

The paper makes a valuable contribution by systematically testing the cross-modality transfer of mechanistic interpretability findings, challenging the assumption that image-domain VAE insights generalize to tabular data. The methodological innovations—particularly posterior-calibrated CES to address distributional mismatch and the identification that causal mediation analysis saturates in sequential architectures—are theoretically sound and practically useful. However, the strong correlation between CES and reconstruction MSE (r = -0.886) complicates interpretation of whether CES measures genuine circuit structure or merely reconstruction fidelity, and the reliance on a single synthetic image benchmark (dSprites) limits the generalizability of cross-modality claims to real-world images.

“Tabular VAEs have circuits with modularity that is approximately 50% lower than their image counterparts”

paper · Abstract

“β-VAE CES on the Adult dataset is approximately 0.0003... a 260 fold decrease from standard VAE”

paper · Section 5.3

What holds up

The three methodological refinements are well-justified theoretically: posterior-calibrated CES corrects for distributional mismatch where fixed-range sweeps overestimate influence by 1/σ^eff_d (Proposition 1); path-specific activation patching provides exact telescoping decomposition for sequential architectures (Proposition 2); and FGD generalizes DCI to correlated feature groups while reducing to standard DCI when features are independent (Proposition 3). The experimental scale (75 independent runs with statistical correction) supports the key finding that specificity, not modularity, predicts downstream AUC (r = 0.460, p < .001), offering practical guidance for tabular VAE selection. The discovery that β-VAE capacity collapse is modality-dependent and reconstruction-mediated is mechanistically well-explained via the information bottleneck framework.

“For a well-regularized dimension with σ^eff_d≪ 1, the fixed-range CES overestimates causal influence by a factor of 1/σ^eff_d”

paper · Section 3.2.2

“Intervention specificity emerges as the strongest predictor: higher specificity correlates with better AUC (r = 0.460, p < 0.001)”

paper · Section 5.5

Main concerns

The primary limitation is the image benchmark selection: dSprites is a synthetic dataset with independent generative factors and binary pixels, which the authors acknowledge differs from "the complexities present within real world images" (Section 6.6). This idealized structure likely inflates the observed cross-modality gap compared to natural images with correlated features. The strong negative correlation between CES and reconstruction MSE (r = -0.886) creates ambiguity—while the authors interpret this as evidence that the β-VAE bottleneck operates through reconstruction degradation, it is equally plausible that CES simply measures decoder capacity rather than meaningful circuit structure, a confound the authors partially acknowledge in Section 6.2. Additionally, each architecture uses a single hyperparameter setting (e.g., β = 4.0) without ablation studies, making it impossible to determine if the observed β-VAE collapse is continuous or represents a phase transition at that specific value.

“dSprites does not contain the complexities present within real world images... conclusions regarding the 'image-domain' MI properties are best viewed as 'dSprites-domain' MI properties”

paper · Section 6.6

“each architecture uses one setting for a hyperparameter... it is possible that the collapse of β-VAE is β-specific”

paper · Section 6.6

Evidence and comparison

The evidence generally supports the core claims, though with noted caveats. The comparison to prior work by Roy and Misra [11] is fair and explicitly acknowledges the limitation that their convolutional encoder study avoided the sequential architecture saturation issue identified here. The statistical analysis appropriately uses non-parametric Wilcoxon tests with Holm–Šídák correction for multiple comparisons, though pooling across datasets assumes homogeneous architectural effects which Table 5 suggests may not hold (β-VAE CES varies dramatically by dataset). The random grouping ablation effectively validates that semantic groupings capture genuine structure (semantic modularity 0.099 vs random 0.080). However, the downstream evaluation uses only logistic regression on latent representations rather than domain-specific tasks (imputation, anomaly detection) that would more directly validate practical utility.

“semantic groupings yield mean modularity of 0.099 versus 0.080 for random groupings (Δ = +0.019)”

paper · Section 5.6

“downstream evaluation using logistic regression... doesn't cover all the ways that tabular VAEs can be used (e.g. how well they perform at imputation quality, anomaly detection AUROC)”

paper · Section 6.6

Reproducibility

Reproducibility is mixed. The paper specifies training details (Adam optimizer with lr=10⁻³, ReduceLROnPlateau scheduling, early stopping patience=20, batch size 256, latent dimension 10) and seeds (42, 123, 456), and uses public UCI datasets and dSprites. Architecture specifications are detailed in Table 1. However, code and trained model checkpoints are promised "upon publication" rather than being currently available. The deterministic CUDA settings are mentioned but specific hardware (NVIDIA L40S) and software dependencies (PyTorch version, etc.) are not listed. The limited hyperparameter exploration (single β value per architecture) and fixed latent dimensionality without ablation would make it difficult to verify if the dramatic β-VAE collapse generalizes across the hyperparameter space or is specific to the chosen configuration.

“Each configuration is trained with 3 random seeds (42, 123, 456), yielding 75 independent runs”

paper · Section 4.3

“Code and trained model checkpoints will be made available upon publication”

paper · Data Availability

Abstract

Although mechanism-based interpretability has generated an abundance of insight for discriminative network analysis, generative models are less understood -- particularly outside of image-related applications. We investigate how much of the causal circuitry found within image-related variational autoencoders (VAEs) will generalize to tabular data, as VAEs are increasingly used for imputation, anomaly detection, and synthetic data generation. In addition to extending a four-level causal intervention framework to four tabular and one image benchmark across five different VAE architectures (with 75 individual training runs per architecture and three random seed values for each run), this paper introduces three new techniques: posterior-calibration of Causal Effect Strength (CES), path-specific activation patching, and Feature-Group Disentanglement (FGD). The results from our experiments demonstrate that: (i) Tabular VAEs have circuits with modularity that is approximately 50% lower than their image counterparts. (ii) $\beta$-VAE experiences nearly complete collapse in CES scores when applied to heterogeneous tabular features (0.043 CES score for tabular data compared to 0.133 CES score for images), which can be directly attributed to reconstruction quality degradation (r = -0.886 correlation coefficient between CES and MSE). (iii) CES successfully captures nine of eleven statistically significant architecture differences using Holm--\v{S}id\'{a}k corrections. (iv) Interventions with high specificity predict the highest downstream AUC values (r = 0.460, p < .001). This study challenges the common assumption that architectural guidance from image-related studies can be transferred to tabular datasets.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.