SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models

cs.LG cs.CV Md Kaykobad Reza, Ameya Patil, Edward Ayrapetian, M. Salman Asif · Mar 23, 2026
Local to this browser
What it does
SSAM tackles the problem of merging independently trained multimodal large language models (e. g.
Why it matters
The core idea is to project language-specific parameter updates (task vectors) onto a shared low-rank subspace identified via SVD, thereby aligning consistent update directions while filtering conflicting ones before merging. This is...
Main concern
SSAM is a strong empirical contribution that convincingly demonstrates the value of subspace alignment for model merging. The method is well-motivated by the low-rank structure of task vectors and consistently outperforms both prior...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

SSAM tackles the problem of merging independently trained multimodal large language models (e.g., vision-language and audio-language specialists) into a single model capable of processing arbitrary modality combinations without any paired multimodal training data. The core idea is to project language-specific parameter updates (task vectors) onto a shared low-rank subspace identified via SVD, thereby aligning consistent update directions while filtering conflicting ones before merging. This is significant because it offers a training-free alternative to expensive joint multimodal training, achieving state-of-the-art results on four benchmarks.

Critical review
Verdict
Bottom line

SSAM is a strong empirical contribution that convincingly demonstrates the value of subspace alignment for model merging. The method is well-motivated by the low-rank structure of task vectors and consistently outperforms both prior training-free merging techniques and jointly trained multimodal models across diverse benchmarks. However, the approach is limited to models with identical architectures (specifically those sharing the same language decoder) and requires manual tuning of a scaling coefficient $\lambda$.

“Without using any multimodal training data, SSAM achieves state-of-the-art performance across four datasets, surpassing prior training-free merging methods and even jointly trained multimodal models.”
paper · Abstract
“$\Delta_{merged} = \lambda \sum_{i=1}^{n} \Delta_{i}^{p}$ where $\lambda$ is a scaling coefficient... In practice, $\lambda$ can be determined using a small validation set for the target task.”
paper · Section 3.4
What holds up

The subspace alignment mechanism is theoretically grounded: SSAM constructs covariance matrices $A = \sum_{i=1}^{n} \Delta_{i} \Delta_{i}^{\top}$ and $B = \sum_{i=1}^{n} \Delta_{i}^{\top} \Delta_{i}$ to extract orthonormal bases $U_c$ and $V_c$ that explicitly minimize projection error (Equations 8-9). The empirical results are robust—SSAM achieves the best average accuracy on MUSIC-AVQA (54.97%), AVQA (81.29%), and MCUB (60.35% average), while Table 4 confirms that merging preserves or improves specialist capabilities on MMLU, OCRBench, and MMAU benchmarks.

“$U_{c} = \arg\min_{U^{\top}U=I_{k}} \sum_{i=1}^{n} \|\Delta_{i}-UU^{\top}\Delta_{i}\|_{F}^{2}$”
paper · Equation 8-9
“SSAM slightly outperforms all specialist models indicating that merging does not degrade language capabilities of the original specialist models.”
paper · Table 4
Main concerns

The method assumes all specialist models share the same pretrained language decoder (Vicuna-7B-v1.5) and LoRA fine-tuning structure, limiting applicability to heterogeneous architectures. The scaling coefficient $\lambda$ is not automatically determined—while the paper mentions it can be set to $\frac{1}{n}$ when vector norms are similar, the reported results likely used task-specific tuning (implied by \"selected based on the performance... on individual modality-language pairs\"), yet specific values are not reported. Additionally, the rank $k=128$ is selected empirically without theoretical justification for why this dimension captures the \"consensus\" subspace, and robustness to model scale beyond 7B parameters is unexplored.

“$\lambda$ can be determined using a small validation set for the target task or selected based on the performance of the merged model on individual modality-language pairs.”
paper · Section 3.4
“Our experiments focused on models with approximately 7B parameters. We focus on merging models with the same architecture; extending to heterogeneous architectures remains an open challenge.”
paper · Section S6
Evidence and comparison

The comparisons appear fair: SSAM uses the same specialist checkpoints (from DAMC [3]) as baseline merging methods, ensuring controlled evaluation. The paper convincingly shows that simple averaging or addition of language vectors (NaiveMC, Task Arithmetic) suffers from parameter interference, while SSAM's subspace projection mitigates this. The margins over jointly trained models are substantial (+7.98 points on MUSIC-AVQA over Proj-Only, +19.03 on MCUB-4), though the paper does not clarify whether these jointly trained models used equivalent compute budgets or data scales.

“SSAM achieves the highest average accuracy of 54.97%, surpassing the best finetuned baseline, Proj-Only (46.99%), by 7.98 percentage points.”
paper · Table 1
“SSAM achieves an accuracy of 62.03%, outperforming all training-free baselines, including DAMC (60.08%), and surpassing the best jointly trained model, Proj-Only (43.00%), by 19.03 percentage points.”
paper · Section 4.4.2
Reproducibility

The algorithm is well-specified in Algorithm 1, with clear matrix operations for computing $A$, $B$, $U_c$, $V_c$, and the projection steps. The paper uses publicly available checkpoints from prior work (DAMC) and reports the critical hyperparameter $k=128$. However, the procedure for selecting $\lambda$ is underspecified (validation set selection is mentioned but not implemented details are given), and no code repository URL is provided in the main text. The reliance on specific LoRA configurations (rank 128, $\alpha=256$) from the prior work also limits generalization to models with different adaptation schemes.

“Input: Pretrained model weight $W_0$; Language-specific weights from specialist models $\{W_i^t\}_{i=1}^n$; Rank $k$... Output: Merged language-specific weight $W_{merged}^t$”
paper · Algorithm 1
“We use the publicly released checkpoints from [3] directly for all the experiments without any additional fine-tuning or modification.”
paper · Section S2
Abstract

Multimodal large language models (MLLMs) achieve strong performance by jointly processing inputs from multiple modalities, such as vision, audio, and language. However, building such models or extending them to new modalities often requires large paired datasets and substantial computational resources. Since many pretrained MLLMs (e.g., vision-language or audio-language) are publicly available, we ask whether we can merge them into a single MLLM that can handle multiple modalities? Merging MLLMs with different input modalities remains challenging, partly because of differences in the learned representations and interference between their parameter spaces. To address these challenges, we propose Singular Subspace Alignment and Merging (SSAM), a training-free model merging framework that unifies independently trained specialist MLLMs into a single model capable of handling any combination of input modalities. SSAM maintains modality-specific parameter updates separately and identifies a shared low-rank subspace for language-related parameter updates, aligns them within this subspace, and merges them to preserve complementary knowledge while minimizing parameter interference. Without using any multimodal training data, SSAM achieves state-of-the-art performance across four datasets, surpassing prior training-free merging methods and even jointly trained multimodal models. These results demonstrate that aligning models in parameter space provides a scalable and resource-efficient alternative to conventional joint multimodal training.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.