SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models
SSAM tackles the problem of merging independently trained multimodal large language models (e.g., vision-language and audio-language specialists) into a single model capable of processing arbitrary modality combinations without any paired multimodal training data. The core idea is to project language-specific parameter updates (task vectors) onto a shared low-rank subspace identified via SVD, thereby aligning consistent update directions while filtering conflicting ones before merging. This is significant because it offers a training-free alternative to expensive joint multimodal training, achieving state-of-the-art results on four benchmarks.
SSAM is a strong empirical contribution that convincingly demonstrates the value of subspace alignment for model merging. The method is well-motivated by the low-rank structure of task vectors and consistently outperforms both prior training-free merging techniques and jointly trained multimodal models across diverse benchmarks. However, the approach is limited to models with identical architectures (specifically those sharing the same language decoder) and requires manual tuning of a scaling coefficient $\lambda$.
The subspace alignment mechanism is theoretically grounded: SSAM constructs covariance matrices $A = \sum_{i=1}^{n} \Delta_{i} \Delta_{i}^{\top}$ and $B = \sum_{i=1}^{n} \Delta_{i}^{\top} \Delta_{i}$ to extract orthonormal bases $U_c$ and $V_c$ that explicitly minimize projection error (Equations 8-9). The empirical results are robust—SSAM achieves the best average accuracy on MUSIC-AVQA (54.97%), AVQA (81.29%), and MCUB (60.35% average), while Table 4 confirms that merging preserves or improves specialist capabilities on MMLU, OCRBench, and MMAU benchmarks.
The method assumes all specialist models share the same pretrained language decoder (Vicuna-7B-v1.5) and LoRA fine-tuning structure, limiting applicability to heterogeneous architectures. The scaling coefficient $\lambda$ is not automatically determined—while the paper mentions it can be set to $\frac{1}{n}$ when vector norms are similar, the reported results likely used task-specific tuning (implied by \"selected based on the performance... on individual modality-language pairs\"), yet specific values are not reported. Additionally, the rank $k=128$ is selected empirically without theoretical justification for why this dimension captures the \"consensus\" subspace, and robustness to model scale beyond 7B parameters is unexplored.
The comparisons appear fair: SSAM uses the same specialist checkpoints (from DAMC [3]) as baseline merging methods, ensuring controlled evaluation. The paper convincingly shows that simple averaging or addition of language vectors (NaiveMC, Task Arithmetic) suffers from parameter interference, while SSAM's subspace projection mitigates this. The margins over jointly trained models are substantial (+7.98 points on MUSIC-AVQA over Proj-Only, +19.03 on MCUB-4), though the paper does not clarify whether these jointly trained models used equivalent compute budgets or data scales.
The algorithm is well-specified in Algorithm 1, with clear matrix operations for computing $A$, $B$, $U_c$, $V_c$, and the projection steps. The paper uses publicly available checkpoints from prior work (DAMC) and reports the critical hyperparameter $k=128$. However, the procedure for selecting $\lambda$ is underspecified (validation set selection is mentioned but not implemented details are given), and no code repository URL is provided in the main text. The reliance on specific LoRA configurations (rank 128, $\alpha=256$) from the prior work also limits generalization to models with different adaptation schemes.
Multimodal large language models (MLLMs) achieve strong performance by jointly processing inputs from multiple modalities, such as vision, audio, and language. However, building such models or extending them to new modalities often requires large paired datasets and substantial computational resources. Since many pretrained MLLMs (e.g., vision-language or audio-language) are publicly available, we ask whether we can merge them into a single MLLM that can handle multiple modalities? Merging MLLMs with different input modalities remains challenging, partly because of differences in the learned representations and interference between their parameter spaces. To address these challenges, we propose Singular Subspace Alignment and Merging (SSAM), a training-free model merging framework that unifies independently trained specialist MLLMs into a single model capable of handling any combination of input modalities. SSAM maintains modality-specific parameter updates separately and identifies a shared low-rank subspace for language-related parameter updates, aligns them within this subspace, and merges them to preserve complementary knowledge while minimizing parameter interference. Without using any multimodal training data, SSAM achieves state-of-the-art performance across four datasets, surpassing prior training-free merging methods and even jointly trained multimodal models. These results demonstrate that aligning models in parameter space provides a scalable and resource-efficient alternative to conventional joint multimodal training.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.