Learning Progressive Adaptation for Multi-Modal Tracking

cs.CV cs.AI He Wang, Tianyang Xu, Zhangyong Tang, Xiao-Jun Wu, Josef Kittler · Mar 22, 2026
Local to this browser
What it does
Multi-modal tracking suffers from scarce paired training data, forcing reliance on RGB pre-trained models with lightweight fine-tuning. PATrack proposes a progressive adaptation framework using three complementary...
Why it matters
PATrack proposes a progressive adaptation framework using three complementary adapters—Modality-Dependent (MDA), Cross-Modality Entangled (CEA), and Head Adaptation (HA)—to bridge the domain gap between RGB and auxiliary modalities...
Main concern
The paper presents a technically sound and empirically strong contribution to parameter-efficient multi-modal tracking. The three-level adaptation strategy provides a systematic way to handle modality-specific and shared information.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Multi-modal tracking suffers from scarce paired training data, forcing reliance on RGB pre-trained models with lightweight fine-tuning. PATrack proposes a progressive adaptation framework using three complementary adapters—Modality-Dependent (MDA), Cross-Modality Entangled (CEA), and Head Adaptation (HA)—to bridge the domain gap between RGB and auxiliary modalities (Thermal, Depth, Event) at the intra-modal, inter-modal, and task levels. The approach decomposes features into frequency bands and uses fusion-guided cross-attention, yielding state-of-the-art results on LasHeR, RGBT234, and VisEvent benchmarks.

Critical review
Verdict
Bottom line

The paper presents a technically sound and empirically strong contribution to parameter-efficient multi-modal tracking. The three-level adaptation strategy provides a systematic way to handle modality-specific and shared information. Extensive experiments across five benchmarks demonstrate consistent improvements over strong adapter-based baselines like BAT and ViPT. However, the "progressive" nature is descriptive rather than algorithmically enforced, and the efficiency claims should be weighed against the modest parameter increase over the base OSTrack model. The synthesis of frequency-domain processing in MDA and fusion-guided cross-attention in CEA represents a meaningful advance over symmetric adapter architectures.

“This innovative approach incorporates modality-dependent, modality-entangled, and task-level adapters, effectively bridging the gap in adapting RGB pre-trained networks to multi-modal data through a progressive strategy.”
paper · Abstract
What holds up

The Modality-Dependent Adaptation (MDA) effectively decomposes features into high and low-frequency components using max-pooling, depthwise convolution ($DWConv$), and average pooling, validated by ablations showing the combination achieves $SR=0.578$ versus $0.536$ without MDA. The Cross-Modality Entangled Adaptation (CEA) addresses a genuine gap in prior symmetric adapters by introducing cross-attention guided by fused representations: $H_{RGB}' = H_{RGB} + CA(Conv_v(\hat{H}_X), Conv_k(\hat{H}_X), fus)$. The Head Adaptation module demonstrates that even lightweight bottleneck adapters ($S_{RGBX}' = Up(Act(Down(S_{RGBX})))$) can bridge the RGB-pretrained prediction head to multi-modal fusion features, with Table IV confirming the parameter efficiency (95.51M vs 92.13M for baseline).

“When Maxpool, Avgpool, and DWConv are combined, the evaluation metrics for SR and PR reach higher values of 0.578 and 0.718, respectively”
paper · Table V
“H^{{}^{\prime}}_{\text{{RGB}}}={H}_{\text{{RGB}}}+{CA}(Conv_{v}(\hat{H}_{\text{{X}}}),Conv_{k}(\hat{H}_{\text{{X}}}),{fus})”
paper · Section III-D, Equation 14
“{S}^{{}^{\prime}}_{\text{{RGBX}}}=Up(Act(Down({S}_{\text{{RGBX}}})))”
paper · Section III-E, Equation 17
Main concerns

The "progressive" adaptation strategy is primarily a conceptual grouping of three adapter types rather than a sequentially dependent learning process; the paper does not demonstrate that training must proceed from intra-modal to inter-modal to task-level. While PATrack achieves competitive results, it underperforms SDSTrack on DepthTrack (F-score $0.600$ vs $0.614$), with the authors attributing this to depth data sparsity—a limitation suggesting the frequency-decomposition in MDA may be less effective for sparse modalities. The claim that "the strong inductive bias of the prediction head does not adapt to the fused information" is asserted but not experimentally isolated from general domain shift effects. Furthermore, the layer selection for CEA (layers 4, 7, 10) is motivated by prior work but not rigorously ablated against random placements beyond the limited comparison in Table VII.

“Although our results on DepthTrack dataset do not surpass SDSTrack, our understanding is that the depth data in DepthTrack dataset is sparse, unlike the RGB-T tracking tasks where thermal can provide richer target information, thus leading to a significant improvement in the metrics.”
paper · Section IV-C
“Additionally, recognising that the strong inductive bias of the prediction head does not adapt to the fused information, a task-level adapter specific to the prediction head is introduced.”
paper · Section I
Evidence and comparison

The evidence broadly supports superiority over asymmetric prompt-tuning (ViPT) and symmetric adapters (BAT), with PATrack achieving $SR=0.578$ and $PR=0.718$ on LasHeR compared to BAT's $0.563$ and $0.702$. However, the comparison to SDSTrack reveals context-dependent performance: PATrack excels on RGB-T and RGB-E but lags on RGB-D, indicating the proposed adapters are not universally superior across all modality combinations. The ablation studies in Table III provide convincing evidence that the combination of MDA, CEA, and HA yields the best performance ($0.683$ NPR on LasHeR versus $0.555$ for the baseline). The analysis of single-modality information entropy (Fig. 4) effectively contextualizes why depth and event data (entropy 3.76 and 3.48) benefit less than thermal (4.68) from the proposed fusion mechanisms.

“PATrack Our 0.578 0.718 0.683 ... BAT* 0.563 0.702”
paper · Table I
“RGB images yield the highest entropy score (5.34)... In contrast, depth and event data display comparatively lower entropy values—3.76 and 3.48, respectively—indicating a higher degree of sparsity and reduced information density.”
paper · Section IV-C
Reproducibility

The authors provide a GitHub repository link and specify the base architecture (OSTrack), optimizer (AdamW with weight decay $10^{-4}$), and learning rate schedule (initial $4\times 10^{-4}$, exponential decay $0.8$). However, critical details such as batch size, total training iterations, and data augmentation pipelines are omitted. The hardware specification (Intel i9-9980XE, NVIDIA 3090Ti) is mentioned but training time is not reported. The exact parameter counts are provided ($95.51$M for the full model vs $92.13$M for the base), facilitating reproduction, though specific initialization strategies for the adapter weights (e.g., $C'=8$ bottleneck) are not detailed.

“Code is available at https://github.com/ouha1998/Learning-Progressive-Adaptation-for-Multi-Modal-Tracking.”
paper · Abstract
“The backbone OSTrack is trained by an initial learning rate of 4e-4, subject to an exponential decay with a decay ratio of 0.8.”
paper · Section IV-A
“Our-base 95.51M 58.24G 8 & 192 0.578 0.718 16.40”
paper · Table IV
Abstract

Due to the limited availability of paired multi-modal data, multi-modal trackers are typically built by adopting pre-trained RGB models with parameter-efficient fine-tuning modules. However, these fine-tuning methods overlook advanced adaptations for applying RGB pre-trained models and fail to modulate a single specific modality, cross-modal interactions, and the prediction head. To address the issues, we propose to perform Progressive Adaptation for Multi-Modal Tracking (PATrack). This innovative approach incorporates modality-dependent, modality-entangled, and task-level adapters, effectively bridging the gap in adapting RGB pre-trained networks to multi-modal data through a progressive strategy. Specifically, modality-specific information is enhanced through the modality-dependent adapter, decomposing the high- and low-frequency components, which ensures a more robust feature representation within each modality. The inter-modal interactions are introduced in the modality-entangled adapter, which implements a cross-attention operation guided by inter-modal shared information, ensuring the reliability of features conveyed between modalities. Additionally, recognising that the strong inductive bias of the prediction head does not adapt to the fused information, a task-level adapter specific to the prediction head is introduced. In summary, our design integrates intra-modal, inter-modal, and task-level adapters into a unified framework. Extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event tracking tasks demonstrate that our method shows impressive performance against state-of-the-art methods. Code is available at https://github.com/ouha1998/Learning-Progressive-Adaptation-for-Multi-Modal-Tracking.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.