Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning
This paper addresses weakly-supervised video scene graph generation (WS-VSGG), where models must parse videos into structured relational triplets using only sparse unlocalized annotations without bounding boxes. The core insight is that off-the-shelf object detectors indiscriminately detect all visible objects, overwhelming relation models with noisy non-interactive pairs, while fully-supervised detectors implicitly filter relationally irrelevant objects. To bridge this gap, the authors propose a three-component framework: Relation-Aware Matching (RAM) refines pseudo-labels via vision-language grounding, Pair Affinity Learning and Scoring (PALS) learns to distinguish interactive from non-interactive pairs, and Pair Affinity Modulation (PAM) gates attention based on affinity scores. This substantially narrows the gap to full supervision while reducing annotation costs.
The paper presents a technically sound and well-motivated solution to a genuine problem in WS-VSGG. The identification of the train-test distribution gap—where models train exclusively on interactive pairs but must infer over spaces dominated by non-interactive pairs—is a crisp insight that justifies the proposed mechanisms. The extensive ablations across two backbones (STTran and DSG-DETR) and the demonstration of synergy between components (RAM, PALS, PAM) provide strong empirical support for the approach.
The problem formulation is compelling: the authors clearly articulate that while fully-supervised detectors implicitly filter out non-interactive objects, off-the-shelf detectors indiscriminately detect all visible objects, shifting the filtering burden to the relation model. PALS effectively leverages the train-test mismatch by retaining unmatched pairs as negative supervision for pair affinity learning rather than discarding them, using a class-balanced formulation that prevents the majority negative class from dominating. The empirical gains are consistent and substantial, with the best configuration achieving 88.3% and 94.3% of fully-supervised upper bound performance at R@10 under With Constraint and No Constraint protocols respectively.
Several limitations temper the contribution. First, RAM depends critically on a pretrained vision-language model (GroundingDINO), introducing an external dependency and computational overhead; Table 3 shows that even after refinement, pseudo-label precision remains moderate (0.7255 for PLA), and the reliability threshold $\tau_r=0.3$ is a heuristic proxy that fails when objects are occluded or visually ambiguous (Appendix C.3). Second, the two-step training pipeline with knowledge distillation and distance-aware supervision (Appendix A.2) adds significant complexity, where label propagation for non-middle frames accumulates errors with temporal distance. Third, the method assumes that binary pair affinity adequately captures interaction likelihood, but complex relations may require finer-grained modeling beyond a single scalar score $PA_{(s,o)} \in [0,1]$.
The evidence generally supports the claims, though some comparisons warrant scrutiny. The ablation study in Table 2 demonstrates clear synergies: PALS provides the largest individual gain (+4.40 R@10), RAM amplifies PALS when combined (+2.24 R@10), and PAM adds complementary improvements. However, the comparison to TRKT is problematic; the authors note they could not reproduce TRKT's reported results (Table 1 footnote indicates TRKT$^\dagger$), and the reproduced baseline underperforms the original paper, potentially inflating the relative gain. The analysis of failure modes in Appendix C is laudable, showing that grounding fails under occlusion or categorical ambiguity, but the paper does not compare against recent fully-supervised methods beyond the vanilla backbone upper bounds.
The authors provide substantial implementation detail including hyperparameters in Table 5, specific embedding dimensions ($d_R=1936$, $d_P=128$), loss formulations ($\mathcal{L}_{PA}$ with class-balanced BCE, $\mathcal{L}_{PAM}$ with margin $m=1.0$), and threshold values ($\tau_r=0.3$, $\tau_{gs}=0.2$). The computational overhead is minimal (+1.27% parameters, +0.01-0.04% FLOPs). However, reproducibility faces challenges: the RAM preprocessing requires specific vision-language checkpoints (GroundingDINO with Swin-B) and careful attention map extraction from the final feature enhance layer, which may be sensitive to implementation details. The two-step training with teacher models, distance-aware blending weights ($\alpha=3.0$), and adaptive PAM margins requires substantial codebase infrastructure not trivial to replicate from the description alone.
Weakly-supervised video scene graph generation (WS-VSGG) aims to parse video content into structured relational triplets without bounding box annotations and with only sparse temporal labeling, significantly reducing annotation costs. Without ground-truth bounding boxes, these methods rely on off-the-shelf detectors to generate object proposals, yet largely overlook a fundamental discrepancy from fullysupervised pipelines. Fully-supervised detectors implicitly filter out noninteractive objects, while off-the-shelf detectors indiscriminately detect all visible objects, overwhelming relation models with noisy pairs.We address this by introducing a learnable pair affinity that estimates the likelihood of interaction between subject-object pairs. Through Pair Affinity Learning and Scoring (PALS), pair affinity is incorporated into inferencetime ranking and further integrated into contextual reasoning through Pair Affinity Modulation (PAM), enabling the model to suppress noninteractive pairs and focus on relationally meaningful ones. To provide cleaner supervision for pair affinity learning, we further propose Relation- Aware Matching (RAM), which leverages vision-language grounding to resolve class-level ambiguity in pseudo-label generation. Extensive experiments on Action Genome demonstrate that our approach consistently yields substantial improvements across different baselines and backbones, achieving state-of-the-art WS-VSGG performance.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.