Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning

cs.CV Minseok Kang, Minhyeok Lee, Minjung Kim, Jungho Lee, Donghyeong Kim, Sungmin Woo, Inseok Jeon, Sangyoun Lee · Mar 23, 2026

What it does

Why it matters

To bridge this gap, the authors propose a three-component framework: Relation-Aware Matching (RAM) refines pseudo-labels via vision-language grounding, Pair Affinity Learning and Scoring (PALS) learns to distinguish interactive from...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper addresses weakly-supervised video scene graph generation (WS-VSGG), where models must parse videos into structured relational triplets using only sparse unlocalized annotations without bounding boxes. The core insight is that off-the-shelf object detectors indiscriminately detect all visible objects, overwhelming relation models with noisy non-interactive pairs, while fully-supervised detectors implicitly filter relationally irrelevant objects. To bridge this gap, the authors propose a three-component framework: Relation-Aware Matching (RAM) refines pseudo-labels via vision-language grounding, Pair Affinity Learning and Scoring (PALS) learns to distinguish interactive from non-interactive pairs, and Pair Affinity Modulation (PAM) gates attention based on affinity scores. This substantially narrows the gap to full supervision while reducing annotation costs.

Critical review

Verdict

Bottom line

The paper presents a technically sound and well-motivated solution to a genuine problem in WS-VSGG. The identification of the train-test distribution gap—where models train exclusively on interactive pairs but must infer over spaces dominated by non-interactive pairs—is a crisp insight that justifies the proposed mechanisms. The extensive ablations across two backbones (STTran and DSG-DETR) and the demonstration of synergy between components (RAM, PALS, PAM) provide strong empirical support for the approach.

“The relation prediction model is trained exclusively on interactive pairs, yet at inference it must operate over a detection space dominated by non-interactive pairs that were never observed during training.”

paper · Section 1

What holds up

The problem formulation is compelling: the authors clearly articulate that while fully-supervised detectors implicitly filter out non-interactive objects, off-the-shelf detectors indiscriminately detect all visible objects, shifting the filtering burden to the relation model. PALS effectively leverages the train-test mismatch by retaining unmatched pairs as negative supervision for pair affinity learning rather than discarding them, using a class-balanced formulation that prevents the majority negative class from dominating. The empirical gains are consistent and substantial, with the best configuration achieving 88.3% and 94.3% of fully-supervised upper bound performance at R@10 under With Constraint and No Constraint protocols respectively.

“Fully-supervised detectors implicitly filter out non-interactive objects, while off-the-shelf detectors indiscriminately detect all visible objects, overwhelming relation models with noisy pairs.”

paper · Section 1

“Our best configuration (PLA + Ours on STTran) achieves 88.3% and 94.3% of the fully-supervised upper bound at R@10 under the With-Constraint and No-Constraint protocols, respectively.”

paper · Section 4.2

Main concerns

Several limitations temper the contribution. First, RAM depends critically on a pretrained vision-language model (GroundingDINO), introducing an external dependency and computational overhead; Table 3 shows that even after refinement, pseudo-label precision remains moderate (0.7255 for PLA), and the reliability threshold $\tau_r=0.3$ is a heuristic proxy that fails when objects are occluded or visually ambiguous (Appendix C.3). Second, the two-step training pipeline with knowledge distillation and distance-aware supervision (Appendix A.2) adds significant complexity, where label propagation for non-middle frames accumulates errors with temporal distance. Third, the method assumes that binary pair affinity adequately captures interaction likelihood, but complex relations may require finer-grained modeling beyond a single scalar score $PA_{(s,o)} \in [0,1]$.

“When the object satisfying the relational condition cannot be confirmed categorically with sufficient confidence, the grounding model substitutes it with a more visually unambiguous candidate of the same category, even if that candidate does not satisfy the relational condition.”

paper · Appendix C.3

“Label propagation inherently accumulates errors with temporal distance, as the IoU-based propagation criterion becomes increasingly unreliable for frames far from the annotated middle frame.”

paper · Appendix A.2

Evidence and comparison

The evidence generally supports the claims, though some comparisons warrant scrutiny. The ablation study in Table 2 demonstrates clear synergies: PALS provides the largest individual gain (+4.40 R@10), RAM amplifies PALS when combined (+2.24 R@10), and PAM adds complementary improvements. However, the comparison to TRKT is problematic; the authors note they could not reproduce TRKT's reported results (Table 1 footnote indicates TRKT$^\dagger$), and the reproduced baseline underperforms the original paper, potentially inflating the relative gain. The analysis of failure modes in Appendix C is laudable, showing that grounding fails under occlusion or categorical ambiguity, but the paper does not compare against recent fully-supervised methods beyond the vanilla backbone upper bounds.

“TRKT$^\dagger$ indicates results reproduced with official code under identical settings.”

paper · Table 1

“RAM alone yields only a modest improvement; although it reduces label noise, it simultaneously reduces the number of matched pairs while the model still trains exclusively on them. However, RAM substantially amplifies PALS.”

paper · Section 4.3

Reproducibility

The authors provide substantial implementation detail including hyperparameters in Table 5, specific embedding dimensions ($d_R=1936$, $d_P=128$), loss formulations ($\mathcal{L}_{PA}$ with class-balanced BCE, $\mathcal{L}_{PAM}$ with margin $m=1.0$), and threshold values ($\tau_r=0.3$, $\tau_{gs}=0.2$). The computational overhead is minimal (+1.27% parameters, +0.01-0.04% FLOPs). However, reproducibility faces challenges: the RAM preprocessing requires specific vision-language checkpoints (GroundingDINO with Swin-B) and careful attention map extraction from the final feature enhance layer, which may be sensitive to implementation details. The two-step training with teacher models, distance-aware blending weights ($\alpha=3.0$), and adaptive PAM margins requires substantial codebase infrastructure not trivial to replicate from the description alone.

“The relation embedding $\mathbf{R}_{0}\in\mathbb{R}^{d_{R}}$ ... yielding $d_{R}=1936$. The pair affinity embedding $\mathbf{P}_{0}\in\mathbb{R}^{d_{P}}$ ... projecting to $d_{P}=128$.”

paper · Appendix A.3

“We adopt GroundingDINO in particular, as its encoder architecture provides readily accessible dense cross-modal attention maps.”

paper · Section 3.3

Abstract

Weakly-supervised video scene graph generation (WS-VSGG) aims to parse video content into structured relational triplets without bounding box annotations and with only sparse temporal labeling, significantly reducing annotation costs. Without ground-truth bounding boxes, these methods rely on off-the-shelf detectors to generate object proposals, yet largely overlook a fundamental discrepancy from fullysupervised pipelines. Fully-supervised detectors implicitly filter out noninteractive objects, while off-the-shelf detectors indiscriminately detect all visible objects, overwhelming relation models with noisy pairs.We address this by introducing a learnable pair affinity that estimates the likelihood of interaction between subject-object pairs. Through Pair Affinity Learning and Scoring (PALS), pair affinity is incorporated into inferencetime ranking and further integrated into contextual reasoning through Pair Affinity Modulation (PAM), enabling the model to suppress noninteractive pairs and focus on relationally meaningful ones. To provide cleaner supervision for pair affinity learning, we further propose Relation- Aware Matching (RAM), which leverages vision-language grounding to resolve class-level ambiguity in pseudo-label generation. Extensive experiments on Action Genome demonstrate that our approach consistently yields substantial improvements across different baselines and backbones, achieving state-of-the-art WS-VSGG performance.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.