Nothing here yet
Most road extraction benchmarks focus on binary segmentation, lacking the hierarchical attributes critical for transport infrastructure planning and management. This paper introduces SYSU-HiRoads, a large-scale dataset spanning 3,631 km² with aligned pixel masks, vector centerlines, and three-level road grades, alongside RoadReasoner—a framework that combines frequency-domain feature extraction with vision-language models to infer road hierarchy from geometric descriptors. The work bridges a significant gap in automated mapping by moving beyond "where are the roads" to "what roles do these roads play."
This paper addresses privacy-preserving facial expression recognition (FER) in video without requiring identity labels—a critical gap since real-world deployment often lacks identity annotations. The core idea leverages intra- and inter-video knowledge priors to train an identity suppression network followed by a denoising module, enabling open-set privacy preservation. This matters because current methods either require closed-set identity supervision or suffer from entangled privacy-utility trade-offs that degrade performance.
The paper tackles semi-supervised 3D rotation regression from monocular images, addressing the rigidity of fixed entropy thresholds in pseudo-label filtering used by prior work like FisherMatch. It proposes HACMatch, a hardness-aware curriculum learning framework that dynamically selects unlabeled samples by difficulty using either multi-stage or adaptive strategies, paired with PoseMosaic, a patch-based augmentation that applies diverse transformations while preserving geometric integrity. This matters because rotation annotations are expensive to obtain, and effectively leveraging unlabeled data could reduce costs for autonomous driving and robotics applications.
Federated video action recognition faces a dual challenge: gradient sharing risks leaking sensitive motion patterns, while synchronizing high-dimensional video models incurs prohibitive bandwidth costs. This paper proposes FedDP-STECAR, which selectively fine-tunes only task-relevant layers under differential privacy and transmits only those layers, claiming over 99% communication reduction alongside strong privacy guarantees ($\epsilon \leq 1.33$). The work matters for enabling practical privacy-preserving video analysis in healthcare and surveillance where data cannot be centralized.
Open-vocabulary panoptic segmentation aims to recognize and segment arbitrary object categories beyond training vocabularies, but suffers from two coupled failures: mask transformers discard proposals for unseen categories due to biased objectness scoring, while CLIP's global image-text alignment poorly localizes to image regions. OVRCOAT addresses both via COAT—which adjusts foreground probabilities using CLIP's classification confidence to rescue out-of-vocabulary masks—and OVR, a memory-efficient fine-tuning protocol for region-text alignment. The approach achieves +5.5% PQ gains on ADE20K and reduces training memory by 56% versus prior SOTA.
This paper tackles the challenge of deploying traffic forecasting models in resource-constrained Wi-Fi controllers that manage thousands of access points (APs). The core idea is to use feature-based clustering (k-means on PCA-reduced features) to group APs by traffic behavior, then deploy cluster-specific LSTM models only to high-activity clusters while using a lightweight global model for low-activity clusters. The approach reduces memory footprint by approximately 40% compared to deploying complex models for all clusters, while preserving prediction accuracy through selective specialization.
This paper investigates why linear steering methods for transformers sometimes fail silently by leaking probability mass to unintended tokens. The authors show that softmax induces a Bregman geometry governed by the Hessian $H(\lambda) = \operatorname{Cov}[\gamma \mid \lambda]$, and when this Hessian is degenerate at intermediate layers, Euclidean steering becomes unreliable. Using a carefully controlled $2 \times 2$ factorial design crossing stream separation (CASCADE architecture) with per-layer supervision, they find that maintaining a frozen token stream improves Hessian conditioning by up to $22\times$ compared to standard single-stream transformers. The work provides both a diagnostic tool (cosine similarity between primal and dual directions with threshold $\sim$0.3) and an architectural fix for safer linear interventions.
This paper addresses a key gap in language model research by conducting the first tightly controlled comparison between autoregressive (AR) and masked diffusion language models (MDLM). The author trains both models on identical data (50M tokens from TinyStories), identical compute budget (20K steps, batch size 32), and identical hardware (NVIDIA H100), isolating the generation paradigm as the sole variable. The work is significant because prior studies compared these paradigms at different scales or with different datasets, making it impossible to attribute observed differences to the core architectural distinction itself.
Existing visual privacy benchmarks treat privacy as a binary property, but this work argues that privacy is fundamentally compositional: benign attributes in isolation can combine to create severe violations. The authors introduce the Compositional Privacy Risk Taxonomy (CPRT), a four-level framework aligned with regulations like GDPR and HIPAA that assigns continuous severity scores based on attribute interactions. They construct a dataset of 6,736 images annotated for 22 privacy attributes and evaluate frontier vision-language models, finding that while structured taxonomic guidance improves alignment, models systematically underestimate composition-driven risks.
UniMotion addresses the fragmentation in human motion modeling by unifying motion, text, and RGB understanding/generation within a single 1.5B parameter architecture. Unlike prior work relying on discrete tokenization or handling only partial modality subsets, it treats motion as a continuous first-class modality via a Cross-Modal Aligned Motion VAE (CMA-VAE). The framework introduces Dual-Posterior KL Alignment to distill visual semantics into motion representations without requiring images at inference, and Latent Reconstruction Alignment to bootstrap the motion pathway through dense self-supervision before sparse text calibration.
StreamingEval introduces a unified evaluation framework for Video-LLMs under realistic streaming constraints, moving beyond offline benchmarks to assess continuous, real-time video understanding with limited memory. The protocol enforces a fixed-capacity memory bank and jointly measures encoding throughput (MaxFPS), decoding latency (TTFT), memory usage, and task accuracy via a composite StreamingScore. Experiments reveal that current "online" models often fail under strict streaming constraints, while offline models adapted with FIFO memory banks frequently outperform specialized streaming architectures at the cost of higher resource consumption.
The paper tackles the challenge of enhancing long-context reasoning in Large Language Models (LLMs), a critical capability as real-world tasks grow more complex. It proposes structured table data as a solution, mathematically demonstrating via mutual information analysis that tables possess periodic non-vanishing dependencies—unlike natural language which decays polynomially—making them ideal for training long-context reasoning. The authors present TableLong, a scalable pipeline for synthesizing diverse, verifiable table data for reinforcement learning, showing significant performance gains across benchmarks.
This paper proposes SqueezeComposer, a long-form music generation framework that tackles computational constraints by applying temporal speed-up (e.g., 2×, 4×, 8×) to compress audio sequences before generation. The core idea is to generate music in an accelerated domain using diffusion models, then restore it to normal speed, theoretically enabling models to produce 10+ minute compositions with fixed memory budgets. The approach is tested on continuation, completion, and singing accompaniment tasks.
F4Splat tackles inefficient Gaussian allocation in feed-forward 3D Gaussian Splatting (3DGS), where existing methods uniformly assign Gaussians per pixel or voxel, causing redundancy and fixed budgets. The core idea is a learnable densification score that predicts spatial regions needing additional Gaussians based on geometric complexity and multi-view overlap, enabling adaptive allocation and explicit budget control without retraining. This matters because it delivers compact scene representations—using 10–28% of the Gaussians of prior work—while maintaining or improving rendering fidelity.
Bayesian neural networks (BNNs) suffer from fragmented, high-dimensional posteriors due to weight-space symmetries, raising doubts about the practicality of sampling-based inference. This paper demonstrates that overparametrization—using more hidden units than necessary—actually transforms the posterior geometry in beneficial ways. The authors identify three key phenomena induced by redundancy: balancedness (norm equalization across layers), weight reallocation on equal-probability manifolds (following Dirichlet distributions), and prior conformity (marginals aligning with zero-mean Gaussian priors). Through theory for ReLU networks and extensive experiments with up to 10 million posterior samples, the work explains why recent sampling methods succeed and provides a principled foundation for understanding weight priors in overparametrized regimes.
PRM-as-a-Judge addresses the fundamental limitation of binary success metrics in robotic manipulation by repurposing Process Reward Models (PRMs) as dense evaluators. The paper introduces the OPD (Outcome–Process–Diagnosis) metric system, which decomposes execution quality via a task-aligned progress potential $\Phi(x_t) \in [0,1]$ induced from trajectory videos. Validated on the RoboPulse benchmark and RoboTwin policy auditing, the work shows that trajectory-supervised PRMs achieve superior micro-resolution compared to foundation models, revealing behavioral signatures invisible to outcome-only evaluation.
This paper addresses long-tailed (LT) learning by proposing that the head-tail performance trade-off stems from "tail performance degradation"—where models overfit to head classes and forget tail classes. The core idea reframes LT learning as continual learning, using a Grouped Knowledge Preservation (GKP) module to maintain class-specific optimal parameters and a Grouped Sharpness Aware (GSA) module to find flatter minima. The method operates without external data or pre-trained models, showing improvements on CIFAR-LT, ImageNet-LT, and iNaturalist benchmarks.
This paper solves stability and bifurcation analysis for nonlinear PDEs using Physics-Informed Random Projection Neural Networks (PI-RPNNs). The core innovation is a matrix-free shift-invert Krylov-Arnoldi method operating directly in weight space to circumvent the exponential singular value decay of the random collocation matrix $\Psi$. This enables reliable computation of leading eigenpairs for detecting saddle-node, Hopf, and pitchfork bifurcations without requiring additional PDE solves beyond the initial training.
This paper introduces a new computer vision task called Anytime Interframe Semantic Segmentation: predicting dense semantic segmentation at arbitrary timestamps between low-frame-rate RGB frames using only a past frame and asynchronous event data. The core idea is feature propagation via event-driven motion fields rather than direct multi-modal fusion. The method is motivated by the perceptual gaps created by LFR cameras in high-speed autonomous driving scenarios, where critical events (e.g., pedestrians entering paths) may be missed between frames.
Single-view reference-to-video methods struggle to preserve identity when faces rotate through large angles. This paper proposes Mv2ID, a multi-view conditioning framework that uses region-masking and a decoupled positional encoding scheme to prevent view-dependent copy-paste artifacts without requiring expensive cross-paired training data. The work is relevant for digital character creation and visual effects where identity must remain consistent across extreme viewpoints.