Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
482 papers
Trending mixes fresh papers with community signal.
0
cs.CV Ting Han, Xiangyi Xie, Yiping Chen et al. · Mar 22, 2026

Most road extraction benchmarks focus on binary segmentation, lacking the hierarchical attributes critical for transport infrastructure planning and management. This paper introduces SYSU-HiRoads, a large-scale dataset spanning 3,631 km² with aligned pixel masks, vector centerlines, and three-level road grades, alongside RoadReasoner—a framework that combines frequency-domain feature extraction with vision-language models to infer road hierarchy from geometric descriptors. The work bridges a significant gap in automated mapping by moving beyond "where are the roads" to "what roles do these roads play."

In this work, we present SYSU-HiRoads, a large-scale hierarchical road dataset, and RoadReasoner, a vision-language-geometry framework for automatic multi-grade road mapping from remote sensing imagery. SYSU-HiRoads is built from GF-2 imagery covering 3631 km2 in Henan Province, China, and contains 1079 image tiles at 0.8 m spatial resolution. Each tile is annotated with dense road masks, vectorized centerlines, and three-level hierarchy labels, enabling the joint training and evaluation of segmentation, topology reconstruction, and hierarchy classification. Building on this dataset, RoadReasoner is designed to generate robust road surface masks, topology-preserving road networks, and semantically coherent hierarchy assignments. We strengthen road feature representation and network connectivity by explicitly enhancing frequency-sensitive cues and multi-scale context. Moreover, we perform hierarchy inference at the skeleton-segment level with geometric descriptors and geometry-aware textual prompts, queried by vision-language models to obtain linguistically interpretable grade decisions. Experiments on SYSU-HiRoads and the CHN6-CUG dataset show that RoadReasoner surpasses state-of-the-art road extraction baselines and produces accurate and semantically consistent road hierarchy maps with 72.6% OA, 64.2% F1 score, and 60.6% SegAcc. The dataset and code will be publicly released to support automated transport infrastructure mapping, road inventory updating, and broader infrastructure management applications.
0
cs.CV Feng Xu, Xun Li, Lars Petersson et al. · Mar 22, 2026

This paper addresses privacy-preserving facial expression recognition (FER) in video without requiring identity labels—a critical gap since real-world deployment often lacks identity annotations. The core idea leverages intra- and inter-video knowledge priors to train an identity suppression network followed by a denoising module, enabling open-set privacy preservation. This matters because current methods either require closed-set identity supervision or suffer from entangled privacy-utility trade-offs that degrade performance.

Facial expression recognition relies on facial data that inherently expose identity and thus raise significant privacy concerns. Current privacy-preserving methods typically fail in realistic open-set video settings where identities are unknown, and identity labels are unavailable. We propose a two-stage framework for video-based privacy-preserving FER in challenging open-set settings that requires no identity labels at any stage. To decouple privacy and utility, we first train an identity-suppression network using intra- and inter-video knowledge priors derived from real-world videos without identity labels. This network anonymizes identity while preserving expressive cues. A subsequent denoising module restores expression-related information and helps recover FER performance. Furthermore, we introduce a falsification-based validation method that uses recognition priors to rigorously evaluate privacy robustness without requiring annotated identity labels. Experiments on three video datasets demonstrate that our method effectively protects privacy while maintaining FER accuracy comparable to identity-supervised baselines.
0
cs.CV Mei Li, Huayi Zhou, Suizhi Huang et al. · Mar 23, 2026

The paper tackles semi-supervised 3D rotation regression from monocular images, addressing the rigidity of fixed entropy thresholds in pseudo-label filtering used by prior work like FisherMatch. It proposes HACMatch, a hardness-aware curriculum learning framework that dynamically selects unlabeled samples by difficulty using either multi-stage or adaptive strategies, paired with PoseMosaic, a patch-based augmentation that applies diverse transformations while preserving geometric integrity. This matters because rotation annotations are expensive to obtain, and effectively leveraging unlabeled data could reduce costs for autonomous driving and robotics applications.

Regressing 3D rotations of objects from 2D images is a crucial yet challenging task, with broad applications in autonomous driving, virtual reality, and robotic control. Existing rotation regression models often rely on large amounts of labeled data for training or require additional information beyond 2D images, such as point clouds or CAD models. Therefore, exploring semi-supervised rotation regression using only a limited number of labeled 2D images is highly valuable. While recent work FisherMatch introduces semi-supervised learning to rotation regression, it suffers from rigid entropy-based pseudo-label filtering that fails to effectively distinguish between reliable and unreliable unlabeled samples. To address this limitation, we propose a hardness-aware curriculum learning framework that dynamically selects pseudo-labeled samples based on their difficulty, progressing from easy to complex examples. We introduce both multi-stage and adaptive curriculum strategies to replace fixed-threshold filtering with more flexible, hardness-aware mechanisms. Additionally, we present a novel structured data augmentation strategy specifically tailored for rotation estimation, which assembles composite images from augmented patches to introduce feature diversity while preserving critical geometric integrity. Comprehensive experiments on PASCAL3D+ and ObjectNet3D demonstrate that our method outperforms existing supervised and semi-supervised baselines, particularly in low-data regimes, validating the effectiveness of our curriculum learning framework and structured augmentation approach.
0
cs.CV Idris Zakariyya, Pai Chet Ng, Kaushik Bhargav Sivangi et al. · Mar 22, 2026

Federated video action recognition faces a dual challenge: gradient sharing risks leaking sensitive motion patterns, while synchronizing high-dimensional video models incurs prohibitive bandwidth costs. This paper proposes FedDP-STECAR, which selectively fine-tunes only task-relevant layers under differential privacy and transmits only those layers, claiming over 99% communication reduction alongside strong privacy guarantees ($\epsilon \leq 1.33$). The work matters for enabling practical privacy-preserving video analysis in healthcare and surveillance where data cannot be centralized.

Federated video action recognition enables collaborative model training without sharing raw video data, yet remains vulnerable to two key challenges: \textit{model exposure} and \textit{communication overhead}. Gradients exchanged between clients and the server can leak private motion patterns, while full-model synchronization of high-dimensional video networks causes significant bandwidth and communication costs. To address these issues, we propose \textit{Federated Differential Privacy with Selective Tuning and Efficient Communication for Action Recognition}, namely \textit{FedDP-STECAR}. Our \textit{FedDP-STECAR} framework selectively fine-tunes and perturbs only a small subset of task-relevant layers under Differential Privacy (DP), reducing the surface of information leakage while preserving temporal coherence in video features. By transmitting only the tuned layers during aggregation, communication traffic is reduced by over 99\% compared to full-model updates. Experiments on the UCF-101 dataset using the MViT-B-16x4 transformer show that \textit{FedDP-STECAR} achieves up to \textbf{70.2\% higher accuracy} under strict privacy ($\epsilon=0.65$) in centralized settings and \textbf{48\% faster training} with \textbf{73.1\% accuracy} in federated setups, enabling scalable and privacy-preserving video action recognition. Code available at https://github.com/izakariyya/mvit-federated-videodp
0
cs.CV Nikolay Kormushev, Josip \v{S}ari\'c, Matej Kristan · Mar 22, 2026

Open-vocabulary panoptic segmentation aims to recognize and segment arbitrary object categories beyond training vocabularies, but suffers from two coupled failures: mask transformers discard proposals for unseen categories due to biased objectness scoring, while CLIP's global image-text alignment poorly localizes to image regions. OVRCOAT addresses both via COAT—which adjusts foreground probabilities using CLIP's classification confidence to rescue out-of-vocabulary masks—and OVR, a memory-efficient fine-tuning protocol for region-text alignment. The approach achieves +5.5% PQ gains on ADE20K and reduces training memory by 56% versus prior SOTA.

Open-vocabulary panoptic segmentation remains hindered by two coupled issues: (i) mask selection bias, where objectness heads trained on closed vocabularies suppress masks of categories not observed in training, and (ii) limited regional understanding in vision-language models such as CLIP, which were optimized for global image classification rather than localized segmentation. We introduce OVRCOAT, a simple, modular framework that tackles both. First, a CLIP-conditioned objectness adjustment (COAT) updates background/foreground probabilities, preserving high-quality masks for out-of-vocabulary objects. Second, an open-vocabulary mask-to-text refinement (OVR) strengthens CLIP's region-level alignment to improve classification of both seen and unseen classes with markedly lower memory cost than prior fine-tuning schemes. The two components combine to jointly improve objectness estimation and mask recognition, yielding consistent panoptic gains. Despite its simplicity, OVRCOAT sets a new state of the art on ADE20K (+5.5% PQ) and delivers clear gains on Mapillary Vistas and Cityscapes (+7.1% and +3% PQ, respectively). The code is available at: https://github.com/nickormushev/OVRCOAT
0
eess.SPcs.LG Gianluca Fontanesi, Luca Barbieri, Lorenzo Galati Giordano et al. · Mar 23, 2026

This paper tackles the challenge of deploying traffic forecasting models in resource-constrained Wi-Fi controllers that manage thousands of access points (APs). The core idea is to use feature-based clustering (k-means on PCA-reduced features) to group APs by traffic behavior, then deploy cluster-specific LSTM models only to high-activity clusters while using a lightweight global model for low-activity clusters. The approach reduces memory footprint by approximately 40% compared to deploying complex models for all clusters, while preserving prediction accuracy through selective specialization.

This manuscript presents a comprehensive analysis of predictive modeling optimization in managed Wi-Fi networks through the integration of clustering algorithms and model evaluation techniques. The study addresses the challenges of deploying forecasting algorithms in large-scale environments managed by a central controller constrained by memory and computational resources. Feature-based clustering, supported by Principal Component Analysis (PCA) and advanced feature engineering, is employed to group time series data based on shared characteristics, enabling the development of cluster-specific predictive models. Comparative evaluations between global models (GMs) and cluster-specific models demonstrate that cluster-specific models consistently achieve superior accuracy in terms of Mean Absolute Error (MAE) values in high-activity clusters. The trade-offs between model complexity (and accuracy) and resource utilization are analyzed, highlighting the scalability of tailored modeling approaches. The findings advocate for adaptive network management strategies that optimize resource allocation through selective model deployment, enhance predictive accuracy, and ensure scalable operations in large-scale, centrally managed Wi-Fi environments.
0
cs.LG James Clayton Kerce · Mar 22, 2026

This paper investigates why linear steering methods for transformers sometimes fail silently by leaking probability mass to unintended tokens. The authors show that softmax induces a Bregman geometry governed by the Hessian $H(\lambda) = \operatorname{Cov}[\gamma \mid \lambda]$, and when this Hessian is degenerate at intermediate layers, Euclidean steering becomes unreliable. Using a carefully controlled $2 \times 2$ factorial design crossing stream separation (CASCADE architecture) with per-layer supervision, they find that maintaining a frozen token stream improves Hessian conditioning by up to $22\times$ compared to standard single-stream transformers. The work provides both a diagnostic tool (cosine similarity between primal and dual directions with threshold $\sim$0.3) and an architectural fix for safer linear interventions.

Linear methods for steering transformer representations, including probing, activation engineering, and concept erasure, implicitly assume the geometry of representation space is Euclidean. Park et al. [Park et al., 2026] showed that softmax induces a curved Bregman geometry whose metric tensor is the Hessian of the log-normalizer, $H({\lambda}) = Cov[{\gamma} | {\lambda}]$. Ignoring this curvature causes Euclidean steering to leak probability mass to unintended tokens. Their analysis applies at the output layer. We measure this Hessian at intermediate layers in a controlled 2x2 design crossing stream separation with per-layer supervision (vocabulary decoding loss at each layer), all at matched vocabulary and parameter count. In standard single-stream transformers, H is severely degenerate at intermediate layers (effective rank 8 in 516 dimensions). Stream separation improves conditioning by up to 22 in effective rank, even without auxiliary supervision. Per-layer supervision helps, but less. The cosine similarity between primal and dual concept directions predicts per-layer steering effectiveness on downstream tasks, with a threshold near 0.3. These results bear on the reliability of linear safety interventions, which depend on the geometry being well-conditioned at the layer where they are applied.
0
cs.CL Caio Vicentino · Mar 23, 2026

This paper addresses a key gap in language model research by conducting the first tightly controlled comparison between autoregressive (AR) and masked diffusion language models (MDLM). The author trains both models on identical data (50M tokens from TinyStories), identical compute budget (20K steps, batch size 32), and identical hardware (NVIDIA H100), isolating the generation paradigm as the sole variable. The work is significant because prior studies compared these paradigms at different scales or with different datasets, making it impossible to attribute observed differences to the core architectural distinction itself.

We present a controlled empirical comparison between autoregressive (AR) and masked diffusion (MDLM) language models. Both models are trained on identical data (50M tokens from TinyStories), identical compute budget (20,000 steps, batch size 32, sequence length 512), and identical hardware (NVIDIA H100 80GB), isolating the generation paradigm as the sole variable. We report three findings. First, both paradigms achieve comparable training throughput (~50K tokens/second), with MDLM requiring only 4.7% more wall-clock time. Second, AR converges faster and begins overfitting by step 14,000, while MDLM converges more slowly and is still improving at step 20,000, suggesting different compute-optimal training regimes. Third, quantitative diversity analysis over 1,000 generated samples reveals a structural diversity-fluency trade-off: AR produces fluent but repetitive outputs (99.8% begin with the same word), while MDLM generates more diverse narratives (93.4% unique 5-word openings, higher Distinct-n, lower Self-BLEU), at the cost of occasional grammatical inconsistencies. All code, trained checkpoints, and data pipelines are released for reproducibility.
0
cs.CV Efthymios Tsaprazlis, Tiantian Feng, Anil Ramakrishna et al. · Mar 23, 2026

Existing visual privacy benchmarks treat privacy as a binary property, but this work argues that privacy is fundamentally compositional: benign attributes in isolation can combine to create severe violations. The authors introduce the Compositional Privacy Risk Taxonomy (CPRT), a four-level framework aligned with regulations like GDPR and HIPAA that assigns continuous severity scores based on attribute interactions. They construct a dataset of 6,736 images annotated for 22 privacy attributes and evaluate frontier vision-language models, finding that while structured taxonomic guidance improves alignment, models systematically underestimate composition-driven risks.

Existing visual privacy benchmarks largely treat privacy as a binary property, labeling images as private or non-private based on visible sensitive content. We argue that privacy is fundamentally compositional. Attributes that are benign in isolation may combine to produce severe privacy violations. We introduce the Compositional Privacy Risk Taxonomy (CPRT), a regulation-aware framework that organizes visual attributes according to standalone identifiability and compositional harm potential. CPRT defines four graded severity levels and is paired with an interpretable scoring function that assigns continuous privacy severity scores. We further construct a taxonomy-aligned dataset of 6.7K images and derive ground-truth compositional risk scores. By evaluating frontier and open-weight VLMs we find that frontier models align well with compositional severity when provided structured guidance, but systematically underestimate composition-driven risks. Smaller models struggle to internalize graded privacy reasoning. To bridge this gap, we introduce a deployable 8B supervised fine-tuned (SFT) model that closely matches frontier-level performance on compositional privacy assessment.
0
cs.CVcs.AI Ziyi Wang, Xinshun Wang, Shuang Chen et al. · Mar 23, 2026

UniMotion addresses the fragmentation in human motion modeling by unifying motion, text, and RGB understanding/generation within a single 1.5B parameter architecture. Unlike prior work relying on discrete tokenization or handling only partial modality subsets, it treats motion as a continuous first-class modality via a Cross-Modal Aligned Motion VAE (CMA-VAE). The framework introduces Dual-Posterior KL Alignment to distill visual semantics into motion representations without requiring images at inference, and Latent Reconstruction Alignment to bootstrap the motion pathway through dense self-supervision before sparse text calibration.

We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.
0
cs.CVcs.MM Guowei Tang, Tianwen Qian, Huanran Zheng et al. · Mar 23, 2026

StreamingEval introduces a unified evaluation framework for Video-LLMs under realistic streaming constraints, moving beyond offline benchmarks to assess continuous, real-time video understanding with limited memory. The protocol enforces a fixed-capacity memory bank and jointly measures encoding throughput (MaxFPS), decoding latency (TTFT), memory usage, and task accuracy via a composite StreamingScore. Experiments reveal that current "online" models often fail under strict streaming constraints, while offline models adapted with FIFO memory banks frequently outperform specialized streaming architectures at the cost of higher resource consumption.

Real-time, continuous understanding of visual signals is essential for real-world interactive AI applications, and poses a fundamental system-level challenge. Existing research on streaming video understanding, however, typically focuses on isolated aspects such as question-answering accuracy under limited visual context or improvements in encoding efficiency, while largely overlooking practical deployability under realistic resource constraints. To bridge this gap, we introduce StreamingEval, a unified evaluation framework for assessing the streaming video understanding capabilities of Video-LLMs under realistic constraints. StreamingEval benchmarks both mainstream offline models and recent online video models under a standardized protocol, explicitly characterizing the trade-off between efficiency, storage and accuracy. Specifically, we adopt a fixed-capacity memory bank to normalize accessible historical visual context, and jointly evaluate visual encoding efficiency, text decoding latency, and task performance to quantify overall system deployability. Extensive experiments across multiple datasets reveal substantial gaps between current Video-LLMs and the requirements of realistic streaming applications, providing a systematic basis for future research in this direction. Codes will be released at https://github.com/wwgTang-111/StreamingEval1.
0
cs.CL Huaibing Xie, Guoliang Zhao, Yang Liu et al. · Mar 23, 2026

The paper tackles the challenge of enhancing long-context reasoning in Large Language Models (LLMs), a critical capability as real-world tasks grow more complex. It proposes structured table data as a solution, mathematically demonstrating via mutual information analysis that tables possess periodic non-vanishing dependencies—unlike natural language which decays polynomially—making them ideal for training long-context reasoning. The authors present TableLong, a scalable pipeline for synthesizing diverse, verifiable table data for reinforcement learning, showing significant performance gains across benchmarks.

As real-world tasks grow increasingly complex, long-context reasoning has become a core capability for Large Language Models (LLMs). However, few studies explore which data types are effective for long-context reasoning and why. We find that structured table data with periodic structures shows strong potential for long-context reasoning. Motivated by this observation, we mathematically analyze tabular dependency structures using mutual information, revealing periodic non-vanishing dependencies in table data. Furthermore, we systematically analyze the capabilities of structured table data, conduct relevant scaling experiments, and validate its underlying mechanisms for enhancing long-context reasoning, yielding several meaningful insights. Leveraging these insights, we propose a simple yet scalable pipeline(TableLong) for synthesizing high-quality, diverse, and verifiable structured table data to boost long-context reasoning via RL. Extensive experimental results demonstrate that table data significantly enhances the long-context reasoning capability of LLMs across multiple long-context benchmarks (+8.24\% on average), and even improves performance on out-of-domain benchmarks (+8.06\% on average). We hope that our insights provide practical guidance for effective post-training data to enhance long-context reasoning in LLMs.
0
eess.AScs.CLcs.SD Jianyi Chen, Rongxiu Zhong, Shilei Zhang et al. · Mar 22, 2026

This paper proposes SqueezeComposer, a long-form music generation framework that tackles computational constraints by applying temporal speed-up (e.g., 2×, 4×, 8×) to compress audio sequences before generation. The core idea is to generate music in an accelerated domain using diffusion models, then restore it to normal speed, theoretically enabling models to produce 10+ minute compositions with fixed memory budgets. The approach is tested on continuation, completion, and singing accompaniment tasks.

Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at https://SqueezeComposer.github.io/.
0
cs.CV Injae Kim, Chaehyeon Kim, Minseong Bae et al. · Mar 22, 2026

F4Splat tackles inefficient Gaussian allocation in feed-forward 3D Gaussian Splatting (3DGS), where existing methods uniformly assign Gaussians per pixel or voxel, causing redundancy and fixed budgets. The core idea is a learnable densification score that predicts spatial regions needing additional Gaussians based on geometric complexity and multi-view overlap, enabling adaptive allocation and explicit budget control without retraining. This matters because it delivers compact scene representations—using 10–28% of the Gaussians of prior work—while maintaining or improving rendering fidelity.

Feed-forward 3D Gaussian Splatting methods enable single-pass reconstruction and real-time rendering. However, they typically adopt rigid pixel-to-Gaussian or voxel-to-Gaussian pipelines that uniformly allocate Gaussians, leading to redundant Gaussians across views. Moreover, they lack an effective mechanism to control the total number of Gaussians while maintaining reconstruction fidelity. To address these limitations, we present F4Splat, which performs Feed-Forward predictive densification for Feed-Forward 3D Gaussian Splatting, introducing a densification-score-guided allocation strategy that adaptively distributes Gaussians according to spatial complexity and multi-view overlap. Our model predicts per-region densification scores to estimate the required Gaussian density and allows explicit control over the final Gaussian budget without retraining. This spatially adaptive allocation reduces redundancy in simple regions and minimizes duplicate Gaussians across overlapping views, producing compact yet high-quality 3D representations. Extensive experiments demonstrate that our model achieves superior novel-view synthesis performance compared to prior uncalibrated feed-forward methods, while using significantly fewer Gaussians.
0
cs.LGstat.ML Julius Kobialka, Emanuel Sommer, Chris Kolb et al. · Mar 23, 2026

Bayesian neural networks (BNNs) suffer from fragmented, high-dimensional posteriors due to weight-space symmetries, raising doubts about the practicality of sampling-based inference. This paper demonstrates that overparametrization—using more hidden units than necessary—actually transforms the posterior geometry in beneficial ways. The authors identify three key phenomena induced by redundancy: balancedness (norm equalization across layers), weight reallocation on equal-probability manifolds (following Dirichlet distributions), and prior conformity (marginals aligning with zero-mean Gaussian priors). Through theory for ReLU networks and extensive experiments with up to 10 million posterior samples, the work explains why recent sampling methods succeed and provides a principled foundation for understanding weight priors in overparametrized regimes.

Bayesian neural network (BNN) posteriors are often considered impractical for inference, as symmetries fragment them, non-identifiabilities inflate dimensionality, and weight-space priors are seen as meaningless. In this work, we study how overparametrization and priors together reshape BNN posteriors and derive implications allowing us to better understand their interplay. We show that redundancy introduces three key phenomena that fundamentally reshape the posterior geometry: balancedness, weight reallocation on equal-probability manifolds, and prior conformity. We validate our findings through extensive experiments with posterior sampling budgets that far exceed those of earlier works, and demonstrate how overparametrization induces structured, prior-aligned weight posterior distributions.
0
cs.ROcs.CV Yuheng Ji, Yuyang Liu, Huajie Tan et al. · Mar 23, 2026

PRM-as-a-Judge addresses the fundamental limitation of binary success metrics in robotic manipulation by repurposing Process Reward Models (PRMs) as dense evaluators. The paper introduces the OPD (Outcome–Process–Diagnosis) metric system, which decomposes execution quality via a task-aligned progress potential $\Phi(x_t) \in [0,1]$ induced from trajectory videos. Validated on the RoboPulse benchmark and RoboTwin policy auditing, the work shows that trajectory-supervised PRMs achieve superior micro-resolution compared to foundation models, revealing behavioral signatures invisible to outcome-only evaluation.

Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based PRM judges provide a natural instantiation of dense evaluation, with macro-consistency following directly from the induced scalar potential. We empirically validate the micro-resolution property using RoboPulse, a diagnostic benchmark specifically designed for probing micro-scale progress discrimination, where several trajectory-trained PRM judges outperform discriminative similarity-based methods and general-purpose foundation-model judges. Finally, leveraging PRM-as-a-Judge and the OPD metric system, we conduct a structured audit of mainstream policy paradigms across long-horizon tasks, revealing behavioral signatures and failure modes that are invisible to outcome-only metrics.
0
cs.CV Shenghan Chen, Yiming Liu, Yanzhen Wang et al. · Mar 22, 2026

This paper addresses long-tailed (LT) learning by proposing that the head-tail performance trade-off stems from "tail performance degradation"—where models overfit to head classes and forget tail classes. The core idea reframes LT learning as continual learning, using a Grouped Knowledge Preservation (GKP) module to maintain class-specific optimal parameters and a Grouped Sharpness Aware (GSA) module to find flatter minima. The method operates without external data or pre-trained models, showing improvements on CIFAR-LT, ImageNet-LT, and iNaturalist benchmarks.

Balancing performance trade-off on long-tail (LT) data distributions remains a long-standing challenge. In this paper, we posit that this dilemma stems from a phenomenon called "tail performance degradation" (the model tends to severely overfit on head classes while quickly forgetting tail classes) and pose a solution from a loss landscape perspective. We observe that different classes possess divergent convergence points in the loss landscape. Besides, this divergence is aggravated when the model settles into sharp and non-robust minima, rather than a shared and flat solution that is beneficial for all classes. In light of this, we propose a continual learning inspired framework to prevent "tail performance degradation". To avoid inefficient per-class parameter preservation, a Grouped Knowledge Preservation module is proposed to memorize group-specific convergence parameters, promoting convergence towards a shared solution. Concurrently, our framework integrates a Grouped Sharpness Aware module to seek flatter minima by explicitly addressing the geometry of the loss landscape. Notably, our framework requires neither external training samples nor pre-trained models, facilitating the broad applicability. Extensive experiments on four benchmarks demonstrate significant performance gains over state-of-the-art methods. The code is available at:https://gkp-gsa.github.io/.
0
math.NAcs.LGcs.NA Gianluca Fabiani, Michail E. Kavousanakis, Constantinos Siettos et al. · Mar 23, 2026

This paper solves stability and bifurcation analysis for nonlinear PDEs using Physics-Informed Random Projection Neural Networks (PI-RPNNs). The core innovation is a matrix-free shift-invert Krylov-Arnoldi method operating directly in weight space to circumvent the exponential singular value decay of the random collocation matrix $\Psi$. This enables reliable computation of leading eigenpairs for detecting saddle-node, Hopf, and pitchfork bifurcations without requiring additional PDE solves beyond the initial training.

We address a numerical framework for the stability and bifurcation analysis of nonlinear partial differential equations (PDEs) in which the solution is sought in the function space spanned by physics-informed random projection neural networks (PI-RPNNs), and discretized via a collocation approach. These are single-hidden-layer networks with randomly sampled and fixed a priori hidden-layer weights; only the linear output layer weights are optimized, reducing training to a single least-squares solve. This linear output structure enables the direct and explicit formulation of the eigenvalue problem governing the linear stability of stationary solutions. This takes a generalized eigenvalue form, which naturally separates the physical domain interior dynamics from the algebraic constraints imposed by boundary conditions, at no additional training cost and without requiring additional PDE solves. However, the random projection collocation matrix is inherently numerically rank-deficient, rendering naive eigenvalue computation unreliable and contaminating the true eigenvalue spectrum with spurious near-zero modes. To overcome this limitation, we introduce a matrix-free shift-invert Krylov-Arnoldi method that operates directly in weight space, avoiding explicit inversion of the numerically rank-deficient collocation matrix and enabling the reliable computation of several leading eigenpairs of the physical Jacobian - the discretized Frechet derivative of the PDE operator with respect to the solution field, whose eigenvalue spectrum determines linear stability. We further prove that the PI-RPNN-based generalized eigenvalue problem is almost surely regular, guaranteeing solvability with standard eigensolvers, and that the singular values of the random projection collocation matrix decay exponentially for analytic activation functions.
0
cs.CV Xiaoshan Wu, Xiaoyang Lyu, Yifei Yu et al. · Mar 22, 2026

This paper introduces a new computer vision task called Anytime Interframe Semantic Segmentation: predicting dense semantic segmentation at arbitrary timestamps between low-frame-rate RGB frames using only a past frame and asynchronous event data. The core idea is feature propagation via event-driven motion fields rather than direct multi-modal fusion. The method is motivated by the perceptual gaps created by LFR cameras in high-speed autonomous driving scenarios, where critical events (e.g., pedestrians entering paths) may be missed between frames.

Dense semantic segmentation in dynamic environments is fundamentally limited by the low-frame-rate (LFR) nature of standard cameras, which creates critical perceptual gaps between frames. To solve this, we introduce Anytime Interframe Semantic Segmentation: a new task for predicting segmentation at any arbitrary time using only a single past RGB frame and a stream of asynchronous event data. This task presents a core challenge: how to robustly propagate dense semantic features using a motion field derived from sparse and often noisy event data, all while mitigating feature degradation in highly dynamic scenes. We propose LiFR-Seg, a novel framework that directly addresses these challenges by propagating deep semantic features through time. The core of our method is an uncertainty-aware warping process, guided by an event-driven motion field and its learned, explicit confidence. A temporal memory attention module further ensures coherence in dynamic scenarios. We validate our method on the DSEC dataset and a new high-frequency synthetic benchmark (SHF-DSEC) we contribute. Remarkably, our LFR system achieves performance (73.82% mIoU on DSEC) that is statistically indistinguishable from an HFR upper-bound (within 0.09%) that has full access to the target frame. This work presents a new, efficient paradigm for achieving robust, high-frame-rate perception with low-frame-rate hardware. Project Page: https://candy-crusher.github.io/LiFR_Seg_Proj/#; Code: https://github.com/Candy-Crusher/LiFR-Seg.git.
0
cs.CV Bin Hu, Zipeng Qi, Guoxi Huang et al. · Mar 22, 2026

Single-view reference-to-video methods struggle to preserve identity when faces rotate through large angles. This paper proposes Mv2ID, a multi-view conditioning framework that uses region-masking and a decoupled positional encoding scheme to prevent view-dependent copy-paste artifacts without requiring expensive cross-paired training data. The work is relevant for digital character creation and visual effects where identity must remain consistent across extreme viewpoints.

Single-view reference-to-video methods often struggle to preserve identity consistency under large facial-angle variations. This limitation naturally motivates the incorporation of multi-view facial references. However, simply introducing additional reference images exacerbates the \textit{copy-paste} problem, particularly the \textbf{\textit{view-dependent copy-paste}} artifact, which reduces facial motion naturalness. Although cross-paired data can alleviate this issue, collecting such data is costly. To balance the consistency and naturalness, we propose $\mathrm{Mv}^2\mathrm{ID}$, a multi-view conditioned framework under in-paired supervision. We introduce a region-masking training strategy to prevent shortcut learning and extract essential identity features by encouraging the model to aggregate complementary identity cues across views. In addition, we design a reference decoupled-RoPE mechanism that assigns distinct positional encoding to video and conditioning tokens for better modeling of their heterogeneous properties. Furthermore, we construct a large-scale dataset with diverse facial-angle variations and propose dedicated evaluation metrics for identity consistency and motion naturalness. Extensive experiments demonstrate that our method significantly improves identity consistency while maintaining motion naturalness, outperforming existing approaches trained with cross-paired data.