Nothing here yet
This paper analyzes temporal dynamics in Swiss digital news across French, German, and Italian language regions using a triangulated methodology that combines quantitative NLP with qualitative interpretation. The authors process 1.7 million articles to study how different event types—Brexit, Swiss Wolf, Christmas, and the British Royal Family—are covered across linguistic boundaries, introducing domestication profiles and proximity salience ratios to quantify cultural proximity effects.
4DGS360 addresses the ill-posed challenge of reconstructing dynamic objects from monocular video by tackling a critical failure mode: existing methods rely on 2D-native priors that overfit to visible surfaces and cannot reconstruct occluded regions at extreme viewpoints (>90°). The authors propose AnchorTAP3D, a hybrid 3D tracker that leverages high-confidence 2D track points as spatial-temporal anchors to stabilize long-term tracking and resolve depth ambiguity in occluded areas. Combined with a new iPhone360 benchmark featuring test cameras up to 135° from training views, the method enables coherent 360° 4D reconstruction without diffusion priors.
This paper develops a neural operator framework for approximating mappings defined on constrained Wasserstein spaces $\mathcal{M}_\lambda$, consisting of probability measures on $I \times \mathbb{R}^d$ with prescribed marginal $\lambda$ on the label space $I$. The core contribution is the DeepONetCyl architecture, which combines cylindrical moment approximations $\Phi_J(\mu) = (\langle \varphi_1, \mu \rangle, \ldots, \langle \varphi_J, \mu \rangle)$ with a DeepONet-type branch–trunk structure to preserve the marginal constraint. This enables learning of heterogeneous (non-exchangeable) mean-field control problems where agent interactions depend on labels, extending prior neural methods beyond the exchangeable case.
Omni-WorldBench addresses the gap between passive video generation metrics and active world model evaluation by focusing on interactive response—how actions causally drive state transitions across space and time. It introduces Omni-WorldSuite, a 1,068-prompt hierarchical taxonomy spanning three interaction levels (single-object to global environmental effects), and Omni-Metrics, an agent-based evaluation protocol that aggregates Interaction Effect Fidelity, Generated Video Quality, and Camera-Object Controllability into an adaptive AgenticScore.
AdaEdit tackles the injection dilemma in flow-based image editing, where source feature injection preserves backgrounds but suppresses novel content generation. The authors propose two training-free adaptations: a Progressive Injection Schedule using continuous decay functions (sigmoid, cosine, linear) instead of binary cutoffs, and Channel-Selective Latent Perturbation that applies per-channel AdaIN based on distributional gaps between inverted and random latents. Extensive experiments on PIE-Bench show AdaEdit improves background preservation metrics by 8.7% LPIPS reduction versus ProEdit while maintaining competitive CLIP scores.
This paper tackles the challenge of automating BT-RADS (Brain Tumor Reporting and Data System) classification for post-treatment glioma MRI surveillance. BT-RADS requires integrating complex information: volumetric tumor changes, medication effects (steroids, bevacizumab), and radiation timing. The authors propose an end-to-end pipeline combining CNN-based tumor segmentation with a multi-agent LLM system to extract clinical variables from unstructured notes and apply algorithmic scoring logic. This matters because manual BT-RADS scoring is error-prone, with prior studies showing substantial inter-reader variability and inconsistent application of clinical context.
This paper presents PhotoBeamSolver, a hybrid system that converts hand-drawn beam diagrams into analytical structural solutions by combining computer vision with large language models. The core idea uses a custom-trained YOLO-based detector to identify supports and loads from images, feeding a symbolic solver that computes shear, moment, and deflection diagrams. While targeted at academic and quick professional verification tasks, the work highlights the challenges of integrating deep learning into safety-critical structural engineering workflows.
Multifidelity surrogate modeling aims to leverage cheap low-fidelity simulations to improve predictions of expensive high-fidelity models when training data is scarce. This paper proposes MAGPI, a Gaussian process regression method that augments the high-fidelity input space with features derived from recursively-trained low-fidelity surrogate models. The approach unifies desirable properties from cokriging and autoregressive estimators while allowing non-GP models for low-fidelity levels, achieving superior accuracy and computational efficiency.
This paper challenges the long-held assumption that infrared and visible image fusion (IVIF) requires strictly paired training data. The authors propose UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP), demonstrating that pixel-level self-supervision enables training on unaligned cross-modal combinations. By reformulating the maximum likelihood objective to treat infrared and visible images as independent variables, they show that a base dataset of $N$ pairs can be expanded to $N^2$ trainable combinations, potentially reducing collection costs while improving generalization.
Articulated object reconstruction typically requires either multi-view capture of discrete states or monocular video with a strict static-base-part assumption, limiting practical deployment. FreeArtGS introduces a "free-moving" setting where both joint angles and object poses vary arbitrarily during capture, using only a monocular RGB-D video. The method combines motion-based part segmentation via point tracking priors with joint estimation and 3D Gaussian Splatting optimization to jointly reconstruct geometry, appearance, and articulation.
Cardiac ultrasound view acquisition is notoriously operator-dependent, limiting reproducibility and access. This paper proposes an anatomical prior (AP)-driven framework that unifies cardiac structure segmentation with autonomous probe adjustment. The core innovation is a spatial-relation graph (SRG) module that injects spatial-topological constraints into YOLO-based segmentation, coupled with an RL formulation where states and rewards are built from quantifiable anatomical features drawn from Gaussian priors. The work matters because it offers an interpretable alternative to black-box end-to-end methods, potentially enabling zero-shot sim-to-real deployment for robotic echocardiography.
3D fragment reassembly becomes challenging at scale because incorrect contact adjacencies trigger cascading failures. This paper proposes SARe, a generative framework that explicitly models contact structure by jointly predicting fracture-surface tokens and inter-fragment adjacency graphs, paired with an inference-time refinement stage that anchors reliable substructures to correct uncertain regions. The work demonstrates state-of-the-art results across synthetic and real fracture datasets, with notable improvements in the many-fragment ($K$) regime.
This paper tackles unregistered hyperspectral-multispectral image fusion (HMF), where spatially misaligned images with partial overlap must be mutually super-resolved without training data or co-registration. The authors propose FRESCO, a two-stage unsupervised framework that uses coupled block-term tensor decomposition (BTD) for MSI spectral super-resolution and latent-space adversarial learning for HSI spatial super-resolution. The work is notable for offering the first theoretical recoverability guarantees in the unregistered setting, addressing a practically important gap in remote sensing.
This paper investigates whether large language models exhibit metacognitive control—specifically, whether they use internal confidence signals to guide abstention decisions (knowing when to answer versus withhold responses). The authors develop a rigorous four-phase paradigm combining behavioral analysis, activation steering, and computational modeling to demonstrate that abstention arises from a two-stage confidence-decision pathway involving confidence representation formation followed by threshold-based policy implementation. Their findings suggest that LLMs deploy native confidence signals in a structured manner paralleling biological metacognition, with substantial implications for safe AI deployment.
Video facial expression recognition (FER) suffers from severe subject-specific distribution shifts that degrade CLIP model performance at test time. This paper proposes TTA-CaP, a gradient-free test-time adaptation method that personalizes models using three coordinated caches—a fixed source-domain prototype cache, a dynamic positive target cache for reliable samples, and a negative cache for uncertain predictions—coupled with a tri-gate filtering mechanism to prevent error accumulation.
DepthTCM tackles depth map compression by combining physics-inspired Multiwavelength Depth (MWD) encoding—mapping depth to sinusoidal 3-channel images—with global 4-bit quantization and a Transformer-CNN mixed learned codec. The core claim is that this hybrid approach reshapes depth statistics into a form amenable to modern learned image compression, achieving 60% bitrate reduction over prior MWD methods while maintaining >99% geometric accuracy.
The paper addresses the quadratic complexity of transformer attention and limited local detail extraction in RGB-D Salient Object Detection (SOD). It proposes STENet, which introduces superpixels as intermediate tokens to reduce computational overhead while preserving structural coherence. The core idea replaces global pixel-to-pixel attention with two modules: one for pixel-to-superpixel global enhancement and another for intra-superpixel local refinement, aiming to balance efficiency and accuracy.
This paper investigates which static analysis alert removals actually reduce bug rates—a critical question since developers constantly face noisy linting warnings. The author employs three complementary methods: a randomized controlled trial with 521 manual interventions, labeling functions to identify intervention-like events in 8,245 natural commits, and supervised learning to predict beneficial removals. The core finding is that removing complexity alerts (too-many-branches, too-many-nested-blocks) via method extraction reduces bug tendency by 4.1–5.5 percentage points, offering evidence-based guidance for prioritizing refactoring efforts.
BanglaVerse introduces a culturally grounded benchmark evaluating vision-language models on Bengali culture across standard Bangla, four historically linked languages, and five regional dialects. Built from 1,152 manually curated images expanded to ~32.3K artifacts, the work reveals that standard Bangla evaluation substantially overestimates model capabilities compared to dialectal settings. The core finding—that missing cultural knowledge, not visual grounding alone, is the primary bottleneck—challenges conventional multimodal evaluation practices for underrepresented languages.
Video subtitle removal traditionally requires expensive per-frame mask annotations and external detection modules during both training and inference. CLEAR introduces a two-stage mask-free framework that decouples prior extraction (via self-supervised disentangled feature learning) from generative refinement (via LoRA-adapted diffusion with adaptive weighting). The method claims to train only 0.77% of base model parameters while achieving +6.77dB PSNR gains and zero-shot generalization across six languages without ground-truth masks at inference.