Nothing here yet
The paper addresses adaptive broadcast of data-intensive sensory streams (e.g., camera/LiDAR) to heterogeneous edge devices with diverse channel conditions and computational budgets. It proposes Nonlinear Transform Rateless Source-Channel Coding (NTRSCC), integrating learned nonlinear transforms with physical-layer Luby Transform (LT) codes to enable receivers to adaptively adjust the number of received symbols and belief propagation iterations. This achieves an explicit, controllable tradeoff between distortion, transmission rate, and decoding complexity—addressing key limitations of fixed-rate DeepJSCC schemes that either underserve capable devices or require costly retransmissions.
The paper addresses the problem of identifying "Who said what and when" in multi-speaker video conversations, which current Omni-modal LLMs fail at due to sparse visual sampling (1-2 fps) and "shortcut learning" on visual biases. The authors introduce VR-SDR (Visual-Registered Speaker Diarization and Recognition), a rigorous benchmark that forces models to bind identities from natural language descriptions without visual shortcuts. They propose HumanOmni-Speaker, featuring a Visual Delta Encoder that samples video at 25 fps yet compresses inter-frame motion residuals into only 6 tokens per frame to capture fine-grained visemes while avoiding token explosion.
This paper investigates amortized Bayesian inference (ABI) for estimating coupling parameters in Kuramoto oscillator networks—a nonlinear dynamical system widely used to study synchronization. The authors apply neural posterior estimation via BayesFlow to learn an amortized approximation of the posterior distribution from simulated phase dynamics. While the method succeeds for simple single-parameter networks, the paper's central finding is that it fails for complex multi-node networks due to structural non-identifiability and data inefficiency—making the title's focus on 'limitations' well-earned.
Instance segmentation in dense microscopy images requires separating tightly packed, touching cells—a task where binary masks and pixel-wise losses often merge adjacent instances. This paper proposes predicting continuous signed distance functions (SDFs) mapped to probabilistic segmentations via a learnable sigmoid, trained with a differentiable Modified Hausdorff Distance (MHD) loss. The approach eliminates the need for interactive prompting or watershed post-processing while aiming to improve boundary fidelity in high-throughput cellular imaging.
MRI acquisition is inherently slow due to sequential k-space sampling. This paper proposes TRUST-MRI, an active sampling framework that leverages discrete anatomical tokens from the pretrained MedITok tokenizer and a latent Transformer to guide measurement selection. The core innovation uses token prediction entropy as an uncertainty signal, introducing two policies: Latent Entropy Selection (LES) projects patch-wise entropy to k-space to select lines, while Gradient-based Entropy Optimization (GEO) uses gradients of total entropy with respect to input measurements. The approach trades pixel-wise fidelity for perceptual quality and computational efficiency, achieving superior feature-based metrics while running at 0.97 fps compared to 0.01 fps for diffusion-based active methods.
Cross-tokenizer knowledge distillation faces a fundamental alignment challenge when Teacher and Student models use different vocabularies. This paper analyzes DSKD-CMA, the state-of-the-art method for this setting, through manual chunk alignment probes and reveals that its cross-model attention mechanism captures coarse chunk structures but suffers from noisy localization with repeated tokens. Building on this insight, the authors propose DSKD-CMA-GA, which uses generative adversarial key-query matching to align distributions between models, achieving modest improvements in ROUGE-L scores that narrow the gap between cross-tokenizer and same-tokenizer distillation.
RGBD SLAM with 3D Gaussian Splatting (3DGS) struggles to balance scalability against rendering fidelity: global Gaussians consume excessive GPU memory, while view-tied Gaussians (fixed at depth) suffer from limited novel-view quality. This paper proposes pixel-aligned Gaussians that can adjust their positions along viewing rays via learned depth offsets, paired with a fast geometry-similarity tracking strategy using Generalized ICP on depth distributions. The approach claims state-of-the-art rendering and tracking performance while maintaining smaller active memory footprints than prior 3DGS-based methods.
Stand-up comedy depends as much on timing and embodied presence as on verbal content, yet computational humor has largely focused on text alone. This paper introduces TIC-TALK, a multimodal corpus of 90 professionally filmed Netflix specials (2015–2024) with temporally aligned annotations for language, gesture, and audience response. The processing pipeline combines BERTopic for thematic segmentation, Whisper-AT for laughter detection, and YOLOv8 for shot classification and pose keypoint extraction, all aligned hierarchically without resampling. The authors validate the resource through corpus-level findings including a negative correlation between kinetic energy and laughter rate ($r = -0.75$), consistent with a stillness-before-punchline pattern, and through a short-horizon laughter prediction benchmark.
This paper addresses paper-code consistency detection in bioinformatics, tackling the reproducibility crisis where algorithmic descriptions in publications often diverge from software implementations. The authors introduce BioCon, a benchmark of 48 bioinformatics projects with expert-annotated sentence-code pairs, and propose a cross-modal framework using UniXcoder with weighted focal loss. While the task is important for computational biology reproducibility, claims of novelty require qualification given concurrent efforts in the broader scientific community.
This paper tackles Unsupervised Continuous Anomaly Detection (UCAD), where models must sequentially learn new product categories without forgetting previous ones or storing all raw data. The core idea is to augment visual-only approaches with learnable text prompts from CLIP, storing both modalities in a Continuous Multimodal Prompt Memory Bank (CMPMB) and fusing them via a Defect-Semantic-Guided Adaptive Fusion Mechanism (DSG-AFM). Benchmarked on MVTec AD and VisA, the authors claim state-of-the-art detection accuracy (+4.4% AUROC) and segmentation (+14.8% AUPR) over the prior UCAD baseline.
ThinkJEPA addresses the limitation of JEPA-style latent world models that rely on short, densely sampled windows, which bias predictions toward local dynamics while missing long-horizon semantics. The paper proposes a dual-temporal architecture combining a dense-frame V-JEPA branch for fine-grained motion with a sparsely sampled VLM "thinker" branch that provides semantic guidance via multi-layer feature pyramids. This matters because it attempts to marry the physical consistency of latent world models with the general knowledge of vision-language models for robust trajectory forecasting.
This paper establishes information-theoretic limits on LLM steganography, proving that any semantic-preserving embedding of a payload $P$ into a covertext $M_1$ to produce stegotext $M_2$ must increase Kolmogorov complexity by at least $K(P) - O(\log n)$. Since Kolmogorov complexity is uncomputable, the authors propose perplexity ratios (specifically the Binoculars score) as a practical proxy and validate the approach on a color-based encoding scheme with 300 samples.
CVT-Bench evaluates whether multimodal LLMs can maintain stable spatial representations under counterfactual viewpoint transformations—such as inferring object relationships from a camera angle never shown in the image. Using 100 synthetic tabletop scenes and 6,000 relational queries across rotations from $0^{\circ}$ to $360^{\circ}$, the benchmark reveals that state-of-the-art models, despite high single-view accuracy, systematically fail at mental rotation tasks and degrade further under extended sequential context. These findings challenge the assumption that strong episodic spatial performance implies robust viewpoint-invariant representations, with critical implications for embodied AI and robotics applications requiring perspective-taking.
SparseDVFS tackles energy-efficient DNN inference on edge devices by bridging the gap between coarse model-level and prohibitive operator-level DVFS. The core insight is using operator sparsity to distinguish compute-bound and memory-bound phases, applying specialized frequency triplets via a block-level strategy. A white-box offline modeler, greedy graph partitioner with amortization constraints, and unified co-governor with look-ahead pipelining collectively achieve substantial energy savings while managing switching overheads.
Autoregressive video diffusion models struggle with minute-scale generation due to error accumulation in long-horizon rollouts. This paper challenges the assumption that more memory is better, proposing instead to decompose KV-cache conditioning into three functional roles—Sink for global anchors, Tail for recent continuity, and dynamically selected History for mid-range structure. The result is a training-free inference method that improves motion dynamics by 66.8% while cutting attention overhead by roughly 2.6×.
Weather captioning—generating natural language descriptions from meteorological time series—sits at the intersection of time-series analysis and domain-specific NLG. This paper proposes WeatherTGD, a training-free framework that treats caption refinement as gradient descent in text space: three specialized LLM agents (Statistical, Physics, Meteorology) output textual gradients that are fused via a consensus-aware mechanism and applied iteratively to improve an initial caption. The approach aims to bridge the gap between numerical forecasting and human-interpretable explanations without any model fine-tuning.
This paper tackles the Close Small Object Unmixing (CSOU) problem for infrared imagery, where distant clustered targets appear as overlapping mixed spots due to optical diffraction limits. The authors propose DSCSNet, a deep-unfolded network that unrolls the ADMM algorithm with learnable parameters to recover target count, sub-pixel positions, and radiant intensities from mixed spots. The core idea is to replace the traditional ℓ2-norm smoothness terms with strict ℓ1-norm sparsity constraints and add a dynamic thresholding mechanism for scene-adaptive reconstruction.
TAMTRL addresses the temporal credit assignment problem in multi-turn RL for long-context document processing. When LLMs process documents chunk-by-chunk with memory updates, standard outcome-only rewards cannot distinguish good from bad intermediate memory updates. The paper proposes using the model itself as a teacher: during training, it provides the model with filtered (relevant-only) chunks and uses the normalized token probabilities of the generated memory as turn-level rewards. This avoids expensive rollouts or external judges while providing fine-grained supervision for each turn.
FluidGaussian addresses a critical gap in 3D reconstruction: methods optimized solely for photometric losses often produce visually plausible but physically implausible geometries that fail in downstream simulations. The paper proposes coupling 3D Gaussian Splatting with incompressible fluid simulation (SPH/DFSPH) to define a simulation-based uncertainty metric—velocity divergence at the fluid-structure interface—and integrates it into active view selection. By reranking next-best-view candidates using this physical signal, the method improves both visual fidelity (PSNR) and physical plausibility (divergence) on synthetic and aerodynamic datasets.
Medical Vision-Language Models (Med-VLMs) for ultrasound analysis are vulnerable to subtle prompt variations that mimic real clinical communication patterns. This paper proposes a black-box attack framework using an LLM to generate minimal, clinically plausible text edits guided by Monte Carlo Tree Search (MCTS), requiring no access to the target model's weights or gradients. The study reveals that small adversarial rewrites can drastically degrade diagnostic QA accuracy—raising critical safety concerns for deploying such systems in point-of-care settings where prompt variability is inherent.