Nothing here yet
This paper evaluates three inference-time strategies—self-consistency with temperature/top-p sampling, dual-model cross-verification, and iterative self-reflection—to improve multi-step reasoning in LLMs without parameter updates. The core premise is that aggregating diverse reasoning traces or validating across models yields more reliable outputs than single-pass decoding. The work addresses a practical need for deployment scenarios where retraining is infeasible, though the experimental scope is limited by unclear model specifications and dataset choices.
Domain Elastic Transform (DET) addresses the registration of high-dimensional vector-valued functions on irregular, sparse manifolds—a critical bottleneck in spatial transcriptomics where gene expression data resides on scattered cell positions rather than regular grids. The core idea is a Bayesian framework that treats registration as elastic domain deformation guided by a joint spatial-functional likelihood, bypassing the lossy voxelization required by image-based methods while exploiting functional signals that pure geometric point-set registration ignores. This matters because it enables training-free analysis of massive atlases (e.g., MERFISH, Stereo-seq) without sacrificing single-cell resolution.
This paper tackles multimodal hate speech detection where hateful intent emerges from complex interactions between text and images—what the authors call "more than the sum of its parts." The core innovation is the Stratified Multimodal Interaction (SMI) paradigm, which categorizes eight distinct cross-modal interaction patterns into three difficulty levels (Easy, Normal, Hard), coupled with the ARCADE framework that simulates an asymmetric courtroom debate between Prosecutor, Defender, and Judge agents to decipher subtle intent shifts. This matters because current detection systems fail when hateful content is constructed implicitly through benign-seeming modalities that only become toxic in combination.
QMoP tackles the computational bottleneck in multimodal LLMs caused by excessive visual tokens, which dwarf text tokens in memory and compute costs. The paper proposes a Query Guided Mixture-of-Projector that dynamically combines three compression strategies—pooling for global semantics, resampling for high-level features, and pruning for fine-grained details—via a learned router. This adaptive approach matters because fixed compression rules inherently sacrifice different information types (global context vs. local details) depending on the task.
DeepXplain tackles the opacity of autonomous APT defense by integrating explainability signals directly into reinforcement learning rather than treating explanation as a post-hoc add-on. The framework augments provenance-graph-based DRL with an alignment loss that ties policy decisions to GNN-derived structural explanations and temporal attributions, coupled with a confidence-aware reward shaping term. The core claim is that this tight coupling improves both task performance (F1-score from 0.887 to 0.915) and explanation quality (confidence 0.86, fidelity 0.79) compared to black-box alternatives.
Host-acting agents let users state goals while the system figures out how to achieve them. This paper argues this convenience creates a novel attack surface: semantic under-specification. When users specify outcomes but not safety boundaries, agents must fill in missing semantics—and may choose security-divergent plans even when no attacker is present and the goal is benign.
Current multimodal large language models rely on expensive annotated data or teacher distillation for reasoning improvements. This paper proposes an unsupervised self-evolution framework that trains without ground-truth labels or external reward models by instantiating dual roles—an Actor that generates multiple reasoning trajectories and a frozen Judge that modulates consistency-based rewards. The method employs group-wise distributional modeling using Group Relative Policy Optimization (GRPO) to convert absolute scores into relative advantages, achieving up to +5.9 absolute accuracy gains on MathVision while maintaining healthier training entropy than majority-voting baselines.
This paper investigates whether AI-assisted writing improves essay quality at the cost of homogenizing student thinking. Analyzing 6,875 essays across five conditions (Human-only, AI-only, and three Human+AI prompt strategies), the authors identify a Quality-Homogenization Tradeoff whereby substantial quality gains co-occur with structural convergence. The effect is dimension-specific: cohesion architecture loses 70–78% of its variance while perspective plurality diversifies, and prompt specificity can reverse homogenization into diversification on argument depth.
Sonny tackles the compute barrier in medium-range weather forecasting by proposing a hierarchical transformer that trains on a single A40 GPU in 5.5 days. The core idea is a two-stage StepsNet pipeline: a narrow 'slow path' processes large-scale dynamics (U,V,Z,P) first, then a full-width 'fast path' integrates thermodynamics (T,Q). Combined with EMA during training, randomized dynamics forecasting, and pressure-weighted losses, Sonny aims to deliver competitive forecast skill without the TPU/GPU cluster requirements of models like Pangu-Weather or GraphCast.
Existing counterfactual image generation methods produce either global changes or require tedious user-defined masks. This paper proposes Positional Seg-CFT, which subdivides anatomical structures into regional segments (e.g., proximal, mid, distal) and derives independent measurements per region from pretrained segmentors. The extension enables spatially localized interventions for modeling regional disease progression, demonstrated on coronary CT angiography.
This paper tackles the tension between local melodic continuity and global structural coherence in symbolic music generation. It proposes a hybrid architecture fusing a Transformer encoder (for global patterns) with an LSTM decoder (for temporal precision), evaluating it against pure LSTM and Transformer baselines using 17 musical quality metrics on 1,000 generated melodies per model. The work matters because it provides systematic evidence that architectural hybridization can reconcile the complementary strengths of memory-based and attention-based models.
WARBENCH is a benchmark for evaluating LLMs in military decision-making, addressing critical gaps in current frameworks by testing International Humanitarian Law (IHL) compliance, edge deployment constraints, fog-of-war robustness, and explicit reasoning. Using 136 high-fidelity scenarios derived from real post-WWII conflicts, the authors expose severe structural flaws: state-of-the-art models collapse under complex terrain and asymmetric force distributions, while edge-optimized models exhibit legal violation rates approaching 70%.
This paper investigates the security of multi-agent LLM discussions under continuous monitoring, where anomaly detectors block suspicious inter-agent messages. The authors identify that existing attacks either exhibit detectable patterns (>93% detection rates) or become ineffective when adapted for stealth (<8% success). To address this, they develop a novel attack strategy using an adversarial-aware Friedkin-Johnsen opinion dynamics model to strategically select which agents to hijack and which targets to influence. Their findings demonstrate that even under continuous monitoring, attacks can achieve over 40% success rates, revealing that monitoring alone is insufficient to secure multi-agent systems.
This paper tackles logical context poisoning—the degradation of LLM responses when flat, linear conversation structures force topically distinct threads to accumulate in a single unbounded context window. The core idea is the Conversation Tree Architecture (CTA), which models conversations as a directed rooted tree $\mathcal{T}=(V,E,r,W)$ where each node $v \in V$ maintains an isolated local context window $w_v$. Structured flow operations—downstream passing $\phi_{\downarrow}$, upstream merging $\psi_{\uparrow}$, and volatile nodes—govern how context moves between branches. This matters because current interfaces offer no middle ground between discarding context (new chat) and accumulating noise (linear threads).
The paper addresses federated fine-tuning of Mixture-of-Experts (MoE) based large language models under non-IID data distributions, where direct parameter aggregation causes gating preference misalignment and expert semantic blurring. The proposed FedAlign-MoE framework introduces consistency-based gating distribution alignment using routing consistency weighting ($\omega_i(e) = s_i(e)/\sum_j s_j(e)$) and semantic-aware expert aggregation via region-conditioned gated weights ($\gamma_{i,j}(e)$). This matters because MoE architectures are increasingly vital for scaling LLMs efficiently, yet data heterogeneity across federated clients undermines their specialization benefits.
Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is challenging due to document length and the distribution of scientific arguments across sections. This paper proposes a two-stage retrieve-and-extract pipeline that first links an abstract finding to its corresponding hypothesis, then extracts the statistical evidence supporting that hypothesis. Through controlled ablations varying context quantity ($k \in \{5, 10, 20\}$), retrieval quality (standard RAG, reranking, fine-tuned retriever), and oracle paragraph settings, the authors demonstrate that hypothesis extraction is primarily bounded by retrieval quality, while evidence extraction faces persistent extractor limitations even with perfect paragraph selection.
This paper formalizes the transformer context window as an I/O page and proves that tool-augmented agents with indexed external memory achieve exponential retrieval cost savings over sequential scanning: $\mathcal{O}(\log_b N)$ versus $\Omega(N)$ page reads. The authors validate these predictions experimentally across three content types and identify "parametric memory competition" as a failure mode where models bypass retrieval protocols for familiar content.
The paper tackles the labor-intensive challenge of creating software architecture views, which are essential for documentation but often become outdated—75\% are never updated after creation. The authors conduct a large-scale empirical study evaluating whether LLMs and agentic approaches can automate view generation from source code, testing 3 LLMs across 3 prompting strategies and 2 agentic approaches on 340 repositories. This matters because as systems grow complex, automated view generation could bridge the gap between implementation and architectural documentation, potentially alleviating the manual burden that leads to outdated artifacts.
TRACE is a multi-agent LLM system designed to automate end-to-end seismological analysis, from raw waveform processing to physical mechanism inference. The framework addresses the longstanding bottleneck of expert-dependent interpretation in seismology by orchestrating modules for catalog construction, statistical analysis, and cross-perspective reasoning, demonstrated on two distinct tectonic environments: the 2019 Ridgecrest earthquake sequence and the 2025 Santorini-Kolumbo volcanic crisis.
This paper reports that an autonomous AI ecosystem (SUBSTRATE S3) independently discovered the need for Z3 SMT-based formal verification across six distinct domains—ranging from LLM code to tool APIs to hardware assembly—without being explicitly instructed to do so. The authors treat this convergence as evidence that formal verification "emerges" as a fundamental property of AI systems reasoning about safety. They then present substrate-guard, a unified Python framework implementing Z3 verification across five AI output classes. The claim matters because if true, it would suggest AI systems naturally recognize the limitations of empirical testing and converge on mathematical proof as a safety mechanism.