Nothing here yet
This paper investigates whether LLMs exhibit genuine moral reasoning or merely produce convincing moral rhetoric through a large-scale empirical study of 13 models across 6 classical moral dilemmas. Using Kohlberg's stages of moral development as a diagnostic framework, the authors evaluate whether model outputs track human developmental patterns or reflect alignment training artifacts. The core finding is "moral ventriloquism" — the hypothesis that models acquire post-conventional moral language through RLHF without the underlying cognitive architecture, evidenced by distributional inversions (86% Stages 5-6 vs. human Stage 4 dominance), near-robotic cross-dilemma consistency (ICC > 0.90), and "moral decoupling" where stated justifications misalign with action choices.
Evaluating LLM outputs at scale remains a bottleneck for deploying safe AI systems. This paper conducts a comprehensive empirical study of 37 conversational LLMs serving as automated judges across eight security and quality assessment tasks. The work identifies viable open-source alternatives to GPT-4o for judgment tasks while demonstrating that popular techniques like second-level judging and specialized evaluator models underperform compared to well-prompted general models.
As AI agents move from human-supervised copilots to fully autonomous infrastructure, organizations face a critical observability gap: existing systems capture computational state and execution traces but lack structured records of the agent's reasoning. This paper introduces the Agent Execution Record (AER), a schema-level primitive that captures intent, observation, and inference as first-class queryable fields at execution time. The core claim is that reasoning provenance cannot be faithfully reconstructed from state checkpoints due to fundamental non-identifiability (intent multiplicity, observation ambiguity, inference volatility). If validated, AERs would enable population-level behavioral analytics—systematic comparison of reasoning patterns across thousands of investigations, confidence calibration against expert judgments, and counterfactual regression testing via mock replay—that existing tooling achieves only through fragile post-hoc extraction.
WorldCache addresses the prohibitive latency of Diffusion Transformers (DiTs) for video world models by replacing static feature caching with a content-aware dynamical approximation framework. The method introduces motion-adaptive thresholds, saliency-weighted drift estimation, and optimal feature blending to eliminate ghosting artifacts during fast motion. Achieving 2.3× speedup on Cosmos-Predict2.5 with 99.4% quality retention, it offers a training-free path toward interactive world simulation.
This paper proposes that AI inference tokens are evolving into a standardized commodity like electricity, and designs a complete futures market framework including the "Standard Inference Token" (SIT) contract, settlement mechanisms, and margin systems. The core motivation is hedging compute cost risk for application-layer enterprises as inference displaces training as the dominant AI cost.
This work attacks the friction between smooth GELU training (ubiquitous in Transformers) and piecewise-linear deployment pipelines (quantization, formal verification). The authors parametrize GELU as $f(x;\lambda) = x\Phi(\lambda x)$ with learnable sharpness $\lambda \geq 1$, deriving a principled annealing target from an $\ell_1$ approximation bound to the Heaviside step. While the hardening protocol reduces validation-drop upon ReLU substitution in vision and tabular tasks, the 25% annealing switch is heuristic and actual downstream benefits in integer-only inference or SMT verification remain unevaluated.
SHAPE addresses unsupervised domain adaptation for medical image segmentation, where models trained on one imaging modality (e.g., MRI) degrade sharply when applied to another (e.g., CT). The core innovation shifts the paradigm from pixel-level correctness to global anatomical plausibility through a DINOv3 foundation model, a Hierarchical Feature Modulation (HFM) module for class-aware alignment, and a Hypergraph Plausibility Estimation (HPE) pipeline that validates pseudo-labels using higher-order anatomical relationships. This matters for deploying robust clinical segmentation models across diverse imaging environments without costly manual re-annotation.
PW-FouCast addresses the degradation of radar-only precipitation nowcasting at long lead times by proposing a frequency-domain fusion framework that integrates Pangu-Weather foundation model priors with radar observations. The core insight is that meteorological forecasts and radar reflectivity share similar phase structure despite differing amplitudes, enabling spectral alignment through phase-aware modulation and memory-based correction. The approach achieves quantitative improvements on standard benchmarks and offers a novel alternative to spatial fusion methods.
Traditional latent diffusion models require staging—first train a VAE tokenizer, freeze it, then train a diffusion model on top. UNITE proposes a single-stage approach where a shared "Generative Encoder" serves as both tokenizer and denoiser via weight sharing, achieving FID 1.73 on ImageNet 256×256 without adversarial losses or pretrained encoders like DINOv2.
This paper investigates how Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning by focusing on the *direction* of policy updates rather than their magnitude. The authors introduce $\Delta \log p$, the signed log-probability difference between base and RLVR models, and argue it better captures reasoning-critical tokens than magnitude-based metrics like entropy or KL divergence. They validate this through token-replacement interventions and propose two practical applications: a test-time extrapolation method that amplifies the learned direction without additional training, and a training-time reweighting scheme that focuses learning on low-probability tokens.
Large language models often lack coverage in specialized, data-scarce domains where web text is limited. This paper proposes SPA (Scaling Prompt-engineered Augmentation), a baseline that generates large-scale synthetic corpora using just seven carefully designed prompt templates grounded in cognitive learning strategies (Concept Learning, Critical Thinking, and Generative Learning). The core finding is that this simple approach consistently outperforms complex RL-based methods like SEAL and multi-stage pipelines like EntiGraph across Wikipedia QA, long-document comprehension, and multi-hop reasoning benchmarks, suggesting that careful prompt design combined with straightforward scaling is surprisingly effective for knowledge injection.
This paper tackles the limitation that XAI systems assume static user models, ignoring diverse epistemic stances among domain experts. The authors propose agentic personas—structured representations of expert reasoning strategies derived from clustered feedback and instantiated via LLMs—to condition reinforcement learning-based explanation generation on knowledge graphs. This enables adaptive explanations that align with specific interpretive preferences (mechanistic rigor vs. focused clarity) without requiring extensive individual-level human feedback, demonstrated in drug discovery with 22 expert participants.
Personalized image generation with diffusion models relies on Low-Rank Adaptation (LoRA) to fine-tune models efficiently, but current practice uses a fixed rank across all layers regardless of subject complexity. This paper proposes LoRA2, which learns adaptive ranks per LoRA component via a variational framework that imposes an importance ordering over rank indices using a discretized exponential distribution. The method achieves better subject fidelity and prompt alignment while using significantly less memory than high-rank baselines, addressing the combinatorial explosion of searching $S K^L$ architectural configurations.
Multi-Objective Reinforcement Learning (MORL) agents must balance competing objectives like speed versus energy consumption, yet existing Explainable RL methods fail to clarify how specific behavioral choices drive Pareto trade-offs. This paper proposes TREX, a post-hoc trajectory attribution framework that clusters agent behaviors into semantically meaningful segments and quantifies each cluster's influence on objective trade-offs by training complementary policies that exclude specific trajectory groups. The work addresses a genuine gap in explainability by moving beyond policy selection to reveal which behavioral patterns (such as "long leaps" versus "short strides") justify the agent's learned trade-off logic.
The paper addresses sample-efficient selection among multiple pretrained generative models, formulated as a diversity-aware multi-armed bandit problem where the optimal solution may be a mixture rather than a single model. The authors challenge the necessity of explicit UCB exploration bonuses, proposing that Mixture-Greedy—which directly optimizes empirical diversity objectives without optimism bonuses—can achieve sublinear regret through implicit exploration induced by the objective geometry. This matters because sampling from suboptimal generative models is computationally expensive, and their results suggest that structural properties of diversity metrics (FID, Vendi, RKE) naturally enforce sufficient exploration without costly confidence bound computations.
CICTM addresses deformable brain MRI registration by combining transformer-based global context modeling with cycle inverse-consistency constraints. The core idea uses a Swin-UNet to jointly estimate forward and backward deformation fields, penalizing inconsistencies at both image and flow levels while enforcing topology preservation via Jacobian regularization. The work matters for large-scale neuroimaging studies where deformation stability and physical plausibility are as important as alignment accuracy.
This paper identifies a critical failure mode in multimodal AI evaluation called the 'mirage effect,' where vision-language models generate confident descriptions and reasoning about images that were never provided. The authors demonstrate that frontier models (GPT-5, Gemini-3-Pro, Claude Opus 4.5) retain 70–80% of their benchmark accuracy when evaluated without any visual input, with medical benchmarks showing 60–99% susceptibility to such non-visual inference. A text-only 3B-parameter model fine-tuned on chest X-ray questions outperforms both frontier multimodal systems and human radiologists, exposing how current benchmarks fail to distinguish genuine visual understanding from sophisticated textual pattern matching. The findings challenge the validity of accuracy metrics for multimodal systems and propose B-Clean, a method to filter benchmark questions that can be answered without images.
Constraint-based causal discovery algorithms like PC require exponentially many conditional independence (CI) tests in the worst case---specifically $p^{\mathcal{O}(d)}$ where $d$ is the maximum degree. This paper establishes that the fundamental complexity parameter is actually $s$, the maximum undirected clique size in the essential graph, which can be much smaller than $d$ (e.g., $s=2$ vs $d=p-2$ in Figure 1). The authors propose Greedy Ancestral Search (GAS), which achieves $p^{\mathcal{O}(s)}$ CI tests, and prove a matching lower bound of $2^{\Omega(s)}$, establishing exponent-optimality up to a logarithmic factor.
Long-tail class incremental learning (LT-CIL) suffers from scarce tail-class data and catastrophic forgetting. This paper tackles both issues by using large language models to generate a stratified language tree (SL-Tree) that hierarchically organizes semantic information from coarse to fine granularity. Two parallel guidance mechanisms—adaptive language guidance with learnable per-class weights and alignment language guidance using semantic space stability—dynamically supervise tail classes and constrain optimization. The approach achieves reported state-of-the-art results on ImageNet-R, CIFAR100, and CUB200 benchmarks.
EnterpriseLab tackles the challenge of deploying AI agents in enterprise settings where data sovereignty and cost constraints make frontier models impractical. The paper introduces a full-stack platform that unifies tool integration via Model Context Protocol (MCP), automated trajectory synthesis from environment schemas, and integrated training pipelines including a novel Agentic GRPO method. The core value proposition is that small 8B models can match GPT-4o on enterprise tasks while cutting inference costs by 8–10×, enabling on-premise deployment without sacrificing operational capability.