Nothing here yet
Dyadic is a web-based platform for studying human-human and human-AI conversations through text or voice-based interaction. It attempts to solve the methodological gap in conversation research by providing turnkey tools for experimental manipulation, live monitoring, and in-situ survey delivery during ongoing chats. The core value proposition is lowering barriers to entry for researchers studying dyadic interaction processes without requiring programming expertise.
Cross-Layer Transcoders (CLTs) compress the attribution graphs used in mechanistic interpretability by sharing features across transformer layers, but their quadratic parameter scaling ($N_{\text{CLT}} \propto L^2$) makes training and analysis prohibitively expensive for most researchers. This paper introduces CLT-Forge, an open-source library that combines feature-sharded distributed training, compressed activation caching (int8/int4/int2 with zstd), automated interpretability pipelines, and integration with Circuit-Tracer to provide the first unified workflow for end-to-end CLT analysis at scale.
WorldCache addresses the prohibitive latency of Diffusion Transformers (DiTs) for video world models by replacing static feature caching with a content-aware dynamical approximation framework. The method introduces motion-adaptive thresholds, saliency-weighted drift estimation, and optimal feature blending to eliminate ghosting artifacts during fast motion. Achieving 2.3× speedup on Cosmos-Predict2.5 with 99.4% quality retention, it offers a training-free path toward interactive world simulation.
Large language models often lack coverage in specialized, data-scarce domains where web text is limited. This paper proposes SPA (Scaling Prompt-engineered Augmentation), a baseline that generates large-scale synthetic corpora using just seven carefully designed prompt templates grounded in cognitive learning strategies (Concept Learning, Critical Thinking, and Generative Learning). The core finding is that this simple approach consistently outperforms complex RL-based methods like SEAL and multi-stage pipelines like EntiGraph across Wikipedia QA, long-document comprehension, and multi-hop reasoning benchmarks, suggesting that careful prompt design combined with straightforward scaling is surprisingly effective for knowledge injection.
Standard Transformers apply fixed-depth computation regardless of problem difficulty, limiting their ability to solve tasks requiring variable-depth reasoning like multi-hop traversal or nested logic. This paper proposes a depth-recurrent Transformer that iteratively applies a shared-weight block in latent space—enabling 'vertical Chain-of-Thought' where models trade recurrence steps for deeper reasoning without consuming context window. The work demonstrates strong compositional generalization on three synthetic tasks and offers a mechanistic alternative to horizontal token-generation paradigms.
Long-context LLM inference hits a memory wall: each decode step requires scanning the entire KV cache, incurring $O(n)$ memory bandwidth that cannot be solved by faster arithmetic. PRISM proposes a thin-film lithium niobate photonic accelerator that performs the block-selection similarity search in $O(1)$ optical latency using a broadcast-and-weight architecture, eliminating the $O(n)$ scan entirely. The work claims $16\times$–$32\times$ traffic reduction at 64K–128K tokens and a four-order-of-magnitude energy advantage over GPU baselines by matching photonic hardware capabilities—passive query broadcast, quasi-static microring weights, and low-precision rank output—to the selection task.
This paper examines representation genesis—the transition from non-representational physical systems to those with content-manipulable states. It argues that major frameworks in philosophy of mind (Language of Thought, teleosemantics, predictive processing, enactivism, and genetic phenomenology) share a 'Representation Presupposition' structure that prevents them from explaining this first acquisition without circularity. With large language models now achieving high cognitive performance without clear genesis events, the absence of a satisfactory theory becomes urgent.
The paper critiques the institutionalization of LLM benchmarks as "Silicon Bureaucracy" and "AI Test-Oriented Education", arguing high scores often conflate exam-oriented competence with genuine generalization due to data contamination. It proposes an audit framework using a router-worker setup: clean-control routers transmit full questions while noisy routers delete, rewrite, and perturb before aggregation. For clean benchmarks, noisy aggregation should not systematically exceed the baseline; persistent above-baseline gains suggest contamination-related memory activation. The core finding—that 10 of 12 models exceed clean baselines under multi-router noisy conditions—challenges the interpretability of raw benchmark scores.
CAID tackles long-horizon software engineering tasks where single agents struggle with accuracy and wall-clock time. The core idea is Centralized Asynchronous Isolated Delegation: a manager decomposes tasks into dependency graphs and delegates to multiple engineer agents working in isolated git worktrees, integrating progress via branch-and-merge. The system improves accuracy by 26.7% absolute on PaperBench and 14.3% on Commit0, demonstrating that structured coordination grounded in SWE primitives outperforms simply scaling single-agent iteration budgets.
AdaRubric solves the static-rubric bottleneck in LLM-as-Judge evaluation by dynamically generating task-specific evaluation dimensions from task descriptions. It scores agent trajectories step-by-step with confidence-weighted per-dimension feedback and filters preference pairs using the DimensionAwareFilter—a provably necessary mechanism to prevent high-scoring dimensions from masking failures. The approach achieves Pearson $r=0.79$ correlation with human judgments and yields substantial downstream gains: +6.8–8.5 percentage points in DPO task success and +6.6 pp faster PPO convergence at 5K steps.
This paper addresses computational barriers for Brazilian Portuguese question answering by systematically evaluating Parameter-Efficient Fine-Tuning (PEFT) methods on BERTimbau models using the SQuAD-BR dataset. The authors test LoRA, DoRA, QLoRA, and QDoRA across Base (110M) and Large (335M) variants, demonstrating that LoRA achieves 95.8% of full fine-tuning performance while reducing training time by 73.5%. A key finding is that PEFT methods require substantially higher learning rates ($2\times 10^{-4}$) than standard BERT fine-tuning to achieve optimal results, with quantization resilience favoring larger models.
This paper tackles the problem of measuring dialectal bias in LLMs for Bengali, a low-resource language with nine major regional variants. The authors propose a two-phase framework combining RAG-based translation to create dialectal benchmarks with an RLAIF-inspired evaluation protocol that uses CoT-first reasoning and multi-judge validation. They expose the catastrophic failure of traditional metrics like BLEU and WER for agglutinative dialectal Bengali, showing that LLM-as-judge better predicts human quality assessments.
Large Language Models often inherit societal biases that manifest as stereotyped associations across demographic groups. This paper proposes CatRAG, a dual-mechanism debiasing framework that combines a category-theoretic functor-guided projection—collapsing protected-attribute directions in embedding space via spectral decomposition—with diversity-aware Retrieval-Augmented Generation to ground inference in balanced evidence. Evaluated on the BBQ benchmark across Llama-3, GPT-OSS, and Gemma-3, the method claims to reduce bias scores from ~60% to near zero while improving accuracy by up to 40% over base models.
AgentHER tackles the data waste problem in LLM agent training by adapting Hindsight Experience Replay (HER) from RL to natural-language trajectories. The core insight is that failed trajectories—typically 60–75% of collected data—often represent valid demonstrations for achievable alternative goals. The paper proposes a four-stage pipeline with multi-judge verification that converts discarded failures into SFT and DPO training data, yielding +7.1–11.7 pp gains over success-only fine-tuning across four model families on WebArena and ToolBench.
This paper addresses the fundamental problem that correlational sentiment analysis cannot distinguish genuine economic associations from spurious statistical artifacts in financial markets. The core contribution is a refutation-validated framework for aspect-based sentiment analysis that combines net-ratio sentiment scoring with four robustness tests—placebo, random common cause, subset stability, and bootstrap validation—to filter false discoveries in high-dimensional sentiment-return analysis. This matters because investment strategies built on spurious correlations can lead to systematic losses, and regulators increasingly demand explainable AI systems with auditable validation.
Preference alignment typically requires expensive weight-updating training like RLHF or DPO, which lacks mechanistic interpretability. This paper proposes DSPA, an inference-time method that dynamically steers sparse autoencoder (SAE) features based on prompt content without modifying base-model weights. By computing a sparse conditional-difference map $\mathbf{A}$ from preference triples that links prompt features to generation-control features, DSPA edits only token-active latents during decoding. The method achieves competitive open-ended generation quality with up to $4.47\times$ fewer alignment-stage FLOPs than training-based alternatives, while offering direct auditability of which features are modified and revealing that preference directions are dominated by discourse and stylistic signals.
Discrete diffusion models have been limited to simplistic noising schemes like uniform corruption or masking, restricting their ability to leverage semantic structure in large vocabularies. This paper introduces GDDS (Generalized Discrete Diffusion from Snapshots), a framework supporting arbitrary continuous-time Markov chain noising processes via exact uniformization-based sampling and a tractable snapshot-level ELBO. The work achieves state-of-the-art results on large-scale language modeling tasks, claiming to surpass autoregressive baselines for the first time at this scale.
KG-Hopper addresses Knowledge Base Question Answering (KBQA) by training compact 7B LLMs to perform multi-hop reasoning over Knowledge Graphs in a single inference round. Unlike sequential multi-step approaches that suffer from error cascades, it embeds the entire KG traversal process into a unified "thinking" stage using reinforcement learning. The core innovation is using GRPO (Group Relative Policy Optimization) with composite rewards to teach models to autonomously invoke retrieval tools via special tokens and reason across multiple hops without predefined pipelines.
This paper tackles the lack of explicit memory mechanisms in transformers by introducing Mixture of Chapters (MoC)—a learned bank of 262K latent memory tokens accessed via cross-attention. To scale memory without prohibitive costs, the authors partition the bank into chapters and route each input sequence to a sparse subset (top-64), reducing complexity from $O(L \cdot N_m)$ to $O(L \cdot k \cdot T)$. The work demonstrates that explicit associative memory can serve as a new axis of scaling, showing improved knowledge retention when transitioning from pretraining to instruction fine-tuning.
Designing high-performance system heuristics traditionally requires human experts to navigate multi-step conceptual shifts. This paper introduces Engram, an agentic architecture that sidesteps the 'coherence ceiling' of single-context LLM agents and the 'evolutionary neighborhood bias' of code-mutation systems by decoupling long-horizon exploration into sequential agent handoffs. Each agent distills findings into a persistent Research Digest, enabling cumulative progress without context degradation.