Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
10 papers in cs.CR
Trending mixes fresh papers with community signal.
0
cs.LGcs.CR Devashish Chaudhary, Sutharshan Rajasegarar, Shiva Raj Pokhrel et al. · Mar 23, 2026

This paper addresses the challenge of detecting network attacks in IoT environments while preserving data privacy and minimizing communication overhead. The authors propose a federated learning framework using lightweight autoencoders deployed directly on Raspberry Pi edge devices to detect anomalies in real-time through reconstruction error $\mathcal{E}(t)=\|x_{t}-\hat{x}_{t}\|^{2}$. A real-world testbed with ZigBee-enabled sensor nodes was constructed to evaluate the approach against redirection attacks, demonstrating that federated training can match centralized performance while significantly reducing data transmission from 4.5 MB to 378 KB.

The rapid expansion of the Internet of Things (IoT) and its integration with backbone networks have heightened the risk of security breaches. Traditional centralized approaches to anomaly detection, which require transferring large volumes of data to central servers, suffer from privacy, scalability, and latency limitations. This paper proposes a lightweight autoencoder-based anomaly detection framework designed for deployment on resource-constrained edge devices, enabling real-time detection while minimizing data transfer and preserving privacy. Federated learning is employed to train models collaboratively across distributed devices, where local training occurs on edge nodes and only model weights are aggregated at a central server. A real-world IoT testbed using Raspberry Pi sensor nodes was developed to collect normal and attack traffic data. The proposed federated anomaly detection system, implemented and evaluated on the testbed, demonstrates its effectiveness in accurately identifying network attacks. The communication overhead was reduced significantly while achieving comparable performance to the centralized method.
0
cs.CRcs.AIcs.CL Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera · Mar 23, 2026

SecureBreak introduces a response-level safety dataset designed to detect harmful LLM outputs that bypass alignment mechanisms. Unlike existing benchmarks that classify prompts, this work focuses on binary classification of generated responses (safe vs. unsafe) across 3,059 samples from multiple model families including Llama, Qwen, Gemma, and Mistral. The core value proposition is providing a 'last-line defense' layer for post-generation filtering and supervisory signals to guide security re-alignment, addressing the growing threat of jailbreak attacks.

Large language models are becoming pervasive core components in many real-world applications. As a consequence, security alignment represents a critical requirement for their safe deployment. Although previous related works focused primarily on model architectures and alignment methodologies, these approaches alone cannot ensure the complete elimination of harmful generations. This concern is reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies are needed both to provide qualitative feedback on the robustness of the obtained security alignment at the training stage, and to create an ``ultimate'' defense layer to block unsafe outputs possibly produced by deployed models. To provide a contribution in this scenario, this paper introduces SecureBreak, a safety-oriented dataset designed to support the development of AI-driven solutions for detecting harmful LLM outputs caused by residual weaknesses in security alignment. The dataset is highly reliable due to careful manual annotation, where labels are assigned conservatively to ensure safety. It performs well in detecting unsafe content across multiple risk categories. Tests with pre-trained LLMs show improved results after fine-tuning on SecureBreak. Overall, the dataset is useful both for post-generation safety filtering and for guiding further model alignment and security improvements.
0
cs.CRcs.AIcs.MM Rui Yang Tan, Yujia Hu, Roy Ka-Wei Lee · Mar 23, 2026

This paper exposes a critical vulnerability in Multimodal Large Language Models (MLLMs): safety alignment fails when harmful intent is embedded in structured visual narratives. The authors introduce ComicJailbreak, a benchmark of 1,167 three-panel comics where panels 1–2 establish narrative context and panel 3 contains a blank speech bubble filled with a paraphrased harmful goal. The model is prompted to "complete the comic" by generating the fourth panel. Across 15 state-of-the-art MLLMs, comic-based attacks achieve ensemble success rates exceeding 90% on Gemini-family models and 85%+ on most open-source models—substantially outperforming plain-text and random-image baselines. The work also reveals that existing defenses (AdaShield, Attack as Defense) trigger severe over-refusal on benign prompts, and that automated safety judges are unreliable on sensitive-but-benign content.

Multimodal Large Language Models (MLLMs) extend text-only LLMs with visual reasoning, but also introduce new safety failure modes under visually grounded instructions. We study comic-template jailbreaks that embed harmful goals inside simple three-panel visual narratives and prompt the model to role-play and "complete the comic." Building on JailbreakBench and JailbreakV, we introduce ComicJailbreak, a comic-based jailbreak benchmark with 1,167 attack instances spanning 10 harm categories and 5 task setups. Across 15 state-of-the-art MLLMs (six commercial and nine open-source), comic-based attacks achieve success rates comparable to strong rule-based jailbreaks and substantially outperform plain-text and random-image baselines, with ensemble success rates exceeding 90% on several commercial models. Then, with the existing defense methodologies, we show that these methods are effective against the harmful comics, they will induce a high refusal rate when prompted with benign prompts. Finally, using automatic judging and targeted human evaluation, we show that current safety evaluators can be unreliable on sensitive but non-harmful content. Our findings highlight the need for safety alignment robust to narrative-driven multimodal jailbreaks.
0
cs.CRcs.AIcs.LG Tom Biskupski, Stephan Kleber · Mar 23, 2026

Evaluating LLM outputs at scale remains a bottleneck for deploying safe AI systems. This paper conducts a comprehensive empirical study of 37 conversational LLMs serving as automated judges across eight security and quality assessment tasks. The work identifies viable open-source alternatives to GPT-4o for judgment tasks while demonstrating that popular techniques like second-level judging and specialized evaluator models underperform compared to well-prompted general models.

A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight different categories of judgment tasks and the corresponding ground-truth labels based on human assessments. Our empirical results show a high correlation of LLMs as judges with human assessments, when combined with a suitable prompt, in particular for GPT-4o, several open-source models with $\geqslant$ 32B parameters, and a few smaller models like Qwen2.5 14B.
0
cs.CRcs.AI Yanming Mu, Hao Hu, Feiyang Li et al. · Mar 23, 2026

Retrieval-Augmented Generation (RAG) systems mitigate large language model hallucinations by integrating external knowledge bases, yet this multi-module architecture introduces complex security vulnerabilities spanning data poisoning, membership inference, and adversarial manipulation. This survey systematically maps threats across the RAG pipeline—vector database construction, retrieval, and generation—and categorizes corresponding defenses from input-side access control to output-side privacy preservation. As a comprehensive review of 152 papers, it aims to unify the analysis of threat models, defense mechanisms, and evaluation benchmarks to foster trustworthy RAG deployments in sensitive domains.

Retrieval-Augmented Generation (RAG) significantly mitigates the hallucinations and domain knowledge deficiency in large language models by incorporating external knowledge bases. However, the multi-module architecture of RAG introduces complex system-level security vulnerabilities. Guided by the RAG workflow, this paper analyzes the underlying vulnerability mechanisms and systematically categorizes core threat vectors such as data poisoning, adversarial attacks, and membership inference attacks. Based on this threat assessment, we construct a taxonomy of RAG defense technologies from a dual perspective encompassing both input and output stages. The input-side analysis reviews data protection mechanisms including dynamic access control, homomorphic encryption retrieval, and adversarial pre-filtering. The output-side examination summarizes advanced leakage prevention techniques such as federated learning isolation, differential privacy perturbation, and lightweight data sanitization. To establish a unified benchmark for future experimental design, we consolidate authoritative test datasets, security standards, and evaluation frameworks. To the best of our knowledge, this paper presents the first end-to-end survey dedicated to the security of RAG systems. Distinct from existing literature that isolates specific vulnerabilities, we systematically map the entire pipeline-providing a unified analysis of threat models, defense mechanisms, and evaluation benchmarks. By enabling deep insights into potential risks, this work seeks to foster the development of highly robust and trustworthy next-generation RAG systems.
0
cs.AIcs.CRcs.LG Gregory M. Ruddell · Mar 22, 2026

This paper introduces "silent commitment failure" — a phenomenon where instruction-tuned language models produce confident, incorrect outputs with no detectable pre-commitment warning signal — and proposes "governability" as a measurable property for AI agent safety. The core claim is that 2 of 3 instruction-following models evaluated exhibit zero-warning failure modes, with profound implications for autonomous agent deployment. The work distinguishes itself from hallucination studies by focusing on detectability before commitment rather than correctness of output, and presents empirical evidence that conflict-detection signals (the "authority band") are geometric properties fixed at pretraining rather than injectable through fine-tuning.

As large language models are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime. We present empirical evidence that this assumption fails for two of three instruction-following models evaluable for conflict detection. We introduce governability -- the degree to which a model's errors are detectable before output commitment and correctable once detected -- and demonstrate it varies dramatically across models. In six models across twelve reasoning domains, two of three instruction-following models exhibited silent commitment failure: confident, fluent, incorrect output with zero warning signal. The remaining model produced a detectable conflict signal 57 tokens before commitment under greedy decoding. We show benchmark accuracy does not predict governability, correction capacity varies independently of detection, and identical governance scaffolds produce opposite effects across models. A 2x2 experiment shows a 52x difference in spike ratio between architectures but only +/-0.32x variation from fine-tuning, suggesting governability is fixed at pretraining. We propose a Detection and Correction Matrix classifying model-task combinations into four regimes: Governable, Monitor Only, Steer Blind, and Ungovernable.
0
cs.CRcs.AI Guang Yang, Ziye Geng, Yihang Chen et al. · Mar 22, 2026

Existing adversarial-example-based fingerprinting schemes rely on empirical heuristics to set the fingerprint-to-boundary distance, risking violations of either robustness or uniqueness. This paper proposes AnaFP, an analytical approach that derives theoretical lower and upper bounds $\tau_{\text{lower}} < \tau < \tau_{\text{upper}}$ on a stretch factor controlling this distance. By formalizing robustness and uniqueness constraints and employing surrogate model pools with quantile-based relaxation, AnaFP generates fingerprints with guaranteed properties, validated across CNNs, MLPs, and GNNs.

Adversarial-example-based fingerprinting approaches, which leverage the decision boundary characteristics of deep neural networks (DNNs) to craft fingerprints, have proven effective for model ownership protection. However, a fundamental challenge remains unresolved: how far a fingerprint should be placed from the decision boundary to simultaneously satisfy two essential properties, i.e., robustness and uniqueness, for effective and reliable ownership protection. Despite the importance of the fingerprint-to-boundary distance, existing works lack a theoretical solution and instead rely on empirical heuristics, which may violate either robustness or uniqueness properties. We propose AnaFP, an analytical fingerprinting scheme that constructs fingerprints under theoretical guidance. Specifically, we formulate fingerprint generation as controlling the fingerprint-to-boundary distance through a tunable stretch factor. To ensure both robustness and uniqueness, we mathematically formalize these properties that determine the lower and upper bounds of the stretch factor. These bounds jointly define an admissible interval within which the stretch factor must lie, thereby establishing a theoretical connection between the two constraints and the fingerprint-to-boundary distance. To enable practical fingerprint generation, we approximate the original (infinite) sets of pirated and independently trained models using two finite surrogate model pools and employ a quantile-based relaxation strategy to relax the derived bounds. Due to the circular dependency between the lower bound and the stretch factor, we apply grid search over the admissible interval to determine the most feasible stretch factor. Extensive experimental results show that AnaFP consistently outperforms prior methods, achieving effective ownership verification across diverse model architectures and model modification attacks.
0
cs.CRcs.AI Trung V. Phan, Thomas Bauschert · Mar 22, 2026

DeepXplain tackles the opacity of autonomous APT defense by integrating explainability signals directly into reinforcement learning rather than treating explanation as a post-hoc add-on. The framework augments provenance-graph-based DRL with an alignment loss that ties policy decisions to GNN-derived structural explanations and temporal attributions, coupled with a confidence-aware reward shaping term. The core claim is that this tight coupling improves both task performance (F1-score from 0.887 to 0.915) and explanation quality (confidence 0.86, fidelity 0.79) compared to black-box alternatives.

Advanced Persistent Threats (APTs) are stealthy, multi-stage attacks that require adaptive and timely defense. While deep reinforcement learning (DRL) enables autonomous cyber defense, its decisions are often opaque and difficult to trust in operational environments. This paper presents DeepXplain, an explainable DRL framework for stage-aware APT defense. Building on our prior DeepStage model, DeepXplain integrates provenance-based graph learning, temporal stage estimation, and a unified XAI pipeline that provides structural, temporal, and policy-level explanations. Unlike post-hoc methods, explanation signals are incorporated directly into policy optimization through evidence alignment and confidence-aware reward shaping. To the best of our knowledge, DeepXplain is the first framework to integrate explanation signals into reinforcement learning for APT defense. Experiments in a realistic enterprise testbed show improvements in stage-weighted F1-score (0.887 to 0.915) and success rate (84.7% to 89.6%), along with higher explanation confidence (0.86), improved fidelity (0.79), and more compact explanations (0.31). These results demonstrate enhanced effectiveness and trustworthiness of autonomous cyber defense.
0
cs.CRcs.AI Di Lu, Yongzhi Liao, Xutong Mu et al. · Mar 22, 2026

Host-acting agents let users state goals while the system figures out how to achieve them. This paper argues this convenience creates a novel attack surface: semantic under-specification. When users specify outcomes but not safety boundaries, agents must fill in missing semantics—and may choose security-divergent plans even when no attacker is present and the goal is benign.

Host-acting agents promise a convenient interaction model in which users specify goals and the system determines how to realize them. We argue that this convenience introduces a distinct security problem: semantic under-specification in goal specification. User instructions are typically goal-oriented, yet they often leave process constraints, safety boundaries, persistence, and exposure insufficiently specified. As a result, the agent must complete missing execution semantics before acting, and this completion can produce risky host-side plans even when the user-stated goal is benign. In this paper, we develop a semantic threat model, present a taxonomy of semantic-induced risky completion patterns, and study the phenomenon through an OpenClaw-centered case study and execution-trace analysis. We further derive defense design principles for making execution boundaries explicit and constraining risky completion. These findings suggest that securing host-acting agents requires governing not only which actions are allowed at execution time, but also how goal-only instructions are translated into executable plans.
0
cs.CRcs.AI Qiuchi Xiang, Haoxuan Qu, Hossein Rahmani et al. · Mar 22, 2026

This paper investigates the security of multi-agent LLM discussions under continuous monitoring, where anomaly detectors block suspicious inter-agent messages. The authors identify that existing attacks either exhibit detectable patterns (>93% detection rates) or become ineffective when adapted for stealth (<8% success). To address this, they develop a novel attack strategy using an adversarial-aware Friedkin-Johnsen opinion dynamics model to strategically select which agents to hijack and which targets to influence. Their findings demonstrate that even under continuous monitoring, attacks can achieve over 40% success rates, revealing that monitoring alone is insufficient to secure multi-agent systems.

Multi-agent discussions have been widely adopted, motivating growing efforts to develop attacks that expose their vulnerabilities. In this work, we study a practical yet largely unexplored attack scenario, the discussion-monitored scenario, where anomaly detectors continuously monitor inter-agent communications and block detected adversarial messages. Although existing attacks are effective without discussion monitoring, we show that they exhibit detectable patterns and largely fail under such monitoring constraints. But does this imply that monitoring alone is sufficient to secure multi-agent discussions? To answer this question, we develop a novel attack method explicitly tailored to the discussion-monitored scenario. Extensive experiments demonstrate that effective attacks remain possible even under continuous monitoring, indicating that monitoring alone does not eliminate adversarial risks.