SecureBreak -- A dataset towards safe and secure models

cs.CR cs.AI cs.CL cs.LG Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera · Mar 23, 2026

What it does

Why it matters

unsafe) across 3,059 samples from multiple model families including Llama, Qwen, Gemma, and Mistral. The core value proposition is providing a 'last-line defense' layer for post-generation filtering and supervisory signals to guide...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

SecureBreak introduces a response-level safety dataset designed to detect harmful LLM outputs that bypass alignment mechanisms. Unlike existing benchmarks that classify prompts, this work focuses on binary classification of generated responses (safe vs. unsafe) across 3,059 samples from multiple model families including Llama, Qwen, Gemma, and Mistral. The core value proposition is providing a 'last-line defense' layer for post-generation filtering and supervisory signals to guide security re-alignment, addressing the growing threat of jailbreak attacks.

Critical review

Verdict

Bottom line

SecureBreak offers a practical contribution for response-level safety classification, but is limited by its narrow scope and overclaimed defensive capabilities. The dataset provides manually annotated labels (Cohen's $\kappa$=0.85) on responses generated from JailbreakBench prompts across diverse model families, demonstrating that fine-tuned classifiers can achieve 76-90\% accuracy. However, the work positions itself as an 'ultimate defense layer' without sufficient validation against adaptive adversaries or demonstration of generalization beyond the specific jailbreak patterns in the source data. The experimental design mixing causal LMs (Mistral, Llama) with Seq2Seq models (Qwen) in Table III makes comparisons misleading, and the small dataset size (3,059 samples) raises questions about scalability to production deployment.

“We preserved only annotations with high agreement, resulting in an average Cohen's kappa of 0.85 on the curated dataset.”

SecureBreak · Section III-A

“The fine-tuned model achieved an accuracy of 90.14%, indicating strong alignment with human annotations.”

SecureBreak · Section IV-C

What holds up

The manual annotation approach is methodologically sound, achieving substantial inter-annotator agreement and conservative safety labeling. The dataset successfully captures the 'helpfulness trap' phenomenon where mid-sized models like Llama-8B show lower safety rates than smaller variants due to balancing instruction-following with safety constraints. The finding that expert advice categories (medical, legal, financial) pose greater safety challenges than physical harm is valuable and aligns with real-world deployment risks. The provision of specific LoRA hyperparameters (rank $r=8$, $\alpha=16$, dropout $0.05$) supports reproducibility.

“Contrary to the expectation that parameter scaling improves safety adherence, the data highlights a 'Helpfulness Trap'. The Llama family exhibits a significant inverse trend: the smaller Llama-1b achieves a top-tier safety rate, statistically tying with the much larger Mistral-7b.”

SecureBreak · Section III-C

“The category 'Expert Advice' dominates both the Safe (286) and Unsafe (198) quadrants. This indicates that this is the most frequently tested or most contentious category.”

SecureBreak · Section III-C

Main concerns

The paper suffers from three critical limitations. First, the dataset is derived exclusively from JailbreakBench prompts, meaning it only captures failures from specific adversarial patterns and cannot generalize to novel attack vectors. Second, the 'ultimate defense' claim is unsupported—there is no adversarial evaluation of whether the proposed judge LLMs themselves can be fooled via prompt injection or other attacks targeting the classifier. Third, the performance comparison across architectures is misleading: Table III compares causal LMs with Qwen in a Seq2Seq setting without controlling for architectural advantages or acknowledging that encoder-decoder models may inherently perform better at classification tasks. The authors also fail to address potential data leakage risks when some models used for dataset generation (Qwen) are later evaluated as classifiers.

“To build our dataset, we started by the information available in previous works, in which different LLMs were prompted with harmful questions contained in the JailbreakBench dataset.”

SecureBreak · Section III-A

“The results indicate that the dataset is valuable not only for constructing post-generation filtering modules that act as a last-line defense, but also for building additional supervisory intelligence for alignment optimization.”

SecureBreak · Abstract

Evidence and comparison

The evidence supports the claim that SecureBreak improves classifier performance over base models (accuracy increases from 57-63\% to 76-83\%), but baseline comparisons are weak. The authors use unmodified base models as baselines rather than comparing against existing safety classifiers, few-shot prompted LLMs, or automated evaluation methods like those in JailbreakBench. The category-wise analysis showing fine-tuned models outperform base models on 'Expert Advice' and 'Fraud' categories is convincing, though the sample sizes per category remain undisclosed. The comparison with related work in Table I is fair but omits that their response-level focus sacrifices the prompt-context information that input-level classifiers can leverage.

“From Table III, we can clearly see that when base models are used in classification of the responses into safe and unsafe they do not perform up to the expectation.”

SecureBreak · Section IV-B

“Across all categories and models—with a single exception—the fine-tuned versions consistently outperform the base models, particularly in the more nuanced, moderate-risk categories.”

SecureBreak · Section IV-D

Reproducibility

The paper provides a GitHub repository link and detailed training configurations including LoRA parameters ($r=8$, $\alpha=16$), learning rate ($0.0005$), batch size ($4$ with gradient accumulation), and 4-bit quantization settings. However, critical details are missing: the exact train/test split ratios, random seeds, the specific subset size used for training versus evaluation, and whether any validation set was used for early stopping. The manual nature of the annotation (two expert annotators) is well-documented, but the criteria for 'conservative' labeling are not operationalized, making it difficult to replicate the annotation protocol. Additionally, no code or scripts are referenced for the data gathering pipeline, leaving ambiguity about how responses were extracted and processed.

“Low-rank adaptation was applied with a rank ($r=8$) and a scaling factor ($\alpha=16$), and the LoRA layers were injected into the attention projection modules (q_proj, k_proj, v_proj, and o_proj) with a dropout probability of 0.05.”

SecureBreak · Section IV-A

“In this curation, the responses were manually annotated by two knowledgeable annotators.”

SecureBreak · Section III-A

Abstract

Large language models are becoming pervasive core components in many real-world applications. As a consequence, security alignment represents a critical requirement for their safe deployment. Although previous related works focused primarily on model architectures and alignment methodologies, these approaches alone cannot ensure the complete elimination of harmful generations. This concern is reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies are needed both to provide qualitative feedback on the robustness of the obtained security alignment at the training stage, and to create an ``ultimate'' defense layer to block unsafe outputs possibly produced by deployed models. To provide a contribution in this scenario, this paper introduces SecureBreak, a safety-oriented dataset designed to support the development of AI-driven solutions for detecting harmful LLM outputs caused by residual weaknesses in security alignment. The dataset is highly reliable due to careful manual annotation, where labels are assigned conservatively to ensure safety. It performs well in detecting unsafe content across multiple risk categories. Tests with pre-trained LLMs show improved results after fine-tuning on SecureBreak. Overall, the dataset is useful both for post-generation safety filtering and for guiding further model alignment and security improvements.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.