SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection

cs.CV cs.AI Shuxian Zhao, Jie Gui, Baosheng Yu, Lu Dong, Zhipeng Gui · Mar 23, 2026

What it does

Why it matters

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

SteelDefectX introduces a vision-language dataset for steel defect detection that aggregates 7,778 images from four existing sources with novel coarse-to-fine textual annotations—ranging from class-level defect descriptions to sample-level attributes (shape, size, depth, position, contrast) generated via GPT-4o. The paper establishes a four-task benchmark showing that rich textual supervision improves cross-material transfer, though it reveals a tension where fine-grained annotations unexpectedly hurt few-shot performance.

Critical review

Verdict

Bottom line

The paper presents a worthwhile contribution to industrial vision-language learning by consolidating existing steel defect datasets and adding structured textual annotations via an automated LLM pipeline. The coarse-to-fine strategy demonstrates measurable benefits for zero-shot transfer to aluminum and seamless steel tube defects, with manually verified descriptions (T3) improving aluminum defect recognition from 8.60% to 29.03% over zero-shot baselines. However, the experimental narrative is undermined by counter-intuitive findings—specifically that fine-grained annotations degrade few-shot recognition—that warrant deeper theoretical analysis rather than brief acknowledgment.

“T3 ... 29.03 ... Zero-shot ... 8.60”

Paper · Table 5

“Long-CLIP-Adapter (ViT-L/14, T0) achieves the best results, whereas its performance under the "T3" setting is the lowest”

Paper · Section 4.3

What holds up

The automated annotation pipeline offers a reproducible framework for industrial domain description generation, using Sentence-BERT embeddings to filter GPT-4o outputs with a diversity threshold (cosine similarity $<0.9$) and enforcing coverage across five semantic dimensions via binary scoring $s(d_i)$. The zero-shot transfer experiments provide tangible evidence that textual annotations enable cross-material generalization, with consistent gains from T0 to T3 across both aluminum and seamless steel tube benchmarks. The heatmap visualizations effectively demonstrate that fine-grained prompts improve spatial localization compared to classname-only templates.

“greedy selection based on Sentence-BERT embeddings ... maximum cosine similarity to selected descriptions is below 0.9”

Paper · Section 3.1

“Compared to the classname-only description (T0), the fine-grained textual description (T3) enables the model to capture fine-grained visual cues more effectively”

Paper · Section 4.5

Main concerns

The experimental results contain internal contradictions that challenge the paper's central thesis. While the authors claim coarse-to-fine annotations improve generalization, they acknowledge that "fine-grained descriptions introduce intra-class variance and thus hurt performance in low-shot regimes," with T3 performing worst in few-shot settings. This suggests the annotations may be overly specific or noisy for limited-data scenarios. Furthermore, the vision-only baseline results reveal a dataset scalability issue: ViT architectures catastrophically underfit (44.84% accuracy) compared to CNNs (96.34%), indicating SteelDefectX lacks the scale required for transformer training. The absolute zero-shot transfer performance remains modest (<43% accuracy on closely related materials), raising questions about practical industrial deployment. Finally, aggregating four existing datasets (NEU, GC10, X-SDD, S3D) without addressing potential overlaps or leakage paths weakens the validity claims.

“Fine-grained descriptions introduce intra-class variance and thus hurt performance in low-shot regimes, while benefiting interpretability and cross-domain analysis”

Paper · Section 4.5

“ViT-B/16 ... 44.84 ... ShuffleNetV2 ... 96.34”

Paper · Table 2

Evidence and comparison

The evidence supports claims regarding interpretability and zero-shot transfer, but comparisons to related work are weakened by the lack of ablation studies isolating specific annotation dimensions (e.g., shape vs. contrast). The demonstration that Long-CLIP with T3 achieves 93.63% accuracy—approaching vision-only CNN performance—validates that vision-language alignment works for industrial textures. However, the paper aggregates existing datasets rather than collecting new imagery, meaning performance improvements stem from annotations rather than novel visual diversity. The comparison against baseline datasets (NEU, GC10) in zero-shot recognition is somewhat misleading since SteelDefectX includes more categories and those very source datasets, confounding difficulty assessments.

“Long-CLIP ... ViT-L/14 ... 93.63”

Paper · Table 3

“constructed by integrating and reorganizing four publicly available steel surface defect datasets: NEU, GC10, X-SDD, and S3D”

Paper · Section 3.1

Reproducibility

The authors commit to releasing data on GitHub and document most training hyperparameters (SGD with momentum 0.9, CLIP-Adapter training for 20 epochs at lr=$10^{-4}$). However, reproducing the annotation pipeline faces significant barriers: the GPT-4o generation relies on "controlled randomness" (temperature=0.9, top_p=0.9) and undocumented prompt templates beyond the brief mention of "open-ended prompts" and structured prompts $P_a$/$P_b$. The manual refinement step (275 hours by two annotators) introduces subjective judgment that cannot be exactly replicated. Additionally, the specific random splits for few-shot experiments and potential overlaps between the aggregated source datasets are not characterized, which could lead to train-test leakage if not carefully handled.

“temperature=0.9, top_p=0.9, max_tokens=80”

Paper · Section 3.1

“Two annotators conducted approximately 275 hours of manual cross-validation”

Paper · Section 3.1

“train for 20 epochs using the Adam optimizer with a learning rate of 1e-4”

Paper · Section 4.2

Abstract

Steel surface defect detection is essential for ensuring product quality and reliability in modern manufacturing. Current methods often rely on basic image classification models trained on label-only datasets, which limits their interpretability and generalization. To address these challenges, we introduce SteelDefectX, a vision-language dataset containing 7,778 images across 25 defect categories, annotated with coarse-to-fine textual descriptions. At the coarse-grained level, the dataset provides class-level information, including defect categories, representative visual attributes, and associated industrial causes. At the fine-grained level, it captures sample-specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations. We further establish a benchmark comprising four tasks, vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer, to evaluate model performance and generalization. Experiments with several baseline models demonstrate that coarse-to-fine textual annotations significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explainable, generalizable steel surface defect detection. The data will be publicly available on https://github.com/Zhaosxian/SteelDefectX.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.