Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection

cs.CV Kaiqiang Li, Gang Li, Mingle Zhou, Min Li, Delong Han, Jin Wan · Mar 23, 2026
Local to this browser
What it does
Zero-shot 3D anomaly detection enables industrial inspection without target-category training data, but existing methods discard geometric details by projecting point clouds to 2D images. This paper proposes BTP (Back To Point), the first...
Why it matters
This paper proposes BTP (Back To Point), the first framework to apply pre-trained Point-Language Models directly on 3D point clouds. By aligning multi-granularity patch features with text embeddings and incorporating geometric descriptors,...
Main concern
BTP achieves state-of-the-art point-level anomaly localization (84. 5% P-AUROC on Real3D-AD, surpassing PointAD's 73.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Zero-shot 3D anomaly detection enables industrial inspection without target-category training data, but existing methods discard geometric details by projecting point clouds to 2D images. This paper proposes BTP (Back To Point), the first framework to apply pre-trained Point-Language Models directly on 3D point clouds. By aligning multi-granularity patch features with text embeddings and incorporating geometric descriptors, BTP achieves fine-grained anomaly localization while avoiding view-dependent projection artifacts.

Critical review
Verdict
Bottom line

BTP achieves state-of-the-art point-level anomaly localization (84.5% P-AUROC on Real3D-AD, surpassing PointAD's 73.5%), yet its object-level detection (61.4% O-AUROC) significantly trails VLM-based alternatives like PointAD (74.8%). This trade-off suggests that while multi-granularity patch alignment excels at fine-grained spatial reasoning, it fails to aggregate into discriminative global representations. The claim of being the "first to employ pre-trained PLMs" for zero-shot 3D AD holds against PLANE, which requires target-category adaptation.

“In contrast, BTP achieves the best point-level performance with a mean AUROC of 84.5%, surpassing the second-best CPMF (75.9%) by +8.6 points.”
paper · Section 4.3
“Although BTP achieves a lower object-level AUROC (61.4%) than PointAD (74.8%), it remains competitive in several categories”
paper · Section 4.3
“To the best of our knowledge, we are the first to employ pre-trained PLMs for zero-shot 3D anomaly detection.”
paper · Section 1
“PLM-based extensions (e.g., PLANE [38]) adapt pretrained Point-Language Models to 3D anomaly detection via learnable prompts and self-supervised training, but still rely on target-category data for category-specific training/adaptation.”
paper · Section 2.1
What holds up

The Multi-Granularity Feature Embedding Module (MGFEM) effectively exploits intermediate PointBERT features for localization, achieving 83.3% P-AUROC on Real3D-AD even without geometric augmentation. The direct 3D processing approach eliminates view-selection bias inherent in multi-view projection methods, and the ablation study confirms that removing local supervision $\mathcal{L}_{local}$ degrades point-level performance from 84.5% to 68.8% AUROC, validating the importance of patch-text alignment for fine-grained detection.

“MGFEM ... (83.3, 80.1)”
paper · Table 4
“removing the local supervision $\mathcal{L}_{local}$ degrades fine-grained correspondence and harms point-level localization”
paper · Section 4.4
“w/o $\mathcal{L}_{local}$ ... (68.8, 66.2)”
paper · Table 4
Main concerns

The geometric supervision loss $\mathcal{L}_{geo}$ contributes minimally to final performance, with Table 4 showing merely a 0.1% drop in object-level AUROC (61.4% to 61.3%) when removed, raising questions about the value of the handcrafted FPFH-based Geometric Feature Creation Module. More critically, BTP's object-level detection trails several unsupervised baselines including Reg3DAD (69.0%) and R3D-AD (73.4%), undermining claims of "superior performance" for zero-shot detection. The reliance on "auxiliary point cloud data" from source categories during joint representation learning also warrants scrutiny regarding whether this constitutes true zero-shot learning or merely cross-category transfer.

“we introduce a joint representation learning strategy that leverages auxiliary point cloud data to improve robustness and enrich anomaly semantics”
paper · Section 3.6
“w/o $\mathcal{L}_{geo}$ ... (61.3, 64.8) ... Full ... (61.4, 65.1)”
paper · Table 4
“Reg3DAD ... 69.0 ... R3DAD ... 73.4 ... BTP(Ours) ... 61.4”
paper · Table 3
Evidence and comparison

The evaluation reveals a stark modality trade-off: BTP dominates point-level metrics but ranks last among zero-shot methods for object-level detection (Table 1). The comparison to PointAD is fair but highlights that VLM-based projection retains advantages for global anomaly classification. The ablation study rigorously validates that MGFEM drives performance gains while the Geometric Feature Creation Module alone performs poorly (52.5% O-AUROC), indicating that learnable geometric descriptors offer limited value beyond semantic features.

“PointAD ... 74.8 ... BTP(Ours) ... 61.4”
paper · Table 1
“GFCM ... (52.5, 57.4) ... (55.0, 55.0)”
paper · Table 4
“Using GFCM alone yields marginal improvement, indicating that geometric priors without semantic hierarchy are insufficient.”
paper · Section 4.4
Reproducibility

The paper provides detailed implementation specifics including ULIP2 as the 3D encoder, 2048 input points via farthest point sampling, and training on a single RTX 4090 with AdamW optimizer. However, the code is not yet available ("Code will be available at https://github.com/wistful-8029/BTP-3DAD"), preventing independent verification. Hyperparameters $\lambda_1=0.5$ and $\lambda_2=0.1$ are reported but their sensitivity is not analyzed. While standard deviations are provided in Table 3, they are omitted from the primary Table 1, weakening the assessment of result stability across the 10 independent runs.

“Code will be available at https://github.com/wistful-8029/BTP-3DAD.”
paper · Abstract
“$\lambda_1$ and $\lambda_2$ are balancing coefficients, set to 0.5 and 0.1 in our experiments”
paper · Section 3.6
“Input point clouds are uniformly downsampled to 2,048 points via farthest point sampling (FPS). We adopt the public ULIP2 as the 3D and text encoder.”
paper · Section 4.2
Abstract

Zero-shot (ZS) 3D anomaly detection is crucial for reliable industrial inspection, as it enables detecting and localizing defects without requiring any target-category training data. Existing approaches render 3D point clouds into 2D images and leverage pre-trained Vision-Language Models (VLMs) for anomaly detection. However, such strategies inevitably discard geometric details and exhibit limited sensitivity to local anomalies. In this paper, we revisit intrinsic 3D representations and explore the potential of pre-trained Point-Language Models (PLMs) for ZS 3D anomaly detection. We propose BTP (Back To Point), a novel framework that effectively aligns 3D point cloud and textual embeddings. Specifically, BTP aligns multi-granularity patch features with textual representations for localized anomaly detection, while incorporating geometric descriptors to enhance sensitivity to structural anomalies. Furthermore, we introduce a joint representation learning strategy that leverages auxiliary point cloud data to improve robustness and enrich anomaly semantics. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that BTP achieves superior performance in ZS 3D anomaly detection. Code will be available at \href{https://github.com/wistful-8029/BTP-3DAD}{https://github.com/wistful-8029/BTP-3DAD}.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.