AnimalCLAP: Taxonomy-Aware Language-Audio Pretraining for Species Recognition and Trait Inference

cs.SD cs.LG Risa Shinoda, Kaede Shiohara, Nakamasa Inoue, Hiroaki Santo, Fumio Okura · Mar 23, 2026
Local to this browser
What it does
AnimalCLAP addresses zero-shot species recognition from vocalizations—a critical challenge for biodiversity monitoring when training data is scarce for rare species. The core idea is to inject hierarchical taxonomic knowledge (class,...
Why it matters
The core idea is to inject hierarchical taxonomic knowledge (class, order, family, genus, species) into audio-text contrastive learning via multiple prompt templates, paired with a large dataset of 4,225 hours covering 6,823 species...
Main concern
The paper presents a solid extension of CLAP to bioacoustics, demonstrating that structured taxonomic prompts substantially improve zero-shot generalization to unseen species. The experiments are internally consistent: the ablation...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

AnimalCLAP addresses zero-shot species recognition from vocalizations—a critical challenge for biodiversity monitoring when training data is scarce for rare species. The core idea is to inject hierarchical taxonomic knowledge (class, order, family, genus, species) into audio-text contrastive learning via multiple prompt templates, paired with a large dataset of 4,225 hours covering 6,823 species annotated with 22 ecological traits. This matters because it enables automated monitoring in visually occluded habitats like dense forests while inferring biological traits directly from sound.

Critical review
Verdict
Bottom line

The paper presents a solid extension of CLAP to bioacoustics, demonstrating that structured taxonomic prompts substantially improve zero-shot generalization to unseen species. The experiments are internally consistent: the ablation validating ordered versus randomized taxonomy (Table 4) and the error analysis showing hierarchical coherence (Figure 3) together support the central claim that biological hierarchy aids generalization. The trait prediction results (Table 5) further validate that meaningful ecological information can be extracted from vocalizations. However, the 'unseen' evaluation setup has limitations—test species share genera and families with training data—and the baseline CLAP performance is surprisingly weak compared to domain-specific methods, raising questions about whether the gains come primarily from the taxonomy mechanism or simply from better prompt engineering and dataset construction.

“Randomizing the taxonomic order significantly reduces top-1 accuracy across all test prompts, highlighting the importance of hierarchical structure.”
this paper · Section 4.2
“AnimalCLAP achieves 27.6% [Top-1 accuracy with Tax+Com prompt] vs CLAP 1.61%”
this paper · Table 3
What holds up

The hierarchical taxonomy hypothesis is rigorously tested through ablation and visualization. The randomized taxonomy experiment (Table 4) provides causal evidence that the ordering (class → order → family → genus → species) matters, not just the raw text tokens. The t-SNE visualization (Figure 2) shows clearer taxonomic clustering compared to CLAP. The trait prediction experiments (Table 5) demonstrate strong biological plausibility, with behavioral traits like activity pattern and locomotion showing the largest gains—consistent with known bioacoustic ecology.

“Randomizing the taxonomic order significantly reduces top-1 accuracy across all test prompts”
this paper · Section 4.2
“the AnimalCLAP model exhibits clearer embedding clusters aligned with the taxonomic hierarchy (class, order, family) compared to CLAP”
this paper · Figure 2
Main concerns

First, the 'unseen species' evaluation is limited: the test set comprises only 1.2k recordings across 300 species (avg. 4 recordings/species), and critically, the paper states that 'unseen species [have] genera and families... represented in the training subset.' This is not true zero-shot recognition but rather few-shot at the genus level, which undermines claims about handling rare data-scarce species. Second, the baseline comparison is weak: CLAP achieves only 1.61% top-1 accuracy, yet domain-specific models like BioLingual (cited but not compared) reportedly achieve strong results on similar tasks. Third, trait annotation relied on GPT-5 extraction with unspecified manual verification scope—this introduces reproducibility risks if the trait labels cannot be exactly reconstructed. Fourth, the best-performing Tax+Com prompt concatenates full taxonomy with common names, which may not be practical for real-world deployment where partial taxonomic information is often unavailable.

“we selected unseen species whose genera and families were represented in the training subset”
this paper · Section 2.2
“1.2k [recordings] in the test set”
this paper · Section 2.2
“Trait information was extracted from the iNaturalist website using GPT-5... Extracted trait labels were subsequently verified manually”
this paper · Section 2.1
Evidence and comparison

The evidence supports the specific claim that taxonomy-aware prompts outperform single-template prompts, with AnimalCLAP achieving 27.6% top-1 accuracy versus 25.6% for the best single-prompt model (Tax-only) in Table 3. However, the comparison to prior work is incomplete. The paper cites BioLingual as having 'demonstrated the effectiveness of linking animal vocalizations to textual representations using CLAP,' yet BioLingual is not included in the quantitative comparison of Table 3. Similarly, NatureLM-Audio is mentioned as expanding task range but not evaluated. The dramatic CLAP baseline failure (1.61% accuracy) versus AnimalCLAP success suggests the baseline may have been poorly adapted rather than representing a strong domain-agnostic competitor. The BioCLIP citation is accurate regarding hierarchical embeddings in vision, but the audio domain extension remains under-compared against other bioacoustic foundation models.

“Intrinsic evaluation reveals that BioCLIP has learned a hierarchical representation conforming to the tree of life, shedding light on its strong generalizability”
BioCLIP paper · Abstract
“BioLingual demonstrated the effectiveness of linking animal vocalizations to textual representations using CLAP”
this paper · Section 1
“CLAP: 1.16 / 0.36 / 0.63 / 1.70 / 1.61 [Top-1 accuracies]”
this paper · Table 3
Reproducibility

The paper promises public release of 'dataset, code, and models,' which would enable reproduction. Implementation details are reasonably specific: HTS-AT audio encoder, RoBERTa text encoder, AdamW optimizer with $10^{-4}$ learning rate, 20 epochs, 48 kHz resampling, 10-second crops. However, critical details for reproduction are missing: the exact GPT-5 prompts used for trait extraction, the manual verification protocol and inter-annotator agreement, the random seed, and computational resources required. The trait annotation pipeline using GPT-5 is particularly concerning for reproducibility since GPT outputs are stochastic and version-dependent; without exact prompts and verification criteria, the 22 trait labels cannot be exactly reconstructed. The five prompt templates (Table 2) are provided, which helps, but the code for constructing the balanced training set (30 clips per species) is not described in sufficient detail.

“Our dataset, code, and models will be publicly available”
this paper · Abstract
“trained for 20 epochs using AdamW with a learning rate of $10^{-4}$”
this paper · Section 3.1
“Trait information was extracted from the iNaturalist website using GPT-5”
this paper · Section 2.1
Abstract

Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocalizations. However, classifying species unseen during training remains challenging. To address this limitation, we introduce AnimalCLAP, a taxonomy-aware language-audio framework comprising a new dataset and model that incorporate hierarchical biological information. Specifically, our vocalization dataset consists of 4,225 hours of recordings covering 6,823 species, annotated with 22 ecological traits. The AnimalCLAP model is trained on this dataset to align audio and textual representations using taxonomic structures, improving the recognition of unseen species. We demonstrate that our proposed model effectively infers ecological and biological attributes of species directly from their vocalizations, achieving superior performance compared to CLAP. Our dataset, code, and models will be publicly available at https://dahlian00.github.io/AnimalCLAP_Page/.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.