Text-Image Conditioned 3D Generation
The paper addresses a fundamental limitation in 3D generation: image-conditioned models suffer from viewpoint bias and hallucinate unobserved regions, while text-conditioned models lack precise visual fidelity. The authors propose Text–Image Conditioned 3D Generation, a task requiring joint reasoning over visual exemplars and textual descriptions, and introduce TIGON—a minimalist dual-branch baseline that fuses separate image- and text-conditioned DiT backbones via zero-initialized cross-modal bridges and simple prediction averaging. This matters because it offers users more flexible control by combining pixel-aligned appearance cues with high-level semantic guidance.
The paper presents a well-motivated empirical study that convincingly demonstrates the complementarity of text and image conditioning for 3D generation. The diagnostic analysis and SimFusion baseline provide strong justification for the proposed task, while TIGON achieves state-of-the-art quantitative results on standard benchmarks. However, the architectural contribution is intentionally minimal—combining existing backbones with lightweight linear bridges—which limits novelty despite the solid empirical validation.
The diagnostic study in Section 4 effectively establishes the limitations of single-modality conditioning, showing that shifting from a frontal view (View-0) to a low-information view (View-1) causes TRELLIS to degrade from 56.08 to 143.58 $\text{FD}_{\text{DINOv2}}$, while text-only models achieve only 148.21–154.88. The SimFusion experiment provides compelling evidence that even naive late fusion yields significant gains (82.40 vs. 125.93), confirming strong cross-modal complementarity. Table 3 validates that zero-initialized cross-modal bridges are essential, improving $\text{FD}_{\text{DINOv2}}$ from 66.78 to 61.59 compared to marginal gains without them.
The architectural contribution is minimal by design—the authors adopt an explicitly "minimalist" approach that simply combines two existing UniLat3D backbones with lightweight linear projections ($\mathcal{P}^{(i)}_{\texttt{txt}\rightarrow\texttt{img}}$) and simple averaging ($\mathbf{v}=\frac{1}{2}(\mathbf{v}_{\texttt{txt}}+\mathbf{v}_{\texttt{img}})$). While the ablation shows that sophisticated late-fusion variants (adaptive weighting, attention-based) offer only marginal improvements (60.90 vs. 61.59 $\text{FD}_{\text{DINOv2}}$), this also suggests the model relies heavily on the pre-trained unimodal backbones rather than learning deep cross-modal interactions. Furthermore, evaluation is limited to synthetic datasets (Toys4K, UniLat1K) with canonical objects, leaving open questions about robustness to real-world imagery.
The quantitative evidence supports the claim that joint conditioning improves over single-modality methods, with TIGON achieving 61.59 $\text{FD}_{\text{DINOv2}}$ compared to 84.62 (image-only) and 152.34 (text-only). However, comparisons to Hunyuan3D-2.1 and Step1X-3D are marked with † indicating non-public training data, potentially compromising fairness. The paper lacks comparison against multimodal fusion techniques from vision-language literature (e.g., cross-attention, gated fusion) beyond the simple ablations, and does not evaluate on real-world or ambiguous natural images where text-image conflicts might be more challenging than the synthetic test cases shown.
Training details are reasonably comprehensive: the text branch trains for 1M iterations (batch size 256, lr $1\times 10^{-4}$), followed by 50K joint fine-tuning iterations (lr $1\times 10^{-5}$) on 64 NVIDIA A800 GPUs using DeepSpeed ZeRO-2 and FlashAttention. The paper uses the public TRELLIS-500K dataset and builds upon released UniLat3D checkpoints. However, no code release is mentioned, and the computational requirements are substantial. While condition dropout (probability 0.5) and zero-initialization are specified, implementation details for the cross-modal bridges (e.g., exact dimensionality mappings) and the full hyperparameter configuration for the late-fusion ablations are relegated to supplementary material only.
High-quality 3D assets are essential for VR/AR, industrial design, and entertainment, motivating growing interest in generative models that create 3D content from user prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models achieve high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, while text-conditioned models provide broad semantic guidance yet lack low-level visual detail. This limits how users can express intent and raises a natural question: can these two modalities be combined for more flexible and faithful 3D generation? Our diagnostic study shows that even simple late fusion of text- and image-conditioned predictions outperforms single-modality models, revealing strong cross-modal complementarity. We therefore formalize Text-Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification. To address this task, we introduce TIGON, a minimalist dual-branch baseline with separate image- and text-conditioned backbones and lightweight cross-modal fusion. Extensive experiments show that text-image conditioning consistently improves over single-modality methods, highlighting complementary vision-language guidance as a promising direction for future 3D generation research. Project page: https://jumpat.github.io/tigon-page
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.