Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement

cs.CV cs.AI Junrong Guo, Shancheng Fang, Yadong Qu, Hongtao Xie · Mar 23, 2026
Local to this browser
What it does
This paper tackles the visual perception gap in automated text layout generation. While existing Multimodal Large Language Models (MLLMs) generate layout code (SVG/JSON) to render text on images, they operate blind to the actual rendered...
Why it matters
The framework uses a two-stage pipeline—cold-start supervised fine-tuning followed by reinforcement learning with GRPO—and introduces a specialized layout reward model trained on fine-grained quality hierarchies. A surprising finding is...
Main concern
VFLM advances text layout generation by convincingly demonstrating that visual feedback significantly improves layout quality over code-only generation. The paper establishes that iterative refinement via RL is effective, with the model...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper tackles the visual perception gap in automated text layout generation. While existing Multimodal Large Language Models (MLLMs) generate layout code (SVG/JSON) to render text on images, they operate blind to the actual rendered output, producing layouts with overlapping text, poor contrast, or misalignment. The authors propose Visual Feedback Layout Model (VFLM), which closes the loop by rendering generated SVGs and feeding the visual results back to the model for iterative reflection and refinement. The framework uses a two-stage pipeline—cold-start supervised fine-tuning followed by reinforcement learning with GRPO—and introduces a specialized layout reward model trained on fine-grained quality hierarchies. A surprising finding is that simple outcome-based rewards outperform complex process-oriented rewards that explicitly encode step-wise incentives.

Critical review
Verdict
Bottom line

VFLM advances text layout generation by convincingly demonstrating that visual feedback significantly improves layout quality over code-only generation. The paper establishes that iterative refinement via RL is effective, with the model autonomously learning to correct visual artifacts like overlapping text and poor color contrast. However, the reliance on distillation from a proprietary model (Doubao-Seed-1.6) for training data and the lack of analysis regarding latency-cost tradeoffs limit its immediate practical applicability.

“we employ Doubao-Seed-1.6 [5] as a teacher model for data synthesis”
paper · Section 3.2
“4.306s, 8.283s, 3.705s”
paper · Figure 11
What holds up

The core technical contribution—the two-stage training pipeline yielding robust iterative refinement capabilities—is well-validated. The cold-start SFT effectively seeds the reflection behavior using synthetic multi-round trajectories, while the RL stage with GRPO optimization elicits genuine self-improvement. The layout reward model architecture demonstrates strong discriminative power (97.4% pairwise accuracy) across four quality tiers, and the ablation studies adequately isolate the contribution of visual feedback from mere additional compute.

“achieves a high pairwise prediction accuracy of 97.4% on the preference data test set”
paper · Section 4.3
“VFLM achieves the best performance across the majority of metrics, substantially outperforming all RL baselines”
paper · Table 3
Main concerns

The claim that "simple outcome-based rewards are more effective than complex process-oriented rewards," while intriguing, relies on a single comparison (Figure 4) without theoretical analysis of why the process reward "hacked" specifically toward local optima. The paper lacks failure mode analysis—cases where VFLM fails to converge or enters oscillatory refinement loops are absent. Furthermore, the evaluation relies heavily on GPT-4o-as-judge scores (Table 1), which may encode biases favoring MLLM-style outputs; though a human study is mentioned in supplementary materials, its sample size (16 participants, 960 votes) is modest for validating the subjective aesthetic metrics.

“complex process-oriented rewards may actually inhibit optimal performance, leading to the brittle local optima and 'reward hacking'”
paper · Section 4.5
“blind human study involving 16 participants... 960 votes were collected”
paper · Supplementary Section 4.4
Evidence and comparison

The evidence supports the central claim that visual feedback outperforms code-only methods. Table 1 shows VFLM achieving OCR F1 of 0.9376 vs Claude 3.7's 0.8672 on TextLayout, with consistent gains across $R_{ove}$ (overlap) and $R_{com}$ (composition) metrics. The ablation in Table 3 fairly compares VFLM against "Single-Round RL" and "Direct Output" models trained on identical data volumes, confirming that gains stem from iterative reflection rather than data scale. However, comparisons to image editing models (GPT-4o-Image, FLUX-Kontext) are somewhat disingenuous as these models alter backgrounds, whereas VFLM preserves them—a distinction noted but not controlled for in aggregate metrics.

“VFLM 0.9376... Claude3.7 0.8672... IGD 0.8481”
paper · Table 1
“Our success highlights that our visual feedback framework is a more effective solution for layout generation”
paper · Section 4.4
Reproducibility

The authors provide code availability and extensive training details including hyperparameters ($\alpha=0.25$ for reward weights, GRPO batch size 64, learning rate $10^{-6}$). The base model Qwen2.5-VL-7B is publicly accessible. However, reproducibility is hampered by the proprietary data source: the TextLayout dataset comprises "free and paid data from the internet" without disclosed URLs or licenses, and the cold-start trajectories require Doubao-Seed-1.6, a closed API. The computational requirement—16 NVIDIA H200 GPUs—is substantial though not prohibitive for well-resourced labs. Missing details include the specific prompt templates used for GPT-4o evaluation and exact filtering criteria for the 200K training samples.

“collected approximately 200K samples, including free and paid data from the internet”
paper · Supplementary Section 1
“experiments are conducted on a cluster of 16 NVIDIA H200 GPUs”
paper · Supplementary Section 2.1
“Our code and data are available at https://github.com/FolSpark/VFLM”
paper · Abstract
Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model's iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at https://github.com/FolSpark/VFLM.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.