Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation
Large language models have historically lagged behind specialized encoder-decoder MT systems, but their superior context modeling makes them natural candidates for document-level translation. This paper tackles two key obstacles: the scarcity of high-quality document-level parallel corpora and LLM tendencies toward hallucinations and omissions. The authors propose a two-stage fine-tuning framework that first generates synthetic document-level data from summarization corpora via LLM augmentation, filters this data using sacreBLEU, COMET, and LaBSE cosine similarity, and then trains models first on sentence-level data before adapting to the filtered document corpus.
The paper presents a pragmatic approach to document-level MT with LLMs and shows modest but consistent gains from multi-metric filtering and curriculum-style two-stage training. However, the evaluation suffers from circularity—the same metrics used for filtering are used for evaluation—and the experimental scale (2K training samples, 1.5K evaluation) is insufficient to establish robust conclusions. The reliance on Google Translate as a pseudo-reference introduces bias, and the complete absence of human evaluation or comparisons to strong baselines like NLLB-200 limits the practical significance of the reported gains.
The two-stage fine-tuning strategy is well-motivated by the observation that sentence-level MT datasets are abundant while document-level resources are scarce. Table 2 shows clear improvements from adding sentence-level pretraining (sacreBLEU 11.24 → 15.07, COMET 0.618 → 0.697), and the multi-metric filtering approach (Table 5) demonstrates that aggressive filtering can yield modest gains. The use of LaBSE-CosSim as a reference-free quality signal is sensible given past work on hallucination detection.
The most serious flaw is circular evaluation: sacreBLEU, COMET, and LaBSE-CosSim are used both to filter the training data and to evaluate the final models. Models trained on data selected to score well on these metrics will unsurprisingly score well on the same metrics at test time. The geometric mean of these three metrics is mathematically nonsensical given their vastly different scales—sacreBLEU ranges ~0-100 while COMET and LaBSE-CosSim range ~0-1. The paper lacks human evaluation and meaningful baselines: no comparison to state-of-the-art MT systems (NLLB, DeepL, GPT-4), no significance testing for small metric differences, and thresholds that appear tuned post-hoc. The Google Translate pseudo-reference introduces systematic bias toward sentence-level literalness rather than document-appropriate translation.
The evidence presented is limited to a small-scale English-German experiment with only 2,000 training instances and automatic metrics. The paper fails to demonstrate that filtered synthetic data outperforms genuine document-level parallel data or established document-level MT systems. The authors acknowledge that xCOMET and Doc-COMET were considered but abandoned for practical reasons, yet these would have provided more robust hallucination detection than the chosen metrics. The claim that 'LLMs have generally underperformed compared to conventional encoder–decoder systems' goes uncontested; no empirical comparison is provided to validate this premise or show the proposed method closes the gap.
The experimental setup is reasonably well-documented with model names (Llama-3.2-1B-Instruct base, Llama-3.1-8B-Instruct for augmentation) and data sources (CNN/Daily Mail, News Commentary v16). However, neither code nor hyperparameters for fine-tuning are provided, and the specific prompt template for LLM-based translation is not included. The random seed and exact data sampling procedure are not specified. The threshold selection (sacreBLEU ≥ 35, COMET ≥ 0.75, LaBSE-CosSim ≥ 0.85) appears empirical without validation methodology. Without access to the filtered dataset or generation code, independent reproduction would require substantial reverse-engineering.
In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.