Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation

cs.CL cs.AI Ireh Kim, Tesia Sker, Chanwoo Kim · Mar 23, 2026
Local to this browser
What it does
Large language models have historically lagged behind specialized encoder-decoder MT systems, but their superior context modeling makes them natural candidates for document-level translation. This paper tackles two key obstacles: the...
Why it matters
This paper tackles two key obstacles: the scarcity of high-quality document-level parallel corpora and LLM tendencies toward hallucinations and omissions. The authors propose a two-stage fine-tuning framework that first generates synthetic...
Main concern
The paper presents a pragmatic approach to document-level MT with LLMs and shows modest but consistent gains from multi-metric filtering and curriculum-style two-stage training. However, the evaluation suffers from circularity—the same...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Large language models have historically lagged behind specialized encoder-decoder MT systems, but their superior context modeling makes them natural candidates for document-level translation. This paper tackles two key obstacles: the scarcity of high-quality document-level parallel corpora and LLM tendencies toward hallucinations and omissions. The authors propose a two-stage fine-tuning framework that first generates synthetic document-level data from summarization corpora via LLM augmentation, filters this data using sacreBLEU, COMET, and LaBSE cosine similarity, and then trains models first on sentence-level data before adapting to the filtered document corpus.

Critical review
Verdict
Bottom line

The paper presents a pragmatic approach to document-level MT with LLMs and shows modest but consistent gains from multi-metric filtering and curriculum-style two-stage training. However, the evaluation suffers from circularity—the same metrics used for filtering are used for evaluation—and the experimental scale (2K training samples, 1.5K evaluation) is insufficient to establish robust conclusions. The reliance on Google Translate as a pseudo-reference introduces bias, and the complete absence of human evaluation or comparisons to strong baselines like NLLB-200 limits the practical significance of the reported gains.

“To handle length constraint of LaBSE model, we compute sentence-level embeddings for source and translation, average them, and calculate cosine similarity.”
paper · Section 2.1
“sacreBLEU 30 COMET 0.7 LaBSE-CosSim 0.8”
paper · Table 1
What holds up

The two-stage fine-tuning strategy is well-motivated by the observation that sentence-level MT datasets are abundant while document-level resources are scarce. Table 2 shows clear improvements from adding sentence-level pretraining (sacreBLEU 11.24 → 15.07, COMET 0.618 → 0.697), and the multi-metric filtering approach (Table 5) demonstrates that aggressive filtering can yield modest gains. The use of LaBSE-CosSim as a reference-free quality signal is sensible given past work on hallucination detection.

“Two-stage fine-tuning: sacreBLEU 15.07, COMET 0.697; Document-level only: sacreBLEU 11.24, COMET 0.618”
paper · Table 2
“This two-stage strategy improves the fluency and consistency of the translated output.”
paper · Section 2
Main concerns

The most serious flaw is circular evaluation: sacreBLEU, COMET, and LaBSE-CosSim are used both to filter the training data and to evaluate the final models. Models trained on data selected to score well on these metrics will unsurprisingly score well on the same metrics at test time. The geometric mean of these three metrics is mathematically nonsensical given their vastly different scales—sacreBLEU ranges ~0-100 while COMET and LaBSE-CosSim range ~0-1. The paper lacks human evaluation and meaningful baselines: no comparison to state-of-the-art MT systems (NLLB, DeepL, GPT-4), no significance testing for small metric differences, and thresholds that appear tuned post-hoc. The Google Translate pseudo-reference introduces systematic bias toward sentence-level literalness rather than document-appropriate translation.

“evaluation is performed on 1,500 document-level instances with sacreBLEU, COMET, LaBSE-CosSim, and their geometric mean”
paper · Section 3.2
“employ three evaluation metrics—sacreBLEU, COMET, and LaBSE-CosSim—to filter the augmented data”
paper · Section 2.1
Evidence and comparison

The evidence presented is limited to a small-scale English-German experiment with only 2,000 training instances and automatic metrics. The paper fails to demonstrate that filtered synthetic data outperforms genuine document-level parallel data or established document-level MT systems. The authors acknowledge that xCOMET and Doc-COMET were considered but abandoned for practical reasons, yet these would have provided more robust hallucination detection than the chosen metrics. The claim that 'LLMs have generally underperformed compared to conventional encoder–decoder systems' goes uncontested; no empirical comparison is provided to validate this premise or show the proposed method closes the gap.

“xCOMET is capable of evaluating hallucinations, it does not support document-level inputs”
paper · Section 2.1
“LLMs have generally underperformed compared to conventional encoder–decoder systems”
paper · Abstract
Reproducibility

The experimental setup is reasonably well-documented with model names (Llama-3.2-1B-Instruct base, Llama-3.1-8B-Instruct for augmentation) and data sources (CNN/Daily Mail, News Commentary v16). However, neither code nor hyperparameters for fine-tuning are provided, and the specific prompt template for LLM-based translation is not included. The random seed and exact data sampling procedure are not specified. The threshold selection (sacreBLEU ≥ 35, COMET ≥ 0.75, LaBSE-CosSim ≥ 0.85) appears empirical without validation methodology. Without access to the filtered dataset or generation code, independent reproduction would require substantial reverse-engineering.

“randomly sampled 20K documents (7% of 287K instances)”
paper · Section 3.1
“randomly sample 2,000 instances from each”
paper · Section 3
Abstract

In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.