TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression

cs.CL Li Wang, Yandong Wang, Xin Yu, Kui Zhang, Tianhao Peng, Wenjun Wu · Mar 23, 2026
Local to this browser
What it does
TAMTRL addresses the temporal credit assignment problem in multi-turn RL for long-context document processing. When LLMs process documents chunk-by-chunk with memory updates, standard outcome-only rewards cannot distinguish good from bad...
Why it matters
The paper proposes using the model itself as a teacher: during training, it provides the model with filtered (relevant-only) chunks and uses the normalized token probabilities of the generated memory as turn-level rewards. This avoids...
Main concern
The paper presents a well-motivated solution to a genuine problem in long-context RL training. The POMDP formulation and CTDE-inspired teacher-student framework provide solid theoretical grounding, and the empirical results demonstrate...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

TAMTRL addresses the temporal credit assignment problem in multi-turn RL for long-context document processing. When LLMs process documents chunk-by-chunk with memory updates, standard outcome-only rewards cannot distinguish good from bad intermediate memory updates. The paper proposes using the model itself as a teacher: during training, it provides the model with filtered (relevant-only) chunks and uses the normalized token probabilities of the generated memory as turn-level rewards. This avoids expensive rollouts or external judges while providing fine-grained supervision for each turn.

Critical review
Verdict
Bottom line

The paper presents a well-motivated solution to a genuine problem in long-context RL training. The POMDP formulation and CTDE-inspired teacher-student framework provide solid theoretical grounding, and the empirical results demonstrate consistent improvements over strong baselines. However, the gains over MemAgent are modest (1.87-2.02% relative), and the reliance on ground-truth annotations to filter relevant documents during training limits the method's applicability to settings where such annotations are available. The claim of "self-supervised" learning is somewhat misleading since the teacher requires privileged access to filtered chunks derived from ground-truth relevance labels.

“we remove irrelevant content based on the ground-truth document annotations, yielding a filtered chunk Ct that contains only relevant information”
TAMTRL, Sec. 4.3 · Section 4.3
“average relative improvements of 1.87% and 2.02% over strong baselines on the 0.6B and 1.7B backbone models, respectively”
TAMTRL, Sec. 6.2 · Table 1
What holds up

The POMDP formulation of long-document processing is rigorous and appropriate: states encode full documents, observations are local chunks, and actions are memory updates. The ablation studies robustly validate each design component—removing length normalization causes performance drops (37.57% vs 39.29% on 0.6B), while removing min-max normalization entirely collapses training (0% accuracy). The computational cost analysis is thorough; Table A2 shows TAMTRL (33.79h) is indeed more efficient than PRM-based methods (81.77h) and even slightly faster than vanilla MemAgent (37.75h) due to shorter response lengths. The theoretical decomposition in Theorem 1 provides a principled interpretation of the optimization objective, showing how the method balances success-conditional alignment with failure-conditional regularization.

“The state at time t is defined as st=(q,D,Mt)∈S, encapsulating the static global document and the LLM's internal memory... The action space A corresponds to the entire text space V*”
TAMTRL, Sec. 4.2 · Section 4.2
“w/o mm-norm... 0.00... confirming the necessity of each module in TAMTRL”
TAMTRL, Sec. 6.4 · Table 2
“TAMTRL... 33.79... PRM... 81.77... MemAgent... 37.75”
TAMTRL, Appendix B.3 · Table A2
Main concerns

The circularity of using the student model πθ as its own teacher raises concerns about what is actually being learned. While the teacher sees filtered chunks Ct (relevant documents only) and the student sees full chunks Dt, both use the same model weights θ. This means the teacher is not a stable expert but a moving target that evolves with training. The theoretical analysis assumes a static πteacher, creating a gap with practice. More critically, the method requires ground-truth relevance annotations to construct Ct during training—without these, the teacher cannot distinguish relevant from irrelevant content. This limits scalability to unannotated corpora, contrary to the claim of generalizing to "unannotated documents at test time" which skirts the issue that training still requires annotations. The improvements over MemAgent, while consistent, are small in absolute terms and may not justify the added complexity of turn-level reward calculation.

“the student and teacher models are the same model πθ, with different input contexts”
TAMTRL, Sec. 4.3 · Section 4.3
“we remove irrelevant content based on the ground-truth document annotations”
TAMTRL, Sec. 4.3 · Section 4.3
“Given a teacher log-likelihood score pt=log πteacher(Mt+1|St)... the TAMTRL objective is defined as...”
TAMTRL, Sec. 5 · Theorem 1
Evidence and comparison

The comparisons to baselines appear fair: all RL methods use the same DAPO algorithm and hyperparameters. However, the comparison to PRM is weakened by the fact that they train a small BERT-based PRM (rather than using a properly scaled model like Qwen3-0.6B), which may explain its volatile performance on NIAH (85.20% vs TAMTRL's 95.04% for 0.6B). The LLM-judge baseline uses Qwen3-8B, which is much larger than the student models (0.6B/1.7B), potentially introducing noise from judge-student capability mismatch. The seven benchmarks cover diverse scenarios (HotpotQA, RULER, NIAH, etc.), though most are synthetic QA tasks with distractor documents; performance on natural long-context tasks like NarrativeQA remains low (<6%), suggesting limited generalization to truly open-ended long-context reasoning.

“PRM... 85.20... TAMTRL (ours)... 95.04... NIAH (0.6B)”
TAMTRL, Sec. 6.2 · Table 1
“train a BERT-based process reward model... final classification accuracy of 96.05%”
TAMTRL, Appendix B.2 · Appendix B.2
“Narrativeqa... TAMTRL-0.6B... 4.37... TAMTRL-1.7B... 5.50”
TAMTRL, Sec. 6.2 · Table 1
Reproducibility

The paper provides detailed hyperparameters (KL factor 1×10−3, learning rate 1×10−6, batch size 32, group size 8) and training infrastructure (8×A100 80GB). The code is available at an anonymous repository. However, reproduction is complicated by the requirement for ground-truth relevance annotations to filter training chunks Ct—the paper does not specify how these are obtained for arbitrary documents, and the method cannot be applied to raw unannotated text without preprocessing. The DAPO base algorithm is properly cited and described. The evaluation uses standard benchmarks (HotpotQA, RULER, NIAH) with exact match metrics, enhancing comparability. The chunk size (5000 tokens) and context window (8K) are clearly specified.

“KL factor of 1×10−3... AdamW optimizer with β1=0.9, β2=0.95... constant learning rate of 1×10−6”
TAMTRL, Appendix B.2 · Appendix B.2
“restrict the context window to 8K tokens... allocating 1024 tokens for the query, 5000 for the context chunk”
TAMTRL, Sec. 6.1 · Section 6.1
“Our code is available at https://anonymous.4open.science/r/TAMTRL-F1F8”
TAMTRL, Abstract · Abstract
Abstract

The rapid progress of large language models (LLMs) has led to remarkable performance gains across a wide range of tasks. However, when handling long documents that exceed the model's context window limit, the entire context cannot be processed in a single pass, making chunk-wise processing necessary. This requires multiple turns to read different chunks and update memory. However, supervision is typically provided only by the final outcome, which makes it difficult to evaluate the quality of memory updates at each turn in the multi-turn training setting. This introduces a temporal credit assignment challenge. Existing approaches, such as LLM-as-a-judge or process reward models, incur substantial computational overhead and suffer from estimation noise. To better address the credit assignment problem in multi-turn memory training, we propose Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning (TAMTRL). TAMTRL leverages relevant documents as teacher signals by aligning them with each turn of model input and assigns rewards through normalized probabilities in a self-supervised manner. This provides fine-grained learning signals for each memory update and improves long-context processing. Experiments with multiple models of varying scales across seven long-context benchmarks show that TAMTRL consistently outperforms strong baselines, demonstrating its effectiveness. Our code is available at https://anonymous.4open.science/r/TAMTRL-F1F8.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.