Off-Policy Evaluation for Ranking Policies under Deterministic Logging Policies

cs.LG Koichi Tanaka, Kazuki Kawamura, Takanori Muroi, Yusuke Narita, Yuki Sasamoto, Kei Tateno, Takuma Udagawa, Wei-Wei Du, Yuta Saito · Mar 23, 2026

What it does

Why it matters

The key insight is to replace action-propensity weighting with click-propensity weighting, yielding the Click-based IPS (CIPS) estimator that leverages intrinsic user stochasticity even when the logging policy has none. This shifts the...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

The paper tackles Off-Policy Evaluation (OPE) for ranking policies when the logging policy is deterministic—a common industrial scenario where existing estimators fail due to lack of common support. The key insight is to replace action-propensity weighting with click-propensity weighting, yielding the Click-based IPS (CIPS) estimator that leverages intrinsic user stochasticity even when the logging policy has none. This shifts the support requirement from ranking-wise or position-wise action overlap to click-wise overlap, enabling low-bias estimation in previously intractable deterministic settings.

Critical review

Verdict

Bottom line

The paper presents a novel and well-motivated solution to a genuine practical problem. The core idea—using $p_c(x,a,\pi)/p_c(x,a,\pi_0)$ as importance weights rather than $\pi(A|x)/\pi_0(A|x)$—is theoretically sound and empirically validated. The Click-based Doubly Robust (CDR) extension sensibly trades off bias and variance. However, the practical utility depends heavily on accurately estimating click probabilities and on the independence of potential rewards assumption, which may not hold in all ranking applications.

“Existing estimators... require the data collection policy to be sufficiently stochastic and suffer from severe bias when the logging policy is fully deterministic.”

paper · Abstract

What holds up

The theoretical framing is rigorous. The paper clearly identifies why existing IPS and IIPS estimators fail under deterministic logging (violation of ranking-wise and position-wise common support) and proves that CIPS achieves unbiasedness under the milder click-wise common support condition (Condition 3.1) plus independence of potential rewards (Condition 3.2). The empirical validation is thorough, covering synthetic experiments with controlled violations of assumptions and a real-world KuaiRec dataset, consistently showing CIPS achieves lower MSE and squared bias than IPS, IIPS, and RIPS when logging is deterministic.

“Under Conditions 3.1 and 3.2, CIPS is unbiased, i.e., $\mathbb{E}_{p(\mathcal{D})}[\hat{V}_{\mathrm{CIPS}}(\pi;\mathcal{D})]=V(\pi)$.”

paper · Theorem 3.1

“CIPS consistently outperforms the baselines across all logged data sizes by substantially reducing bias.”

paper · Section 5.1

Main concerns

Two assumptions warrant scrutiny. First, the Independence of Potential Rewards (Condition 3.2) requires that $\mathbb{E}[R(a)|x,A]=\mathbb{E}[R(a)|x]$, meaning downstream rewards depend only on the item, not its position or neighboring items. While the authors argue this is realistic for e-commerce, it may fail in settings where context effects (e.g., item comparison) strongly influence conversion. Second, the method requires estimating click probabilities $\hat{p}_c(x,a,\pi)$ from logged data; Theorem 3.2 warns that bias scales with the ratio estimation error, and in fully deterministic settings where clicks are sparse or deterministic, the click-wise common support itself may fail. The real-world experiments show CIPS still exhibits 'some non-negligible bias' due to partial support violations.

“The expected potential rewards satisfy independence if $\mathbb{E}[R(a)\mid x,A]=\mathbb{E}[R(a)\mid x]=q_{r}(x,a)$ for all $a\in\mathcal{A}$ and $x\in\mathcal{X}$.”

paper · Condition 3.2

“This bias stems from violations of the click-wise common support, which can occur when both the new and logging policies are deterministic, an extremely challenging setting for any estimator.”

paper · Section 5.2

Evidence and comparison

The comparison to baselines is fair and comprehensive, covering IPS, IIPS, and RIPS under varying data sizes, ranking lengths, and degrees of logging-policy determinism. The synthetic experiments properly isolate the effect of violating the independence assumption via the $\lambda$ parameter, showing CIPS remains robust. However, the real-world evaluation is limited to a single dataset (KuaiRec) with a synthetic click model (Eq. 14), which somewhat undermines the claim of real-world applicability. Additional validation on purely observational industrial logs (where ground truth is unavailable) would strengthen the evidence.

“The MSE of CIPS increases only marginally as $\lambda$ grows... CIPS continues to provide accurate estimates even when the independence condition is not strictly satisfied.”

paper · Figure 3

“We use the user–item interaction matrix recorded in the original data as the potential reward function $q_{r}(x,A(k))$ and define the expected click probability as... [synthetic model].”

paper · Section 5.2

Reproducibility

Reproducibility is strong. The authors provide a public GitHub repository with implementation code. Experimental details in Appendix D describe the synthetic data generation (Eq. 10-12), neural network architecture (3-layer) for click probability estimation, and hyperparameters such as $\epsilon=0.3$ for the new policy and $\alpha=\infty$ for fully deterministic logging. The KuaiRec dataset is publicly available. Key missing details include the specific optimizer, learning rate, and training procedure for the click probability model, as well as the exact random seeding protocol.

“The implementation code is public at this repository, and detailed experimental setups can be found in Appendix D.”

paper · Section 5

“We estimate click probabilities to implement CIPS using a 3-layer neural network and the logged data $\mathcal{D}$.”

paper · Appendix D

Abstract

Off-Policy Evaluation (OPE) is an important practical problem in algorithmic ranking systems, where the goal is to estimate the expected performance of a new ranking policy using only offline logged data collected under a different, logging policy. Existing estimators, such as the ranking-wise and position-wise inverse propensity score (IPS) estimators, require the data collection policy to be sufficiently stochastic and suffer from severe bias when the logging policy is fully deterministic. In this paper, we propose novel estimators, Click-based Inverse Propensity Score (CIPS), exploiting the intrinsic stochasticity of user click behavior to address this challenge. Unlike existing methods that rely on the stochasticity of the logging policy, our approach uses click probability as a new form of importance weighting, enabling low-bias OPE even under deterministic logging policies where existing methods incur substantial bias. We provide theoretical analyses of the bias and variance properties of the proposed estimators and show, through synthetic and real-world experiments, that our estimators achieve significantly lower bias compared to strong baselines, for a range of experimental settings with completely deterministic logging policies.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.