Off-Policy Evaluation for Ranking Policies under Deterministic Logging Policies
The paper tackles Off-Policy Evaluation (OPE) for ranking policies when the logging policy is deterministic—a common industrial scenario where existing estimators fail due to lack of common support. The key insight is to replace action-propensity weighting with click-propensity weighting, yielding the Click-based IPS (CIPS) estimator that leverages intrinsic user stochasticity even when the logging policy has none. This shifts the support requirement from ranking-wise or position-wise action overlap to click-wise overlap, enabling low-bias estimation in previously intractable deterministic settings.
The paper presents a novel and well-motivated solution to a genuine practical problem. The core idea—using $p_c(x,a,\pi)/p_c(x,a,\pi_0)$ as importance weights rather than $\pi(A|x)/\pi_0(A|x)$—is theoretically sound and empirically validated. The Click-based Doubly Robust (CDR) extension sensibly trades off bias and variance. However, the practical utility depends heavily on accurately estimating click probabilities and on the independence of potential rewards assumption, which may not hold in all ranking applications.
The theoretical framing is rigorous. The paper clearly identifies why existing IPS and IIPS estimators fail under deterministic logging (violation of ranking-wise and position-wise common support) and proves that CIPS achieves unbiasedness under the milder click-wise common support condition (Condition 3.1) plus independence of potential rewards (Condition 3.2). The empirical validation is thorough, covering synthetic experiments with controlled violations of assumptions and a real-world KuaiRec dataset, consistently showing CIPS achieves lower MSE and squared bias than IPS, IIPS, and RIPS when logging is deterministic.
Two assumptions warrant scrutiny. First, the Independence of Potential Rewards (Condition 3.2) requires that $\mathbb{E}[R(a)|x,A]=\mathbb{E}[R(a)|x]$, meaning downstream rewards depend only on the item, not its position or neighboring items. While the authors argue this is realistic for e-commerce, it may fail in settings where context effects (e.g., item comparison) strongly influence conversion. Second, the method requires estimating click probabilities $\hat{p}_c(x,a,\pi)$ from logged data; Theorem 3.2 warns that bias scales with the ratio estimation error, and in fully deterministic settings where clicks are sparse or deterministic, the click-wise common support itself may fail. The real-world experiments show CIPS still exhibits 'some non-negligible bias' due to partial support violations.
The comparison to baselines is fair and comprehensive, covering IPS, IIPS, and RIPS under varying data sizes, ranking lengths, and degrees of logging-policy determinism. The synthetic experiments properly isolate the effect of violating the independence assumption via the $\lambda$ parameter, showing CIPS remains robust. However, the real-world evaluation is limited to a single dataset (KuaiRec) with a synthetic click model (Eq. 14), which somewhat undermines the claim of real-world applicability. Additional validation on purely observational industrial logs (where ground truth is unavailable) would strengthen the evidence.
Reproducibility is strong. The authors provide a public GitHub repository with implementation code. Experimental details in Appendix D describe the synthetic data generation (Eq. 10-12), neural network architecture (3-layer) for click probability estimation, and hyperparameters such as $\epsilon=0.3$ for the new policy and $\alpha=\infty$ for fully deterministic logging. The KuaiRec dataset is publicly available. Key missing details include the specific optimizer, learning rate, and training procedure for the click probability model, as well as the exact random seeding protocol.
Off-Policy Evaluation (OPE) is an important practical problem in algorithmic ranking systems, where the goal is to estimate the expected performance of a new ranking policy using only offline logged data collected under a different, logging policy. Existing estimators, such as the ranking-wise and position-wise inverse propensity score (IPS) estimators, require the data collection policy to be sufficiently stochastic and suffer from severe bias when the logging policy is fully deterministic. In this paper, we propose novel estimators, Click-based Inverse Propensity Score (CIPS), exploiting the intrinsic stochasticity of user click behavior to address this challenge. Unlike existing methods that rely on the stochasticity of the logging policy, our approach uses click probability as a new form of importance weighting, enabling low-bias OPE even under deterministic logging policies where existing methods incur substantial bias. We provide theoretical analyses of the bias and variance properties of the proposed estimators and show, through synthetic and real-world experiments, that our estimators achieve significantly lower bias compared to strong baselines, for a range of experimental settings with completely deterministic logging policies.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.