A Job I Like or a Job I Can Get: Designing Job Recommender Systems Using Field Experiments

econ.EM stat.ML Guillaume Bied, Philippe Caillou, Bruno Cr\'epon, Christophe Gaillac, Elia P\'erennes, Mich\`ele Sebag · Mar 23, 2026

What it does

Why it matters

This paper develops a structural job-search model where vacancy value depends on utility $U$ and hiring probability $p$, deriving a welfare-optimal ranking based on an expected-surplus index $\Gamma(p, U) = p \sigma \log(1 +...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Job recommender systems deployed by public employment services are typically optimized for predictive metrics like clicks, applications, or hires rather than job seeker welfare. This paper develops a structural job-search model where vacancy value depends on utility $U$ and hiring probability $p$, deriving a welfare-optimal ranking based on an expected-surplus index $\Gamma(p, U) = p \sigma \log(1 + e^{\Delta(p,U)/\sigma})$. Through two randomized field experiments with the French public employment service, the authors demonstrate that algorithms approximating this theoretical benchmark substantially outperform existing approaches, while formalizing the "inversion problem" where behavior-based rankings diverge from welfare-maximizing ones.

Critical review

Verdict

Bottom line

The paper represents a strong marriage of economic theory and field experimentation, providing a coherent welfare framework for algorithmic design that moves beyond standard predictive metrics. The iterative experimental approach—using beta-test 1 to inform the design of beta-test 2—generates credible evidence that hybrid algorithms outperform single-objective baselines. However, the analysis relies on specific parametric assumptions (logistic taste shocks) and the myopic job seeker benchmark, while extremely low hiring rates (below 0.05%) severely limit statistical power for ultimate employment outcomes.

“The model implies that welfare-optimal RSs rank vacancies by an expected-surplus index combining both, and shows why rankings based solely on utility, hiring probabilities, or observed application behavior are generically suboptimal”

paper · Abstract

What holds up

The experimental design leveraging random assignment of recommendation algorithms generates clean exogenous variation for comparing designs, validated by the finding that both utility scores and inverse hiring probabilities are highly significant predictors of application decisions across specifications (Table 3). The structural estimation supports the key behavioral mechanisms, and the finding that welfare-approximating algorithms (particularly Vadore.2) consistently outperform pure utility-based (U-rec) or pure hiring-based (Vadore.0) rankings holds robustly across both experiments with substantial gains in click-through and application rates.

“The newly introduced algorithms, particularly the application-based algorithm and the approximation of the welfare-optimal rule, substantially outperform the initial approaches, especially in terms of clicks and applications”

paper · Section 4.3

“both the utility score and the inverse hiring probability are highly significant predictors of application decisions, with coefficients stable across specifications”

paper · Page 4

Main concerns

The primary empirical limitation is statistical power: unconditional hiring probabilities on recommended vacancies remain around 0.01-0.04%, making it impossible to detect significant differences in ultimate employment outcomes. The authors acknowledge that "detecting welfare differences through hiring rates alone would require samples several orders of magnitude larger than a beta-test." Additionally, the theoretical optimal benchmark depends on strong functional form assumptions and the myopic job seeker assumption; while extensions to forward-looking behavior are discussed, the welfare calculations and counterfactual optimal recommendations rely on the baseline specification. The analysis also explicitly abstracts from congestion effects and general equilibrium considerations, which could alter optimal recommendations when scaled to full deployment.

“Since application probabilities are small (below 1% per recommendation), the joint hiring probability $p_h = p \times p_a$ is extremely low. Detecting welfare differences through hiring rates alone would require samples several orders of magnitude larger than a beta-test”

paper · Page 33

“The model abstracts from competition among job seekers and from congestion effects.”

paper · Page 17

Evidence and comparison

The evidence strongly supports the central theoretical claims: the structural model's predictions regarding application behavior are validated experimentally, and the decomposition $\Gamma(p, U) = p \times p_a(p, U) \times m(p_a(p, U))$ clarifies why neither pure utility nor pure hiring rankings are optimal. The comparison to related work is comprehensive (Table 1 provides a clear taxonomy of RS approaches in economics and computer science). The claim that Vadore.2 performs close to the welfare-optimal benchmark is supported by Figure 3(c), though the paper fairly notes that the inversion problem has "modest quantitative implications in this setting" due to empirically low application probabilities.

“The decomposition (10) immediately implies that neither $U$, $p$, $p_a$, nor $p_h$ alone is sufficient to recover the optimal ranking: each captures only a subset of the dimensions jointly determining $\Gamma(p, U)$.”

paper · Section 2.5.1

Reproducibility

Reproduction faces significant barriers due to reliance on proprietary administrative data from the French Public Employment Service (France Travail), including matched job seeker characteristics, vacancy postings, clicks, applications, and hiring outcomes that are not publicly available. While experimental protocols are referenced (Appendix F) and algorithmic architectures (neural network structures for Vadore.0, scoring formulas for U-rec) are detailed in Sections 3.2 and 3.3, the specific implementation code, trained model weights, and hyperparameters are not provided. The structural estimation requires the specific experimental variation generated through the institutional partnership, making independent replication impossible without analogous data access and IRB approvals.

Abstract

Recommendation systems (RSs) are increasingly used to guide job seekers on online platforms, yet the algorithms currently deployed are typically optimized for predictive objectives such as clicks, applications, or hires, rather than job seekers' welfare. We develop a job-search model with an application stage in which the value of a vacancy depends on two dimensions: the utility it delivers to the worker and the probability that an application succeeds. The model implies that welfare-optimal RSs rank vacancies by an expected-surplus index combining both, and shows why rankings based solely on utility, hiring probabilities, or observed application behavior are generically suboptimal, an instance of the inversion problem between behavior and welfare. We test these predictions and quantify their practical importance through two randomized field experiments conducted with the French public employment service. The first experiment, comparing existing algorithms and their combinations, provides behavioral evidence that both dimensions shape application decisions. Guided by the model and these results, the second experiment extends the comparison to an RS designed to approximate the welfare-optimal ranking. The experiments generate exogenous variation in the vacancies shown to job seekers, allowing us to estimate the model, validate its behavioral predictions, and construct a welfare metric. Algorithms informed by the model-implied optimal ranking substantially outperform existing approaches and perform close to the welfare-optimal benchmark. Our results show that embedding predictive tools within a simple job-search framework and combining it with experimental evidence yields recommendation rules with substantial welfare gains in practice.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.