On the Number of Conditional Independence Tests in Constraint-based Causal Discovery
Constraint-based causal discovery algorithms like PC require exponentially many conditional independence (CI) tests in the worst case---specifically $p^{\mathcal{O}(d)}$ where $d$ is the maximum degree. This paper establishes that the fundamental complexity parameter is actually $s$, the maximum undirected clique size in the essential graph, which can be much smaller than $d$ (e.g., $s=2$ vs $d=p-2$ in Figure 1). The authors propose Greedy Ancestral Search (GAS), which achieves $p^{\mathcal{O}(s)}$ CI tests, and prove a matching lower bound of $2^{\Omega(s)}$, establishing exponent-optimality up to a logarithmic factor.
The paper makes a substantial theoretical contribution by resolving the query complexity of constraint-based causal discovery in terms of the clique size $s$ rather than the degree $d$. The proposed GAS algorithm integrates ancestral relationship learning with adjacency search to achieve the improved bound, and the authors rigorously prove both the upper bound (Theorem 1) and the information-theoretic lower bound (Theorem 2). The result is a nearly tight characterization of the CI test complexity for this problem.
The theoretical framework is elegant and well-supported. The key insight---that ancestral information can guide adjacency testing to avoid the exponential dependence on degree---is executed through a careful integration of prefix node set expansion and edge orientation via v-structures and Meek rules. The lower bound construction using the largest undirected clique is novel and strictly stronger than prior work, providing an "any-case" guarantee that for any collection of fewer than $2^{s}-s-1$ CI tests, an indistinguishable MEC exists.
The primary limitation is the reliance on CI oracles (infinite samples), which abstracts away the statistical complexity of testing; the finite-sample behavior and precise correctness conditions remain unclear as noted in the Discussion. Additionally, while $s \leq d+1$, the advantage of GAS depends on $s \ll d$, which though common in some domains (e.g., gene regulatory networks), is not guaranteed for all graph structures. The experiments also reveal that GAS+ (with $\mathcal{O}(p^{2})$ additional tests) outperforms vanilla GAS in practice, complicating the claim that the base algorithm is sufficient for accuracy.
The theoretical evidence is compelling: Theorem 1 establishes the $p^{\mathcal{O}(s)}$ upper bound while Theorem 2 provides the $2^{\Omega(s)}$ lower bound, together showing the complexity is exponential in $s$. Comparisons to PC, GSP, and GRaSP2 on Erdős-Rényi and SERGIO data demonstrate that GAS runs significantly faster, particularly on denser graphs (Figure 4a), though accuracy requires the GAS+ variant to match competitors in sparse settings (Figure 4b). The real-world Airfoil experiment confirms efficiency but shows limitations in identifying exogenous variables compared to PC.
The authors provide implementation code at https://github.com/uhlerlab/greedy-ancestral-search, and Algorithm 1 along with Appendix E specify the implementation details. However, full reproduction requires careful handling of finite-sample CI testing thresholds, which are not fully standardized in the paper. The theoretical results assume infinite samples and exact CI oracles, so independent verification would require implementing the oracle abstractly or using specific finite-sample approximations (e.g., partial correlation tests for Gaussian data) with appropriate alpha levels.
Learning causal relations from observational data is a fundamental problem with wide-ranging applications across many fields. Constraint-based methods infer the underlying causal structure by performing conditional independence tests. However, existing algorithms such as the prominent PC algorithm need to perform a large number of independence tests, which in the worst case is exponential in the maximum degree of the causal graph. Despite extensive research, it remains unclear if there exist algorithms with better complexity without additional assumptions. Here, we establish an algorithm that achieves a better complexity of $p^{\mathcal{O}(s)}$ tests, where $p$ is the number of nodes in the graph and $s$ denotes the maximum undirected clique size of the underlying essential graph. Complementing this result, we prove that any constraint-based algorithm must perform at least $2^{\Omega(s)}$ conditional independence tests, establishing that our proposed algorithm achieves exponent-optimality up to a logarithmic factor in terms of the number of conditional independence tests needed. Finally, we validate our theoretical findings through simulations, on semi-synthetic gene-expression data, and real-world data, demonstrating the efficiency of our algorithm compared to existing methods in terms of number of conditional independence tests needed.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.