An Automated Approach to Causal Inference in Discrete Settings

When causal quantities cannot be point identified, researchers often pursue partial identification to quantify the range of possible values. However, the peculiarities of applied research conditions can make this analytically intractable. We present a general and automated approach to causal inference in discrete settings. We show causal questions with discrete data reduce to polynomial programming problems, and we present an algorithm to automatically bound causal effects using efficient dual relaxation and spatial branch-and-bound techniques. The user declares an estimand, states assumptions, and provides data (however incomplete or mismeasured). The algorithm then searches over admissible data-generating processes and outputs the most precise possible range consistent with available information -- i.e., sharp bounds -- including a point-identified solution if one exists. Because this search can be computationally intensive, our procedure reports and continually refines non-sharp ranges that are guaranteed to contain the truth at all times, even when the algorithm is not run to completion. Moreover, it offers an additional guarantee we refer to as $\epsilon$-sharpness, characterizing the worst-case looseness of the incomplete bounds. Analytically validated simulations show the algorithm accommodates classic obstacles, including confounding, selection, measurement error, noncompliance, and nonresponse.


Introduction
When causal quantities cannot be point identified, researchers often pursue partial identification to quantify the range of possible answers. These solutions are tailored to specific scenarios (e.g. Lee, 2009;Gabriel et al., 2020;Kennedy et al., 2019;Knox et al., 2020;Li and Pearl, 2021;Sjölander et al., 2014), but the idiosyncrasies of applied research can render prior results unusable if identifying assumptions fail or slightly differing causal structures are encountered. This case-by-case approach to deriving causal bounds presents a major obstacle to scientific progress. To increase the pace of discovery, researchers need a general approach that is robust to context-specific peculiarities.
In this paper, we present an automated approach to causal inference in discrete settings which can be applied to all graphical causal models, as well as all observed quantities and domain assumptions in standard use. With our algorithm, users declare an estimand, state assumptions, and provide available data-however incomplete or mismeasured. The algorithm then outputs sharp bounds, the most precise possible answer to the causal query given these inputs, including a point estimate if the solution is identified. This approach can accommodate scenarios involving any classic threat to inference, including but not limited to missing data, selection, measurement error, and noncompliance. Our algorithm also has the desirable property of alerting users when assumptions conflict with observed data, indicating a faulty causal theory. Finally, we develop techniques for drawing statistical inferences about estimated bounds. We demonstrate our method using a host of simulations, validating results wherever existing analytic solutions are available.
Our work advances a rich literature on partial identification in causal inference (Robins, 1989;Manski, 1990;Heckman and Vytlacil, 2001;Zhang and Rubin, 2003;Cai et al., 2008;Swanson et al., 2018;Gabriel et al., 2020;Molinari, 2020), outlined in Section 2, which has sometimes cast the task as a constrained optimization problem that can be solved computationally. In pioneering work, Pearl (1994, 1997) provided a method for calculating sharp bounds when causal queries can be expressed as linear programming problems. However, a wide range of estimands and empirical obstacles result in causal queries that are not reducible to linear programs, and a complete computational solution has remained elusive.
When feasible, sharp-bounding approaches offer a principled and transparent approach to causal inference that makes maximum use of available information while acknowledging its limitations. Claims outside the bounds can be immediately rejected, and claims inside the bounds must be explicitly justified by additional assumptions or data that enable tightening. But several obstacles still preclude widespread use of these techniques.
For one, analytic derivation remains intractable for many problems. Within the subclass of linear problems, Balke and Pearl's (1994) simplex method offers a highly efficient analytic solution, but one that fails to generalize to the many partially observed settings where nonlinearity arises. Analytic nonlinear solutions remain limited to specific results, painstakingly derived case by case (e.g. Knox et al., 2020;Gabriel et al., 2020;Li and Pearl, 2021). Though general sharp bounds can in theory be obtained by various nonlinear optimization techniques (Geiger and Meek, 1999;Zhang and Bareinboim, 2021), such approaches are often computationally infeasible. This is because without exhaustively exploring a vast model space, analysts can obtain local optima that correspond to potentially invalid bounds-i.e., ranges that may fail to contain the truth.
To address these limitations, we first show in Sections 3 and 4 that essentially all common causal queries involving discrete variables can be reduced to polynomial programs-a well-studied class of optimization tasks that nest linear programming as a special casebuilding on prior results from Geiger and Meek (1999) and Wolfe et al. (2019). 1 While mature techniques have been developed for such tasks (Belotti et al., 2009;Vigerske and Gleixner, 2018;Gamrath et al., 2020), it is well known that solving polynomial programs to global optimality is in general NP-hard. The difficulty of the problem thus highlights the need for efficient algorithms and bounding techniques that remain valid even when analysts are faced with time constraints. In Sections 5-6, we develop a procedure, based on dual relaxation and spatial branch-and-bound relaxation techniques, that provides valid bounds of arbitrary sharpness, for all causal structures, under virtually any information environments and domain assumptions. We show this procedure is guaranteed to achieve complete sharpness with sufficient computation time; in smaller problems, this can occur in a matter of seconds. However, in cases where the time needed to discover sharp bounds is prohibitive-which can occur even in moderately sized problems with severe information fragmentation-our algorithm is anytime (Dean and Boddy, 1988), meaning it can be interrupted to obtain non-sharp bounds that are nonetheless guaranteed to be valid. Our technique also offers an additional guarantee we term "ε-sharpness," indicating the worst-case looseness factor of the relaxed bounds relative to the unknown, completely sharp bounds. In Section 7, we provide two approaches for characterizing uncertainty in the estimated bounds, and we demonstrate our technique in a series of simulations in Section 8. Our simulations, validated against previously derived analytically results where possible, show the flexibility of our approach and the ease with which assumptions can be modularly imposed or relaxed. Moreover, we demonstrate how the algorithm can uncover counterintuitive results: in one case, we show a scenario that appears to be partially identified is in fact point identified, improving over widely used bounds (Manski, 1990) and recovering a recent advance in the literature on nonrandom missingness (Miao et al., 2015).
Our approach offers a complete and computationally feasible approach to causal inference in discrete settings. Given a well-defined causal query, valid assumptions, and data, researchers now have a general and automated process to draw causal inferences that are guaranteed to be valid and, with sufficient computation time, provably optimal. As we discuss in Sections 9-10, our approach's modular nature also allows analysts to conduct principled robustness tests and sensitivity analyses that can identify the most promising avenues for future research, promote research transparency, and accelerate scientific discovery.

Related Literature
Researchers have long sought to automate causal identification by recasting causal queries as constrained optimization problems that can be solved computationally. Our work is most closely related to Pearl (1994, 1997), which showed that certain bounding problems in discrete settings-generally corresponding to causal systems in which outcomes and manipulated variables are fully observed-could be formulated as the minimization and maximization of a linear objective function subject to linear equality and inequality constraints. In these cases, causal bounding problems can be reformulated as linear programming problems, which admit both symbolic solutions and highly efficient numerical solutions. Subsequent studies have proven that in particular settings, the bounds produced by this technique are sharp (Ramsahai, 2012;Bonet, 2001;Heckman and Vytlacil, 2001), and Sachs et al. (2020) shows this approach produces sharp bounds for any such linear problem. These results were extended by Geiger and Meek (1999), which showed that a much broader class of discrete problems can be formulated in terms of polynomial relations-at least, when analysts have precise information about the kinds of disturbances or confounders that may exist, expressed in terms of latent variable cardinalities. These discrete problems include not only the bounds studied in this paper, but the related problem of determining what constraints on the main variables are implied by a causal graph. In addition to the well-known conditional independence constraints implied by d-separation, these can include generalized equality constraints (or Verma constraints; Verma and Pearl, 1990;Tian and Pearl, 2002). Beyond these equalities, the main variables are also constrained by generalizations of the instrumental inequalities (Pearl, 1995;Bonet, 2001). Geiger and Meek (1999) note that in theory, algorithms for quantifier elimination can provide symbolic solutions for these questions. However, the time complexity of quantifier elimination is doubly exponential, rendering it infeasible for all but the simplest cases. At the core of this issue is that symbolic methods provide a general solution, meaning that they must explore the space of all possible inputs. In contrast, numerical methods such as our approach can often eliminate portions of the space that are irrelevant, accelerating computation.
Even so, computation can be time-consuming; polynomial programming is in general NP-hard. In practice, many optimizers are able to rapidly find reasonably good values but cannot guarantee optimality without exhaustively searching the space of candidates. This approach poses a challenge for obtaining causal bounds, which represent minimal and maximal values of the estimand under all models that are admissible, or consistent with observed data and modeling assumptions. If a local optimizer operates on the original problem (the primal ), proceeding from the interior and widening bounds as more extreme models are discovered, then failing to reach global optimality will result in invalid boundsranges narrower than the optimal sharp bounds which do not contain all possible solutions.
In this paper, we detail an approach that resolves this obstacle by allowing analysts to obtain valid bounds in limited time. At a high level, our approach is to reexpress causal inference problems in terms of principal strata (Frangakis and Rubin, 2002). To do so, we first present new results on lossless reductions for latent variables of unknown cardinality. We then show that causal estimands, modeling assumptions, and observed information can all be expressed in terms of polynomial expressions, equalities, and inequalities with no loss of information. We show how these systems can be simplified for computational efficiency, then develop an iterative primal-dual algorithm that searches for admissible models from the interior of the bounds (the primal problem) while simultaneously refining a guaranteed-valid outer envelope for the sharp bounds (the dual problem). Even when exhaustive search is computationally infeasible, suboptimal primal and dual values can still be found and improved over time. We show suboptimal dual points allow analysts to report valid loose bounds-those that are wider than the unknown sharp bounds. Our method also utilizes the suboptimal primal points, allowing analysts to assess the worst-case looseness factor of the reported valid bounds, relative to the unknown sharp bounds.

Preliminaries
In this section, we define notation and discuss concepts necessary to derive our key results. We first review how arbitrary directed acyclic graphs (DAGs) can be "canonicalized" without loss of information, resulting in an equivalent form with properties amenable to analysis (Evans, 2018). We then describe how graphs in this form give rise to potential outcomes and principal strata (Frangakis and Rubin, 2002), two key building blocks in our analytic strategy.
Suppose that for each i.i.d. unit i ∈ {1, . . . , N }, the main variables of interest are contained in V i = {V i,1 , . . . , V i,J }, indexed by j. We will suppose that the sample space of each main variable, S(V i,j ), has finite cardinality. These variables may be either observed or unobserved-as we will show, it is often useful to reason about unobserved elements of V i in the context of missing data and measurement error. We will also consider unobserved causal ancestors of V i , collectively denoted U i = {U i,1 , . . . , U i,K } and indexed by k, that represent random disturbances or confounders. Without loss of generality, these disturbanceswhich have unknown, possibly infinite cardinality-are assumed to subsume all phenomena that are causally relevant to V i . 2 As we show in Section 3.3, this assumption is without consequence, because even a continuous and infinite dimensional U i must still map down to the same finite canonical partitions that we describe there. In addition, we will make use of counterfactual random variables, which represent hypothetical versions of random variables in V i that would have occurred had, contrary to fact, treatment variables been exogenously set to a specified value. (A more rigorous definition is given in Section 3.2.) By convention, bold letters denote collections of variables; uppercase and lowercase letters respectively denote random variables and their realizations. We will consider population distributions until discussing inference in Section 7.

Canonical DAGs
We now discuss how DAGs can be canonicalized, or distilled to minimal form, to clarify which aspects of the structural model can be ignored, greatly simplifying the bounding task. Suppose that causal relationships between all variables in V i and U i are represented by a directed acyclic graph (DAG) G. The nonparametric structural equations model (NPSEM) of a DAG states that each main variable V i,j ∈ V i is a deterministic function of its parents in the graph G, denoted pa(V i,j ). That is, all factors determining V i,j are contained in pa(V i,j ), a subset of U i and V i . We denote the function mapping from pa( . . , f J } to denote the collection of these structural equations, or the structural causal model model of V (Pearl, 2009;Richardson and Robins, 2013). Note that each main variable may be influenced by multiple disturbances, and a single disturbance may influence multiple main variables.
A DAG is said to be in canonical form if (i) all disturbances are exogeneous, i.e. no variable in U i has any parents in G; and (ii) there exists no pair of disturbances, U i and U i , such that U i influences a subset of the variables influenced by U i . Evans (2018) showed that for any DAG G not in canonical form; there exists a canonical form DAG G with an identical distribution over all factual and counterfactual versions of all variables in V i . We can therefore without loss of generality limit our consideration to DAGs in canonical form. An example of a DAG not in canonical form is given in panel Figure 1(a). Panel Figure 1(b) illustrates the canonicalized version of this graph. For convenience, we will refer to the joint distribution over all factual and counterfactual versions of V i as the full data law. Moreover, any DAG over U i , V i , and unobserved ancillary variables W i with unknown cardinality (e.g., confounders or mediators not of direct interest) also has an equivalent canonical DAG with respect to this full data law. A guide for canonicalizing arbitrary DAGs is given in Appendix A.1.
In short, representing the causal graph in canonical form distills the data-generating process (DGP) to its simplest form, eliminating potentially complex networks of disturbances. Removing variables that are irrelevant to the causal goal further simplifies the structure. Without these simplifications, it would be exceedingly difficult, if not intractable, to convert causal problems into polynomial programs that can be readily optimized-the essence of the approach we develop below. Figure 1: Canonicalization of a mediation model. Mediation DAG in non-canonical form (panel a) and canonical form (panel b) that are fully equivalent with respect to their full data law. Unit indices, i, are suppressed. Canonicalization proceeds as follows: (i) the dependent disturbance U 3 is absorbed into its parent U 23 ; (ii) the superfluous U 2 is eliminated as it influences a subset of U 23 's children; and (iii) the irrelevant W 13 is absorbed into the V 1 → V 3 path as it is neither observed nor of interest. A complete guide to canonicalization is given in Appendix A.1.

Potential Outcomes
The notation of potential outcome functions allows us to compactly express the effects of manipulating a variable's parents or other ancestors. Let A ⊂ V be a subset of variables that will be intervened upon, fixing them to A = a. When A = ∅, so that no intervention occurs, then define V i,j (a) = V i,j , the natural value. When A ⊆ pa(V i,j ), so that only immediate parents are manipulated, then unit i's potential outcome function is given by its response function, We will now define more general potential outcome functions by recursive substitution (Richardson and Robins, 2013;Shpitser, 2018). For arbitrary interventions on here, is a generic index that sweeps over main variables in the graph. This means that if a parent of V i,j is directly manipulated, it is set to the corresponding value in a. Otherwise, the parent takes on its potential value after intervention on any causally prior variables, or its natural value otherwise. To obtain the parent's potential value, we follow the same definition recursively. When defining potential outcomes, intervention on V i,j itself is ignored. To illustrate, consider the mediation graph of Figure 1 ), relating to total effects; and (iii) , relating to controlled effects.

Principal Stratification
Analysts have little information about the disturbances U i , which may take on an infinite number of values. This poses an analytic challenge, as it is difficult to reason about infinite spaces. Here, we review a result that makes the general partial identification problem tractable despite this issue: broadly, when V i are discrete, the model that a DAG encodes can be represented by a finite number of parameters without loss of generality, as long as the reduced space is sufficiently large. We then introduce the functional parameterization used for this task, discuss its relationship to principal strata, and review how any marginal of the full data law can be represented in terms of these parameters. Finkelstein et al. (2021) show that there are finite state spaces for U i that do not restrict the NPSEM of a DAG for Pr(V i = v), i.e. the model over the factual main variables. In the following proposition, we extend this result to show that there are finite state spaces for U i that do not restrict the NPSEM of a DAG for the full data law-i.e., the full distribution over all factual and counterfactual versions of the main variables.
Proposition 1. Suppose G is a canonical DAG over discrete main variables V i and disturbances U i with infinite cardinality. Then the model over the full data law implied by G is unchanged by assuming that the disturbances have finite cardinalities, provided those cardinalities are sufficiently large.
A proof can be found in Appendix B, along with details on how to obtain a lower bound on non-restrictive cardinalities for the disturbances.
Further, Evans (2018) showed that for a large class of graphs called geared graphs, it is possible to develop a functional model that does not alter the causal model of a DAG. In the functional model of a graph, each main variable V i is associated with a disturbance U i that fully determines how V i responds to the values of its remaining parents. 3 Proposition 1 enables us to develop functional models for graphs that are not geared as well. Finkelstein et al. (2021) presents an algorithm for constructing a concise functional model for non-geared graphs, taking as input a disturbance cardinality that is non-restrictive of the model over factual random variables. By instead substituting the Proposition 1 disturbance cardinality, which may be larger and restricts neither the factual nor the counterfactual random variables, we obtain a functional model that is likewise non-restrictive of the full data law. Intuitively, functional models are closely related to principal stratification (Greenland and Robins, 1986;Frangakis and Rubin, 2002). For example, consider the simple DAG, in which V i,1 and V i,2 are binary. This relationship is governed by the structural equations , where the functions f 1 : S(U i,1 ) → S(V i,1 ) and ) are deterministic and shared across all units. Thus, the only source of randomness is in the disturbances, Analysts generally do not have direct information about these disturbances. For example, U i,1 could potentially take on any value in (−∞, ∞). However, Proposition 1 states that this variation is irrelevant, because V i,1 can only take on two possible values: 0 and 1. The space of U i,1 can therefore be divided into two canonical partitions (Balke and Pearl, 1997)-those that deterministically lead to V i,1 = 0 and those that lead to V i,1 = 1-and thus there is no loss of generality in treating U i,1 as if it were binary. 4 The situation with V i,2 is similar but more involved. After the random U i,2 is realized, it induces the partially applied response function V i,2 = f 2 (V i,1 , U i,2 = u 2 ) = f (U i,2 =u 2 ) 2 (V i,1 ), which deterministically governs how V i,2 counterfactually responds to V i,1 . Regardless of how many values the disturbance can take on, this response function must fall into one of only four possible groups, or principal strata, each corresponding to a possible function of the form f (Angrist et al., 1996). These groups are (i) V i,2 = 1 regardless of V i,1 , "always takers"; (ii) V i,2 = 0 regardless of V i,1 , "never takers"; (iii) V i,2 = V i,1 , "compliers"; and (iv) V i,2 = 1 − V i,1 , "defiers". Thus, from the perspective of V i,2 , any finer-grained variation in S(U i,2 ) beyond the canonical partitions is irrelevant. These partitions of U are in one-to-one correspondence with principal strata, allowing causal quantities to be expressed in simple algebraic expressions; e.g., the average treatment effect (ATE) in (1) is equal to the proportion of compliers minus the proportion of defiers. 5 By writing down all information in terms of (possibly unknown) strata sizes, we can convert causal inference problems into tractable polynomial programming problems over these variables.
The functional parameterization of this graph has four free parameters: one for the binary U i,1 (or its reduced representation) and three for the quaternary U i,2 . 6 Because the distributions of disturbances are independent in canonical DAGs by virtue of their exogeneity, only their marginal distributions need be parameterized. Each of U i,1 and U i,2 encode full information about how V i,1 and V i,2 respectively respond to their remaining parents. In other words, each setting of U i provides full information not only about each variable V i , but also about each of its potential outcomes. This means that we can represent "cross-world" distributions such as Pr V i,2 (V i,1 = 0) = 0, V i,2 (V i,1 = 1) = 1 -the "complier" proportion-in terms of parameters of the marginal distributions of U i,2 alone. As we will see below, this fact will be useful in encoding cross-world type assumptions like monotonicity, as well as for bounding cross-world targets like the natural direct effect or the probability of causation. More generally, any marginal of the full data law may be expressed in terms of the functional parameters.
Finally, consider a more complex example, the mediation DAG of Figure 1(b). The response functions for V i,1 and V i,2 remain unchanged. In contrast, V i,3 is caused by 23 ). Substituting in a realization of the disturbance, U i,23 = u i,23 , will produce one of sixteen response functions of the form f . More generally, the number of unique response functions grows with (i) the cardinality of the variable, (ii) the number of causal parents it has, and (iii) the parents' cardinalities. Specifically, V i,j has |S(V i,j )| |S(pa(V i,j ))| possible response functions: given a particular input from V i,j 's parents, the number of possible outputs for V i,j is |S(V i,j )|; the number of possible inputs from V i,j 's parents is 5 To see this, note that the ATE is given by | strata]·Pr(strata) = 0·Pr(always taker)+0·Pr(never taker)+1·Pr(complier)−1·Pr(defier). 6 These can be thought of as the probabilities of encountering a unit of the "control" type with V i,1 = 0 (for U i,1 ) and of encountering units of the "always-taker," "never-taker," and "complier" types (for U i,2 ). These parameters determine the probabilities of the remaining types (the "treatment" type for U i,1 and the "defier" type for U i,2 ), as principal strata probabilities must sum to unity.
the product of the parents' cardinalities.
In turn, this determines the minimal cardinality of U in a reduced but non-restrictive functional model-roughly speaking, the number of principal strata combinations that exist, if we think of U as principal strata. Here, "non-restrictive" means that the simplified model is fully expressive, or that it can represent any possible full data law. For example, to capture the joint response patterns that a unit may have on V i,2 and V i,3 , a reduced version of U i,23 will be capable of expressing any full data law if it has a cardinality of |S(U 23 )| = 4 × 16, because V i,2 has four possible response functions and V i,3 has sixteen.

Formulating the Polynomial Program
We now turn to the central problem of this paper: sharply bounding causal quantities with missing data. Our approach is to (i) rewrite the causal query into a polynomial expression, (ii) rewrite modeling assumptions and empirical information into polynomial constraints, and (iii) thereby transform the task into a constrained optimization problem that can be solved computationally. Sharp bounds are the narrowest range that contain all admissible values for a target quantity, i.e., all values that are consistent with information available to the analyst: structural causal knowledge in the form of a canonical DAG, G; as well as empirical evidence, E, and modeling assumptions, A, formalized below. We also suppose that the main variables take on values in a known, discrete set, S = S(V ). In this section, we will demonstrate (i) that {G, E, A, S} restricts the admissible values of the target quantity, and (ii) this range of observationally indistinguishable values can be recovered by polynomial programming.
The causal graph and sample space, G and S, together imply a set of possible functional models, each fully characterizing the main variables. By Proposition 1, without loss of generality, we can consider a simple functional model in which (i) each counterfactual main variable is a deterministic function of exogeneous, discrete disturbances; (ii) there are a relatively small number of such disturbances; and (iii) disturbances take on a finite number of possible values, corresponding to principal strata of the main variables. When repeatedly sampling units (along with each unit's random disturbances, U i ), the k-th disturbance thus follows the categorical distribution with parameters By the properties of canonical DAGs, these disturbances are independent. It follows that the parameters P U of the joint disturbance distribution Pr(U i = u i ) = k Pr(U i,k = u i,k ) not only fully determine the distribution of each factual main variable-i.e. the potential outcome under no intervention, V i,j (∅)-they also determine the counterfactual distribution of V i,j (a) under any intervention a, and its joint distribution with other counterfactual variables V i,j (a ) under possibly different interventions a . This leads to the following proposition.
Proposition 2. Suppose G is a canonical DAG and C = {V i, (a ) = v } is a set of counterfactual statements, indexed by , that variable V i, will take on value v under manipulation a . Let U ⊂ S(U ) indicate the subset of disturbance realizations that lead deterministically to every statement in C being satisfied. Then under the structural equation model G, which is a polynomial equation in P U i , the parameters of Pr(U = u).
For example, in the mediation setting of Figure 1(b), Proposition 2 implies that the joint distribution of the factual variables-V i,1 (∅), V i,2 (∅), and V i,3 (∅)-is given by Alternatively, analysts may be interested in the probability that a randomly drawn unit i has a positive controlled direct effect when fixing the mediator to V i,2 = 0. This is given by Pr We now expand this result to include a large class of functionals of marginal probabilities and logical statements about these functionals.
Corollary 1. Suppose G is a canonical DAG. Let P V denote the full data law and g(P V ) denote a functional of P V involving elementary arithmetic operations on constants and marginal probabilities of P V . Then g(P V ) can be expressed as a polynomial fraction in the parameters of P U , h(P U ), by replacing each marginal probability with its Proposition 2 polynomialization.
We say functionals of the full data law that fulfill these properties are polynomialfractionalizable, or simply polynomializable if the result contains no fractions. The corollary has a number of implications, which we briefly discuss here. First, it demonstrates that a wide array of single-world and cross-world functionals can be expressed as polynomial fractions. These include traditional targets such as the ATE, as well as more complex targets such as the pure direct effect and the probability of causal sufficiency. It also suggests that any non-elementary functional of P V can be approximated to arbitrary precision by a polynomial fraction, provided the functional has a convergent power series. 7 Next, observe that when (i) g(P V ) is a polynomial-fractionalizable expression; (ii) ∈ {<, ≤, =, >, ≥, =} is a binary comparison operator; and (iii) α is a constant, then statements of the form g(P V ) α can be equivalently expressed as non-fractional polynomial relations h(P U ) 0. Finally, by the same token, any polynomial-fractional expression h(P U ) in the parameters of P U can be reexpressed with (i) a non-fractional polynomial in an expanded parameter space and (ii) a polynomial equation in the same expanded space. 8 We will make extensive use of these properties to convert causal queries to polynomial programs.
In Appendix A, Algorithm 1 provides a step-by-step procedure for obtaining sharp bounds. We begin by transforming a factual or counterfactual target of inference T into polynomial form, possibly with the use of additional auxiliary variables to eliminate fractions. To accomplish this task, the procedure utilizes the possibly non-canonical DAG G and the possible main-variable outcomes S(V ) to reexpress T in terms of functional parameters that correspond to principal strata proportions. The result is the objective function of the polynomial program. The procedure then polynomializes the sets of constraints on polynomializable functionals resulting from empirical evidence and by modeling assumptions, respectively E and A. For example, in the binary mediation setting of Figure 1, G may be the graph depicted in either panel (a) or (b). If only observational data is available, then E consists of eight pieces of evidence, each represented as a statement corresponding to a cell of the factual distribution Pr Modeling assumptions include all other information, such as monotonicity or dose-response assumptions; these can be expressed in terms of principal strata. For example, the assumed unit-level monotonicity of the V 1 → V 2 relationship (e.g., the "no defiers" assumption of Angrist et al., 1996) can be written as the statement that Pr Finally, the statement that each disturbance k follows a categorical probability distribution is reexpressed as the polynomial relations Pr(U k = u k ) ≥ 0 : u k and u k Pr(U k = u k ) = 1.
Algorithm 1 produces an optimization problem with a polynomial objective subject to polynomial constraints. This polynomial programming problem is equivalent to the original causal bounding problem. This leads directly to the following theorem. Once the causal problem is expressed in polynomial form, a variety of computational solvers can in principle be used to optimize (e.g. IPOPT; Wächter and Biegler, 2006). However, local solvers cannot guarantee valid bounds without exhaustively searching the space; when time is limited, these often fail to discover global extrema for the causal estimand, resulting in intervals that may fail to contain the quantity of interest. Moreover, such approaches often become computationally intractable as causal problems grow complex. In the next section, we show how the polynomial program can be simplified to speed computation.

Simplifying the Polynomial Program
Because solving polynomial programs is in general NP-hard, efficient computation requires us to fully exploit our knowledge of the problem structure. This knowledge allows analysts to reduce the complexity of the program in ways that algebraic presolvers may not necessarily detect. In this section, we discuss several ways to do this.

Eliminating Blocked Disturbances
We begin by using the following observation to limit the number of disturbance distribution parameters involved in the target and constraints.
Proposition 3. Consider the polynomialization of a probability Pr C , where C = {V i, (a ) = v : }. We say that for intervention a , a disturbance U i,k is blocked from the corresponding counterfactual V i, if there are no paths from U i,k to V i, that do not go through the causally prior members of the intervention a . When U i,k is blocked from V i, for every , the corresponding parameters P U k can be eliminated from the polynomialization.
In other words, Proposition 3 states that each main variable V i,j is only a function of its ancestors in U that affect it through a variable not under intervention. For each marginal probability of an event, the disturbances that do not affect any variable in the event are irrelevant. This allows us to amend the polynomialization of Proposition 2 so that the outer sum ranges only over all possible settings of relevant disturbances, reducing the degree of each term in the polynomial. For example, in the mediation graph of Figure 1(b), consider the total effect of the treatment V i,1 on the outcome V i,3 . Here, all probabilities are of the The disturbance U i,1 is therefore blocked from the outcome V i,3 , because the sole path from U i,1 to V i,3 goes through the intervention set V i,1 . This means that whenever Pr(U 1 = u 1 ) appears in the polynomial, it does so in a way that ensures u 1 Pr(U 1 = u 1 ) = 1 can be factored out and eliminated.

Exploiting the Nested Markov Parameterization
We now consider the common case when the empirical evidence E includes single-world marginal distributions. This occurs when (i) a factual or counterfactual event {V i, (a) = v } involves the same intervention, a, for every variable of interest; and (ii) the probability Pr {V i, (a) = v } is observed for every event in that state space, {v : } ∈ S(V i, ). For example, in the binary mediation setting of Figure 1

(b), an observational study might obtain information about Pr
Similarly, an experiment that randomly manipulated V i,1 would obtain two such distributions: (i) by observing Pr A naïve parameterization might use one parameter for the probability of each principal strata (e.g., four parameters for the proportion of "always takers," "never takers," "compliers," and "defiers"). It is immediately apparent that one naïve parameter is redundant, as its value is already implied by the fact that all distributions must marginalize to unity. Hidden-variable DAG models imply additional equality constraints, each of which can likewise be used to reduce the number of parameters needed to describe the model, thereby reducing the number of polynomial constraints in the program.
These additional equality constraints, which are implied by the structural equations model of G, are well understood. In particular, G imposes certain conditional independence and generalized equality constraints (or Verma constraints, Verma and Pearl, 1990;Tian and Pearl, 2002) on these distributions. Each equality constraint can be used to eliminate one parameter of the single-world distribution. The parameterization that takes full advantage of these structural equality constraints to reduce the number of parameters is called the nested Markov parameterization (Evans et al., 2019). This parameterization achieves the minimal number of parameters, equal to the dimension of the model of G.
Each nested Markov parameter is exactly equal to an identified marginal probability of a single-world event, Pr {V i, (a ) = v } . Because each of the nested Markov parameters is identified from the initial single-world distribution, Pr {V i, (a) = v } , whether indirectly or directly, it can be calculated directly from the empirical evidence E. By Proposition 2, this probability remains polynomializable in the parameters P U . This allows us to add one equality constraint to the program per nested Markov parameter, based on its polynomialization, rather than one equality constraint per outcome in the state space. For hidden variable DAGs that imply a large number of equality constraints, this can substantially reduce the number of constraints. Evans et al. (2019) offers a a complete guide to obtaining nested Markov formulations of arbitrary single-world distributions.
Each parameter in the nested Markov formulation is the probability of a single-world event that involves fewer main variables and more interventions, compared to any event used by the naïve parameterization. As a result, the corresponding polynomialization will have fewer terms, of lower degree, than the polynomialization of naïve parameters. An example is provided in Appendix A.4. Using this technique, we modify Algorithm 1 by partitioning the empirical evidence into all single-world marginal distributions E M and the remaining evidence E R . The constraints in E M can then be reduced into their nested Markov form before polynomialization. Because the nested Markov parameterization allows for fewer, simpler polynomial constraints in the program, it is important to use it whenever the empirical evidence permits.
We note that when certain deterministic relationships exist between variables in V i , as in the missing-data setting of Figure 5(c-d), 10 these relationships may imply equality constraints not exploited by the nested Markov parameterization. In such cases, it may be possible to further reduce the number of constraints; we do not explore that option here.

Eliminating Additional Constraints and Parameters
Finally, we describe when constraints and parameters can be safely eliminated from a program. We say that parameters x and y co-occur in a polynomial system if they appear in the same constraint; they interact if there exists a sequence of parameters from x to y such that every adjacent pair co-occurs. 11 If a constraint's parameters do not interact with the objective's parameters, that constraint may be dropped. If a parameter exists only in constraints that have been eliminated, then the parameter has also been eliminated, simplifying the system. This may be used in conjunction with the structure of G to help simplify the program, because different districts-components in G connected by bidirected arcs (Tian and Pearl, 2002;Richardson, 2003)-do typically do not interact. That is, likelihoods on marginal distributions of G have a representation that is decomposable by districts. For example, in Figure 1(b), V i,1 lies in one district; in contrast, V i,2 and V i,3 lie in another district, because they are connected by U i,23 . Because each nested Markov parameter is the probability of a single-world event involving main variables within a single district, its polynomialization will at most involve disturbance parameters in that district. This leads to the following proposition.
Proposition 4. The degree of each polynomial in the nested Markov constraints is bounded from above by the number of latent variables in the corresponding district. Moreover, if two disturbances U i,k and U i,k appear in different districts, their parameters P U k and P U k will not interact in any nested Markov constraint.
To illustrate, consider the common scenario where an analyst observes the full joint distribution over factual variables, Pr V i,1 (∅) = v 1 , . . . , V i,J (∅) = v J , and seeks to bound a functional relating a treatment a to an outcome V i,j (a) in the same district. As an example, in Figure 1(b), the effect of the mediator V i,2 on the outcome V i,3 is wholly contained within a single district. We can therefore drop all constraints related to nested Markov parameters involving other districts, and thus all disturbance parameters in other districts.
10 In this graph, a latent variable Y has an observed version Y * that deterministically inherits Y * = Y when a reporting variable R = 1, but takes on the missing-value indicator Y * = NA otherwise.
11 For example, consider the constraints x + y = a, y + z = b. Here, x and y co-occur; x and z interact.

Computing ε-sharp Bounds in Polynomial Programs
We now turn to the practical optimization of the polynomial program defined by Algorithm 1. Theorem 1 states minimization and maximization of this original primal program is equivalent to the initial bounding problem. However, obtaining globally optimal solutions in polynomial programming can be computationally intensive. Worryingly, methods that iteratively improve suboptimal values for the primal problem may fail to produce valid bounds (i.e., bounds containing all possible values of the estimand, including global extrema) without searching the full parameter space, P. To address this challenge, we use dual methods that construct and iteratively refine an outer envelope around the primal function (i.e. the objective function, or causal quantity of interest). Specifically, we employ a variation of the spatial branch-and-bound method, combined with a piecewise linear envelope, implemented using a variety of optimization frameworks that include SCIP and Couenne (Vigerske and Gleixner, 2018;Gamrath et al., 2020;Belotti et al., 2009). Throughout the optimization process, current suboptimal values for dual minimization and maximization problems are guaranteed to produce valid but loose outer bounds; current suboptimal values for the primal problem produce possibly invalid inner bounds; and the lower (upper) endpoint of the unknown sharp bounds is guaranteed to lie between the current suboptimal primal and dual minimization (maximization) values. Through simultaneous primal-dual optimization, we use these suboptimal inner bounds to precisely quantify worst-case looseness, ε, of the suboptimal but valid outer bounds. This allows researchers to assess how more computation may lead to tightened conclusions. A step-by-step description of our optimization procedure, which we term ε-sharp bounding, is given in Algorithm 2 of Appendix A. At a high level, it proceeds as follows. Our procedure takes as inputs the polynomialized objective function T (p) and constraint set C(p), obtained from Algorithm 1. It then evaluates a range of models, or points p in the model space P for which C(p) is satisfied. It seeks to identify extreme values of T (p) within this subspace. It also accepts two parameters: thresh , a stopping threshold for the looseness factor stopping, and θ thresh , a stopping threshold for width of the bounds. The algorithm returns two types of information: the bounds for the causal program, and the worst-case looseness factor ε.
Primal bounds are denoted P and P , adopting the convention that underlines refer to objects used for minimization and overlines for maximization. These indicate the extreme values of the target estimand in any admissible model-that is, satisfying C(p)-that has been located so far. These are initialized at +∞ and −∞, respectively, indicating that no admissible models have been found yet. As optimization proceeds, the primal bounds improve as new, more extreme admissible models are found. We refer to [P , P ] as the inner bounds: the unknown sharp bounds must at least contain these points, which correspond to models that are observationally indistinguishable from the true DGP.
Dual optimization begins by partitioning the parameter space into branches, proceeding separately for the lower and upper bound and respectively producing partitions B b and B b . At initialization, these consist of a single branch spanning the entire parameter space; each branch is then recursively divided. The lower and upper parts of the dual envelope, or outer envelope, are denoted D and D. These are piecewise linear functions, with pieces corresponding to the branching partitions, that are relaxations of the true objective function, T (p), from below and above. These relaxations are made to ensure they will always contain the entire objective function at all points in the parameter space. Within branch b, the value min{D b (p) : p ∈ B b } indicates the lowest value attained by the lower envelope; thus, T = min b min{D b (p) : p ∈ B b } represents the lowest value attained by the lower envelope anywhere in the parameter space. Conversely, T = max b max{D b (p) : p ∈ B b } represents the highest value of the upper envelope. These extreme points on the dual envelope, [T , T ], define the dual (outer) bounds. These are the reported causal bounds; whatever the true sharp bounds, they must lie inside the dual bounds, even if the algorithm has not run to completion. We let θ equal the bound width, or the difference between the upper and lower dual bounds, and we define the worst-case looseness factor ε as the slack (the difference in dual and primal bound widths) divided by the primal bound width.
The algorithm heuristically selects branches in the model space that appear promising, and refines primal and dual bounds in turn. It first searches within the branch for an admissible model; if found, and if the associated causal estimand is more extreme than those previously encountered, it is stored as a new primal bound. Whatever the true nonparametric sharp bounds, they must lie outside the primal bounds because the true bounds must contain the extreme models that define the primal bounds. Then, it divides the branch into sub-branches and refines the dual envelope by tightening the piecewise linear outer-approximation. The algorithm continuously prunes branches of B b and B b that wholly violate constraints; it also continuously branches and refines the bounds while θ and ε exceed specified thresholds.

Statistical Inference
We now turn to statistical inference for the bounds developed above. We say that the results of Algorithm 2 when applied to E, the population empirical constraints-i.e., margins of the full data law that are observed without sampling error-are population bounds. In practice, the empirical quantities used in these constraints are estimated from finite samples. Our goal in this section is to account for variation inÊ, the estimated constraints, that arises over repeated sampling. The results of Algorithm 2 when substitutingÊ for E are referred to as the estimated bounds. In this section, we describe how to construct confidence bounds that (i) contain the estimated bounds and (ii) contain the population bounds at a rate of at least the confidence level α over repeated samples.
Recall that each element of empirical evidence E is a relation between (i) some population quantity that is an observable functional of the main variables' distribution, g(P V ), reexpressed in terms of the disturbance distribution P U ; and (ii) the population value of that observable quantity. InÊ, we plug in for (ii) the estimated value of the quantity in finite data. For example, in the mediation graph of Figure 1(b), an analyst with access to a sample of observational data would have Ê = polynomialize We will refer to the vector of estimated quantities on the right-hand side ofÊ elements-in the above example, quantities of the form 1 We denote the corresponding population quantities as E, the right-hand side values in E.
To construct confidence bounds we consider the sampling variability of these estimated quantities. We construct regions, CR α (Ê), containingÊ and guaranteed to contain the population quantities E with at least probability α over repeated samples. These regions correspond to population distributions over observed parameters that cannot be rejected at level α. In Algorithm 2, we then replace theÊ constraints with a set of loosened confidence constraints CR α (Ê). In other words, if the population bounds are obtained by optimizing subject to a equality constraint {g (P V ) = E } ∈ E, and the estimated bounds are obtained with the plug-in version g (P V ) =Ê ∈Ê, then the confidence bounds will incorporate the interval constraint g (P V ) ∈ CR α (Ê ) ∈ CR α (Ê).
Because looseningÊ to CR α (Ê) can only decrease (increase) the minimum (maximum) value obtained by the polynomial program, confidence bounds always contain the estimated bounds. Similarly, when the confidence region for estimated quantities fully contains their population analogues, then the confidence constraint CR α (Ê) is looser than the population constraint E, and resulting confidence bounds also contain the population bounds. However, when the confidence region does not fully contain population quantities due to sampling error, confidence bounds may still contain population bounds. This can occur if the noncovered quantity corresponds to a constraint that is irrelevant to the bounds. Therefore, if the confidence region on the observed quantities has coverage of exactly α, confidence bounds will contain the population bounds in at least α of repeated samples.
In discrete settings, the task of obtaining confidence bounds thus reduces to the problem of constructing regions CR α (Ê) for the multinomial proportion, such that Pr E ∈ CR α (Ê) ≥ α. We focus on two methods for doing so. Drawing on Malloy et al. (2020), we first consider a "Bernoulli-KL" approach that constructs separate confidence regions for each observable atomic event, Pr(V i = v), treating it as a "success" in a Bernoulli distribution. The approach rotates through all possible v and combines the event-specific regions using a result on the Kullback-Leibler divergence of sampling distributions to the underlying population distribution. The Bernoulli-KL method produces a confidence region for singleworld distributions that is guaranteed to have conservative coverage for the multinomial proportion in finite samples. The region can be represented as a system of linear inequality constraints, then incorporated into the polynomial program. Our second approach uses an asymptotically valid confidence region based on the multivariate Gaussian limiting distribution of the Dirichlet (Bienaymé, 1838), which can be represented as a single convex quadratic inequality constraint. Figure 2 visualizes these regions for a simple two-node Figure 2: Polynomial confidence regions in a binary graph. Panel (a) presents a causal graph in which binary X causes binary Y , but both are confounded by an unobserved U . N = 1, 000 observations are drawn from this DGP, producing an empirical distribution with proportions 1 N N i=1 1(X i = x, Y i = y). Panels (b-c) depict confidence regions for Pr(X i = 0, Y i = 0), Pr(X i = 0, Y i = 1), and Pr(X i = 1, Y i = 0); the final category, Pr(X i = 1, Y i = 1) (not depicted), must sum to unity. Panel (b) shows the Bernoulli-KL confidence region, which is conservative in finite samples and can be polynomialized as a set of linear inequalities. Panel (c) shows the Gaussian confidence region, which is asymptotically valid and can be polynomialized as a single convex quadratic inequality.
graph. Simulations reported in Section 8.2 evaluate coverage of the methods for various sample sizes. Appendix C provides details on the implementation of these methods. We also provide a method for polynomializing arbitrary confidence regions, allowing analysts to exploit tighter finite-sample confidence regions.

Simulated Examples
We now demonstrate our algorithm's performance via simulations. Several examples correspond to known analytic solutions, offering further validation of our approach. Section 8.1 illustrates how Algorithms 1-2 allow analysts to iteratively state possible assumptions, test their observable implications, and use them to narrow causal bounds under noncompliance. Section 8.2 evaluates our proposals for statistical inference with estimated bounds. Section 8.3 examines several challenges-selection, mismeasurement, and missingness-that pose more complex threats to statistical inference. For clarity of exposition, all simulations use binary variables; our method adapts automatically to categorical variables.

Instrumental Variables
Noncompliance, or deviation between assigned (Z i ) and realized (X i ) treatment status, is a common obstacle to causal inference in randomized trials. Balke and Pearl (1997) showed that the task of bounding the ATE on an outcome Y i in the presence of noncompliance can be formulated as a linear programming problem, admitting a computational solution to partial identification. However, this approach cannot be extended to bound the local average treatment effect (LATE) among "compliers" that accept the assigned treatment-a principal effect that has received considerable attention-because this estimand corresponds to a nonlinear objective function. Angrist et al. (1996) shows the LATE can be point identified, but only if a number of conditions hold. These conditions include (i) ignorability of Z i ; (ii) a non-null effect of Z i on X i ; (iii) an exclusion restriction, or the absence of a direct effect of Z i on Y i ; and (iv) monotonicity, or the absence of "defiers" that behave inversely to instructions. In this section, we estimate both the ATE and LATE in settings where assumptions i-ii are satisfied, then probe the implications of assumptions iii-iv. Our results show that while extant methods offer solutions for specific scenarios and estimands, even minor deviations from ideal conditions can render them inapplicable or inaccurate. Below, we show how our algorithm easily accommodates these variations and complications.  Angrist et al. (1996) are satisfied except monotonicity of Z → X. In this simulation, the true values of the ATE and LATE are −0.25 and −0.36, respectively. In practice, analysts may proceed with an abundance of caution and make the conservative causal assumptions depicted in panel (a)-a challenging scenario in which a direct effect of the instrument on the outcome cannot be excluded and monotonicity is not assumed. Assuming model (a) and applying our algorithm yields sharp bounds of [−0.63, 0.37] and [−1, 1] for the ATE and LATE, respectively. While these bounds are relatively wide-the ATE cannot be signed, and the bounds for LATE are entirely uninformative-the resulting intervals do contain the true estimand values, and they represent the most precise statement possible under assumptions the analyst is willing to defend.
If the analyst was willing to assume the exclusion restriction, per model (b)-perhaps due to domain expertise or an experimental design that ruled out direct effects-our al-gorithm would bound the ATE at [−0.55, −0.15], revealing a negative effect and correctly containing the true value of −0.25. However, under these circumstances, the bounds on the LATE remain entirely uninformative at [−1, 1]. This reflects the fact that without strong assumptions, it is difficult to learn about cross-world quantities such as principal effects.
Finally, panel (c) shows a DGP imagined by an overconfident analyst, in which all four identifying assumptions in Angrist et al. (1996) are embraced. Unbeknownst to the analyst, the monotonicity assumption is in fact violated. Helpfully, when asked to estimate bounds, our Algorithm 2 reports that the causal query is infeasible. Recall that the true DGP corresponds to model (b), in which defiers are present; because the algorithm fails to locate any DGPs in which the observed information is consistent with the absence of defiers, it provides a clear warning to users that the assumption cannot be defended. However, if the analyst naïvely applied the traditional instrumental variables two-stage least squares estimator, they would not be alerted to this fact. Rather, they would obtain a point estimate of −0.74, roughly twice the true LATE. Put differently, the standard IV approach ignores observable implications of underlying assumptions. In contrast, our algorithm flags faulty theory by identifying infeasible scenarios, forestalling fruitless inquiry.

Coverage of Confidence Bounds
In applied settings, the bounds estimated by our algorithm will be subject to sampling error. We now evaluate the performance of confidence bounds that characterize this uncertainty, constructed according to Section 7, using the instrumental variable model of Figure 3(b). Specifically, we draw samples of N = 1, 000, N = 10, 000, or N = 100, 000 observations from this DGP. For each sample, we then compute estimates of eight quantities: Pr(Z i = z, X i = x, Y i = y) for all x, y, z ∈ {0, 1}. These quantities form the basis of estimated bounds, by the plug-in principle. To quantify uncertainty, we compute 95% confidence regions on the same observed quantities, then convert them to polynomial constraints for inclusion in Algorithm 2. Optimizing subject to these confidence constraints produces confidence bounds, depicted in Figure 4. For each combination of sample size and uncertainty method, we draw 1,000 simulated datasets and run Algorithm 2 once. Table 1 reports average values of estimated lower (upper) confidence bounds obtained by Algorithm 2 over 1,000 simulated datasets, for varying N . At all sample sizes, estimated bounds are centered on population bounds. Figure 2 shows confidence bounds obtained across methods and sample sizes. The Bernoulli-KL method produces wider confidence intervals at all N ; at N = 1, 000, it is generally unable to reject zero, whereas the asymptotic method does so occasionally. Differences in interval width persist but shrink rapidly as sample size grows and both methods collapse on population bounds. As discussed in Section 7, we find more conservative coverage for confidence bounds on the ATE (100% coverage of population bounds), compared to coverage of the underlying confidence regions on the observed quantities (95% joint coverage of observed population quantities for the asymptotic method).

More Complex Bounding Problems
We now examine four hypothetical DGPs, shown in Figure 5, featuring various threats to inference. Throughout, we target the ATE of X on Y . Panel (a) illustrates outcome-based selection: we observe unit i only if S i = 1, where S i may be affected by Y i . Selection severity, Pr(S i = 0), is known, but no information about Pr(X i = x, Y i = y|S i = 0) is available. X i and Y i are also confounded by unobserved U i . Bounding in this setting is a nonlinear program, with an analytic solution recently derived in Gabriel et al. (2020). Panel (b) illustrates measurement error: an unobserved confounder U i jointly causes Y i and its proxy Y * i , but only treatment and the proxy outcome are observed. Bounding in this setting is a linear problem. A number of results for linear measurement error were recently presented in Finkelstein et al. (2020); here, we examine the monotonic errors case, where . Panel (c) depicts missingness in outcomes, i.e. nonresponse or attrition. Here, X i affects both the partially observed Y i and response indicator R i ; if R i = 1, then Y * i = Y i , but if R i = 0, then Y * i takes on the missing value indicator NA. Nonresponse on Y i is differentially affected by both X i and the value of Y i itself (i.e. "missingness not at random," MNAR); Manski (1990) provides analytic bounds. Finally, panel (d) depicts joint missingness in both treatment and outcome-sometimes a challenge in longitudinal studies with dropout-with MNAR on Y i . Figure 6(a-c) illustrates how Algorithm 2 recovers sharp bounds. Each panel shows progress in time, converging on known analytic results depicted at the right of each plot. Primal bounds (blue) widen over time as more extreme, observationally equivalent models are found. Dual bounds (red) narrow as the outer envelope is tightened. When a region cannot possibly produce a more extreme value than a previously discovered primal point, it is eliminated from consideration. Optimization proceeds by simultaneously searching for more extreme primal points and narrowing the dual envelope. Analysts can terminate the process at any time, reporting guaranteed-valid dual bounds along with their worst-case suboptimality factor, ε-or await complete sharpness, ε = 0. Figure 5: Various threats to inference. Panels depict (a) outcome-based selection, (b) measurement error, (c) nonresponse and (d) joint missingness. In each graph, X and Y are treatment and outcome, respectively. Dotted red regions represent observed information. In (a), the box around S indicates selection: other variables are only observed conditional on S = 1. In (b), Y * represents a mismeasured version of the unobserved true Y . In (c), R Y indicates reporting, so that Y * = Y if R = 1 and is missing otherwise. In (d), both treatment and outcome can be missing; and missingness on X can affect missingness on Y . In Figure 6(a-c), the algorithm converges on known analytic results. Ultimately, in the selection simulation (a), Algorithm 2 achieves bounds of [−0.37, 0.68], correctly recovering Gabriel et al.'s (2020) bounds;, measurement error bounds are [−0.57, 1.00], matching Finkelstein et al. (2020); and in (c), outcome missingness bounds are [−0.25, 0.75], equaling Manski (1990) bounds. Somewhat counterintuitively, Figure 6(d) shows dual bounds collapsing to a point, correctly point-identifying the ATE at −0.25 despite severe missingness. This surprising result turns out to be a special case of an approach using "shadow variables" recently developed by Miao et al. (2015). 12 This example illustrates that the algorithm is general enough to recover results even when they are not widely known in a particular model; note that the commonly used approach of Manski (1990) produces far looser bounds of [−0.72, 0.40], failing to exploit causal structure given in Figure 5(d). This result suggests our approach enables an empirical investigation of complex models where general identification results are not yet available. Situations where bounds converge suggest models where point identification via an explicit functional may be possible, potentially enabling new identification theory.

Potential Critiques of the Approach
Below, we briefly discuss several potential critiques of our method.
"The user must know the true causal model." Our algorithm requires users specify a causal graph and assumptions, but in many applications, the true DGP is unknown. This is precisely the obstacle that motivates our approach, which allows for valid inferences in the absence of complete information. Rather than assert a faulty "complete" model, the user need only input what they know or believe. The algorithm then outputs the most precise possible solution given that information; key assumptions can be relaxed further using easily incorporated sensitivity analyses, as needed. We note the difficulty of declaring a causal theory, even a partial one, is universal: any attempt to draw causal inferences from data-even in experimental settings-is premised (often implicitly) on underlying causal theory. Making assumptions explicit is not a tradeoff relative to other methods, but a boon for research transparency.
"The bounds may be too wide to be informative." Yes.
When a point-identified solution exists, our algorithm will discover it. As Section 8.3 shows, this can occur in surprising scenarios and may help reveal new identification theory. However, when point-identification is impossible, our approach produces sharp bounds. These bounds may be insufficient for an analyst to achieve a goal such as discerning the sign of a causal effect. This is simply a fact about the limitations of the research designas we prove, it is impossible to narrow the bounds further without additional information. Again, there is no tradeoff: incorrect point estimates based on faulty assumptions are also uninformative. When sharp bounds incorporating all defensible assumptions are wide, it means progress will require collecting more data or justifying additional assumptions.
"What about continuous variables?" Our approach applies to discrete data, but analysis of continuous variables can often still proceed with some adjustments. Discrete approximations often suffice in applied work. (Indeed, "all data as observed are discrete," Rubin, 1981, p. 133). When continuous treatments (e.g. birth date, vehicle speed) often affect discrete outcomes (school admittance, police stops) only when exceeding a threshold, discretization is lossless. Moreover, when analyzing discrete treatments and continuous outcomes, much of our theory generalizes to estimands involving expectations of the outcome. Future work may study our method's applicability to bounded continuous variables with smooth effects.
"The bounds will take too long to compute." Computation time for sharp bounds may sometimes be prohibitive, but our approach is likely still faster than manual derivation. Notably, the algorithm recovers several recently published analytic results in mere seconds Miao et al., 2015;Knox et al., 2020). Second, when computation time is long, our algorithm's "anytime" guarantee ensures premature termination will still produce valid bounds and report a worst-case looseness factor for the resulting non-sharp bounds.

Future Work with Automated Bounding
Causal inference is a central goal of science, and several established techniques can estimate causal quantities under ideal conditions. But in many applications, these conditions are simply not satisfied, and developing new analytic solutions is often intractable. For knowledge accumulation to proceed in the messy world of applied statistics, a general solution is needed. We present a tool to automatically produce sharp bounds on causal quantities in settings involving discrete data. Our approach involves a reduction of all such causal queries to polynomial programming problems, enables efficient search over observationally indistinguishable DGPs, and produces sharp bounds on arbitrary causal estimands. This approach is sufficiently general to accommodate a range of classic inferential obstacles.
Beyond providing a general tool for causal inference, our approach aligns closely with recent calls to improve research transparency by requiring the explicit declaration of estimands, identifying assumptions, and theory (Miguel et al., 2014;Lundberg et al., 2021). With a common understanding of goals and premises, scholars can have meaningful debates over the credibility of research. When aspects of a theory are contested, our approach allows for a fully modular exploration of how assumptions shape empirical conclusions. Scholars can learn whether a particular assumption is empirically consequential, and if so, craft a targeted line of inquiry to probe its validity. Our approach can also act as a safeguard for analysts, flagging assumptions as infeasible when they conflict with observed information. This means hopeless research projects can be abandoned before wasting effort or disseminating untruths. Future work should seek to reduce computation time for sharp bounds, especially when incorporating point-identified subquantities or additional semi-parametric modeling approaches. Causal inference scholars may also use this method as an exploratory tool to aid in the discovery of new identification theory. These lines of inquiry now represent the major open questions in discrete causal inference.

A.1 Canonicalization of DAGs
In this appendix, we summarize the process for obtaining a canonical hidden variable DAG, presented as Definition 4.6 in Evans (2016). Theorem 4.13 in Evans (2016) shows that the marginal model of any hidden variable DAG is the same as that of its canonical hidden variable DAG, and Proposition 7.4 of the same work shows that the same holds for the model for post-intervention distributions, when interventions are restricted to the main variables.
Given a hidden variable DAG G, the canonical form of the DAG is constructed by the following procedure: 1. Add an edge X j → X j for any pair of variables X j , X j such that there is a path from X j to X j along which all variables between X j and X j are hidden. X j and X j can each be hidden or observed.
2. Remove incoming edges to hidden variables.
3. Remove hidden variables whose children are a subset of the children of another hidden variable.
By construction, all latent variables in the canonical DAG will be exogenous.

A.2 Functional Models in the Context of Determinism
The general approach for obtaining functional models for discrete hidden variable DAGs (Evans, 2018;Finkelstein et al., 2021) does not take account of the kind of determinism introduced into the model by missingness indicators, and as such may yield a functional model that is over-parameterized. Due to the complexity of polynomial programming, it is beneficial to avoid excess parameters where possible. We now briefly explore this issue.
Consider the scenario depicted in Figure 7. In this graph, A * is a proxy for the unobserved variable A, which is observed with missingness as indicated by R A . When R A = 0, then A * is deterministically equal to a special value indicating missingness (usually denoted with the special value such as "?" or "NA"). In addition, This example demonstrates that incorporating known deterministic relationships can yield a non-restrictive functional parameterization with fewer parameters.

A.3 DAG Parameterization for Non-geared Graphs
Most graphs we encounter in practice are geared (Evans, 2018), which means they have no non-trivial bi-directed cycles Finkelstein et al. (2021). When graphs are not geared, and the target estimand as well as all empirical evidence involves only single world probabilities, it is possible to improve the complexity of the system. Under these circumstances, it is preferable to obtain non-restrictive bounds on the cardinalities of latent variables according to Finkelstein et al. (2021). All single world probabilities can be expressed in terms of the usual DAG parameters according to the g-formula, and therefore all functionals of such probabilities described in Corollary 1 can be polynomialized as well. If the target or any of the empirical evidence involve cross-world probabilities, we must revert to the functional model approach.

A.4 Example of Program Simplification
Consider the graph presented in Figure 8. We will use this graph to illustrate a number of points raised in the main body of the paper. Suppose we are interested in the ATE of E on C. First, we will explicitly construct the functional model of this graph, then use it to generate a simple polynomial program that bounds a causal target. Next, we will employ several of the strategies described in Section 5 to simplify the program, demonstrating the importance of these strategies in obtaining tractable program formulations. Finally, we will observe that a broader class of partial identification problems than previously recognized can be formulated as linear programs.
Suppose all observed variables in the graph above are binary. In constructing a functional model, we first note that U 2 is responsible for determining the values of A, C and E in response to their parents. A has no parents, E has one parent, and C has two parents.
Therefore U 2 takes values in a state space of size 2 1 × 2 2 × 2 4 = 128. Next, we suppose U 1 is responsible for determining the value of B in response to A, and therefore has size 2 2 = 4. U 3 is left to determine the value of F in response to D, and of D in response to U 1 and C. It therefore takes values in space of size 2 8 × 2 2 = 1024. 13 To construct the polynomial program, we begin with the non-negativity and linear marginalization constraints on the parameters of the distributions of the disturbances (for simplicity, we abstain from eliminating one parameter per distribution using the sum-tounity constraint): We then add constraints encoding the empirical evidence E. For simplicity, we assume that we observe the full joint distribution Pr which is a vector of size 2 6 = 64, corresponding to 64 equality constraints in the program.
There are 3 disturbance variables in this graph, including E , leading to polynomials in these equality constraints with terms of degree 3. Given the cardinalities of the disturbances, there are 2 4 × 2 7 × 2 10 = 2, 097, 152 possible combinations of disturbance assignments. By a simple exchangeability argument, the same number of possible combinations lead to each outcome in the state space. As there are 2 6 outcomes, each of the 64 polynomial equality constraints for E will have 2 21 2 6 = 2 15 terms, again each of degree 3. This is a very large program.
We now consider the strategies described in Section 5. First, observe that there are 13 It is also possible to construct a functional model by first taking U 3 to be responsible for determining F in response to D, and then U 1 to be responsible for determining B in response to A and D in response to C and U 3 . By a simple symmetry argument, the two functional models yield the same number of parameters. only 31 nested Markov parameters for this graph, corresponding to 31 polynomial equality constraints encoding E: a substantial savings over the 64 parameters of the naïve parameterization. This reduced parameterization is possible because it encodes standard conditional independences, such as F ⊥ A | D. In addition, it encodes Verma constraints, which emerge either (i) from independences in post-intervention distributions or (ii) from the irrelevance of an intervention to a particular distribution. In this case, A ⊥ {D, F } | do(C).
As discussed in the main text, each equality constraint can be used to reduce the number of parameters needed in a non-restrictive reduction that can express every possible distribution in the model.
Recall that each nested Markov parameter corresponds to the identified probability of a single world event, where the event is specified in terms of variables in a single district, and the intervention is on all parents of the district relevant to those variables. For example, in this case, one of the nested Markov parameters is Pr b = 1, f = 1|d = 1, do(a = 1, c = 1) . We can now make use of Proposition 3 to reason that each of these polynomial constraints must involve only disturbances from a single district. Therefore in the equations corresponding to nested Markov parameters for the district corresponding to U 2 , parameters of the distributions of U 1 and U 3 will all sum out, and we will be left with equations that are linear in the parameters of U 2 . Likewise, in equations corresponding to nested Markov parameters for the district containing descendants of U 1 and U 3 , parameters for the distribution of U 2 will factor out, and we will be left with a quadratic equation.
Finally, we can make use of Proposition 4 to note that constraints involving nested Markov parameters corresponding to the {U 1 , U 3 } district can be dropped from the program. This is because they only involve parameters for the distributions of U 1 and U 3 , which do not appear in any constraint involving parameters for the distribution of U 2 . The target, by contrast, involves only parameters for the distribution of U 2 .
As a result of taking the three steps described in Section 5, we have taken this problem from a polynomial program involving 1156 parameters to a linear program involving only 2 7 = 128 parameters and fewer constraints. This example also motivates the following corollary, which expands the class of partial identification problems that can be formulated as linear programs relative to known results (Balke and Pearl, 1997;Finkelstein et al., 2020;Wolfe et al., 2019). Proof. Because the common district of C contains only a single latent variable, by Proposition 3 the objective will be linear in the parameters of the distribution of that latent variable. By Proposition 4, the constraints will not involved parameters corresponding to other districts. By Algorithm 1, no single term in a constraint will involve multiple parameters for the same latent distribution, meaning that all constraints involving only parameters corresponding to a single-variable district will be linear. The non-negativity and sum-to-unity constraints on the parameters of the latent-variable distribution are also linear. It follows that the objective and all constraints are linear.

A.5 Constructing the Polynomial Program
Algorithm 1 constructs a polynomial program to sharply bound any factual or counterfactual target of inference, T , that is a polynomial fraction or monotonic transformation thereof. In addition to T , the algorithm takes as input a possibly non-canonical DAG G; empirical evidence E, modeling assumptions A, and sample space of possible outcomes for the main variables, S(V ). It produces an optimization problem with a polynomial objective subject to polynomial constraints. This polynomial programming problem is equivalent to the original causal bounding problem.

Algorithm 1 Constructing a Polynomial Program
Input: graph G, evidence E, assumptions A, sample space S(V ), target T Output: polynomial program in parameters P U or P U ∪ s polynomialize g(P V ) α and append to C 11: end for 12: for U i,k ∈ U i do 13: append P U k is a distribution to C 14: end for Optimize 15: return optimize T subject to C

A.6 Optimizing the Polynomial Program
Algorithm 2 provides a step-by-step description of the ε-sharp bounding procedure. For ease of reference, we duplicate the Section 6 discussion of the algorithm's various components here.
Algorithm 2 takes as inputs the polynomialized objective function T (p) and constraint set C(p), obtained from Algorithm 1. It then evaluates a range of models, or points p in the model space P for which C(p) is satisfied. It seeks to identify extreme values of T (p) within this subspace. It also accepts two parameters: thresh , a stopping threshold for the looseness factor stopping, and θ thresh , a stopping threshold for width of the bounds. The algorithm returns two types of information: the upper and lower bounds for the causal program, and the worst-case looseness factor ε.
Primal bounds are denoted P and P , adopting the convention that underlines refer to objects used for minimization and overlines for maximization. These indicate the extreme values of the target estimand in any admissible model-that is, satisfying C(p)-that has been located so far. These are initialized at +∞ and −∞, respectively, indicating that no admissible models have been found yet. As optimization proceeds, the primal bounds improve as new, more extreme admissible models are found. We refer to [P , P ] as the inner bounds: the unknown sharp bounds must at least contain these points, which correspond to models that are observationally indistinguishable from the true DGP. whatever the true sharp bounds, they must lie inside the dual bounds, even if the algorithm has not run to completion. We let θ equal the bound width, or the difference between the upper and lower dual bounds, and we define the worst-case looseness factor ε as the slack (the difference in dual and primal bound widths) divided by the primal bound width.
The algorithm heuristically selects branches in the model space that appear promising, and refines primal and dual bounds in turn. It first searches within the branch for an admissible model; if found, and if the associated causal estimand is more extreme than those previously encountered, it is stored as a new primal bound. Whatever the true nonparametric sharp bounds, they must lie outside the primal bounds because the true 32: ε ← θ/(P − P ) − 1 33: end while 34: return T , T , ε

B Proofs
Proof of Proposition 1.
Proof. We adapt the proof of Finkelstein et al. (2021) to account for counterfactuals as follows. First, we define one-step-ahead counterfactuals, V i,j pa(V i,j ) = a , to be those where all main parents of a variable are subject to intervention pa(V i,j ) = a. Next, we note that all other counterfactuals and factuals in the full data law are deterministic functions of one-step-ahead variables, after fixing U i . Therefore it is sufficient to reason about only one-step-ahead variables; intervention on other variables is irrelevant to the full data law.
Because the likelihoods of multi-district graphs factorize as the likelihoods of the districts after intervention on their parents (Richardson et al., 2017), we can consider singledistrict graphs without loss of generality. In multi-district graphs, the bound obtained below can be applied within each district.
Each main variable V i,j has |S(pa(V i,j ))| one-step-ahead counterfactuals, corresponding to possible manipulations of its parents. Each one-step-ahead counterfactual V i,j pa(V i,j ) = a has a cardinality equal to those of the corresponding main variable |S(V i,j )|. Therefore, the collection of a single variable's one-step-ahead counterfactuals V i,j pa(V i,j ) = a , V i,j pa(V i,j ) = a , .
can take on |S(V i,j )| |S(pa(V i,j ))| possible values, and there are d values that the full collection of all one-step-ahead variables can take. Any model over this full collection must be a subset of the d − 1 simplex. We let V pa(V ) denote the collection of one-step-ahead variables.
Suppose the disturbances U i are enumerated as {U i,1 , . . . , U i,K }. We will now show that each U i,k can be assumed to be discrete without altering the model for V pa(V ) and therefore the full data law. First, for each value u k in the domain of U i,k , we define the distribution P u k V pa(V ) = u \k P V pa(V ) | u \k , u k P (u \k ), where u \k denotes all disturbances other than u k . This fixes U i,k at the value u k , modifying the distribution over V pa(V ) .
We now make two observations. First, the model for V pa(V ) contains P u k for any u k , because U i,k is not restricted by the model and is therefore permitted to have a pointmass distribution at u k . Second, the expected value of P u k with respect to U i,k recovers the original marginal distribution P V pa(V ) , which is therefore in the convex hull of the set of distributions S(P u k ) ≡ {P u k | u k ∈ S(U i,k )}. Carathéodory's Theorem (1907) states that for any point P in the convex hull of a set S in a space of dimension d−1, there exists a set of d−1 points {P u k 1 , . . . , P u k d−1 } and weights {w 1 , . . . , w d−1 } such that P = d−1 =1 w P u i . It then follows directly that any distribution in the marginal model over V pa(V ) when latent variables have unrestricted cardinality is also in the marginal model over V pa(V ) when latent variables have cardinality restricted

Proof of Proposition 2
Proof. Using the approach developed in Evans (2018) and generalized to arbitrary graphs in Finkelstein et al. (2021), we can obtain a functional model that is non-restrictive of the causal model of G over observed variables. In such a model, each V i, (a ) is determined by by values of the disturbances U i . By assumption, G is in canonical form, rendering all disturbances marginally independent. The proposition then follows from standard probability calculus.

Proof of Proposition 3
Proof. Under the conditions specified, no element in C involves a function of U i,k . It follows that whether the disturbances lead to C is not a function of the value of U i,k . As a result, a sum over all parameters of the distribution of U i,k can be factored out of the product in Equation 2. By the definition of probability distributions, this sum will be equal to 1, rendering the parameters irrelevant to the polynomial.

Proof of Proposition 4
Proof. Each of the nested Markov parameters corresponds to the probability that random variables in a single district take certain values after an intervention on parents of the district. It follows from Proposition 3 that no disturbances outside the district corresponding to the nested Markov parameter will appear in the polynomialization of that parameter. From this, it then follows that no disturbances in different districts will interact in constraints corresponding to nested Markov parameters. By Proposition 2, the degree of a polynomialization of the probability of the event is at most the number of relevant disturbances.

C Uncertainty
In this appendix, we provide details on our approach to quantifying the uncertainty of bounds based on estimated empirical inputs,Ê = Ê . Recall that the estimated bounds are obtained from a polynomial program using equality constraints of the form polynomialize g (P V ) =Ê , which is equivalent to polynomial-fractionalize g (P V ) = E . Here,Ê is the noisily estimated empirical quantity and polynomial-fractionalize g (P V ) is the reexpression of that same quantity in terms of principal strata sizes. At a high level, we will proceed by constructing confidence regions CR α (Ê) such that Pr E ∈ CR α (Ê) ≥ α. To obtain confidence bounds, we then replace empirical equality constraints with a looser version that accounts for sampling variation, of the form polynomial-fractionalize g (P V ) ∈ CR α (Ê).
Observe that because the main variables are discrete,Ê is a realization of a multinomial proportion. In what follows, we will assume that empirical evidence arises from a single multinomial distribution, such as a single-world marginal distribution; if multiple independent sets of empirical evidence about differing quantities are available, the procedure generalizes straightforwardly by repeating the procedure within each set and combining the results appropriately.
Based on this idea, we examine two methods for constructing CR α (Ê). Drawing on Malloy et al. (2020), we first consider a "Bernoulli-KL" approach that constructs separate confidence regions for each observable atomic event, Pr(V i = v), treating it as a "success" in a Bernoulli distribution. The approach rotates through all possible v and combines the event-specific regions using a result on the Kullback-Leibler divergence of sampling distributions to the underlying population distribution. The Bernoulli-KL method produces a confidence region for single-world distributions that is guaranteed to have conservative coverage for the multinomial proportion in finite samples.
Let k ∈ {1, . . . , K} index possible atomic events, and denote the probability of the k-th event as p k = Pr(V i = v k ). Empirical frequencies are denotedp k . For the Bernoulli-KL method, we will develop a confidence region of the form CR α (Ê) = K k=1 p k , p k , noting that each p k can be polynomialized. A visualization of the resulting region is given in Figure 2(b).
We now describe how p k and p k can be calculated to ensure that Pr(E ∈ CR α (Ê)) ≥ α.
At a high level, we will do so by analyzing each of the K observable events as a Bernoulli distribution. Taking eachp k estimate as given, we identify regions of the unknown p k from which the observedp k diverge substantially. Equation 11 of Malloy et al. (2020) provides bounds on the sampling probability of observing KL ([1 −p k ,p k ], [1 − p k , p k ]) in excess of some threshold, where KL [1 −p k ,p k ], [1 − p k , p k ] =p k logp k p k + (1 −p k ) log 1−p k 1−p k . In turn, these bounds imply regions of p k that can be conservatively rejected. Let p k be given by the solution to KL [1 −p k ,p k ], [1 − p k , p k ] = 1 N log 2K 1−α subject to p k ∈ [0,p k ]. Similarly, let p k be given by KL [1 −p k ,p k ], [1 − p k , p k ] = 1 N log 2K 1−α subject to p k ∈ [p k , 1]. It can be seen from Malloy et al. (2020) that when constructing p k and p k in this way, Pr K k=1 p k ∈ [p k , p k ] ≥ α over repeated samples. Our second approach uses an asymptotic confidence region based on the multivariate Gaussian limiting distribution of the multinomial proportion, N p, diag(p) − pp (Bienaymé, 1838). Because the multinomial proportion must sum to unity, this distribution is degenerate, and it is often more convenient to work with its first K −1 elements, p \K . We con- Figure 9: Polynomial confidence regions in a binary graph. Panel (a) presents a causal graph in which binary X causes binary Y , but both are confounded by an unobserved U . N = 1, 000 observations are drawn from this DGP, producing an empirical distribution with proportions 1 N N i=1 1(X i = x, Y i = y). Panels (b-c) depict confidence regions for Pr(X i = 0, Y i = 0), Pr(X i = 0, Y i = 1), and Pr(X i = 1, Y i = 0); the final category, Pr(X i = 1, Y i = 1) (not depicted), must sum to unity. Panel (b) shows the Bernoulli-KL confidence region, which is conservative in finite samples and can be polynomialized as a set of linear inequalities. Panel (c) shows the Gaussian confidence region, which is asymptotically valid and can be polynomialized as a single convex quadratic inequality.
struct the asymptotic confidence region as (p \K − p \K ) diag(p \K ) −p \Kp \K −1 (p \K − p \K ) ≤ z, where z is an appropriate critical value of the χ 2 distribution. A visualization of the resulting region is given in Figure 2(c). As before, each element in p is polynomializable, leading to a single confidence constraint that can be straightforwardly incorporated into the optimization routine.
For ease of reference, we duplicate Figure 2 in Figure 9, below. This figure depicts these regions visually for a simple two-node graph, shown in Figure 9(a). The resulting Bernoulli-KL and Gaussian confidence regions are depicted in Figure 9(b-c).
Finally, we describe how arbitrary confidence regions, such as the optimal level-set regions of Malloy et al. (2020) or the exact finite-sample regions of Kuchibhotla et al. (2021), can be polynomialized. At a high level, the proposed method uses a circumscribing polytope, adding faces along the region's principal axes until the desired tightness is achieved.
One possible approach to doing so is to enumerate candidate p along a fine grid, assess each candidate for membership in the confidence region, and compute the convex hull of the non-rejected points. This procedure produces a system of linear inequalities describing the hull facets. However, it is infeasible for even moderately sized problems, as the time complexity of hull construction can grow exponentially in the dimension of the space, K (Ottmann et al., 1995). Our approach builds on this basic intuition of circumscribing a complex confidence region with a larger, more tractable polytope. We compute the principal components of the non-rejected points, then identify the two extreme non-rejected points along each axis. Each principal axis is the normal vector for two boundary planes, and each extreme point along that axis defines an boundary plane offset. By repeating this procedure along each principal axis, we obtain a circumscribing confidence region, a parallelepiped that contains the KL confidence region. The gap between the two confidence regions can be rapidly approximated by using number of grid points that lie in the inscribing region but not the original confidence region. By slicing the simplex along additional directions, such as convex combinations of principal axes, this gap can be tightened to arbitrary precision.
The resulting polytope defines a system of linear inequalities that can then be incorporated into the polynomial program.

D Details of Simulated Models
In this section, we detail all models presented in Section 8. For simplicity, all main variables in these models are binary. Simulation parameters are described in terms of principal strata.
Principal strata can take one of three forms, depending on the number of parents of the relevant variable. Below, we provide compact notation for referring to these principal strata. Subsequent sections report strata probabilities for each simulation, including joint distributions over strata for multiple variables where confounding exists.
1. Variables with no parents, which have two strata. Consider a hypothetical variable X i with no parents, as in Figure 5(a). We use x 0 to denote units with X i (∅) = 0 and x 1 to denote X i (∅) = 1.

2.
Variables with a single parent, which have four strata. Consider a hypotheti-cal variable Y i influenced by parent X i , also depicted in Figure 5(a). For compactness, we adopt the convention that counterfactual manipulations of parent variables are presented in the form y Y i (X i =0),Y i (X i =1) . For example, (i) we use y 00 to denote "never takers" with example, Y i (X i = 0) = 0 and Y i (X i = 1) = 0. Similarly, (ii) y 01 denotes "compliers" with Y i (X i = 0) = 0 and Y i (X i = 1) = 1, (iii) y 10 denotes "defiers" with Y i (X i = 0) = 1 and Y i (X i = 1) = 0, and y 11 denotes "always takers" with Y i (X i = 0) = 1 and Y i (X i = 1) = 1.
3. Variables with two parents, which have sixteen strata. Consider a hypothetical variable Y i influenced by parents Z i and X i , as in Figure 3(a). Extending the convention described above, we denote these in compact forms ranging from y 0000 to y 1111 . Specific definitions are provided in Table 2.

D.1 Noncompliance Simulation
In this section, we describe the DGP for our noncompliance simulation analyzed in Section 8.1. The DGP follows the model of Figure 3(b), reproduced below for ease of reference.
Simulation parameters are reported in terms of the joint distribution over principal strata.

D.2 Outcome-Based Selection Simulation
In this section, we describe the DGP for our outcome-based selection simulation, analyzed in Section 8.3 and Figure 6(a). The DGP follows the model of Figure 5

D.3 Measurement Error Simulation
In this section, we describe the DGP for our measurement error simulation, analyzed in Section 8.3 and Figure 6(b). The DGP follows the model of Figure 5