Bayesian decision theory for tree-based adaptive screening tests with an application to youth delinquency

Crime prevention strategies based on early intervention depend on accurate risk assessment instruments for identifying high risk youth. It is important in this context that the instruments be convenient to administer, which means, in particular, that they should also be reasonably brief; adaptive screening tests are useful for this purpose. Adaptive tests constructed using classification and regression trees are becoming a popular alternative to traditional Item Response Theory (IRT) approaches for adaptive testing. However, tree-based adaptive tests lack a principled criterion for terminating the test. This paper develops a Bayesian decision theory framework for measuring the trade-off between brevity and accuracy, when considering tree-based adaptive screening tests of different lengths. We also present a novel method for designing tree-based adaptive tests, motivated by this framework. The framework and associated adaptive test method are demonstrated through an application to youth delinquency risk assessment in Honduras; it is shown that an adaptive test requiring a subject to answer fewer than 10 questions can identify high risk youth nearly as accurately as an unabridged survey containing 173 items.

Crime prevention strategies based on early intervention depend on accurate risk assessment instruments for identifying high risk youth. It is important in this context that the instruments be convenient to administer, which means, in particular, that they should also be reasonably brief; adaptive screening tests are useful for this purpose. Adaptive tests constructed using classification and regression trees are becoming a popular alternative to traditional Item Response Theory (IRT) approaches for adaptive testing. However, tree-based adaptive tests lack a principled criterion for terminating the test. This paper develops a Bayesian decision theory framework for measuring the trade-off between brevity and accuracy, when considering tree-based adaptive screening tests of different lengths. We also present a novel method for designing tree-based adaptive tests, motivated by this framework. The framework and associated adaptive test method are demonstrated through an application to youth delinquency risk assessment in Honduras; it is shown that an adaptive test requiring a subject to answer fewer than 10 questions can identify high risk youth nearly as accurately as an unabridged survey containing 173 items.
Our motivating application is the design of a brief screening test to identify youth who are at high risk of falling into delinquent behavior, so that they may receive additional social support intended to mitigate that risk. Specifically, we consider data from Honduras, where decades of political, civil, and economic instability have made gang recruitment and violent crime a major concern (Meyer (2019), UNODC (2018)). Certain targeted interventions, such as family counseling and community support resources, have demonstrated significant promise in reducing risk factors of criminal behavior for at-risk youth in Honduras (Katz et al. (2021)). In order to allocate these limited resources in an effective way, a screening instrument is deployed to identify youth with the highest risk of delinquency.
2. Previous work. This paper brings together three distinct strands of research. First, we add to a well-established literature on youth risk assessment, specifically their application to crime prevention programs aimed at youth in Central America. In this context we reanalyze data from Honduras and show that an adaptive screening test consisting of only a handful of items can provide comparably accurate risk assessment to a questionnaire with over a hundred questions. Second, we build on recent work using classification trees to design abridged diagnostic tests. To this literature we add a principled approach to constructing the tree and determining its maximum depth, using ideas from Bayesian decision theory to evaluate the trade-offs between instruments of different lengths. Finally, this Bayesian decision theoretic approach is a natural extension of ideas developed in recent work on utility-based posterior summarization. Here, we apply these ideas in the novel context of adaptive screening tests.
2.1. Youth risk assessment in Honduras and elsewhere. For a broad overview of the difficulties facing youth in Honduras, please consult Berk-Seligson et al. (2014). Here we focus on risk assessment tools used in crime prevention, which has recently gained momentum as an effective alternative to more aggressive suppression strategies.
As a key component of crime prevention, so-called "secondary prevention" programs identify individuals within high risk communities who are at an especially high risk for criminal activity, and provide them with targeted interventions. To effectively execute this secondary prevention strategy, high risk youth must first be identified via a screening tool and are subsequently enrolled in the intervention.
For the model utilized between 2013 and 2015 in Honduras specifically, high risk youth were first identified using a Spanish adaptation of the Youth Services Eligibility Tool (YSET) (Hennigan et al. (2014)), and then enrolled in a seven-module family counseling program. This model represented the first time empirical data was utilized for identifying the youth with the highest risk of criminal behaviors. Following initial successes, a more locally focused risk assessment tool was created, incorporating screening tools from around the world. The data for the present work consists of responses to this revised Honduran YSET, and is described more fully in Section 4.1.
The risk assessment tools utilized in Honduras are based on a large body of research surrounding the risk factor paradigm. Risk factors are characteristics that increase the likelihood of a given problem behavior, whereas protective factors are ones that reduce this likelihood (Arthur et al. (2002)). These factors are typically categorized under domains such as community, family, school, peer, and individual (Howell and Jr. (2005)). See Table 1 in Section 2 of the Supplementary Material (Krantsevich et al. (2022)) for a list of risk and protective factors measured in our work.
The risk factor paradigm entered the youth delinquency sphere in 1992 with Hawkins, Catalano and Miller (1992), who provided a comprehensive review of the literature on risk and protective factors related to substance abuse in adolescents. In subsequent years, multiple groups developed youth risk assessment tools, including three that were used to expand the item bank for the revised Honduran YSET: the Communities That Care (CTC) Youth Survey (Arthur et al. (2002), Arthur et al. (2007)), the Eurogang Youth Survey (Weerman et al. (2009)), and the Youth Eligibility Services Tool (YSET) (Hennigan et al. (2014), Hennigan et al. (2015)), a Los Angeles-specific adaptation of the empirically-developed Gang Risk of Entry Factors instrument.
While these instruments have been deployed in countries around the world, they were largely developed for use in the United States and Europe. Research on youth risk assessments for secondary prevention programs within developing countries includes Katz and Fox (2010) and Maguire, Wells and Katz (2011), focusing on the Caribbean nation of Trinidad and Tobago, and Webb, Nuño and Katz (2016), focusing on the Northern Triangle nation of El Salvador. These works including protective factors in addition to risk factors, which provide an avenue to learn about positive interventions the community can undertake. For more information on risk and protective factors in low-and middle-income countries, we refer the reader to the systematic reviews of Murray et al. (2018) on risk and protective factors for antisocial behavior, and Higginson et al. (2018) on risk and protective factors related to gang membership.

Adaptive testing.
An adaptive test is one where the next question a subject encounters depends on her answer to the previous question or questions. Adaptive tests are a powerful approach for developing shortened risk assessments, because while any given test taker may only see a small number of questions, the wide variety of available questions allows different subjects to be classified more accurately than if every subject was administered the same small number of questions. Traditionally, adaptive tests have been constructed using item-response theory (IRT), which requires estimating the latent constructs of each test taker at the time of testing (Wainer (2000)). Examples of IRT-based adaptive testing in the academic or personnel selection setting include the Graduate Management Admission Test (Rudner (2010)), the Graduate Record Examination (Almond and Mislevy (1998)), and the Armed Services Vocational Aptitude Battery (Sands, Waters and McBride (1997)).
The cornerstone of IRT is the Item Response Function (IRF), which describes, for each item in the item bank, how the participant's response to that item depends on their value of the latent trait being measured (delinquency risk). An adaptive test designed based on IRT proceeds by starting with an initial risk level, administering the most informative item (based on the participant's risk estimate and each item's IRF), updating the risk estimate based on their response, and iterating until a stopping criterion is satisfied. Deploying an IRT-based adaptive test requires several specifications: the IRF family (and calibrating individual item parameters), the algorithm for estimating the latent trait, the criterion for selecting successive items, and the stopping criterion (Chang (2004), Chang (2015), van der Linden (2008), Wainer (2000)).
For a comprehensive overview on IRT, see van der Linden and Hambleton (1997). We also refer interested readers to Hambleton, Swaminathan and Rogers (1991), Embretson and Reise (2000), and de Ayala (2009) for more accessible treatments of the subject.
There are two major downsides to IRT-based adaptive tests in our application. One, the real-time estimation of the latent trait parameter requires intense computational resources, necessitating test administration via a laptop or computer and making the screening process more challenging. Two, as noted by  and Zheng, Cheon and Katz (2020), many constructs in practical screening or diagnostic contexts are multidimensional. While multidimensional extensions of IRT-based adaptive tests to a handful of dimensions exist (Frey andSeitz (2009), Gibbons et al. (2016), Paap et al. (2017), Chang (2011), Wang, Chang andBoughton (2012), Yao, Pommerich and Segall (2014)), this is likely unsuitable for capturing the relationship between the 38 different scales in our risk assessment.
2.2.1. Adaptive testing using classification trees. Tree-based adaptive tests, constructed entirely beforehand using classification trees, are a recently explored alternative to IRT-based tests, and have already been used for measuring a variety of medical and behavioral constructs. Zheng, Cheon and Katz (2020) is the first tree-based adaptive test to our knowledge to be used for assessing youth risk of delinquency; the authors utilized item-response data from crime prevention programs in Honduras, comparing the performance of several treebased adaptive tests fit using the CART algorithm (Breiman et al. (1984)), including one fit to synthetic data generated using the Synthetic Minority Over-sampling Technique (SMOTE, Chawla et al. (2002)).
The tree-based approach to adaptive testing involves collecting responses to a large number of items, as well as a true outcome measurement; a classification tree is then fit to this data to maximize predictive accuracy. Specifically, the Classification And Regression Trees (CART) algorithm, introduced by Breiman et al. (1984), is applied to the item response-outcome data.
Here we review the basics for reference. A modern survey of CART and other tree-growing methods can be found in Loh (2011).
A classification or regression tree T partitions a covariate space X into k disjoint hypercubes, A 1 , A 2 , . . . , A k , by repeatedly splitting X one variable at a time. Each internal node of a final fitted tree contains a splitting variable and an associated cutpoint, x i ≤ b. The number of leaf nodes k corresponds to the size of the partition, and the data stored in each leaf node of the fitted tree represents the output of the tree function. In a regression tree, values µ 1 , . . . , µ k ∈ R are associated to the k leaf nodes, so that x ∈ A j implies T (x) = µ j , 1 ≤ j ≤ k. In a classification tree with c classes, the j th leaf node contains a probability distribution {p 1j , p 2j , . . . , p cj } over the classes, and for x ∈ A j , T (x) is the class with the highest probability: T (x) = argmax i∈{1,...,c} p ij .
In the classic CART algorithm, the tree is fit to data with the goal of minimizing node impurity according to a given criterion (e.g. mean squared error for regression and Gini index for classification). The algorithm proceeds by first growing a very deep tree, then pruning back to the final tree. See Breiman et al. (1984) for details on the CART growing and pruning algorithms.
To use a tree as an adaptive test, items are used as splitting variables and item responses as cutpoints. After fitting the classification tree to item response-risk class data, a new subject takes the tree-based adaptive test by first answering the root node item, then moving right or left according to their response and the cutpoint. Subsequent items are administered based the pattern of item responses. The assessment ends when the subject lands in a terminal or "leaf" node, with their predicted risk class being the one assigned to that leaf node. Alternatively, one can use the "at-risk" probability stored in the leaf node as the tree output, and assign a risk class of "at-risk" to youth above a certain probability, with this cutoff determined separately. See Figure 1 for an example of the latter approach, with two items.
In the past several years, multiple groups have experimented with tree-based adaptive tests in various settings, including measuring quality of life in Multiple Sclerosis patients (Michel et al. (2018)), predicting risk of suicide attempt (Delgado-Gomez et al. (2016)) and reproducing a clinician's diagnosis of depression ).  depart from traditional tree-growing approaches by fitting the tree to a large amount of artificial data, generated as follows (Gibbons and Wang): first, item response vectors are created via local perturbations 2 . Next, a Random Forest model (Breiman (2001)) is fit to the original data, and used to predict artificial outcome classes for the artificial item response vectors. A single classification tree is then fit to this large artificial dataset and FIG 1. A tree-based adaptive test. The splitting variables are X = {Question 1, Question 2}. A subject with responses of 3 and 4, respectively, would land in the right-most leaf node and have "at-risk" probability 79%.
the fitted tree is used as the adaptive test.  and  claim that the use of artificial data increases stability of the final classification tree, although they do not discuss details. Our method, while also utilizing artificially generated data, does so for fundamentally different reasons rooted in past work on posterior summarization for model selection (see Section 2.3). Synthetic data in our case is used to approximate a posterior distribution of a utility function, and we select the optimal decision tree according to this utility. Further details are discussed in Section 3.2.
While tree-based adaptive tests have several advantages over IRT, including ease of deployment and fewer modeling assumptions, there is no clear standard to determine how deep to grow the tree, or in other words, when to terminate the test; instead, this choice is made by the default regularization parameters in the tree-growing software. The exam length has important implications; in particular, shortening the exam too much can lead to unacceptable levels for instrument sensitivity and specificity. However, to the best of our knowledge there is no standard stopping criterion for a tree-based adaptive test to ensure a certain sensitivity and specificity.
2.3. Utility-based posterior summarization. A recent line of research has recast the problem of variable or model selection as one of posterior summarization. The idea is to find a single model-summarizing "action" that minimizes a penalized loss function which favors parsimony. In the present context, the idea is to find a shortened screening test that is suitably accurate relative to the non-shortened instrument. This line of work began with Hahn and Carvalho (2015) for linear regression models and has been expanded in various directions subsequently (Bashir et al., 2019;Puelz, Hahn and Carvalho, 2017;Woody, Carvalho and Murray, 2019). The technique explored in these papers is a two stage process: first, a highly flexible and accurate model is fit; then, draws from the posterior distribution are projected onto simpler structures, producing low-dimensional model summaries. In this way, an analyst may visualize how much accuracy (however that is defined) relative to an "ideal" non-simplified model. In the risk assessment setting studied here, the "ideal" non-simplified model is a non-shortened screening instrument that incorporates responses to every item in the item bank in order to predict the probability of exhibiting delinquent behavior. In this step, we may use any state-of-the-art predictive algorithm to obtain an "at-risk" probability estimate. Then, we consider the trade-off in model accuracy that is made by administering a greatly-shortened adaptive test, in which each subject sees only a small number out of the many items available. We use a Bayesian decision theory framework to formalize these trade-offs.

3.
A decision theory perspective on adaptive screening. In the following sections, we recall general elements of Bayesian decision theory, then explain its particular application in risk assessment of youth delinquency.
Throughout the paper, we use calligraphy X and Y to denote the support of item response vectors and risk classes, upper-case X and Y to denote a random vector/variable representing an item response vector or risk class, and lower-case x and y to denote a single instantiation of the random vector/variable. As is standard in Bayesian statistics, we treat model parameters as random variables, rather than fixed parameters of the data generating process of X and Y ; the random vector of all unknown model parameters is represented by θ, with Θ being its support and θ (j) a single instantiation. When referring to observed data, we use subscripts x 1:n and y 1:n ; synthetic data are denoted byx andỹ.
3.1. Review of Bayesian decision theory. In this section we provide an introduction to Bayesian decision theory using the terminology of Parmigiani and Inoue (2010), according to which an analyst chooses from among a set of actions, Γ. Each action γ : X → {0, 1} has consequences that depend on an unknown state of the world, y ∈ Y. In order to evaluate the merits of possible actions, a quantitative value is assigned to each possible (action, state) pair, either a utility value U (γ(x), y) or a loss value L(γ(x), y). With the utility function framework, which we employ, the analyst chooses the action that maximizes (in some sense) a utility.
We adopt the expected utility principle, which implies the chosen action maximizes expected utility over a target population with density f (x, y). This expected utility is and the optimal action is γ * = argmax γ∈Γ EU (γ).
To summarize, our decision theory formulation consists of: 1. A utility function U. 2. A target population defined by a distribution function F X,Y . 3. A set of actions, denoted Γ.
These three elements come together in defining our expected (integrated) utility EU (γ) = E[U (γ(X), Y ))], where γ ∈ Γ and E(·) denotes expectation with respect to F X,Y . In our application to youth risk assessment, an action γ is a tree-based adaptive screening test, which takes the youth's item responses x ∈ X , and assigns an outcome of either "at-risk" (γ(x) = 1) or "not at-risk" (γ(x) = 0), determining enrollment into a secondary prevention program. We apply the preceding framework to the youth delinquency problem as follows: 1. Our utility function, U , is a weighted average of sensitivity and specificity. 2. Our target population is the group of youth to be screened for risk of problem behaviors. We let f (x, y) denote the joint density function of item responses and risk status for youth in the target population. 3. Our set of actions, Γ, is a collection of candidate screening tests of varying lengths. (This action space will be populated using a tree growing algorithm, detailed later.) Sections 3.2.1, 3.2.2 and 3.2.3, respectively, describe these steps in greater detail. In practice, the density function f (x, y) is unknown and must be estimated from available data. To do so, we will parametrize f by a vector θ, which we will estimate via Bayesian inference. We choose a prior π(θ) and, after conditioning on data (x 1:n , y 1:n ), arrive at a posterior π(θ | x 1:n , y 1:n ). Rather than integrating over the estimation uncertainty in θ as would be done in traditional Bayesian decision theory, we will instead consider posterior uncertainty of the utility EU (γ, θ), defined as As a function of θ, EU (γ, θ) is itself a random variable, which we denote EU θ (γ) for notational convenience. In this paper, we will be interested in the posterior distribution of EU θ (γ) induced by the posterior distribution over θ.
3.2. The adaptive screening decision problem. Here we describe how the three steps of the Bayesian decision theory framework are applied to adaptive screening tests for youth delinquency. Recall that our set of actions Γ is comprised of adaptive screening tests γ for assessing risk of youth delinquency. Each test γ consists of two parts: 1. A binary tree T : X → (0, 1) representing the screening test (see Figure 1, right). T predicts the "at-risk' probability T (x), given item responses x ∈ X . 2. A threshold function Thr C : (0, 1) → {0, 1} that maps the probability T (x) to a risk status prediction via a cutoff C ∈ [0, 1]: Put together, the adpative test is γ(·) = Thr C (T (·)), where γ(x) ∈ {0, 1} for any given set of item responses x. The framework described in the next three sections provides a way to compare different youth risk assessments of this form.
3.2.1. Specifying the utility function.
Step 1 of the framework is specifying a utility function U . The adaptive test γ should maximize EU θ (γ), the expectation of U with respect to the density f (x, y) (which is parameterized by θ) over our target population.
In our application, we want the utility function to carry practical significance for the adaptive test. Two important quantities are sensitivity and specificity, which measure the true positive rate and true negative rate, respectively: Sensitivity = Pr(γ(X) = 1 | Y = 1), Specificity = Pr(γ(X) = 0 | Y = 0).
As a reminder, γ is an adaptive test mapping item responses X to a risk status class Y ("atrisk" means Y = 1, "not at-risk" means Y = 0). Ideally, sensitivity and specificity would both be 1. In practice, there is a trade-off between these two quantities, based on the cutoff C. A high cutoff means that many predicted probabilities will be below the threshold and consequently labeled "not at-risk", leading to high specificity and low sensitivity. A low cutoff leads to more "at-risk" class predictions, increasing sensitivity and reducing specificity. This trade-off can be visualized in a Receiver Operating Characteristic (ROC) curve, shown in Figure 2.
To incorporate the importance of both sensitivity and specificity, our expected utility EU θ (γ) is equal to a weighted average of the two, for a user selected weight w ∈ (0, 1): See Section 1 of the Supplementary Material (Krantsevich et al. (2022)) for the point-wise (individual) specification of the utility function U which induces this expected (population level) utility. For w = 0.5, this utility function can be directly visualized within a ROC curve as shown in Figure 3.   (3) for w = 0.5 is the point on the ROC curve that maximizes the perimeter of the shaded rectangle. The height and width of the rectangle are sensitivity and specificity, respectively, for that cutoff.
Since the final adaptive tree and the associated sensitivity and specificity highly depend on w, we recommend carefully selecting the value of the weight in conjunction with stakeholders who understand the implications of favoring sensitivity or specificity for the population where test will ultimately be deployed. Multiple values of w can and should be examined via the methods presented in Section 5.
With this utility function and a value of w specified, the optimal action γ * is the tree-based adaptive test (i.e., "at-risk" probability prediction and associated cutoff) that maximizes this weighted average.
Since the expected utility of a given action γ is a simple expression at the population level, we can evaluate EU θ (γ) over a sample from the target population by directly computing sensitivity and specificity of γ for a particular set of item responses and true risk classes. To be more specific, after drawing a sample {x ij ,ỹ ij | θ (j) } N i=1 from the target population (wherẽ x ij is an item response vector,ỹ ij is the risk class, and θ (j) is a single fixed draw from the posterior π(θ | x 1:n , y 1:n )-see Section 3.2.2), we compute a draw EU θ (j) (γ) as In the next section, we describe how to sample from the target population in order to obtain draws of EU θ (γ) for any given action γ.

Specifying the target population.
Step 2 of the Bayesian decision theory framework is specifying a target population over which we seek to maximize EU θ (γ). In our application, that means defining a specific group of youth for which we want to optimize our adaptive test. After specifying the target population, the optimal action γ * (the "optimal" adaptive test) would maximize the weighted average of sensitivity and specificity for this group specifically. The target population can be all Honduran youth, or more specific to youth of a certain age, living in a particular neighborhood, and so forth.
After specifying the target population, we draw synthetic samples from the joint density f (x, y) of the item responses X and risk class Y in the target population. We use the composite model specification and specify the random variable θ parameterizing f (x, y) as with θ X parameterizing f (x) and θ Y parameterizing f (y | x). This specification allows for additional flexibility in modeling the relationship between the item responses X and the risk of delinquency Y . Practically, we draw synthetic data from f (x, y) as follows: 1. Fit each component of the composite form using a Bayesian model: one for f (x) with unknown parameters θ X , and one model for is the synthetic probability of belonging to class Y = 1, andỹ ij is the synthetic class status.
Taken together, we will have a sample of size N for each posterior draw Since we fit two models corresponding to different components of the same composite model specification, we use a single dataset for fitting the models for f (x) and f (y | x). Modeling details are provided in Section 4.2; more specifics on sampling can be found in Appendix A. As a reminder, the "synthetic data" in this setting is merely a computational approach for evaluating the integrals at the heart of the decision theory framework.

Populating the action space. In this section we describe
Step 3 of the framework, populating the action space Γ. In our application, Γ consists of tree-based adaptive tests; each γ ∈ Γ is of the form form γ(·) = Thr CT (T (·)), where T is a binary regression tree and C T is the cutoff for classification into the "at-risk" group. The number of possible binary trees is much too large for brute force enumeration 3 ; many possible heuristics are available, and different procedures will lead to higher-utility screening instruments than others. Here we focus on one method for populating our action space, motivated by the Bayesian decision theory context.
We first obtain a regression tree T by applying a particular tree growing algorithm (described shortly) to large Monte Carlo samples from the posterior predictive distribution f (x,ỹ | x 1:n , y 1:n ). We then choose the cutoff C T that optimizes the expected utility (3) relative to T over these samples.
Our proposed heuristic for obtaining the regression tree T relies on a novel stopping criterion we call maxIPP, for "maximum Items Per Path." The maxIPP criterion denotes the maximum number of unique items in each root-to-leaf path of the decision tree defining the adaptive test, and consequently, the number of items each individual will be administered during their screening test. The tree is grown using a variation of the CART algorithm; it achieves the maxIPP constraint by restricting the items available for splitting in a given path after m unique items have been used. For details, see Appendix B.
Categorizing trees by maxIPP is useful in our context of shortening lengthy instruments. While maximum depth also limits the number of items, maxIPP allows for further splitting on items already administered, without counting them against the tree "cost".
For a given m, we calibrate an approximately optimal tree with maxIPP m (denoted T * m ) to synthetic data drawn from the posterior of θ X and the posterior predictive ofX. Specifically, our synthetic data used for calibrating T * m are {x k ,Ē(Ỹ |x k )}, 1 ≤ k ≤ M , where the second element is the posterior predictive "at-risk" probability, givenx k : As a reminder, π(θ | x 1:n , y 1:n ) is the posterior density of θ, having observed data (x 1:n , y 1:n ). We use the term "calibrate" rather than "fit" for the process of applying the maxIPP algorithm to synthetic data, in order to reserve the term "fit" for the context of fitting the Bayesian models to real data.
Having obtained T * m , the cutoff C T * m is then optimized relative to the (unconditional) posterior predictive expected utility: where the inner expression on the right-hand side (i.e., the weighted average of sensitivity and specificity of Thr C (T * m )) is approximated using In summary, T * m is our final regression tree with maxIPP m that predicts the probability of being "at-risk" given a set of item responses, and Thr C T * m maps these probabilities to a predicted class status 0 or 1 (0 = "at-risk", 1 = "not-at-risk"). The threshold is chosen relative to the specific regression tree T * m , to optimize the utility function for the target population. We use γ * m = Thr C T * m (T * m ) to denote our approximately optimal tree-based adaptive test of length m.
Our action space Γ consists of one adaptive test γ * m for each value of m under consideration for a given application. We emphasize this is just one proposed heuristic for obtaining an adaptive screening test that optimizes Equation (3), while administering at most m items. One can obtain adaptive tests with m items using other tree growing methods calibrated with other synthetic or real data. Each of these can be compared using the criteria described in Section 3.2.4 before choosing a final adaptive test; see Section 4 of the Supplementary Material (Krantsevich et al. (2022)) for comparisons of several methods.

3.2.4.
Comparing adaptive tests of different lengths. Once we have (at least) one action γ * m for each test length m, we need to choose the value of m for the final adaptive screening test. In general, shorter screening tests can only degrade accuracy (utility), so the relevant questions are "by how much?" and "with what statistical uncertainty"?
To address these questions we define a random variable (with respect to the posterior distribution) ∆ θ,m that characterizes the utility loss due to shortening to m questions. That is, we are interested in the difference in expected utility between that of the shortened exam EU θ (γ * m ) and that of the full, non-shortened, exam EU θ (γ * ). Here, the optimal non-shortened action is γ * (·) = Thr C * (Ē(Ỹ | ·)), whereĒ(Ỹ | ·) is as in (5), and Thr C * is optimized relative to the posterior predictive expected utility; specifically, Thr C * is optimized using Equation (6), but withĒ(Ỹ |x) in place of T * m (x). We denote this difference as . To obtain Monte Carlo samples of ∆ θ,m , for each posterior draw θ (j) compute where EU θ (j) (γ) is computed using (4). Boxplots may then be plotted for each value of m. See Figure 4 for an example with boxplots of ∆ θ,m varying the number of items m and the weight w that defines the utility function U .These utility difference plots visually represent our statistical uncertainty of the trade-offs between assessment sensitivity/specificity and length.
3.3. Comparison to existing methods for designing adaptive tests. In Section 3.2.3 we proposed a novel algorithm for obtaining adaptive tests of different lengths to populate the action space, our second main contribution. Here we compare to existing work on tree-based adaptive tests, as an IRT-based test is not appropriate for our application (see Section 2.2.1). For comparisons between tree-based adaptive tests and IRT, see  and Zheng, Cheon and Katz (2020).
To the best of our knowledge, current tree-based adaptive tests are fit using existing algorithms like CART; built-in hyperparameters decide the length of the test, and the optimization criteria (typically Gini index) is not specific to the adaptive testing context. Typically, the decision tree is fit to real item response -outcome data. Two exceptions are , who fit the tree to locally perturbed artificial data for increased model stability, and Zheng, Cheon and Katz (2020), who utilized SMOTE to help with class imbalance.
The purpose of synthetic data in our application is to provide an MCMC approximation of the expected utility integral over the target population. We obtain this data by modeling the high-dimensional joint density of item responses -risk outcome via two sophisticated Bayesian models, and use a context-specific utility function (i.e. sensitivity and specificity) for tree optimization, rather than Gini index. Finally, our novel maxIPP stopping criterion is an application-specific design choice, exploiting the fact that items can be reused for splitting.

Screening for Youth delinquency in Honduras.
4.1. Data. The instrument used to collect data for this project was the Instrumento de Medicion de Comportamientos (IMC), a revised version of the original Honduran YSET, which was itself a Spanish adaptation of the YSET developed by Hennigan et al. (2014). Under a collaboration with the Center for Violence Prevention and Community Safety at Arizona State University, the item bank for the Honduran YSET was expanded to include protective factors and increase the number of risk factors measured, drawing on the Communities That Care survey, Eurogang Youth Survey, and others. This revised item bank was further refined to increase predictive power in the local context.
Our data consists of responses to the IMC from 3972 school-attending youth. The IMC covers basic demographics about the youth, along with 173 items measuring 38 risk and protective factors over four domains: community, family, school and peer/individual. The risk and protective factor scales are provided in Table ??. Our variable X consists of responses to these 173 items.
Our data also include answers to 18 items that measure seven problem behaviors. Three items measure violent behavior, four items measure property crime, three items measure gang involvement, three items measure alcohol and drug use, two items measure drug sales, two items measure weapons carrying, and one item measures truancy. In what follows, the outcome Y is a binary variable denoting whether or not the youth is at risk of violent behavior. The three items related to violent behavior are: 1. In the past 6 months, have you hit someone with the intention of hurting them? 2. In the past 6 months, have you attacked someone with a weapon? 3. In the past 6 months, have you used a weapon or force to get money or goods from someone?
Youth are deemed to be "at-risk" (Y = 1) if they answer "yes" to any of the three items above. Items measuring the other six problem behaviors are not utilized for this analysis.
Connection to previous notation. We have responses to the 173 items X and an outcome variable Y denoting whether or not the youth is in the "at-risk" group for violent behavior ("at-risk" = 1,"not-at-risk" = 0). The variable γ denotes an adaptive test which takes the youth's responses to a subset of the 173 items and predicts a risk class. For our purposes, γ is composed of two parts: a binary decision tree T that maps item responses to a risk probability, and a threshold C that determines risk class based on risk probability. We will analyze the quality of a risk assessment γ using an expected utility function EU ; EU (γ) is a weighted average of the sensitivity and specificity of the risk assessment γ.
. We model f (x) as a Gaussian copula factor model and f (y | x) as a logistic XBART model using the bfa and xbart packages, respectively; this model specification is quite flexible. See Appendix A for details on integrating over f (x,ỹ | θ (j) ) using the fitted models.
As with any Bayesian modeling endeavor, we recommend interrogating model quality and adjusting hyperparameters accordingly via standard posterior predictive checks, including plots to avoid model misspecification; see, for example, Gelman, Meng and Stern (1996) and Gabry et al. (2019). These model checks should be performed on training data, and not adjusted after obtaining results on hold-out or validation data. This is the approach we used to determine the number of factors in the model for f (x).

4.2.1.
Item responses. The model we use for f (x) is a Gaussian copula factor model (GCFM), proposed by  and implemented in the R package bfa. Gaussian copula factor models unite Gaussian factor models with the Gaussian copula. The joint distribution of the fitted model assumes the dependence structure of the Gaussian factor model, but with marginal distributions estimated nonparametrically from the data. The joint dependence structure of the Gaussian factor model is reasonable considering the factor-based nature of the latent constructs being measured by adaptive tests. Additionally, the nonparametric estimation of the marginal distributions is an advantage over methods that assume normal marginals. See Section 3.1 of the Supplementary Material (Krantsevich et al. (2022)) for details.
Note that the GCFM was fit to an augmented vector including age: (X, Age). This allows us to condition on age in defining the posterior prediction distribution that represents our target population. While we could have accomplished this by only fitting the GCFM to data from a particular age group, fitting the model to the entire population and sampling conditionally after the fact allows for borrowing information from the larger population, and deploying it in service of a subpopulation with fewer data. We only use item responses as splitting variables (inputs) for the adaptive tests.
Sensitivity analysis to the number of factors via posterior predictive checks revealed that 3 or more factors yielded similar conclusions; results for the k = 3 factor specification in the Gaussian factor copula model are reported here.

4.2.2.
Risk prediction. We model f (y | x) using a log-linear Accelerated Bayesian Additive Regression Trees (XBART) model that builds on the log-linear Bayesian Additive Regression Trees (BART) model for multinomial logistic regression of Murray (2020) with a modification of the "accelerated" model fitting algorithm of He, Yalov and Hahn (2018) developed by Wang and Hahn (2021). See Section 3.2 of the Supplementary Material (Krantsevich et al. (2022)) for details. The log-linear XBART classification model provides class probability predictionsĒ(Ỹ |x k ), the probability that a youth with item responsesx k is in the "at-risk" group. This modeling choice provides the predictive accuracy and Bayesian uncertainty quantification abilities of BART-based models, with the computational speed-up of the XBART family and the classification-specific adaptions implemented by Wang and Hahn (2021). Notably, this approach is substantially less constrained than typical IRT approaches, which require that the risk probability relates to the item response via the same low-dimensional latent factors. Here, while we assume that the item responses have a latent dimension of k = 3, the risk probability can depend directly on every single item individually (with no dimension reduction). However, regularization priors in the tree ensemble representation favor trees that utilize far fewer than every available item. 4.2.3. Connection to previous notation. We fit the Gaussian copula factor model to item responses and age from the IMC data, then obtain synthetic item response data {x k } M k=1 from the target population using the predictive distribution of the fitted model.
The plots in the next section compare the utility of the shortened screening test γ * m to the utility of the full-length test γ * , which uses all 173 items on the IMC. The instrument γ * (·) = Thr C * (Ē(Ỹ | ·)) is composed of a regression functionĒ(Ỹ | ·) predicting the probability of the Honduran youth being "at-risk", followed by a thresholding function Thr C * to predict risk class status. We fit the regression function as an XBART model using the IMC data, then obtain predicted "at-risk" probabilitiesĒ(Ỹ |x k ) for the synthetic item responsesx k . The thresholding function is chosen to optimize the utility function for the target population, given predicted "at-risk" probabilityĒ(Ỹ |x k ).
The shortened instrument with m items, γ * m (·) = Thr C T * m (T * m (·)), is composed of a binary regression tree T * m with maxIPP = m, and a thresholding function The thresholding function for γ * m is computed similarly to the one for γ * , except that it optimizes the cutoff using "at-risk" probabilities T * m (x k ) rather thanĒ(Ỹ |x k ). 5. Results. Section 5.1 provides a demonstration of the method using the data for youth delinquency in Honduras. Section 5.2 provides out-of-sample validation of the method on a hold out set, along with a subgroup analysis using the same hold-out set. 5.1. Demonstration of the method. Recall the three steps in the decision theory framework laid out above: 1) a utility function for measuring the "goodness" of the assessment; 2) a target population; 3) a method for obtaining assessments of different lengths. Sections 5.1.1, 5.1.2, and 5.1.3 demonstrate how the utility difference plots change as we vary these three choices, respectively, when applied to the Honduras youth risk assessment data. 5.1.1. Changing the utility function. First, we highlight how the plots change when we vary Step 1, the utility function. Figure 4 shows boxplots of the difference in expected utility for three different weights w in the utility function For each weight w and each value of m, we compute draws of the utility difference ∆ θ (j) ,m = EU θ (j) (γ * m ) − EU θ (j) (γ * ) using synthetic data from each posterior draw j; the posterior distribution of ∆ θ,m is then visualized via a boxplot of the draws {∆ θ (j) ,m } D j=1 . The distribution of ∆ θ,m can vary depending on our choice of both m and w. Figure 5 provides a visual example of how (Specificity(γ), Sensitivity(γ)) for γ ∈ {γ * , γ * m }, in conjunction with w, lead to different draws of ∆ θ,m for m = 3. In particular, as w gets closer to 0 or 1, it is easier for the shortened test γ * m to achieve a utility value closer to that of the non-shortened instrument γ * . Practically, a value of w close to 0 or 1 amounts to strongly favoring either sensitivity or specificity, at the expense of the other; such decisions can have unintended ramifications, which are discussed further in Section 6.1.  The values of j shown here were arbitrarily chosen for demonstration and are not inherently important. The ROC curves are computed based on the predicted "at-risk" probabilitiesĒ(Ỹ |x ij ) and T * m (x ij ) from the XBART action γ * and maxIPP = 3 action γ * m=3 (respectively) for each specific j th population. For each given w, there is exactly one cutoff C which maximizes the utility function EU θ (γ) = w · Sensitivity(γ) + (1 − w) · Specificity(γ) over all sample populations (all values of j), for γ = γ * m=3 = Thr C T * m (T * m (·)). That cutoff C corresponds to a particular (Specificity, Sensitivity) pair for each value of j (for both the XBART and maxIPP= 3 actions), which are visualized as points on the ROC curves from those two actions for the j th synthetic population. Those Sensitivities and Specificities are used to compute the realized utility values EU θ (j) (γ * m ) and EU θ (j) (γ * ), along with their difference, ∆ θ (j) ,m , which contributes one point to the boxplots in Figure 4 for the given values of w and m = 3. Notice that values of w closer to 1 lead to differences in sensitivity between γ * and γ * m=3 (distance between points on the Sensitivity axis) being smaller than differences in specificity (distance between points on the Specificity axis). This can be observed in the points corresponding to w = 0.75 and, to a lesser extent, w = 0.6. The opposite behavior is observed for w = 0.25 and w = 0.4.

5.1.2.
Changing the target population. Next, we vary Step 2, the target population. The boxplots in Figure 6 represent the same quantity as Figure 4 (namely, the distribution of ∆ θ,m ). However, Figure 6 shows expected utility differences for adaptive tests calibrated using two target populations: all Honduran youth, and youth ages 15 and older. We chose to target youth ages 15 and older since age 15 marks the transition from middle school to secondary school, as well as the quinceañera ceremony. To change the target population, we used the GCFM fit to the entire dataset, but then drew samples {x ij ,ỹ ij | θ (j) } using the conditional predictive distribution, f (x,ỹ | x 1:n , y 1:n , Age ≥ 15).
The expected utility plots are similar; however, targeting the subgroup when designing the adaptive test yields slightly less variability in the posterior estimates of the utility difference.
FIG 6. Utility difference plots when calibrating the trees to a different target population. Plots here are shown for a target population being every youth in the full IMC data ("All"), and those youth ages 15 and older ("Ages 15+"). The utility plots are quite similar, although exam truncation seems to result in a greater loss of utility (relative to the full screening instrument) for the older group. Calibrating the adaptive test to the subgroup also results in slightly more certainty compared to the full population.
FIG 7. Tree representing the adaptive test calibrated using the entire group of Honduran youth as the target population. The items and item responses corresponding to each node label and cutpoint, respectively, are found in Table 1. This figure and Figure 8 were created using the rpart.plot package (Milborrow (2021)).
Interestingly, these similar results arise based on adaptive tests that use different splitting items and cutpoints. Figures 7 and 8 show the trees with maxIPP of 3 representing the adaptive tests for these two target populations. The items corresponding to these trees are listed in Table 1, with the response options found in Table ??. Notice that because of the max-IPP criterion, these trees have a maximum depth of 5, but have only 3 unique items in each root-to-leaf path. 5.1.3. Changing the algorithm to populate the action space. Finally, we can consider different algorithms for populating the action space. In this paper we have focused on the composite action γ = Thr C (T ), a regression tree T predicting "at-risk" probability followed by a cutoff C that determines risk status. Our proposed method for populating the action space is a regression tree obtained by applying the maxIPP growing and pruning method to synthetic data obtained from the posterior predictive distribution, and a threshold optimized to the utility function for the tree T .  Table 1. Has anyone in your family had a severe alcohol 1 = No All Youth or drug problem? 2 = Yes Age ≥ 15 Yh5 During the last six months, how many friends 1 = None All Youth have belonged to or have joined a gang or 2 = A few Age ≥ 15 "mara"? 3 = Half 4 = Most 5 = All Yd2 Sometimes I find it exciting to do things that 1 = Strongly disagree Age ≥ 15 could get me in trouble. 2 = Disagree 3 = Neither agree or disagree 4 = Agree 5 = Strongly agree Ya6 People "blame me" for lying or cheating. 1 = Never Age ≥ 15 2 = Rarely 3 = Half the time 4 = Often 5 = Always Many other methods are possible. For example, one can calibrate the regression tree using a stopping criterion like maximum depth, or apply the algorithm to different synthetic data or to real data; a classification tree can be used as the adaptive test directly instead of a regression tree followed by a cutoff. We explore these possibilities in Section 4 of the Supplementary  FIG 9. Plots of the projected difference in expected utility produced via our method on the training data, and the actual difference in expected utility computed on the test data; these are results for w = 0.5.
Material (Krantsevich et al. (2022)). The main takeaway is that tree-based adaptive tests that do not optimize the utility function at all during their design are significantly worse at reproducing the utility of a full-item assessment, relative to adaptive tests that do.

5.2.
Out-of-sample corroboration. The proposed method will be empirically reliable only insofar as the posterior predictive distribution suitably reflects the distribution of future outcomes. To verify that our Gaussian copula factor and XBART models are succeeding in this regard, we perform the following hold-out experiment. Our data was collected in two different time periods, the first wave between September and November of 2017 and the second wave between January and February of 2018. The earlier-collected data is our training set and consists of 2787 youth; the later data is our testing set and consists of 1185 youth. Simply put, this experiment answers the question: how would our approach have performed if we had applied it in 2018, based on the 2017 data? Figure 9 demonstrates the expected utility difference plots we obtained by applying our method on the training data, and the actual expected utility differences on the testing data. To compute the actual expected utility, we used our proposed method on the training data to obtain a tree-based adaptive test for each value of maxIPP, along with a full-item (nonshortened) test. We also produced the boxplots representing our uncertainty around ∆ θ,m using the training data alone. We then predicted risk classes on the testing set using both the tree-based adaptive test and the full item test, and computed the difference in empirical utility over the testing set. The empirical utility on the testing set is always within our predicted range, in fact within the 25 th and 75 th quantiles of the distribution.
Beyond utility differences relative to the full item test, practitioners are interested in the absolute sensitivity and specificity of the instrument. Table 2 provides out-of-sample sensitivity and specificity values for the adaptive tests from a subset of maxIPP values shown in Figure 9, along with adaptive tests calibrated using utility functions with w = 0.4 and w = 0.6. Increasing w results in higher sensitivity and lower specificity, as expected. For full results on maxIPP 2 to 15, along with these quantities for other types of adaptive tests, see the tables in Section 4 of the Supplementary Material (Krantsevich et al. (2022)).
Finally, we use the holdout set to show how specifying a particular target population can improve sensitivity, specificity, or overall utility when building adaptive tests. The two target populations under consideration are "All Youth" and "Ages 15+". Table 3 shows the number of participants in each of the age groups from our data in both the training and testing sets. For the adaptive test with target population "All Youth", we fit the Gaussian copula factor model and logistic XBART model to the entire training data and obtained synthetic data using these models, which was then used for calibrating the tree-based adaptive test. For the adaptive test with target population "Ages 15+", we used the same models fit to the entire population,  but drew synthetic data from the group of youth ages 15 and older using the conditional predictive distribution f (x,ỹ | x 1:n , y 1:n , Age ≥ 15). We then calibrated a tree-based adaptive test to this synthetic data. This process was repeated for maxIPP = 2 to 15, leaving a total of 28 regression trees. We computed the optimal cutoffs that maximized the utility function (3) for w = 0.6. After calibrating the trees and computing the optimal cutoffs (using the training data only) to obtain 28 tests, both sets of adaptive tests were then deployed to predict "at-risk" status on youth ages 15 and older in the testing set, and sensitivity, specificity, and utility for this group were computed for each of the 28 tests. We chose a value of 0.6 for this analysis, because we are targeting a group of older youth that have been shown to receive positive treatment effects from the secondary prevention counseling program (see Katz et al. (2021)). For this group it is more important to not miss the youth that are at the highest risk, than to prevent "not-at-risk" youth from mistakenly receiving the intervention. Figure 10 shows the differences in empirical out-of-sample sensitivity, specificity, and overall utility between the adaptive tests calibrated to the two different populations, for each value of maxIPP; the absolute quantities are given in Table 4. The adaptive tests optimized for "All Youth" with w = 0.6 are not appropriate for this particular subpopulation, because those questions indicate that all of the youth ages 15+ in the test set are "at-risk" (leading to 0 specificity, which clearly is unacceptably low). Trying to increase sensitivity for the entire population results in items that are uninformative for the older youth. When we calibrate the adaptive test specifically to this subgroup, we sacrifice only a small amount of sensitivity for huge gains in specificity.
The improvement by focusing the test to a specific group is an important finding related to focused deterrence and multiple gating. Focused deterrence implies introducing interventions specific to the group where they will be deployed; multiple gating means targeting youth for secondary prevention programs who are at the highest risk of the delinquent behavior and living within the highest risk neighborhoods (Katz et al. (2021)). Both of these methodologies are important aspects of successful community-based crime prevention programs (Abt and Winship (2016), Katz et al. (2021)), and using accurate screening instruments for the population where an intervention will be introduced is critical to their successful implementation.
While the lack of specificity on the older group of youth using an adaptive test calibrated to all youth is alarming, this highlights the importance of using adaptive tests designed specifically for the group on which they will be deployed. All adaptive tests that are created using a machine learning (ML) algorithm, such as CART, do so by heuristically optimizing a given criterion over a specific dataset. This may have unintended consequences when the data for FIG 10. Differences in sensitivity, specificity, and utility for youth ages 15+ in the testing data, between two adaptive tests (γ All Youth and γ Ages 15+ ) created using training data. The adaptive test γ All Youth is designed to approximately optimize expected utility for all youth, and γ Ages 15+ for youth ages 15+. The bar height in the upper plot is Sensitivity(γ Ages 15+ ) − Sensitivity(γ All Youth ) computed on youth ages 15+ in the testing data, and similarly for specificity and utility.

TABLE 4
Specificity, sensitivity, and utility with w = 0.65 on youth ages 15 and older from the testing data, for adaptive tests calibrated on two different target populations in the training data: youth ages 15 and older, and all youth.
"Target Population" shows the population from which synthetic data were obtained for calibrating the test. which the test was optimized differs in distribution to the specific group on which the screening test will be deployed. The benefit of our proposed method for obtaining the adaptive test (chosen to optimize the criteria in our Bayesian decision theory evaluation framework), is that these choices are directly placed in front of the screening test designer when the adaptive test is created. One must think critically about the target population for which the test is optimized, and the utility function being optimized-these are decisions that are inherently made in other tree-based adaptive test procedures, but under the hood.
A further benefit is that data from a larger population can be used to adapt a screening test to a subpopulation where fewer data are available. We borrow information from the whole population when fitting the GCFM model, but sample from the subpopulation of older youth using the conditional posterior predictive distribution from that fitted model. This conditional sample is then used for calibrating the adaptive test to the subgroup. This is an unusual and exciting example of transfer learning-utilizing the information that an ML algorithm obtains from larger datasets when applying the algorithm in service of a slightly different problem where fewer data are available.
6. Discussion. From a practical perspective, the upshot of our analysis is highly encouraging: a much shorter assessment can be given that will nearly match the predictive accuracy (as characterized by the weighted sensitivity and specificity) of the much longer original assessment. Specifically, we were able to design adaptive tests of varying lengths for the target population of youth ages 15 and older, living in 5 of the poorest and most violent cities in Honduras. Out-of-sample sensitivity over 0.9 and specificity over 0.4 was achieved for an adaptive test that uses only 9 items. This is an increase in specificity of 0.4 over an adaptive test optimized to youth of all ages together. If a more convenient screening tool leads to more individuals being screened, limited crime mitigation resources can be more smartly employed.
However, precisely because the stakes are so high, circumspection is in order, guided by a "first do no harm" ethos. Accordingly, we conclude with an examination of potential pitfalls of our proposed method. The importance of such considerations have recently been emphasized under the broad heading of "ethical AI" (artificial intelligence) (cf. Johndrow and Lum (2019) and Chouldechova and Lum (2020)). Two main concerns include disparate impacts on particular subpopulations, and the difficulty in interpreting or interrogating automated decisions from sophisticated data-driven algorithms.
6.1. Disparate impact. Biased training data can result in risk assessment tools that produce unethical or unfair decisions for particular groups of people, in domains such as criminal justice (Chouldechova and Lum (2020), Chouldechova (2017), Eckhouse et al. (2018)) and child welfare (Chouldechova et al. (2018)). For example, historical data may unfairly indicate that a certain racial group is at higher risk of re-arrest, simply due to more aggressive policing in their neighborhoods; a statistical model trained on this type of historically biased data will produce unethical decisions on important questions like pre-trial release. Similarly, our method is only as unbiased as the data used to train the model. In our particular case, the outcome used in the IMC data is self-reported; unlike in United States recidivism data, for which "re-arrest" is an inaccurate and racially-biased proxy for "re-offense" (see Johndrow and Lum (2019)), the delinquency data on the IMC is based on the individual youth selfreporting whether they engage in the behavior, as opposed to school or law enforcement records that may be biased by historical law enforcement patterns. While our assessment would disadvantage a group of youth who were systematically dishonest in their self-reported violent behavior, and it is possible that there may be such a group, such patterns in the youth represented in the IMC have thus far not been observed; the scales used in the IMC were chosen for their efficacy, internal validity and reliability (Katz et al. (2021)).
The nature of historically advantaged or disadvantaged groups also differs: the youth for whom the current application is intended are fairly homogeneous. These youth are of the same race and ethnicity and experience similar levels of poverty, living in the poorest neighborhoods within the five most dangerous and violent cities in Honduras, which is itself one of the most violent countries in the world. While ethnic minority groups live in parts of rural Honduras, this paper has been written for the scope of application in five particular urban neighborhoods under consideration.
Although our algorithm is unlikely to result in disparate impacts among racial groups in these neighborhoods (simply due to lack of heterogeneity), there is a possibility for differential impact by age, and possibly other features like gender or religion. In Katz et al. (2021), positive treatment effects from the secondary prevention program were observed for older youth (divided at age 14 and older), whereas mixed treatment effects were observed for the younger group. This highlights the importance of careful selection of the weight w in designing the adaptive test. As a concrete example, in the randomized controlled trial (RCT) which continued after the initial IMC data collection (Katz et al. (2021)), services were given to 994 youth deemed to be "at-risk", out of 4495 screened. Supposing that 994 of the 4495 screened youth were truly "at-risk", a decrease in sensitivity of 5% would result in 50 more "at-risk" youth being denied the intervention, whereas a 5% rise in specificity would result in 175 more "not-at-risk" youth being prevented from incorrectly receiving the intervention. An adaptive test that trades this increased specificity for decreased sensitivity may be acceptable within a younger group, but not for an older one. Similarly, harmful consequences can arise from a shift in the target population between test creation and deployment. An adaptive test that optimizes utility for youth over a large age range (e.g., 8-17) may not have acceptable accuracy for youth within a more specific age group; indeed, this was the case for youth ages 15+ (see Section 5.2).
To summarize, the possibility for disparate impact using our proposed method, as with most automated decision making via ML algorithms, hinges on whether or not particular subpopulations are given due consideration in the test design process. Attention and care must be given to the selection of the target population and the weight w when optimizing the adaptive test, to ensure the best outcomes for the youth being screened for risk of delinquency.
6.2. Inscrutability of automated decisions. Independent from concerns surrounding flawed training data and the differential impacts it creates, the sheer complexity of a data driven risk assessment invites skepticism. Flaws can be hard to identify when the inputs and outputs are high dimensional numerical vectors (Chouldechova and Lum (2020)). On this count, we consider our method to be a substantive advance over existing approaches. One, our final risk prediction assessment tool is a single decision tree, which can easily be understood and adapted as needed to reduce potential bias or problematic prediction patterns. For example, if a particular item results in lower predicted risk probability based on behavior that is believed to increase it, that item can be excluded from the item pool and the decision tree re-calibrated to the remaining items. Two, the inputs to our method are transparent -a utility function, a target population, and a set of candidate instruments generated by a heuristic. Sensitivity to these choices can and ought to be investigated; the execution of such comparisons is precisely what our novel decision theory framework facilitates. Although the process is quite involved, its transparency and flexibility should make it less prone to unanticipated flaws than ad-hoc methods of abridging screening tests, whether data-driven or human guided.
To emphasize, while particular choices for each of these steps were presented in our analysis of the Honduras data, many other choices are possible. For example, the specific value of w in the utility function can be chosen based on whether specificity or sensitivity is more important; or, another utility function involving other classification metrics can be chosen. The target population can be specified as youth of a particular age, neighborhood, gender, school, or any other subpopulation for which a specific screening instrument may be useful, as long as some data for this target population are available. And while we have focused on tree-based adaptive tests relying on the CART algorithm in this paper, one can utilize other tree-growing algorithms for populating the action space, or compare IRT-based adaptive tests as well. The framework itself is generic, in the sense that once a practitioner has chosen a utility function, a target population, and an algorithm for populating the action space, the same procedure can be applied to understand the trade-offs of shortening the exam to different lengths, or of making a different choice at one of the three steps.
These choices should be made carefully by policy-makers and local stakeholders, aided by researchers who can explain the trade-offs associated with one decision versus another. Researchers can provide insight via the utility plots, or similar plots created for uncertainty quantification of sensitivity or specificity at the relative or absolute level. Local-stakeholders and policy-makers can assess which outcomes are most important for the group being screened in their specific application. These groups working in concert should adjust the assessment to accommodate desired levels of sensitivity and specificity for the particular population in which it will be deployed, as much as possible considering practical limitations (e.g. counselor availability in our application).

APPENDIX A: INTEGRATING OVER THE TARGET POPULATION
Our process for obtaining synthetic data {x ij ,p ij ,ỹ ij | θ (j) } from the conditional predictive distribution can be summarized as follows: 1. Fit a Gaussian copula factor model with parameters θ X to item response data x 1:n . 2. Fit a multinomial logistic XBART model with parameters θ Y to item response/risk status data (x, y) 1:n . 3. Fixing the j th posterior draw of model parameters θ X ) using the fitted Gaussian copula factor model. 4. Compute the probabilityp ij = Pr(Ỹ = 1 |x ij , θ (j) Y ) using the j th posterior tree ensemble from the fitted multinomial logistic XBART model. 5. Sample the class labelỹ ij ∼ Bernoulli(p ij ). 6. Our dataset conditioned on the j th posterior draw θ (j) = {θ . Additionally, during step (4), we compute the posterior predictive mean probabilitȳ By repeating this process D times, 1 ≤ j ≤ D, we obtain D population-level samples from our target population. In total, the synthetic data is We use synthetic data {x k ,Ē(Ỹ |x k )} M k=1 for calibrating the regression tree with m items, T * m . We use {x k ,ỹ k } M k=1 for choosing the optimal cutoff C T * m . We also use {x k ,ỹ k } M k=1 for doing uncertainty quantification plotting, but broken up into D sample populations as For each value of j, we compute EU θ (j) (γ) for both γ * m (·) = Thr C T * m (T * m (·)) and γ * (·) = Thr C * (Ē(Ỹ | ·)) using Equation (4). The draws of the differences ∆ θ (j) ,m between these utilities are then used for uncertainty quantification of ∆ θ,m = EU θ (γ * m ) − EU θ (γ * ). For this paper, we drew N = 1000 Monte Carlo samples of the form for each of D = 1000 posterior parameter draws θ (j) , 1 ≤ j ≤ 1000. We drew another 100,000 synthetic data from the same fitted models for the pruning step from Section 3.2.3.

APPENDIX B: DETAILS OF THE MAXIPP ALGORITHM
We propose a variation on the popular CART algorithm for obtaining an approximately optimal tree-based adaptive test that contains at most m items. Presuming that item responses can be stored for future splits, the maxIPP of a tree-based adaptive test is precisely the maximum number of questions any participant will answer. The maxIPP characteristic is similar to maximum depth; to see the distinction, consider the tree in the right of Figure 1, which has a maximum depth of 3, but a maxIPP of 2.
For each value of maxIPP = m, we use an adapted version of the CART algorithm to obtain an approximately optimal tree. CART consists of a tree growing phase, followed by a tree pruning phase. Our modification uses the usual greedy algorithm (minimizing sumof-squares) for the growing phase, with a twist: once m unique variables have been used as splitting variables in any particular path, only these same variables are considered as candidates for future splits down this path. This algorithm is implemented as a modification to the rpart package, with maxvpp (the application-agnostic term meaning "maximum variables per path") available as an option for rpart.control. For the pruning stage, we start at the root tree T 0 in the list of subtrees returned by rpart, and for each next tree in the list, compute the reduction in root mean square error (on a holdout set) relative to the previous tree. If this reduction is not above a given threshold 4 for at least 10 consecutive subtrees in the list, we return to the last subtree that met this threshold and call this tree T * m .
2. Table on IMC Data. Table 1 provides a list of risk and protective factors measured by the Instrumento de Medicion de Comportamientos (IMC). Data using this instrument were used for obtaining the results in Section 5 of the paper.

Model Specifications.
3.1. Item responses. As described in Carvalho (2006), in a k-dimensional Gaussian factor model, the i th observation of a p × 1 random vector z can be represented as where Λ is a p × k matrix of factor loadings, f i is a k × 1 vector of factor scores with f i ∼ N (0, I), and ν i is a p × 1 noise vector, with ν i ∼ N (0, Ψ), Ψ = diag(Ψ 1 , . . . , Ψ p ). Under these assumptions, z ∼ N (0, ΛΛ T + Ψ). Intuitively, a copula is a joint distribution function that allows for separation of the marginals from the dependence structure; the copula completely describes all dependencies among variables. A joint distribution F has a Gaussian copula if it can be written as where Φ p is the p-dimensional multivariate Gaussian CDF with correlation matrix C, and Φ −1 is the inverse Gaussian CDF.
The Gaussian copula factor model starts by assigning the latent variable z a k-dimensional Gaussian factor model: f i ∼ N (0, I), z i | f i ∼ N (Λf i , I). We then define x as x ∈ R} is the pseudo-inverse of F r , 1 ≤ r ≤ p. By making this specification, F (x) has a Gaussian copula with covariance matrix ΛΛ T + I, and marginals F 1 , F 2 , . . . , F p .
The bfa R package presented in  fits a Gaussian copula factor model to data using a parameter-expanded Gibbs sampling scheme. Their method allows for inference on joint distributions of mixed continuous and discrete variables, which is necessary for modeling the joint distribution of the item responses and demographic variables from the IMC data. We fit the Gaussian copula factor model to item response and demographic data from the target population in the IMC data, then obtain samples {x i } k i=1 using the predictive distribution of the fitted model. 3.2. Risk prediction. In the log-linear Accelerated Bayesian Additive Regression Trees (XBART) model, the probability of observing class s given covariate x i in a setting with c classes follows a logistic specification . Following is given a sum of trees representation, where g(x i , T .
In our application, we utilize this multinomial logistic XBART model with c = 2 classes to predict the probability of being "at-risk", given item responses 4. Changing the algorithm to populate the action space. In the paper we present one method for populating the action space with a single adaptive test of length m. First, we calibrate a regression tree T * m to synthetic data sampled from the posterior predictive distribution, using the maxIPP algorithm with maxIPP = m. Then we choose a threshold which is optimized relative to T * m . This is only one possible heuristic for obtaining a tree-based adaptive test with at most m questions. We can change the tree-growing algorithm, the data the algorithm is applied to, the way the threshold is chosen; or, we can calibrate a classification tree directly instead of using a two-stage regression tree + cutoff approach. Here we present results for several such alternatives.
First, we obtain different adaptive tests varying the parameters above, using the entire IMC data set. We compare two stopping criteria for growing the regression tree: 1) growing to a maximum depth; 2) growing very deep using maxIPP then pruning back (proposed method, described in Appendix B of the paper). We furthermore compare these tree-growing algorithms applied to synthetic data generated via two different processes: 1. Item responses generated via local perturbations ("Perturb") and "at-risk" probabilities obtained using a Random Forest model ("RF"), as in . 2. Item responses sampled from a Gaussian copula factor model ("GCFM") and "at-risk" probabilities obtained using a logistic Accelerated Bayesian Additive Regression Trees ("XBART") model. This is our proposed method, described in Sections 3.2.2 and 3.2.3 of the main paper, with data notated For each of these two regression trees, we obtain an optimal cutoff using data {x k ,ỹ k } M k=1 , via the method described in Section 3.2.3.
We also consider the simpler approach of calibrating a classification tree (via maximum depth) that predicts "at-risk" status directly, rather than the regression tree + cutoff approach. We calibrate three classification trees using the following synthetic datasets: 1. Item responses generated via local perturbations ("Perturb") and risk classes ("at-risk" = 1, "not-at-risk" = 0) obtained using RF. This data uses the same item responses and the same fitted model as above, but extracting class predictions rather than probability of being in the "at-risk" class. 2. Synthetic item responses and classes {x k ,ỹ k }, 1 ≤ k ≤ M , described in Section 3.2.2 and Appendix A of the main paper.
A reviewer suggested classification tree method (3). Unlike the first two synthetic datasets, which do not incorporate the utility function at all, this method still approximately maximizes the utility function (subject to the number of items constraint), and allows for fine-tuning w based on desired levels of sensitivity and specificity.
The synthetic class outcomes γ * k are defined as: where U 0 and U 1 are defined in Section 1 of this Supplement. If one was not required to use a decision tree for classifying participants as "at-risk" or "not-at-risk", this class assignment would maximize the expected posterior utility point-wise. Thus, a classification tree calibrated to this third synthetic dataset will approximately optimize the expected utility function.
In summary, the settings compared for different adaptive tests are: In total this leaves four regression tree methods and three classification tree methods. For γ being each of the seven methods, we computed the expected utility draw EU θ (j) (γ) over the j th sample population {x ij ,ỹ ij } N i=1 using Eqn (4). We similarly computed the j th draw of expected utility of the non-shortened assessment γ * (·) = Thr C * (Ē(Ỹ | ·)), and the difference between the two expected utilities is ∆ θ (j) ,m . Figure 1 shows the boxplots representing the distribution of ∆ θ,m for each of these seven methods, for a utility function weight of w = 0.5. The maxIPP/maximum depth criteria are grouped as "Number of Items".
First, we emphasize that Figure 1 is not intended to demonstrate the ultimate superiority of any method, but rather to assess which method best approximates our implementation of an optimal (in terms of expected utility) screening instrument that uses all items. We make a direct comparison of these methods on out-of-sample data shortly.
The most striking result in Figure 1 is the superiority of the utility-based methods (all 4 regression-based methods, and the utility-based classification method) at reproducing the utility of the full-item instrument. This is unsurprising because the other two classification methods were not created with the utility function in mind; they were simply calibrated to synthetic data obtained from models fit to the IMC data. The IMC data is highly imbalanced, so these two classification trees tend to predict almost all youth to be "not-at-risk", and they have low expected utility compared to the non-sparse action. The maxIPP and maximum depth stopping criteria produce similar results. We expect there to be a greater difference using maxIPP for an application in which multiple strata of the same question lead to to substantially different outcomes, whereas in this particular application, we feel that differences in item responses are mainly important in the "highlow" sense. However, the maxIPP process seems to do at least as well as maximum depth.
We also derived out of sample sensitivity and specificity for all 7 of these methods in order to do a fair comparison between them. The results are shown in the tables, at the end of this Appendix, one table per utility function (parameterized by w).
Sensitivity is quite poor for the non-utility based classification methods, presumably because the IMC data is highly imbalanced. As a consequence, synthetic data used from models fitted to this imbalanced data also have very few "at-risk" predictions, and so these classification trees in turn make very few "at-risk" predictions. We note that the method Classification -maxDepth -Perturb + RF was the method used for designing the tree-based adaptive tests in , according to our personal communication with the authors (Gibbons and Wang). While their method seems to work well for problems such as screening for Major Depressive Disorder, in our application, which has extremely imbalanced class outcomes, it performs quite poorly.
These results highlight the benefit of optimizing the adaptive test to the utility function, either using a regression tree and a separately optimized cutoff, or using a classification tree calibrated to utility-based class labels. In both of these methods, we are able to make finer adjustments to the screening procedure and balance sensitivity and specificity to be in line with desired ranges for a particular group.
For all of the utility-based methods (all regression methods and the utility-based cutoff method), increasing w improves the sensitivity and decreases the specificity, as expected. The regression methods using GCFM + XBART and the utility based classification methods are two different ways of obtaining an adaptive test that approximates the same utility function using the same fitted models, the only difference being whether the utility is optimized pointwise or globally.
We see three advantages of our proposed two-step (global optimization) method: 1. We believe the two-step formulation is more intuitive, because the connection to sensitivity and specificity in the quantity used to optimize the threshold (Equation (6) in the main paper) is more direct. 2. The two-step process allows one to visualize the impact of the weight w via ROC curves, as in Figure 5 from the main paper. Since the predicted probabilities do not change, and the utility function is optimized via the threshold function Thr CT , one can plot a ROC curve of the predicted probabilities. This allows for directly see how changing w results in different thresholds and corresponding Sensitivity and Specificity. 3. In terms of practical implementation, it is much more efficient to fit only one regression tree and then separately optimize the threshold for many values of w, compared to fitting a new classification tree for every new value of w one would like to examine. Since our MCMC sample contains 1 million synthetic data, this is actually a substantial computational speedup when many values of w are under consideration.