Advice, approach, code, and a quick reference guide for researchers attempting to estimate statistical power in experiments involving rare events.
Objectives: This study addresses the significant challenge of estimating statistical power for experiments involving rare but impactful events, where traditional power analysis models, assuming normal distribution, inadequately estimate required sample sizes. It introduces a simulation framework leveraging Poisson regression models to better capture the dynamics of rare event occurrences and their impact on power calculations.
Methods: Using Poisson distributions, a comprehensive simulation framework systematically assesses the interplay between sample size, effect size, and event occurrence rates. This approach is calibrated against real-world criminological constraints to ensure relevance and applicability. A key innovation is the "power lift" metric, devised to provide a more stable estimation of the required sample sizes by addressing the inherent instability in power simulations for rare events.
Results: The simulation reveals intricate relationships between effect size, sample size, and event rates, underscoring the challenge of achieving sufficient statistical power in studies of rare events. The "power lift" metric emerges as a robust tool for determining sample size requirements, offering greater stability across various scenarios. Results are operationalized through user-friendly simulation code and a quick reference table, providing practical tools for researchers.
Conclusions: The study advances a methodological framework for power analysis in the context of rare events in criminology. By integrating the "power lift" metric and a Poisson regression-based simulation framework, it offers researchers a more accurate and reliable means of estimating sample sizes for statistically robust investigations into rare but socially significant events.
Keywords: power analysis, rare events, count models, simulation
Statistical power, the probability of correctly rejecting a false null hypothesis, is a cornerstone consideration in the design and interpretation of empirical studies across various disciplines (Clark, 1977), including criminology and criminal justice (Gerber & Green, 2012; Vasilaky & Brock, 2020). If, as Sherman notes, “our best chance to reduce human misery is with positive evidence of programs that work well” (2007, p. 300), then appropriate design is not merely a statistical formality, but a necessary condition for criminological research to be ethically responsible and methodologically sound. Criminology findings often have immediate implications for public policy and justice, and scholars have called attention to the importance of statistical power (Barnes et al., 2020; Niemeyer et al., 2022). A poorly powered study risks failing to detect meaningful effects of interventions or policies, which can lead to erroneous conclusions that can perpetuate ineffective, or even harmful practices (MacDonald, 2023; McCord, 2003). Similarly, statistically significant findings in an underpowered study can be misleading and overstate the effect of the intervention (Blair et al., 2019). From this perspective, power analysis becomes an indispensable tool for both planning and the evaluation of empirical inquiries in criminology.
The path to sufficient statistical power becomes less clear when the outcome to be studied is a rare event. Rare events represent a methodological challenge that necessitates robust statistical power, particularly when the goal is to capture substantive variations between treatment and control conditions (Hodkinson & Kontopantelis, 2021; King & Zeng, 2001). Rare events, which are characterized by their low frequency, defy the assumptions made regarding events which possess a normal frequency distribution. Such events are both more commonplace and inherently open to assessment using conventional power analysis tools like those commonly deployed by criminology scholars. The rarity of low frequency event, however, neither diminishes their societal significance nor alleviates the need for rigorous empirical scrutiny. On the contrary, instances of police use of force are highly consequential: they merit public attention and often prompt changes in policy and practice (Alpert et al., 2017). However, the infrequent nature of these events means that the accumulation of sufficient data for robust statistical analyses is as arduous as it is challenging (LeBeau, 2019; Schafer et al., 2022). The fact that these potentially key events are rare is compounded by the inadequacy of traditional power analysis techniques in such circumstances. In combination, the nature of rare event data, and the assumptions inherent in traditional power analysis techniques, complicate the achievement of adequate statistical power; they also significantly hinder the introduction of evidence-based policies, the achievement of which is a cornerstone of modern criminological studies (Lum & Koper, 2017).
The overarching aim of this research, therefore, is to untangle the complex nexus between statistical power, ethical responsibility, and societal impact, particularly within the field of criminology. By systematically exploring key parameters—sample size, effect size, and the mean occurrence rate in control groups—this study proposes empirical guidelines which set out to advance both the design and interpretation of research on rare events. This effort addresses a noticeable gap in the academic literature by introducing a comprehensive simulation framework explicitly designed to evaluate statistical power in the study of such events. Central to this framework is the novel concept of a "power lift metric," a dynamic criterion which mitigates the inherent volatility in power estimates by establishing a power stabilization point. This methodological innovation serves to align the rigorous standards required for well-powered experimental designs with the ethical imperatives that rightfully prevail in these high-stakes research domains.
In light of the challenge inherent in what are frequently underpowered studies in this field of study, it’s also important to engage with the ethical and practical issues that attach, give that null findings in a statistically underpowered experimental environment can still offer valuable contributions to the broader scientific discourse, provided, of course, that they are both meticulously obtained and interpreted with caution. Accordingly, the goal of this paper is future-oriented: it seeks to contribute to the field by enriching the quality of research, the pursuant findings, and subsequent interpretations (see, Gelman, 2019; Paternoster et al., 1998).
The current effort joins a growing body of work aimed improving how social scientists find their way through conflicting advice on statistical power (Giner-Sorolla et al., 2024), critiques of off-the-shelf software estimators (Sommet et al., 2023), how “stopping rules” negatively impact power considerations (Kowialiewski, 2024), and improved simulation techniques to help guide future designs (Weiss, 2024). The current effort is closest to two of these recent works. In the first, Weiss (2024), which shows that using two-way fixed effects in panel settings to study the effects of federalism (i.e., each US state becomes an observational level at a given time) are underpowered even when measuring normally distributed events. Second, Sommet et al. (2023) show that commonly-used software such as G*Power (Faul et al., 2007) and the `pwr` R package (Champely, 2020) provide poor estimates of required sample size when an interaction term is present. The current study offers a more generalized critique, suggesting that in rare events, the assumptions of normality are violated so strongly that a different approach is required.
Conducting an a priori power analysis requires several key pieces of information, each of which can influence the required sample size for an experiment.1 Altering any of these parameters comes with its own set of trade-offs. The common formula used to estimate statistical power is (Blair et al., 2023; Gerber & Green, 2012):2
Where:
By varying the values for
In their meta-analysis, Barnes et al demonstrate that most criminology studies are underpowered (2020); others, though, use simulation techniques (similar to those employed here) to demonstrate that most of the discipline’s findings are likely false positives (Niemeyer et al., 2022). Given the infrequency of their dependent variable of choice, researchers often face the hurdle of obtaining a sufficiently large sample size to detect meaningful differences or effects. This is a concern that is especially to the fore with count models, like Poisson, or with negative binomial regression. These modeling choices are common across various criminology subfields. Policing research is a useful example: whereas some researchers investigate how a de-escalation intervention affects counts of use of force (Engel et al., 2022), other researchers use counts of violent crime and related arrests in hot spots (Smith et al., 2023). These models are well-suited for capturing the underlying event processes, but require a robust number of occurrences to produce reliable estimates (MacDonald & Lattimore, 2010). The challenge is compounded by the potential for over- or under-dispersion in the data, which could necessitate more complex models, such as zero-inflated Poisson, or hurdle models (Wagh & Kamalja, 2018).
While count models offer a natural fit for analyzing rare count events, other methods are also available. Alternative frameworks, such as categorical models that classify events into severity levels, or continuous measures, like survival analysis, are also options for researchers. Event counts could also be transformed into rates per unit of interaction. Necessarily, each alternative approach comes with its own specific set of assumptions and limitations, each of which could differentially affect the power of the study (see Blair et al., 2023 for varied approaches through the `DeclareDesign` framework and software). The choice of model should therefore be motivated by the research question and the nature of the data, as well as considerations of statistical power.3 The simulation code provided in the appendix is designed to help researchers develop even more complex simulations.
By necessity, criminologists study statistically rare events (MacDonald & Lattimore, 2010), including outcomes such as the use of force (Adams & Alpert, 2023; Adams et al., 1999; Garner et al., 2002; Schafer et al., 2022), homicides, sexual assaults, or other types of uncommon victimization (Schnell et al., 2017); death penalty, wrongful conviction, and criminal sentencing scholars (Koons-Witt et al., 2014); those studying terrorist attacks and school shootings (Chermak, 1994; Gruenewald et al., 2022); and those interested in wide variety of correctional outcomes (Logan et al., 2017). In most settings, these outcomes will not be normally distributed. Accordingly, reliance on the traditional method for sample size estimation can have profound implications. As the simulation included below indicates, a study examining rare events using the traditional method would be underpowered, thereby leading to an increased risk of Type II errors as well as an increased likelihood of erroneous conclusions and policy lacunae/recommendations. Manifestly, an underpowered study has potentially significant ethical and practical ramifications, particularly when the results inform policing practices and public policy. Given these limitations and the high stakes involved in criminological research, there is an urgent need for a re-evaluation of existing power analysis methodologies. We--researchers and reviewers alike--must become more critical of the power analysis methods we employ, especially with regard to rare, but socially significant events. Adopting more suitable methods, such as simulation-based power analyses that account for rare count events, would certainly improve the rigor and ethical standing of criminology research.
To illustrate the impact of incorrect assumptions in the context of a power analysis, it may be instructive to consider a hypothetical study aiming to measure the effectiveness of a policy intervention on police use of force. Let us presume that a researcher conducts a power analysis via simulation (a scenario which will be discussed in the following section) based on a standard normal distribution of the outcome variable (here, a count). The researcher sets the Type I error rate (
As counts become relatively large, they begin to approximate the normal distribution, and scholars estimating power for these types of counts can likely use traditional methods with little risk. Such an approach is used by software such as the `powerMediation` package, through its function `powerPoisson` (Qiu, 2021). Taking a count of 10, for example, generates a typically normal approximation. However, as I will demonstrate in the simulations that follow, these traditionally (but incorrectly) derived sample sizes are severely underestimated when compared to those given by a correctly-specified power analysis that accounts for event rarity, and that employs an appropriate count model, such as the Poisson distribution.
In the realm of criminological research, the practice of power analysis has often been subject to neglect or misapplication (Barnes et al., 2020; Niemeyer et al., 2022; Giner-Sorolla, et al., 2024). Traditional methodologies frequently hinge on parametric tests, presupposing a normal distribution of outcome variables (Faul et al., 2007; Vasilaky & Brock, 2020; Sommet, et al., 2023). While these conventional techniques are readily accessible through statistical software and computationally less demanding, their limitations become especially apparent when confronted with the simulation of rare count events. The infrequency of these events results in outcome variables that do not have a normal distribution, thereby violating the foundational assumptions of a traditional power analysis, and so compromising the study's statistical power, validity, and reliability.
Emerging as a modern alternative, power analysis through simulation offers both flexibility and enhanced accuracy, which is particularly beneficial for complex experimental designs (Arnold et al., 2011; Bellemare et al., 2016; Blair et al., 2022; LeBeau, 2019; Luedicke, 2013). Despite these advantages, the adoption of simulation-based power analysis in criminological research remains limited, possibly due to its perceived computational complexity. However, advances in computational resources and user-friendly software have made this approach far more accessible. This study includes an appendix featuring “plug and play” R code, enabling researchers to conduct their own simulations with minimal effort (R Core Team, 2023).
Crucially, the traditional power analysis methods often rely on closed-form analytical solutions, which are derived under certain assumptions and constraints. This lack of flexibility can be problematic when researchers must make a priori guesses about effect sizes or prevalence estimates. One method to address this uncertainty is to “simply vary your assumptions and see how the conclusions on power vary” (Coppock, 2013, sec. 5). This is precisely where the utility of power simulation comes into play: it allows researchers to subject these varying assumptions to millions of simulated trials.
Simulation-based power analysis stands out for its ability to model the complexities specific to criminological data, including non-normal distributions and other idiosyncrasies (Vasilaky & Brock, 2020). It affords greater flexibility in study design, produces more accurate estimates by closely mimicking population characteristics, and ensures ethical soundness by avoiding both underpowered and overpowered studies. Moreover, it is less susceptible to the violation of assumptions, offering a robust pathway to power estimation. When taken together, these advantages offer a clear indication that simulation-based power analysis methods offer a path to more reliable conclusions and all that comes with that.
In the context of criminological research, rare events refer to occurrences that are infrequent but of considerable significance. For instance, instances of police use of force, although statistically uncommon, possess profound implications for public trust and policy-making (Adams & Alpert, 2023; McLean et al., 2022; Schafer et al., 2022). Similarly, the most serious crimes, such as murder, pose significant challenges in measurement due to their rarity (Piquero et al., 2005). Regardless of the discipline or specific focus in question, rare count events must be dealt with carefully. It is particularly important to ensure adequate statistical power in order to detect meaningful differences between treatment and control groups (Hodkinson & Kontopantelis, 2021; King & Zeng, 2001).
For our current purposes, event rarity is conceptualized as the mean occurrence, represented by the symbol lambda (
Effect size is a fundamental measure across various disciplines, quantifying the relationship's magnitude between two variables, distinctly from the sample size. This emphasis on effect size underscores the principle that mere statistical significance in a model is insufficient; rather, the substantive impact of our interventions is paramount (Appelbaum et al., 2018; Kline, 2013). It operates as a normalized metric, enabling the synthesis of findings across disparate studies or contexts, and forms an essential component in meta-analytical methodologies within criminology (Barnes et al., 2020; Pratt, 2010; Wilson, 2001). Although Cohen’s (1988) initial benchmarks (d=0.2,0.5, and 0.8, for small, medium, and large effects, respectively) are widely utilized, Cohen himself cautioned against their indiscriminate application, a sentiment echoed by criminologists who argue these standards do not universally apply within their field (Braga & Weisburd, 2022).
The application of Poisson-distributed data within this study, coupled with the exponential transformation used in the simulation code (
This adaptation of Cohen's conventions to Poisson-distributed data underscores the flexibility and contextual nature of effect size interpretation. In criminology, even modest percentage increases can carry significant implications, emphasizing the necessity of a nuanced approach to evaluating the effectiveness of interventions or policy changes. For instance, even a “small” decrease of 10 % in the rate of specific crimes or police use of force could prompt substantial reconsideration of existing strategies or the introduction of new measures.5
This study uses simulation-based methods to scrutinize the statistical power implications for research on rare event counts. Conventional methods of estimating required sample sizes fail to account for the structure of count data, and the rarity of events adds additional complexity. Therefore, I present two methodological innovations below. First, I demonstrate a method of power simulation for rare count events. Second, based on the results of the simulation, I develop and present the power lift metric, a method of finding the “power stabilization point” as a secondary metric in these types of analyses. Key information for the results begins in the following section.
By using a computational approach, the study systematically assesses how variations in sample size (n), effect size (
This simulation model operates under several key and reasonable assumptions. First, it presumes simple random assignment to treatment and control groups, emulating a randomized controlled trial setting.6 For the purposes of this simulation, a two-arm trial is tested, though the simulation code could easily be adopted to multiple trial arms (e.g., two treatments and a control). Second, the simulation stipulates a constant given effect size across varied sample sizes and mean occurrence rates, thereby simplifying the computational environment. Thirdly, the events being examined are assumed to follow a Poisson distribution, which is a common approach for modeling count data of rare events.
The model further assumes that each event is statistically independent and employs a log-link function in the Poisson regression model. It uses a conventional alpha level of 0.05 to determine statistical significance. Finally, it assumes a constant mean occurrence rate (
The cornerstone of this methodology is the use of Poisson regression models within a Monte Carlo simulation. This choice is motivated by the count nature of the data commonly under investigation in criminological settings. Poisson regression allows for the modeling of count data (Berk & MacDonald, 2008), providing a framework to assess the rate at which these rare events occur, both in treatment and control groups. It is especially suitable for this kind of study because it can handle low-frequency events and still produce robust estimates. The simulation integrates a range of sample sizes, effect sizes, and mean occurrence rates to offer a comprehensive exploration of statistical power in varying contexts.
Two key parameters in this study are the assumed mean occurrence rate lambda (
The effect size (
The parameter assumptions for the simulation were not arbitrarily chosen, but are informed by real-world considerations drawn from policing research. For instance, continuing in the context of common policing research, the upper limit for the sample size was set at 10,000 units, reflecting the practical constraint imposed by the largest police departments. For example, the New York Police Department (NYPD), which is the largest in the country, employs approximately 30,000 officers. However, only a subset of these, primarily the uniformed patrol officers, are likely to engage in use-of-force incidents. This upper boundary, therefore, serves as a realistic estimate for the largest sample sizes that experimental studies could potentially draw upon.7 In the simulations that follow, therefore, the main analysis constrains the upper boundary at 10,000, while a secondary analysis allows for a ceiling of up to 30,000 observations.
Similarly, the limits for effect size and event occurrence rates were selected to reflect practical and operational realities. Smaller effect sizes may correspond to interventions that are easier to implement but yield modest impacts, while larger effect sizes could represent more transformative but challenging-to-implement policy changes. In any case, the effect size parameters of small, medium, and large are common across disciplines and approaches (Ferguson, 2016; Lakens, 2013).8 These considerations ensure that the study's findings are not just methodologically rigorous, but also practically relevant and applicable to the field of criminology.
The simulation procedure is implemented through a nested loop structure coded in R. The simulation code builds upon original simulation code prepared by Coppock (2013) and formalized in the “Model, Inquiry, Data, and Analysis (MIDA) framework of research design via Monte Carlo simulation (Blair et al., 2023).9 Initially, the mean occurrence rate (
The study employed a simulation-based approach to examine statistical power across a variety of scenarios. These scenarios were characterized by unique combinations of three parameters: effect size (0.1, 0.2, 0.35), mean occurrence rate (
To extend and ground the simulation, a secondary study broadens the sample size range to encompass up to 30,000 observations. This expansion is designed to offer an holistic simulation environment that reflects the reality that criminology researchers face when considering potential measures for their studies (Dilulio Jr, 1994). By including this extended range, the study accommodates the full scope of agency sizes, from smaller local departments to the largest in the U.S.—the New York Police Department (NYPD). While most empirical studies may not have access to an agency as large as the NYPD, the inclusion of such an extreme yet realistic sample size ensures that the simulation results are not merely academic exercises, but are directly applicable to the varied and complex settings that researchers in the field may encounter (Alpert & Moore, 1993). Given the extremes of computation in the primary analysis, this secondary analysis restricted the simulation to just 100 trials over each intersection of effect size, mean occurrence, and sample size. Accordingly, the second simulation runs just 270,000 trials. The visualization for the general simulation can be found in Figure 2. The visualization for the large agency simulation can be found in the Appendix (Appendix Figure 1).
In Figure 2, the fluctuating, unstable nature of statistical power estimates in the study of rare events is immediately apparent. Unlike research assuming normally distributed outcomes (see earlier case illustration), which tend to yield stable, mostly monotonic power curves, the use of count data introduces a different set of challenges. These count data exhibit unique distributional properties that can significantly affect the stability of power estimates. Specifically, count data are inherently discrete, as opposed to continuous, and introduce a level of granularity that complicates the estimation of effect sizes and, consequently, statistical power (MacDonald & Lattimore, 2010). Furthermore, the distribution of count data for rare events is highly skewed, and may also display excess kurtosis. Such distributional characteristics deviate considerably from the assumptions of normality that underlie many conventional statistical methods. The simulation results show how this divergence can introduce a level of volatility in the resulting power estimates, rendering them less reliable for the purposes of research design and interpretation. The instability is particularly pronounced under conditions of small effect size (
The mean occurrence rate (
A primary contribution of the power simulation method lies in its demonstration of how the interactions between distributional qualities, small effect size, and a low mean occurrence rate can be particularly pernicious. When both conditions (small effect size and low mean prevalence) are simultaneously present, their combined effect can dramatically amplify the instability in power estimates. Thus, the idiosyncratic attributes of count data, coupled with the challenges posed by small effect sizes and low mean occurrence rates, coalesce to produce unstable estimates of statistical power. This is particularly the case when applied to rare events. Moreover, this instability stands in contrast to the more stable power estimates typically observed in research scenarios predicated on normally distributed outcomes.
After reviewing the visualizations presented in Figure 2 and Appendix Figure 1, it becomes evident that conventional methods for determining sample size fail to provide stable estimates of statistical power, particularly in the context of rare events. Indeed, we can see this in the Case Illustration subsection (above), which is predicated on the assumption of a standard normally distributed outcome variable. This instability produces erratic fluctuations in power estimates across different sample sizes, lifting above and then dipping below the commonly accepted 80% threshold (Cohen, 1988). Such volatility in power estimates makes clear the need for a more nuanced methodological framework, one that can offer robust and reliable guidelines for achieving stable statistical power in empirical studies. It is against this unstable backdrop that I introduce the concept of a 'power stabilization point’ and the power lift metric.
Traditional methods, which identify the minimum sample size at the first instance where statistical power exceeds 80% may yield unstable and potentially misleading results. This is particularly problematic for studies focusing on rare but consequential events, such as incidents of police use of force, where the power estimates are unstable. The concept of a "power lift” metric offers a more robust estimate of the required sample size. This metric is flexible, transparent, and reproducible, thereby allowing researchers to establish how often the required power is “lifted” above the desired threshold.
The concept of a power stabilization point is predicated on two pivotal parameters:
Power Threshold: This is set at 80%, in alignment with conventional statistical guidelines for acceptable power. The threshold serves as the minimum level of power that a sample size must achieve to be considered stable.
Stability Count: This denotes the minimum number of consecutive sample sizes where the power remains consistently above the Power Threshold. In the context of this simulation, a stability count of five was tested; however, this is simply a testing threshold; researchers may choose any given stability count to test in the context of their design and research question. A very large count may represent, for example, “pure” stability where sample size estimates never again fall below the desired threshold. A small count, such as the one used here, represents a pragmatic desire to achieve relatively stable power, but perhaps in a context where there are realistic constraints on the numbers of possible observations (i.e., a police department with a fixed number of officers, or a spatial study with a fixed number of geographic points).
The procedural steps for establishing the stabilization point are as follows:
Parameter Grouping: The simulation results are grouped by unique combinations of effect size and mean occurrence (
Iterative Assessment: Within each parameter group, the data is sorted in ascending order by sample size. This facilitates sequential evaluation, ensuring that the analysis proceeds in a methodologically rigorous manner.
Counter Initialization: Two counters—consecutive_count and stable_sample_size—are initialized. The former tracks the number of consecutive sample sizes meeting the Power Threshold, while the latter stores the smallest sample size at which the power stabilizes.
Sequential Evaluation and Counter Update: For each row within a parameter group, the algorithm evaluates whether the power exceeds the predetermined Power Threshold. If it does, then the consecutive_count is incremented. Once the consecutive_count reaches the Stability Count, the sample size at the first instance of this sequence is recorded as the stable_sample_size.
Dynamic Adaptation: If no stabilization point is identified, then the algorithm dynamically adjusts the Stability Count by decrementing it. This iterative adjustment continues until a stabilization point is identified or the Stability Count reaches zero.
If power is stable for five steps, the sample size at the first step of this stable sequence is recorded as the stable sample size. This means that the minimal sample size required for stable power is the one at the beginning of the sequence, where power is sustained above the 80% threshold for five consecutive steps. This approach is based on the premise that, once a stabilization point has been reached, subsequent sample sizes would naturally inherit this stability, at least up to the point defined by the Stability Count. Therefore, the first step in the sequence provides the minimum sample size that meets the criteria for stable power.
This methodological framework ensures that the identified sample size is not only statistically powerful, but also exhibits stability across a range of conditions. This is particularly important for research on rare events, where sample size recommendations need to be both robust and adaptable.
The R function presented in the Appendix serves to operationalize the defined algorithmic framework for identifying stabilization points in power analysis simulations. Specifically, the function conducts the following operations:
It imports simulation results from a JSON file. These results are flattened and stored in an R list, where each element corresponds to a unique combination of effect size and mean occurrence.
The function initializes an empty list, stabilization_results, to hold the computed stable sample sizes and the number of steps for each unique combination of variables.
A loop iterates over each unique combination of effect size and mean occurrence. Within this loop, the data are sorted by sample size, and then a nested loop searches for the stabilization point according to the predefined criteria.
The function employs an initial Stability Count (initial_steps), which represents the minimum number of consecutive sample sizes that must have a power greater than 80% for the sample size to be considered 'stable'.
If the Stability Count criterion is not met, the function iteratively decrements the count by one and reevaluates, continuing this process until a stabilization point is identified or the count reaches zero.
The stable sample size and the number of steps used to identify it are stored in the stabilization_results list for each unique combination of variables.
The function ultimately returns the stabilization_results list, which can then be converted into a more interpretable data frame format for further analysis or reporting (such as a .csv file).
Results demonstrate the nuanced relationships between effect size, mean occurrence rate, and the stabilization point of statistical power in the context of a rare event modeled with counts. The concept of a "power stabilization point" offers a methodologically rigorous approach to determining the sample sizes required to achieve acceptable levels of statistical power across different effect sizes and base rates. Results for the stability test in both the general simulation (up to 10,000 sample size) and the large agency simulation (up to 30,000 sample size) are reported in Table 1.
Simulation | Effect Size | Mean Occurrence | Stable Sample Size | Stable Steps |
---|---|---|---|---|
General (10k) | .35 | .1 | 3250 | 5 |
General (10k) | .35 | .2 | 1450 | 5 |
General (10k) | .35 | .3 | 1250 | 5 |
General (10k) | .2 | .1 | 4250 | 2 |
General (10k) | .2 | .2 | 4850 | 5 |
General (10k) | .2 | .3 | 4450 | 5 |
General (10k) | .1 | .1 | 3950 | 1 |
General (10k) | .1 | .2 | 8050 | 2 |
General (10k) | .1 | .3 | 8150 | 2 |
Large Agency (30k) | .35 | .1 | 3650 | 5 |
Large Agency (30k) | .35 | .2 | 1550 | 5 |
Large Agency (30k) | .35 | .3 | 1050 | 5 |
Large Agency (30k) | .2 | .1 | 10150 | 5 |
Large Agency (30k) | .2 | .2 | 5450 | 5 |
Large Agency (30k) | .2 | .3 | 4250 | 5 |
Large Agency (30k) | .1 | .1 | 28750 | 5 |
Large Agency (30k) | .1 | .2 | 19950 | 5 |
Large Agency (30k) | .1 | .3 | 12050 | 5 |
Table 1: Power Stabilization Points Across Two Simulations
For a high effect size of 0.35, the stable sample sizes varied considerably depending on the mean occurrence rate (
The required stable sample sizes were notably larger for the moderate effect size of 0.2. At a mean occurrence rate of 0.1, the stable sample size reached 4,250, but stability was less robust, with only two stabilization steps. With a mean occurrence of 0.2, the stable sample size increased to 4,850, returning to five stabilization steps. At a mean occurrence of 0.3, the required stable sample size was 4,450; this also had five stabilization steps.
The scenarios became more complex when the effect size reduced to 0.1. At a mean occurrence rate of 0.1, the stable sample size was 3,950, but with only one stabilization step, which suggests unreliable stability (i.e., no two estimates of power over 80% occurred subsequent to each other). When the mean occurrence rate increased to 0.2, the required stable sample size escalated dramatically to 8,050 with two stabilization steps. For a mean occurrence of 0.3, the stable sample size was 8,150, again with only two stabilization steps.
For a high effect size of 0.35, the stable sample size was relatively modest, even with the expanded range. Specifically, at a mean occurrence rate of 0.1, the stable sample size was 3,650, satisfying the five-step stability criterion. For mean occurrence rates of 0.2 and 0.3, the stable sample sizes were 1,550 and 1,050 respectively, with each also meeting the five-step stability criterion.
When the effect size was moderate, at 0.2, the stable sample sizes increased substantially in the large agency simulation. For instance, at a mean occurrence rate of 0.1, the stable sample size escalated to 10,150, adhering to the five-step stability criterion. As the mean occurrence rate rose to 0.2, the stable sample size was determined to be 5,450, also with five stabilization steps. At a mean occurrence rate of 0.3, the stable sample size was 4,250, which again fulfilled the five-step stability criterion.
In scenarios with a low effect size of 0.1, the required stable sample sizes were remarkably high. For example, assuming a mean occurrence rate of 0.1, a sample size as large as 28,750 was necessary to achieve five stabilization steps. When the mean occurrence rate increased to 0.2, the stable sample to meet the five-step criterion was 19,950. Finally, at a mean occurrence rate of 0.3, a stable sample size of 12,050 was sufficient, maintaining the same five-step stability criterion.
A null finding tells us one of two things: either there is no true effect, or the effect exists, but is too small for the current design to “see” (Blair et al., 2019). In criminology, where the stakes are high and the societal implications are profound, researchers and funders must critically reexamine the implications of null findings in the context of rare but socially significant events (Decker, 2023; Eck, 2006). Failure to publish or fund studies on these topics is especially concerning as it essentially restricts experimental work to a handful of the very largest agencies, severely limiting the generalizability and impact of research findings. In a country of 18,000 agencies, where the average agency fields just 15 or so officers (Gardner & Scott, 2022), policing scholars have an obligation to amass experimental evidence in more than just a handful of the largest agencies in the country. We should not blind ourselves to the forest with a perfect view of the largest two agencies fielding 10,000 or more trees.
Similar measurement issues arise across criminology. Scholars studying homicides, sexual assaults, or other types of uncommon victimization (Schnell et al., 2017); death penalty, wrongful conviction, and criminal sentencing scholars (Koons-Witt et al., 2014); those studying terrorist attacks and school shootings (Chermak, 1994; Gruenewald et al., 2022); and those interested in wide variety of correctional outcomes (Logan et al., 2017): all of them must grapple with rare events. Our ability to intervene and test results in these contexts is critical, but we must also be able to retain the exactitude and statistical inferences of our scientific approach (Barnes et al., 2020; Eck, 2006). To do so effectively, we must understand the relationship between our focal outcomes, how they are measured, and the implications for those measures on requirements for adequate statistical power. Given these ethical and practical challenges, there is an imperative to conduct studies on rare events like police use of force, even when these studies are likely to be underpowered. While the limitations of low statistical power must be fully acknowledged, a null finding should not be hastily dismissed as inconclusive or irrelevant. Rather, it should be seen as a complex piece of evidence that, when carefully interpreted and situated within the broader research context, can contribute to a more nuanced and comprehensive understanding of the ability to affect socially significant issues.
The Poisson distribution's characteristics, combined with the variability in
The study's findings provide a detailed examination of the complexities associated with achieving sufficient statistical power for rare events. Using Poisson regression models, the simulation uncovers the intricate relationships between sample size, effect size, and mean occurrence rate (
First, it’s evident that challenges abound when dealing with small and medium effect sizes. In the general simulation, for a moderate effect size of 0.2, stable sample sizes ranged from 4,250 to 4,850. However, stability was less robust at a mean occurrence of 0.1, with only two stabilization steps. For a small effect size of 0.1, stable sample sizes exceeded 8,000 at higher mean occurrences, yet never exceeded two stabilization steps. In the largest agency specific scenario, the stable sample sizes were even larger: they reached up to 10,150 and 19,950 for effect sizes of 0.2 and 0.1, respectively, because they were able to reach very stable estimates (five steps). This suggests that the pursuit of stable statistical power with small and medium effect sizes is often impractical, especially in large policing agencies. Thus, a reevaluation of the conventional understanding of "adequate" sample size is warranted in these contexts.
Second, the study illustrates that, even for large effect sizes, achieving stable power levels is not straightforward. In both the general and largest agency specific simulations, stable sample sizes for an effect size of 0.35 ranged from 1,050 to 3,650. While these sample sizes were lower, they each maintained five stabilization steps, indicating that even with large effect sizes, maintaining stable power levels is challenging. This makes it clear that methodological robustness remains elusive, even when a policy intervention has a significant impact on the outcome variable.
Finally, the study ventures into what can be considered methodologically uncharted territory for rare events. In both simulations, for the lowest effect size of 0.1 and the lowest mean occurrence rate (
Collectively, these findings emphasize the methodological and practical constraints that researchers in the field of criminology must navigate. The introduction of the "power stabilization point" serves as a methodological advancement, guiding future research efforts. In light of these not insignificant challenges, innovative approaches and collaborative research initiatives are essential for generating more robust and reliable outcomes. Indeed, the ethical imperative to conduct and publish such studies is paramount; all the more so when they may be underpowered.
Statistical power serves as a multi-faceted tool in research: it stands as a safeguard against Type II errors, enhances the credibility of null findings, and provides a methodological framework for detecting even small effect sizes (Barnes et al., 2020; Vasilaky & Brock, 2020). However, the inexorable push for high statistical power using expansive samples narrows the scope of viable research topics. This is particularly the case in criminology, where a focus on large agencies would inevitably overlook the majority of police departments and would therefore undermine the generalizability of any research findings. As Eck (2006, p. 348) asks: “Which is better, no evaluation, or an evaluation that leaves a large amount of uncertainty as to the true effectiveness of the evaluation.” As a practical matter, and given the challenge in properly powering studies of rare event counts as demonstrated here, the only viable answer to Eck’s question is to “keep experimenting.”
Researchers grappling with phenomena like police uses of force face a significant challenge in achieving adequate statistical power, especially when the expected effect size is moderate or small. The simulations showed that a sample size of approximately 4,450 officers would be necessary to attain 80% stable power if one assumed a medium effect size of 0.2 and a mean occurrence rate of 0.3. This requirement surges to around 8,000 officers for a small effect size at the same occurrence rate. What complicates matters is that these sample sizes, particularly for small effect sizes, are not always stable. Assuming the study is conducted at the individual officer level, these large samples would necessitate the involvement of one of the two or three largest police agencies in the United States (e.g., New York or Chicago police departments, see Gardner & Scott, 2022), which would necessarily raise legitimate concerns about sample representativeness, as well as concomitant doubts about both internal and external validity.
It seems reasonable to argue therefore that the focus on achieving high statistical power could be construed as a form of methodological myopia. It prioritizes rigorous methodology at the expense of rare but socially significant phenomena. The quest for high power inadvertently incentivizes research on topics that are statistically frequent enough to attain this power, while excluding rare but impactful issues that merit critical examination, such as police use-of-force in a smaller agency. In a nation with approximately 18,000 policing agencies, this risks dangerously skewing research priorities by narrowing empirical investigations to a select few large agencies and diverting attention from issues of significant public concern (Lum, 2021; J. MacDonald, 2023; Sherman, 2007).
The present study is not without limitations. One of the primary constraints is related to computation time, which is considerable. Computation time for a robust simulation as done here can reach over 24 hours, even on relatively strong home office hardware. This extended computation time underscores a significant barrier for researchers who may not have access to such computational resources. The requirement for high-performance computing hardware could potentially widen the resource gap between well-funded research institutions and those with more limited financial means.
The implication of this computational demand is twofold. Firstly, it may deter individual researchers or smaller institutions from engaging with much-needed power analyses, thereby limiting the adoption and subsequent refinement of this approach across a broader spectrum of the research community. Secondly, it risks amplifying existing disparities in research capabilities, where only those with access to substantial computing power can afford to implement these advanced analytical techniques. This could inadvertently contribute to a research landscape where the ability to produce rigorous, data-driven insights is a privilege of the well-resourced. While the computational demands described here present a significant challenge, they also highlight an opportunity for the scientific community to innovate solutions that make advanced statistical analyses more accessible.
To that end, the Appendix Section 9.2 to this study provides easy-to-implement simulation code for any researcher, and by limiting the number of simulations (‘sims’ argument in the function), commonly used personal computers should efficiently solve for most designs. As computational power continues to vastly increase, current limitations will not stand in the way for long. In the meantime, as a non-computational heuristic, Appendix Section 9.4 also provides Appendix Table 1, which provides an approximation for a range of mean occurrence rates and effect sizes, predicated upon an adaption of Lehr’s (1992) approximation provided by DiMaggio (2024). This table can be used by
A second limitation is related to selecting the target stability count for the `power lift` computation. This is undeniably a “researcher degree of freedom” and there are likely ways that a researcher could exploit this discretionary choice to justify underpowered designs. However, the limitation is met, at least partially, by the ability for reviewers to replicate the analysis using different stability counts. The transparent and reproducible nature of the power lift methodology allows for an open examination of how altering the stability count affects the resultant sample size recommendations. This process not only fosters a dialogue around the most appropriate stability count for a given study but also mitigates the potential for misuse of this flexibility to support underpowered study designs. Furthermore, the methodology encourages the presentation of sensitivity analyses, where researchers can demonstrate the impact of various stability counts on the power and sample size requirements. By requiring such analyses to be included in study protocols or manuscripts, the scientific community can critically assess the robustness of the power calculations and the justification for the chosen stability count. This serves as a safeguard against arbitrary selection and provides a mechanism for consensus building around best practices in power analysis for rare events.
This paper has grappled with a methodological conundrum at the heart of empirical research on rare but impactful societal events. Traditional power analysis models, premised on a normal distribution, are not only misaligned with the distributional nature of these events, but also risk underestimating the requisite sample size, thereby potentially leading to erroneous inferences. In response, this study introduced a comprehensive simulation framework grounded in Poisson regression models, which are intrinsically suited for count data and rare events.
The proposed framework contributes to the discussion by incorporating the concept of a "power stabilization point", which is a more reliable and ethically responsible method for determining sample size. This is not a mere statistical nuance; it is a methodological pivot that aligns statistical rigor with the ethical imperatives of research in high-stakes research domains. The stabilization point ensures that the power estimates are not only statistically adequate, but also robust to the instability of rare events, thereby reducing the risk of Type II errors. Moreover, the paper provides a full suite of simulation code and a user-friendly guide, which will allow researchers to calibrate the framework according to their specific experimental designs. This serves a dual purpose: it democratizes access to robust methodological tools and fosters a culture of replication and validation in the scientific community.
Finally, the paper engages with the ethical and practical imperatives of conducting and disseminating research in areas that may often be underpowered. I argue that null findings, when rigorously obtained and cautiously interpreted, are not statistical artifacts; rather, they are meaningful contributions to the cumulative scientific discourse. They inform policy, guide resource allocation, and, most importantly, inspire questions that propel the field forward. Underpowered and small-n evaluations are “not just better than nothing…Rather, they have a positive value” (Eck, 2006, p. 349). They point to the need to better understand the critical outcomes and interventions found across criminology,
As such, this paper does more than fill a methodological gap: it invites researchers to think more deeply about the interplay between statistical power, ethical responsibility, and social impact. It offers a robust, practical, and ethically sound roadmap for navigating the complexities of empirical research on rare but consequential events, thereby contributing to a more nuanced and reliable scientific understanding in these high-stakes’ fields.
My sincere thanks to the many rare but impactful friends I’ve made along the way, I cannot count the ways you’ve improved my life. Specific gratitude is owed to John MacDonald, Bryant Moy, Annette Gibert, Amanda Weiss, Travis Carter, Daniel Schiff, Geoff Alpert, Scott Mourtgos, Kaylyn Jackson Schiff, Brent Klein, Chris Marier, Hunter Boehme, and participants at the Interdisciplinary Policing Policy Workshop (University of Utah, 2024) for comments and feedback on earlier versions of this paper. Thanks to Josh Jackson for the initial exchanges that piqued my interest. Finally, my thanks to Carl Jenkinson for managing a helpful edit even while dealing with watching his beloved Liverpool FC lose on a devastating self-scoring point.
Adams, I. T., & Alpert, G. P. (2023). Use of Force in Policing. In Oxford Research Encyclopedia of Criminology and Criminal Justice. https://doi.org/10.1093/acrefore/9780190264079.013.845
Adams, K., Alpert, G. P., & Dunham, R. C. (1999). Use of force by police: Overview of national and local data. In K. Adams (Ed.), What we know about police use of force (pp. 1–14). US Department of Justice, Office of Justice Programs.
Alpert, G. P., McLean, K., & Wolfe, S. (2017). Consent Decrees: An Approach to Policy Accountability and Reform. Police Quarterly, 20(3), 239–249.
Alpert, G. P., & Moore, M. H. (1993). Measuring police performance in the new paradigm of policing. Performance Measures for the Criminal Justice System, 109, 111.
Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., Nezu, A. M., & Rao, S. M. (2018). Journal article reporting standards for quantitative research in psychology: The APA Publications and Communications Board task force report. American Psychologist, 73(1), 3.
Barnes, J. C., TenEyck, M. F., Pratt, T. C., & Cullen, F. T. (2020). How Powerful is the Evidence in Criminology? On Whether We Should Fear a Coming Crisis of Confidence. Justice Quarterly, 37(3), 383–409. https://doi.org/10.1080/07418825.2018.1495252
Belle, G. van. (2011). Sample Size. In Statistical Rules of Thumb. John Wiley & Sons.
Berk, R., & MacDonald, J. M. (2008). Overdispersion and Poisson Regression. Journal of Quantitative Criminology, 24(3), 269–284. https://doi.org/10.1007/s10940-008-9048-4
Blair, G., Cooper, J., Coppock, A., & Humphreys, M. (2019). Declaring and Diagnosing Research Designs. American Political Science Review, 113(3), 838–859. https://doi.org/10.1017/S0003055419000194
Blair, G., Coppock, A., & Humphreys, M. (2023). Research Design in the Social Sciences: Declaration, Diagnosis, and Redesign. Princeton University Press.
Braga, A. A., & Weisburd, D. L. (2022). Does Hot Spots Policing Have Meaningful Impacts on Crime? Findings from An Alternative Approach to Estimating Effect Sizes from Place-Based Program Evaluations. Journal of Quantitative Criminology, 38(1), 1–22. https://doi.org/10.1007/s10940-020-09481-7
Champely, S. (2020). pwr: Basic functions for power analysis [Manual]. https://CRAN.R-project.org/package=pwr
Chermak, S. M. (1994). Body count news: How crime is presented in the news media. Justice Quarterly, 11(4), 561–582. https://doi.org/10.1080/07418829400092431
Clark, R. C. (1977). A Note on the Power of Statistical Tests. Journal for Research in Mathematics Education, 8(5), 385–389. https://doi.org/10.2307/748412
Cohen, J. (1988). Statistical Power Analysis for the Behavioural Sciences. Hillsdale, NJ: Lawrence Erlbaum Associates.
Coppock, A. (2013, November 20). 10 Things to Know About Statistical Power – EGAP. EGAP. https://egap.org/resource/10-things-to-know-about-statistical-power/
Decker, S. H. (2023). Failure. Criminology & Public Policy, n/a(n/a). https://doi.org/10.1111/1745-9133.12627
Dilulio Jr, J. D. (1994). Principled agents: The cultural bases of behavior in a federal government bureaucracy. Journal of Public Administration Research and Theory, 4(3), 277–318.
DiMaggio, C. (2024, March 1). Power Tools for Epidemiologists. Power Tools for Epidemiologists
Eck, J. E. (2006). When is a bologna sandwich better than sex? A defense of small-n case study evaluations. Journal of Experimental Criminology, 2(3), 345–362. https://doi.org/10.1007/s11292-006-9014-9
Engel, R. S., Corsaro, N., Isaza, G. T., & McManus, H. D. (2022). Assessing the impact of de-escalation training on police behavior: Reducing police use of force in the Louisville, KY Metro Police Department. Criminology & Public Policy, 21(2), 199–233. https://doi.org/10.1111/1745-9133.12574
Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39(2), 175–191. https://doi.org/10.3758/BF03193146
Ferguson, C. J. (2016). An effect size primer: A guide for clinicians and researchers. American Psychological Association. https://doi.org/10.1037/14805-020
Gardner, A. M., & Scott, K. M. (2022). Census of State and Local Law Enforcement Agencies, 2018 (NCJ 302187). Bureau of Justice Statistics.
Garner, J. H., Maxwell, C. D., & Heraux, C. G. (2002). Characteristics associated with the prevalence and severity of force used by the police. Justice Quarterly, 19(4), 705–746.
Gelman, A. (2019). Don’t calculate post-hoc power using observed estimate of effect size. Ann. Surg, 269, e9–e10.
Gerber, A. S., & Green, D. P. (2012). Field Experiments: Design, Analysis, and Interpretation (Illustrated edition). W. W. Norton & Company.
Giner-Sorolla, R., Montoya, A. K., Reifman, A., Carpenter, T., Lewis, N. A., Aberson, C. L., Bostyn, D. H., Conrique, B. G., Ng, B. W., Schoemann, A. M., & Soderberg, C. (2024). Power to Detect What? Considerations for Planning and Evaluating Sample Size. Personality and Social Psychology Review, 10888683241228328. https://doi.org/10.1177/10888683241228328
Gruenewald, J., Klein, B. R., Hayes, B. E., Parkin, W. S., & June, T. (2022). Examining Disparities in Case Dispositions and Sentencing Outcomes for Domestic Violent Extremists in the United States. Crime & Delinquency, 00111287221109769. https://doi.org/10.1177/00111287221109769
Hodkinson, A., & Kontopantelis, E. (2021). Applications of simple and accessible methods for meta-analysis involving rare events: A simulation study. Statistical Methods in Medical Research, 30(7), 1589–1608. https://doi.org/10.1177/09622802211022385
King, G., & Zeng, L. (2001). Logistic Regression in Rare Events Data. Political Analysis, 9(2), 137–163. https://doi.org/10.1093/oxfordjournals.pan.a004868
Kline, R. B. (2013). Beyond significance testing: Statistics reform in the behavioral sciences, 2nd ed (pp. xi, 349). American Psychological Association. https://doi.org/10.1037/14136-000
Koons-Witt, B. A., Sevigny, E. L., Burrow, J. D., & Hester, R. (2014). Gender and Sentencing Outcomes in South Carolina: Examining the Interactions With Race, Age, and Offense Type. Criminal Justice Policy Review, 25(3), 299–324. https://doi.org/10.1177/0887403412468884
Kowialiewski, B. (2024). The Power of Effect Size Stabilization. OSF. https://osf.io/xsz9k/download
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4. https://doi.org/10.3389/fpsyg.2013.00863
LeBeau, B. (2019). Power Analysis by Simulation using R and simglm—University of Iowa. Iowa Research Online. https://doi.org/10.17077/f7kk-6w7f
Lehr, R. (1992). Sixteen S-squared over D-squared: A relation for crude sample size estimates. Statistics in Medicine, 11(8), 1099–1102. https://doi.org/10.1002/sim.4780110811
Logan, M. W., Dulisse, B., Peterson, S., Morgan, M. A., Olma, T. M., & Paré, P.-P. (2017). Correctional shorthands: Focal concerns and the decision to administer solitary confinement. Journal of Criminal Justice, 52, 90–100. https://doi.org/10.1016/j.jcrimjus.2017.08.007
Lum, C. (2021). Perspectives on Policing: Cynthia Lum. Annual Review of Criminology, 4, 19–25.
Lum, C., & Koper, C. S. (2017). Evidence-based policing. Oxford Univ. Press.
MacDonald, J. (2023). Criminal justice reform guided by evidence: Social control works—The Academy of Experimental Criminology 2022 Joan McCord Lecture. Journal of Experimental Criminology. https://doi.org/10.1007/s11292-023-09558-w
MacDonald, J. M., & Lattimore, P. K. (2010). Count Models in Criminology. In A. R. Piquero & D. Weisburd (Eds.), Handbook of Quantitative Criminology (pp. 683–698). Springer. https://doi.org/10.1007/978-0-387-77650-7_32
McCord, J. (2003). Cures That Harm: Unanticipated Outcomes of Crime Prevention Programs. The ANNALS of the American Academy of Political and Social Science, 587(1), 16–30. https://doi.org/10.1177/0002716202250781
McLean, K., Stoughton, S. W., & Alpert, G. P. (2022). Police Uses of Force in the USA: A Wealth of Theories and a Lack of Evidence. Cambridge Journal of Evidence-Based Policing, 6(3), 87–108. https://doi.org/10.1007/s41887-022-00078-7
Niemeyer, R. E., Proctor, K. R., Schwartz, J. A., & Niemeyer, R. G. (2022). Are Most Published Criminological Research Findings Wrong? Taking Stock of Criminological Research Using a Bayesian Simulation Approach. International Journal of Offender Therapy and Comparative Criminology, 0306624X221132997. https://doi.org/10.1177/0306624X221132997
Paternoster, R., Brame, R., Mazerolle, P., & Piquero, A. (1998). Using the Correct Statistical Test for the Equality of Regression Coefficients. Criminology, 36(4), 859–866. https://doi.org/10.1111/j.1745-9125.1998.tb01268.x
Piquero, A. R., MacDonald, J., Dobrin, A., Daigle, L. E., & Cullen, F. T. (2005). Self-Control, Violent Offending, and Homicide Victimization: Assessing the General Theory of Crime. Journal of Quantitative Criminology, 21(1), 55–71. https://doi.org/10.1007/s10940-004-1787-2
Pratt, T. C. (2010). Meta‐Analysis in Criminal Justice and Criminology: What It is, When It’s Useful, and What to Watch Out for. Journal of Criminal Justice Education, 21(2), 152–168. https://doi.org/10.1080/10511251003693678
Qiu, W. (2021). powerMediation: Power/Sample size calculation for mediation analysis [Manual]. https://CRAN.R-project.org/package=powerMediation
Ratcliffe, J. H., Taniguchi, T., Groff, E. R., & Wood, J. D. (2011). The Philadelphia Foot Patrol Experiment: A Randomized Controlled Trial of Police Patrol Effectiveness in Violent Crime Hotspots*. Criminology, 49(3), 795–831. https://doi.org/10.1111/j.1745-9125.2011.00240.x
Schafer, J., Hibdon, J., & Kyle, M. (2022). Studying rare events in policing: The allure and limitations of using body-worn camera video. Journal of Crime and Justice, 0(0), 1–16. https://doi.org/10.1080/0735648X.2022.2062036
Schnell, C., Braga, A. A., & Piza, E. L. (2017). The Influence of Community Areas, Neighborhood Clusters, and Street Segments on the Spatial Variability of Violent Crime in Chicago. Journal of Quantitative Criminology, 33(3), 469–496. https://doi.org/10.1007/s10940-016-9313-x
Sherman, L. W. (2007). The power few: Experimental criminology and the reduction of harm. Journal of Experimental Criminology, 3(4), 299–321. https://doi.org/10.1007/s11292-007-9044-y
Smith, M. R., Tillyer, R., & Tregle, B. (2023). Hot spots policing as part of a city-wide violent crime reduction strategy: Initial evidence from Dallas. Journal of Criminal Justice, 102091. https://doi.org/10.1016/j.jcrimjus.2023.102091
Sommet, N., Weissman, D. L., Cheutin, N., & Elliot, A. J. (2023). How Many Participants Do I Need to Test an Interaction? Conducting an Appropriate Power Analysis and Achieving Sufficient Power to Detect an Interaction. Advances in Methods and Practices in Psychological Science, 6(3), 25152459231178728. https://doi.org/10.1177/25152459231178728
Vasilaky, K. N., & Brock, J. M. (2020). Power(ful) guidelines for experimental economists. Journal of the Economic Science Association, 6(2), 189–212. https://doi.org/10.1007/s40881-020-00090-5
Wagh, Y. S., & Kamalja, K. K. (2018). Zero-inflated models and estimation in zero-inflated Poisson distribution. Communications in Statistics - Simulation and Computation, 47(8), 2248–2265. https://doi.org/10.1080/03610918.2017.1341526
Weiss, A. (2024). Statistical Power and Modern Difference-in-Differences Analysis of State Policy Effects: The Case of Hate Crime Law.
Wilson, D. B. (2001). Meta-Analytic Methods for Criminology. The ANNALS of the American Academy of Political and Social Science, 578(1), 71–89. https://doi.org/10.1177/000271620157800105
The visualization for a simulation that expands to the large number of reasonably available observations within a single agency (around 30,000) can be seen below. The expanded range of potential sample sizes means that the algorithm can identify a very stable required sample size (e.g., maintains 80% power for at least five steps) under assumptions of fairly large effect sizes, but the required minimum sample size then becomes very large. Assuming small effect sizes, the simulation remains highly unstable. See stabilization point discussion in the main text for details.
Plain language description
The function `simulate_power` is designed to help researchers understand how likely they are to detect a real effect in a study involving rare count outcomes, given various conditions. It is like a virtual laboratory for testing different scenarios without actually conducting the study. The function considers different effect sizes one might expect, various frequencies of the event being studied, and different sample sizes to see how each combination would play out.
In this "virtual lab," the function runs many simulated mini-studies, or "trials," for each combination of conditions. For each of these trials, it uses statistical modeling to see if the effect—say, a change in use of force, or other rare event—is detected as statistically significant. Then, it calculates how often this happens across all the trials for a given set of conditions. That frequency serves as an estimate of "statistical power," or the likelihood of detecting an effect if it is genuinely there.
So, if you are a researcher planning to study a rare event like police use of force, this function helps you explore questions like, "How big does my sample need to be to reliably detect an effect?" or "If the effect is really small, what are my chances of detecting it?" By running this function, you get a clearer picture of what to expect under different scenarios, making it a useful tool for planning rigorous and ethical research.
Plain language use of the code
In this virtual lab created by the simulate_power function, researchers have the freedom to tweak various aspects to fit the specifics of their intended study. Here's how you can tailor the simulation to your needs:
Effect Sizes (effect_sizes): This is where you input the different sizes of the effect you're interested in studying. For example, if you're looking at how a new policy impacts police use of force, you might want to know what happens if the effect is large (0.35), medium (0.2), or small (0.1). You would set the effect_sizes parameter to a vector like c(0.35, 0.2, 0.1). In practice, any number of effect sizes to be tested could be entered here.
Mean Occurrences (mean_occurrences): This is the average rate at which the event happens in your control group. If police use of force is very rare (0.1), rare (0.2), or uncommon (0.3), you can set these values in the mean_occurrences parameter, like c(0.1, 0.2, 0.3). In practice, any number of mean occurrences to be tested could be entered here.
Sample Sizes (possible_n): Here, you indicate the range of sample sizes you're considering for your study. If you're unsure how many subjects you'll be able to recruit, you might input a range from 50 to 5000, increasing in increments of 50, as seq(50, 5000, by = 50).
Number of Trials (sims): This specifies how many virtual mini-studies or "trials" you want to run for each combination of effect size, mean occurrence, and sample size. The more trials you run, the more reliable your estimates will be. You might set this to 1000 for a robust simulation. The primary simulation reported in this study was set to 5000 trials per intersection.
Significance Level (alpha): This is your threshold for deciding whether an effect is statistically significant. A common choice is 0.05, meaning you're willing to accept a 5% chance of falsely declaring an effect as significant when it's not. This is a standard choice and was not varied in the simulation. If required, researchers can simply enter their own accepted alpha (i.e., 0.10).
For example, if you're a criminologist planning a study on a new police training program's effect on use of force, you might be interested in small (0.1), medium (0.2), and large (0.35) effects. You'd set effect_sizes to c(0.1, 0.2, 0.35). If you expect the frequency of use of force incidents to be rare, you might set mean_occurrences to c(0.1, 0.2, 0.3). Then you'd run this function to see what sample size you'd need to have a good chance of detecting these effects.
By tweaking these parameters in the R code, you can run a series of "what-if" scenarios to better plan your study, making your research more rigorous and reliable.
Please note that the simulation code assumes a standard two-arm experimental design. Should your design require more arms, the code would need to be modified. The strong assumption here is that all else equal, additional treatment arms will increase the required sample size to achieve a given desired level of statistical power.
Coding
This R code defines a function, simulate_power, designed to perform power simulation for a Poisson regression model. The function takes in various parameters:
effect_sizes: Vector of effect sizes to consider.
mean_occurrences: Vector of mean occurrences in the control group to consider.
possible_n: Vector of possible sample sizes.
sims: Number of simulations to run for each combination of parameters, at every given sample size.
alpha: The significance level used to determine if an effect is statistically significant.
Here's how the function operates:
It initializes an empty list, results, to store the simulation outcomes for different combinations of effect sizes and mean occurrences.
It iterates over each effect size and mean occurrence rate. For each combination:
Initializes an empty numeric vector, powers, to store the power estimates for varying sample sizes.
For each sample size (N), it:
Simulates a control group (Y0) with a Poisson distribution based on the mean occurrence rate (lambda0).
Calculates the treatment effect (tau) and simulates a treatment group (Y1) based on the transformed effect size.
Initializes an empty vector significant.glm to store whether each simulated experiment yields a significant result.
Runs sims number of simulations:
Randomly assigns treatment (Z.sim).
Reveals outcomes (Y.sim) based on the assignment.
Fits a Poisson regression model (fit.glm).
Extracts p-values and determines significance.
Computes the average power for that sample size and stores it in powers.
Finally, it returns results, a list containing data frames for each combination of effect size and mean occurrence, detailing the sample size and estimated power.
This function is particularly useful for understanding how statistical power in a Poisson regression model varies with sample size, effect size, and mean occurrence rate, which can be crucial in criminological research involving rare events.
Full Code is provided below. Please note that the current parameters in the code result in a computationally expensive calculation, and it is highly recommended researchers use modest parameters when beginning, such as limiting to three choices of effect size and mean occurrence, restricted ranges of potential sample sizes, and low numbers of trials per intersection.
simulate_power <- function(effect_sizes, mean_occurrences, possible_n, sims, alpha) {
results <- list()
for (effect_size in effect_sizes) {
for (mean_occurrence in mean_occurrences) {
powers <- numeric(length(possible_n))
for (j in 1:length(possible_n)) {
N <- possible_n[j]
# Control group
lambda0 <- mean_occurrence
Y0 <- rpois(n = N, lambda = lambda0)
# Treatment effect and treatment group
tau <- exp(effect_size) - 1
lambda1 <- lambda0 * (1 + tau)
Y1 <- rpois(n = N, lambda = lambda1)
significant.glm <- rep(NA, sims)
for (i in 1:sims) {
Z.sim <- sample(0:1, N, replace = T, prob = c(1/2, 1/2))
Y.sim <- ifelse(Z.sim == 0, Y0, Y1)
fit.glm <- glm(Y.sim ~ factor(Z.sim), family = poisson())
p.glm <- summary(fit.glm)$coefficients[2, 4]
significant.glm[i] <- (p.glm <= alpha)
}
powers[j] <- mean(significant.glm)
}
# Store results in list
results[[paste("Effect:", effect_size, "Mean_Occurrence:", mean_occurrence)]] <- data.frame(
"Sample_Size" = possible_n,
"Power" = powers,
"Effect_Size" = effect_size,
"Mean_Occurrence" = mean_occurrence
)
}
}
return(results)
}
#Example Usage - this takes 15 hours- 4.5 million trials
effect_sizes <- c(0.35, 0.2, 0.1) #large, medium, small
mean_occurrences <- c(0.1, 0.2, 0.3) #very rare, rare, uncommon
possible_n <- seq(50, 10000, by = 100) # Replace with your desired range of sample sizes
sims <- 5000 # Number of simulations
alpha <- 0.05 # Significance level
simulation_results <- simulate_power(effect_sizes, mean_occurrences, possible_n, sims, alpha)
# save out as an RDS
library(here)
saveRDS(simulation_results, here("data","processed", "sim_results.RDS"))
# Save the entire list as a JSON file
library(jsonlite)
write(toJSON(simulation_results), here("data","processed", "sim_results.json"))
# Example Usage - this takes who knows, based on NYPD
effect_sizes <- c(0.35, 0.2, 0.1) #large, medium, small
mean_occurrences <- c(0.1, 0.2, 0.3) #very rare, rare, uncommon
possible_n <- seq(50, 30000, by = 100) # Replace with your desired range of sample sizes
sims <- 100 # Number of simulations
alpha <- 0.05 # Significance level
simulation_results <- simulate_power(effect_sizes, mean_occurrences, possible_n, sims, alpha)
# Required Libraries
library(jsonlite)
library(dplyr)
find_stabilization_point <- function(json_file_path, initial_steps) {
# Step 1: Import the JSON file
sim_results <- fromJSON(json_file_path, flatten = TRUE)
# Step 2: Initialize an empty list to store results
stabilization_results <- list()
# Step 3: Loop through each unique combination of effect size and mean occurrence
for (key in names(sim_results)) {
data <- sim_results[[key]]
# Convert to data frame and sort by Sample_Size
df <- as.data.frame(data) %>% arrange(Sample_Size)
# Initialize counters
consecutive_count <- 0
stable_sample_size <- NA
steps <- initial_steps
while (is.na(stable_sample_size) && steps > 0) {
consecutive_count <- 0
# Loop through each row in the data frame
for (i in 1:(nrow(df) - steps + 1)) {
# Check if all the next 'steps' have power > 0.8
if (all(df$Power[i:(i + steps - 1)] > 0.8)) {
consecutive_count <- consecutive_count + 1
if (consecutive_count >= steps) {
stable_sample_size <- df$Sample_Size[i]
break
}
} else {
consecutive_count <- 0
}
}
# Decrement 'steps' if no stabilization point is found
if (is.na(stable_sample_size)) {
steps <- steps - 1
}
}
# Store the result in the list
stabilization_results[[key]] <- list("Stable_Sample_Size" = stable_sample_size, "Steps" = steps)
}
return(stabilization_results)
}
# Example usage – sim general
json_file_path <- "data/processed/sim_results.json"
initial_steps <- 5
result_all <- find_stabilization_point(json_file_path, initial_steps)
# Example usage – sim large agency
json_file_path <- "data/processed/sim_results_NYPD.json"
initial_steps <- 5
result_NYPD <- find_stabilization_point(json_file_path, initial_steps)
# Function to convert 'result' list to a simplified data frame for quick reference
convert_to_table_simplified <- function(result) {
# Initialize empty vectors to store the values
interaction <- c()
stable_sample_size <- c()
stable_steps <- c()
# Loop through each element in the result list
for (key in names(result)) {
# Append the interaction label
interaction <- append(interaction, key)
# Append the stable sample size and steps
stable_sample_size <- append(stable_sample_size, result[[key]]$Stable_Sample_Size)
stable_steps <- append(stable_steps, result[[key]]$Steps)
}
# Create a data frame
df <- data.frame(
"Interaction" = interaction,
"Stable_Sample_Size" = stable_sample_size,
"Stable_Steps" = stable_steps
)
return(df)
}
# Run the function on the sample 'result'
stable_NYPD <- convert_to_table_simplified(result_NYPD)
write.csv(stable_NYPD, "data/processed/stable_30k.csv", row.names = FALSE)
stable_all <- convert_to_table_simplified(result_all)
write.csv(stable_all, "data/processed/stable_10k.csv", row.names = FALSE)
Lehr's (1992) approximation provides a streamlined method for estimating the required sample size in experimental studies, particularly useful when dealing with normally distributed outcomes or approximations thereof. Lehr’s formula relies on the difference between two group means, the desired level of statistical power, and the significance criterion to estimate the sample size necessary for detecting a meaningful effect (see Belle, 2011, eq. 2.5, for illustration). The basic formulation for the approximation is:
where:
This gives Lehr’s approximation for the sample size required to detect a specified effect size with a given level of statistical power and significance. In this context,
For Poisson-distributed data, where the variance is equal to the mean, we must adjust to accommodate the distribution's unique properties. By applying a square root transformation, we adapt Lehr’s equation to the Poisson context, acknowledging that the square root of Poisson-distributed data tends toward a normal distribution, thus allowing for the use of normal approximation methods in power analysis (DiMaggio, 2024). We can express the adaptation of Lehr’s approximation for Poisson-distributed outcomes as:
Building upon Lehr’s approximation, a quick reference table was developed to serve as an accessible tool for estimating required sample sizes in studies with Poisson-distributed outcomes. Appendix Table 1 streamlines the process of sample size estimation, offering a practical solution for research planning. The quick reference table applies Lehr’s equation across an expanded range of scenarios, using a variety of mean occurrences and effect sizes to encompass a broad range of potential designs, in increasing amounts from 0.001 through 1.
A note of caution is warranted. The estimates provided in the quick reference table are just that—estimates. The approximations tend to underestimate the true sample size requirements, and before embarking on experimental studies it is highly recommended researchers engage in a full simulation as presented in the current study. The main paper discusses at length the potential instability in these estimates, especially in the context of rare events. This instability suggests that, while the table offers a valuable starting point for preliminary planning, researchers should consider utilizing the full simulation code presented earlier in the appendix for more nuanced analysis. Furthermore, the introduction of the "power lift" metric within the simulation offers an advanced method for achieving best-in-class power estimates for studies concerning rare events.
Mean Occurrence | Effect Size | Target Mean Occurrence | Required Sample Size (Per arm) |
---|---|---|---|
0.001 | 0.1 | 0.0011 | 1521650 |
0.001 | 0.2 | 0.0012 | 361634 |
0.001 | 0.35 | 0.0014 | 109364 |
0.01 | 0.1 | 0.0111 | 152165 |
0.01 | 0.2 | 0.0122 | 36163 |
0.01 | 0.35 | 0.0142 | 10936 |
0.02 | 0.1 | 0.0221 | 76083 |
0.02 | 0.2 | 0.0244 | 18082 |
0.02 | 0.35 | 0.0284 | 5468 |
0.03 | 0.1 | 0.0332 | 50722 |
0.03 | 0.2 | 0.0366 | 12054 |
0.03 | 0.35 | 0.0426 | 3645 |
0.04 | 0.1 | 0.0442 | 38041 |
0.04 | 0.2 | 0.0489 | 9041 |
0.04 | 0.35 | 0.0568 | 2734 |
0.05 | 0.1 | 0.0553 | 30433 |
0.05 | 0.2 | 0.0611 | 7233 |
0.05 | 0.35 | 0.0710 | 2187 |
0.06 | 0.1 | 0.0663 | 25361 |
0.06 | 0.2 | 0.0733 | 6027 |
0.06 | 0.35 | 0.0851 | 1823 |
0.07 | 0.1 | 0.0774 | 21738 |
0.07 | 0.2 | 0.0855 | 5166 |
0.07 | 0.35 | 0.0993 | 1562 |
0.08 | 0.1 | 0.0884 | 19021 |
0.08 | 0.2 | 0.0977 | 4520 |
0.08 | 0.35 | 0.1135 | 1367 |
0.09 | 0.1 | 0.0995 | 16907 |
0.09 | 0.2 | 0.1099 | 4018 |
0.09 | 0.35 | 0.1277 | 1215 |
0.1 | 0.1 | 0.1105 | 15217 |
0.1 | 0.2 | 0.1221 | 3616 |
0.1 | 0.35 | 0.1419 | 1094 |
0.2 | 0.1 | 0.2210 | 7608 |
0.2 | 0.2 | 0.2443 | 1808 |
0.2 | 0.35 | 0.2838 | 547 |
0.3 | 0.1 | 0.3316 | 5072 |
0.3 | 0.2 | 0.3664 | 1205 |
0.3 | 0.35 | 0.4257 | 365 |
0.4 | 0.1 | 0.4421 | 3804 |
0.4 | 0.2 | 0.4886 | 904 |
0.4 | 0.35 | 0.5676 | 273 |
0.5 | 0.1 | 0.5526 | 3043 |
0.5 | 0.2 | 0.6107 | 723 |
0.5 | 0.35 | 0.7095 | 219 |
0.6 | 0.1 | 0.6631 | 2536 |
0.6 | 0.2 | 0.7328 | 603 |
0.6 | 0.35 | 0.8514 | 182 |
0.7 | 0.1 | 0.7736 | 2174 |
0.7 | 0.2 | 0.8550 | 517 |
0.7 | 0.35 | 0.9933 | 156 |
0.8 | 0.1 | 0.8841 | 1902 |
0.8 | 0.2 | 0.9771 | 452 |
0.8 | 0.35 | 1.1353 | 137 |
0.9 | 0.1 | 0.9947 | 1691 |
0.9 | 0.2 | 1.0993 | 402 |
0.9 | 0.35 | 1.2772 | 122 |
1 | 0.1 | 1.1052 | 1522 |
1 | 0.2 | 1.2214 | 362 |
1 | 0.35 | 1.4191 | 109 |
Appendix Table 1: Quick Reference Guide for Rare Event Sample Size Estimation