Skip to main content
SearchLoginLogin or Signup

Improving Police Behavior through Artificial Intelligence: Pre-Registered Experimental Results in Two Large US Agencies

Published onOct 23, 2024
Improving Police Behavior through Artificial Intelligence: Pre-Registered Experimental Results in Two Large US Agencies
·

Abstract

Police body-worn cameras (BWCs) create huge reservoirs of data showing police behavior, but only a small percentage of this information is ever reviewed due to the demands of human-led review and auditing of footage. One potential solution to the time demand is the use of artificial intelligence (AI) tools to automate this task, with the promise that this information can effectively shape police behavior. In two pre-registered, randomized controlled trials in two large US police agencies, we find mixed support for changes in officer behavior when using AI-led auditing. Specifically, officers in an agency under consent decree significantly reduced levels of substandard professionalism, while officers in an agency not facing those pressures significantly increased the number of highly professional public interactions. Results suggest that AI-led auditing of police BWC footage can shape officer conduct.

Introduction

Since 2015, the use of body-worn cameras (BWCs) in U.S. law enforcement has expanded rapidly, with around 60% of agencies now adopting this technology. Initially celebrated as a reformative technology, BWCs were expected to reduce police misconduct and excessive force, while also enhancing investigative quality and clearance rates. However, empirical evaluations of BWC programs present mixed outcomes, leading to debates on their long-term value. Some scholars argue that BWC efficacy may be overstated (Christodoulou et al., 2019; Lum et al., 2019), whereas others maintain that BWCs still hold promise for improving police-community interactions (White & Malm, 2020).

Driving these varied findings is the unreviewed volume of footage generated by BWCs. With close to one-fifth of Americans reporting contact with the police every year (Tapp & Davis, 2024), the sheer scale of footage capturing those contacts renders comprehensive manual review impossible (Adams, 2022; White & Malm, 2020), resulting in a situation where only events such as use-of-force or external complaints generate a human review of the BWC evidence. However, these events represent a very small proportion of overall police contacts (Garner et al., 2018). This creates a significant challenge: much of the potentially deterrable officer conduct remains buried in vast, unexamined video data (Camp et al., 2024). Consequently, automated or selective review mechanisms may be essential for realizing the full potential of BWCs in influencing officer behavior and enhancing accountability.

Without systematic video audits, police departments face substantial barriers to identifying and addressing incidents or policy violations promptly. For instance, research has demonstrated that even a small public defender's office with seven attorneys would need between eight and sixteen additional staff members to review BWC footage for charged cases (Gaub et al., 2019, 2021). Given that charged cases constitute only a fraction of police interactions, it is unrealistic to expect agencies to audit all BWC footage by human review. This dependence on manual review is not only cost-prohibitive, necessitating additional staff or overtime, but also introduces potential for human error and bias, which can compromise the consistency and reliability of the audit process. These limitations underscore the need for scalable, automated solutions to handle the growing volume of BWC footage efficiently and objectively.

Until recently, human review was the only method available for auditing BWC footage. However, advancements in artificial intelligence (AI) now offer a pathway to automate this process. Leading this development, companies like Truleo and Polis Solutions have introduced software designed to streamline BWC reviews. For example, Truleo employs a single-modal approach, “voice printing” officers to transcribe spoken words, categorizing language into “high,” “standard,” or “below standard” professionalism (Shastry, 2022; Sisson, 2024). These classifications enable departments to identify specific incidents of concern (e.g., instances of unprofessional language) or generate aggregated performance scores for officers.

Comparatively, Polis Solutions' TrustStat software takes a more comprehensive, multimodal approach, analyzing both verbal and non-verbal cues, such as facial expressions to assess officer interactions (Sisson, 2024). Polis collaborates with agencies to tailor these analyses, ensuring alignment with agency-specific performance objectives. A third automated auditing approach is reportedly in development by a research team at the University of Southern California, though specifics about the model have not been released (B. A. T. Graham et al., 2024).

As automated solutions for police accountability gain traction, it is important to recognize the absence of rigorous, peer-reviewed research substantiating the effectiveness of automated BWC review. Currently, neither Polis Solutions nor the University of Southern California team has published peer-reviewed findings to support claims of efficacy. Truleo has reported the results of an internal pilot study using a pre-post, non-randomized design. This study claims a 36% reduction in use-of-force incidents, a 30% increase in professionalism scores, and a 12% drop in subject non-compliance rates following implementation of the software (Shastry, 2022). However, significant methodological limitations hinder the reliability of these results. The study’s non-randomized design and the timing of implementation—overlapping with departmental shifts such as a new police chief and policy reforms—introduce confounding variables. Consequently, it remains unclear whether observed changes stem from the automated review system or from broader organizational reforms, underscoring the need for independent, controlled studies to evaluate credibly its effectiveness.

Pre-Registered, Multi-Site Randomized Control Trials

To address the gap in objective, experimental data on automated BWC review, we conducted two pre-registered, six-month randomized controlled trials within large U.S. law enforcement agencies using the Truleo software. Both trials implemented a three-arm design: in the first experimental arm (self-assessment) officers directly accessed the automated system’s feedback, while in the second arm (supervisor-mediated) officers received feedback solely through meetings with their supervisors. Officers in the control group had no access to feedback from the system, though their footage was still audited to establish baseline counterfactuals. This design aims to assess rigorously how direct versus mediated feedback from AI auditing influences officer behavior.

The design of our treatment arms—self-assessment and supervisor-mediated feedback—reflects contrasting dynamics in U.S. policing culture regarding BWC technology. While officers broadly support BWCs, this approval often hinges on the technology's capacity to substantiate officer actions, particularly in incidents involving force or public complaints (Gaub et al., 2016). However, there remains significant apprehension among officers regarding the potential for BWC footage to serve as a “fishing expedition,” uncovering minor policy breaches that would otherwise go unnoticed (Adams & Mastracci, 2019b; Watts et al., 2024). By offering direct, unsupervised feedback, which might include policy violations, we aim to make the feedback process more acceptable to officers by circumventing the traditional supervisory route.

Conversely, research suggests that supervisors may function as intermediaries, softening the impact of algorithmically generated feedback. Officers tend to respond more favorably to AI evaluations delivered through supervisors, aligning with evidence that people generally trust human-mediated feedback over purely automated assessments (Adams, 2022; Hobson et al., 2021; Schiff et al., 2023). Recent survey data from Watts et al. (Watts et al., 2024) underscores this divide: while police executives are generally optimistic about AI-led review of BWCs, frontline officers exhibit a marked reluctance, fueled by concerns over workplace surveillance. A certain level of apprehension in the face of new workplace surveillance is expected across sectoral and professional contexts (Fusi & Feeney, 2018; Ravid et al., 2020). Given that policing is already a highly monitored field (Adams & Mastracci, 2019a), AI review could be perceived as additional scrutiny, potentially exposing minor infractions or heightening disciplinary risks. This concern is particularly salient in agencies where trust between officers and leadership is already fragile, highlighting the importance of carefully considering feedback delivery mechanisms in technology-driven accountability efforts.

By incorporating a self-assessment option, we aim to empower officers with direct access to AI-generated feedback, which may help mitigate resistance to automated oversight (Alge, 2001). Simultaneously, the supervisor-mediated approach acknowledges the importance of human interaction in contexts where trust and sensitivity are paramount, blending AI insights with the nuanced, relational dynamics of officer-supervisor relationships. This two-pronged strategy enables us to compare the effectiveness of direct versus mediated feedback in fostering professionalism, while also navigating the layered cultural responses to AI within law enforcement agencies.

Agency Contexts

Our experiment took place in two dissimilar agencies, integrating insights from implementation science to explore how context shapes the effectiveness of AI-driven professionalism interventions (del Pozo et al., 2024). Varied contextual contrasts—including differing oversight structures, media exposure, leadership continuity, and union presence—provides a robust framework to assess the adaptability and outcomes of these interventions across varied policing environments.

The Aurora Police Department (APD) in Colorado and the Richland County Sheriff’s Department (RCSD) in South Carolina, both large agencies with over one hundred sworn officers, differ significantly in organizational structure and leadership. APD, a Western municipal agency, operates under the direction of an appointed Police Chief and is subject to a state-imposed consent decree following the high-profile death of Elijah McClain. This decree enforces reforms aimed at addressing systemic issues such as excessive force and racial bias within a unionized environment, which adds complexity to reform efforts (Adams et al., 2024; Falcone & Wells, 1995; Farris & Holman, 2017; Manley, 2024).

In contrast, RCSD, led by elected Sheriff Leon Lott since 1996, enjoys greater autonomy with minimal external oversight. This continuity in leadership is often linked to improved organizational stability and consistency in operations (S. R. Graham & Makowsky, 2024; Pearson-Goff & Herrington, 2014). Additionally, RCSD has cultivated a favorable public image through media programs like Live PD and Missing: Dead or Alive, avoiding the intensive scrutiny faced by APD. The differing labor contexts also distinguish these agencies: APD operates within a unionized setting, while RCSD does not. Prior research has shown that unionized environments may pose unique challenges to implementing new programs compared to their non-union counterparts, due to labor dynamics and negotiated protections (Juris & Feuille, 1973; Nicholson-Crotty et al., 2022; Rad et al., 2023). These diverse contextual factors make APD and RCSD ideal settings to examine how AI-driven feedback tools interact with various organizational environments and reform imperatives in contemporary law enforcement.

A Multisite Randomized Controlled Trial

The purpose of this study is to examine the ability of AI-led auditing of BWC footage to shape officer behavior. In order to do so, we utilize the pre-existing measure of professionalism within the Truleo system with no claim as to its intrinsic value, but rather with the simple understanding that if AI-led review of BWC footage can alter officer behavior it will do so first through the measures that the system generates—i.e., officers will receive their scores generated by the software and alter behavior to make their scores increase or decrease. To evaluate this potential effect, we articulate three pre-registered hypotheses:1

H1: Compared to the control group, officers in either treatment group will have improved professionalism scores.

H2: Self-audited officers will have improved professionalism scores compared to the control group, while the supervisor-mediated treatment group will have improved professionalism scores compared to both control and self-audited officers.

H3: Pooling across treatment, the treated officers will have improved professionalism scores compared to untreated (control) officers.

Measures

Our dependent variable is the pre-existing measure of professionalism that serves as the foundation of the Truleo system. Police professionalism is operationalized through a tripartite coding scheme created by Truleo. Unlike broader academic constructs that frame police professionalism as fostering public trust (Sunshine & Tyler, 2003), this measure categorizes interactions into “high,” “standard,” or “substandard” professionalism based on specific linguistic markers from body-worn camera (BWC) audio transcripts. An interaction is coded as high professionalism when: (1) The officer abstains from using language below professional standards (e.g., profanity, insults), (2) Avoids or refrains from threatening or using force, and (3) Provides more than twenty-five words of explanatory context before executing an official action, such as an arrest, frisk, or search. Conversely, substandard professionalism denotes interactions featuring directed profanity, derogatory language, or racial slurs. Standard professionalism includes interactions that do not meet the criteria for either high or substandard professionalism, representing a baseline level of appropriate conduct.

The validity of the professionalism measure used by the system is open to debate, but this does not detract from the core focus of the study. Our objective is to evaluate whether AI-led audits of BWC footage influence officer behavior. The software's impact will logically manifest on the metric it generates. If the system fails to shift behavior on its own measure, it is unlikely to influence more comprehensive behavioral outcomes. Should we observe an effect, a deeper analysis of the measure's utility may be warranted, and future iterations could refine the metric to better align with desirable policing behaviors.

Notably, the dependent variable is entirely algorithmically derived, with classifications based solely on linguistic content from body-worn camera (BWC) footage, absent any human validation or secondary verification. Though a coarser metric than measures derived from nuanced theoretical models of police-citizen interactions (Terpstra & van Wijck, 2023), this algorithmic assessment serves as a foundational step for evaluating the potential impact of automated BWC review. As noted earlier, for the AI review to hold value, it must demonstrably influence the very indicators it generates. Additionally, using algorithmically derived measures avoids the potential influence of human bias in assessing the effectiveness of the system. Previous research supports this decision, as algorithmic measures of police verbalizations are thought to be a stronger measure compared to upstream attitudes and beliefs (Camp et al., 2024).

Our approach also provides a scalable and efficient method, processing vast amounts of interaction data that would be unmanageable with manual review, such as systematic social observation (Piza et al., 2023; Piza & Sytsma, 2022). This algorithmic method delivers consistency and scalability, essential for widespread assessment of officer behavior across large datasets of police interactions. Future studies could expand upon this by evaluating how well the measure aligns with real-world behavioral goals for policing.

Sample and Randomization

APD (n=219) and RCSD (n=165) front-line officers were block-randomized by gender into one of three experimental conditions – control, self-assessment, or supervisor-mediated. Randomization was blocked for gender of the officers, and the process was successful in randomizing across measured characteristics of gender, race, tenure, title, and unit assignment. Table 1 reports sample characteristics and balancing for APD, and Table 2 for RCSD.

Table 1. Sample Characteristics and Balance (APD)

Characteristic

N

Control,

N = 73

Self

assessment,

N = 73

Supervisor mediated,

N = 73

p

Gender

219

>0.9

Female

8 / 73 (11%)

8 / 73 (11%)

8 / 73 (11%)

Male

65 / 73 (89%)

65 / 73 (89%)

65 / 73 (89%)

Race/Ethnicity

219

0.2

American Indian

1 / 73 (1.4%)

0 / 73 (0%)

1 / 73 (1.4%)

Asian

2 / 73 (2.7%)

3 / 73 (4.1%)

1 / 73 (1.4%)

Black

0 / 73 (0%)

6 / 73 (8.2%)

4 / 73 (5.5%)

Hispanic

18 / 73 (25%)

7 / 73 (9.6%)

12 / 73 (16%)

Two or More Races

5 / 73 (6.8%)

5 / 73 (6.8%)

5 / 73 (6.8%)

White

47 / 73 (64%)

52 / 73 (71%)

50 / 73 (68%)

Tenure

219

5.12 (6.66)

6.88 (8.15)

4.78 (6.36)

0.6

Note: N = Mean (SD), n / N (%); p using Pearson's Chi-squared test

Table 2. Sample Characteristics and Experimental Balance (RCSD)

Characteristic

N

Control

N = 56

Self

assessment,

N = 54

Supervisor

mediated,

N = 55

p

Gender

165

>0.9

Female

10 / 56 (18%)

9 / 54 (17%)

10 / 55 (18%)

Male

46 / 56 (82%)

45 / 54 (83%)

45 / 55 (82%)

Race/Ethnicity

165

0.10

Asian

0 / 56 (0%)

3 / 54 (5.6%)

0 / 55 (0%)

Black

14 / 56 (25%)

13 / 54 (24%)

24 / 55 (44%)

Hispanic

1 / 56 (1.8%)

2 / 54 (3.7%)

2 / 55 (3.6%)

Two or More

1 / 56 (1.8%)

1 / 54 (1.9%)

0 / 55 (0%)

Unknown

0 / 56 (0%)

0 / 54 (0%)

1 / 55 (1.8%)

White

40 / 56 (71%)

35 / 54 (65%)

28 / 55 (51%)

Tenure

155

4.45 (4.00)

4.03 (3.33)

4.10 (5.07)

0.4

Note: N = Mean (SD), n / N (%); p using Pearson's Chi-squared test

Analysis

Our analysis is observed at the video level. In APD, a total of 124,443 body-worn camera videos were reviewed between July 1 and December 31, 2023. In RCSD, a total of 65,172 such videos were audited between January 1 and June 30, 2024.

We employ a mixed-effects linear regression model, using the lme4 package in R (Bates et al., 2015). The model accounts for the nested structure of the data, where each officer produces multiple videos. The random intercept for each officer allows the model to capture the within-officer correlation in the likelihood of producing a video that is adjudicated as either high or below-standard professionalism. The model is specified as:

professionalismij=β0+β1Self-assessmentij+β2Supervisor-mediatedij+ui+ϵij\text{professionalism}_{ij} = \beta_{0} + \beta_{1} \cdot \text{Self-assessment}_{ij} + \beta_{2} \cdot \text{Supervisor-mediated}_{ij} + u_{i} + \epsilon_{ij} (1)

Where:

  • professionalismij{professionalism}_{ij} is the binary outcome indicating whether video jj from officer ii was rated as highly professional (Model 1) or below-standard professional (Model 2).

  • β0\beta_{0} is the intercept, representing the expected probability of a highly professional (Model 1) or below-standard professional (Model 2) video for an officer in the control group.

  • β1\beta_{1} is the coefficient for the self-assessment treatment, indicating the change in the probability of a highly professional (Model 1) or below-standard professional (Model 2) video when the officer is in the self-assessment group compared to the control group.

  • β2\beta_{2} is the coefficient for the supervisor-mediated treatment, indicating the change in the probability of a highly professional (Model 1) or below-standard professional (Model 2) video when the officer is in the supervisor-mediated group compared to the control group.

  • uiu_{i} is the random intercept for officer ii, capturing the individual variability in the probability of producing a highly professional (Model 1) or below-standard professional (Model 2) video.​

  • ϵij\epsilon_{ij} is the residual error term, representing the unexplained variability in the outcome for video jj from officer ii.

We report on three models, drawn from our pre-registered hypotheses, and report them by agency. Model 1 reports the effects of automated review of body-worn camera footage on high professionalism, Model 2 reports on substandard professionalism, and Model 3 reports on the pooled treatment effect for either substandard professionalism (APD) or high professionalism (RCSD).

Aurora Police Department Results, A Drop in Substandard Professionalism

Analyses of over 124,000 body-worn camera videos from the Aurora Police Department show that the automated review process significantly reduced incidents of substandard professionalism among officers. The mixed-effects linear regression model for substandard professionalism (Model 2) reveals that officers in the self-assessment treatment group exhibited a significant decrease in the likelihood of producing substandard professional videos compared to the control group (β = -0.010, SE = 0.002, p < .001). Similarly, officers in the supervisor-mediated treatment group also show a significant reduction in substandard professionalism (β = -0.014, SE = 0.002, p < .001). When pooling the treatment effects (Model 3), the combined impact (pooling treatment) of both treatment groups on substandard professionalism remained significant (β = -0.012, SE = 0.002, p < .001).

However, the effects of the automated review of body-worn camera footage on high professionalism were not statistically significant. In Model 1, neither the self-assessment (β = 0.001, SE = 0.001, p > .05) nor the supervisor-mediated treatment (β = 0.000, SE = 0.001, p > .05) result in a significant increase in the likelihood of producing highly professional videos compared to the control group. The marginal R² values indicate that the models explain a small proportion of the variance in professionalism outcomes, with R² Marginal ranging from 0.000 to 0.003 and R² Conditional ranging from 0.009 to 0.012.

Table 3. Experimental Results, APD

(1)

High

Professionalism

(2)

Substandard

Professionalism

(3)

Substandard

(Pooled Treatment)

(Intercept)

0.005***

0.021***

0.021***

(0.001)

(0.001)

(0.001)

Self-assessment

0.001

-0.010***

(0.001)

(0.002)

Supervisor-mediated

0.000

-0.014***

(0.001)

(0.002)

Pooled Treatment

-0.012***

(0.002)

SD (Officer)

0.007

0.011

0.011

SD (Observations)

0.072

0.114

0.114

Num. Obs.

124443

124443

124443

R2 Marg.

0.000

0.003

0.002

R2 Cond.

0.009

0.012

0.012

+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Richland County Sheriff’s Results, an Increase in High Professionalism

In the analysis of the Richland County Sheriff’s Department (RCSD), the implementation of the automated review of body-worn camera footage demonstrates a significant increase in high professionalism among officers. The mixed-effects linear regression model for high professionalism (Model 1) indicates that officers in the self-assessment treatment group show a statistically significant increase in the likelihood of producing highly professional videos compared to the control group (β = 0.018, SE = 0.009, p < .05). Although the supervisor-mediated treatment group also show an increase in high professionalism, this effect was not statistically significant (β = 0.009, SE = 0.009, p > .05).For substandard professionalism (Model 2), neither the self-assessment (β = -0.002, SE = 0.002, p > .05) nor the supervisor-mediated treatment (β = -0.002, SE = 0.002, p > .05) produced a significant reduction in the likelihood of substandard professional conduct. This suggests that, unlike in the Aurora Police Department, the automated review of body-worn camera footage did not have a significant impact on reducing substandard professionalism within RCSD.

When pooling the treatment effects across both groups (Model 3), there was a marginally significant increase in high professionalism (β = 0.013, SE = 0.007, p < .10), suggesting a positive but modest overall impact of the automated review of body-worn camera footage on promoting high professionalism within RCSD. This indicates that the software may contribute to better professional conduct, particularly through self-assessment.

Table 4. Experimental Results, RCSD

(1)

High

Professionalism

(2)

Substandard

Professionalism

(3)

High

(Pooled Treatment)

(Intercept)

0.022***

0.017***

0.022***

(0.006)

(0.002)

(0.006)

Self-assessment

0.018*

-0.002

(0.009)

(0.002)

Supervisor-mediated

0.009

-0.002

(0.009)

(0.002)

Pooled Treatment

0.013+

(0.007)

SD (Officer)

0.043

0.010

0.043

SD (Observations)

0.171

0.118

0.171

Num. Obs.

65172

65172

65172

R2 Marg.

0.002

0.000

0.001

R2 Cond.

0.060

0.007

0.060

+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Substantive Impact

To estimate the impact of different interventions on professionalism, we calculate the change in the likelihood of high or substandard professionalism associated with each intervention, apply this change to the total number of encounters observed, and then compare it to the baseline rate in the control group. In effect, we are asking what the counterfactual would expect if every officer had been assigned to a specific treatment group over the study period.

For APD, the self-assessment group was 1.0 percentage point less likely to exhibit substandard professionalism, translating to an estimated reduction of approximately 1,244 substandard encounters, representing a 47.6% decrease from the baseline rate. The supervisor-mediated group saw a 1.4 percentage point decrease in substandard professionalism, leading to approximately 1,742 fewer substandard encounters, a 66.7% reduction. Finally, the pooled treatment group, combining self-assessment and supervisor-mediated interventions, result in a 1.2 percentage point decrease in substandard professionalism, yielding around 1,493 fewer substandard encounters, representing a 57.1% decrease from the control group's baseline.

For RCSD, the self-assessment group was 1.8 percentage points more likely to exhibit high professionalism, translating to an estimated increase of approximately 1,173 highly professional encounters, representing an 81.8% increase from the baseline rate. The supervisor-mediated group saw a 0.9 percentage point increase in high professionalism, leading to approximately 586 additional highly professional encounters, a 40.9% increase. Finally, the pooled treatment group, combining self-assessment and supervisor-mediated interventions, resulted in a 1.3 percentage point increase in high professionalism, yielding around 847 additional highly professional encounters, representing a 59.1% increase from the control group's baseline, though this effect was not significant at the traditional 0.05 alpha level.

Discussion

Our findings demonstrate the potential of AI-driven auditing tools to shape officer behavior across varied law enforcement settings. Testing in two contrasting agencies—the Aurora Police Department (APD) and Richland County Sheriff’s Department (RCSD)—yielded partial support for our hypotheses, showing that while AI auditing influenced officer behavior, its effects varied by agency context and the type of behavior targeted.

In APD, automated review of body-worn camera (BWC) footage led to a significant reduction in substandard professionalism, especially within the supervisor-mediated condition, which showed the most substantial decrease in substandard interactions. This suggests that AI feedback, when relayed through a human supervisor, may be particularly effective at reducing undesirable behaviors. However, the tool did not significantly increase high professionalism, indicating that while AI feedback can mitigate negative behaviors, it may be less effective in fostering positive conduct without additional supportive interventions.

In RCSD, the self-assessment group displayed a notable increase in high professionalism, suggesting that direct access to AI-generated feedback may encourage officers to adjust their behaviors proactively to meet algorithm-defined standards. RCSD officers appeared to align their conduct strategically with the system’s criteria for “highly professional” interactions.

Anecdotally, during debriefs following the study, officers in both agencies indicated concern that officers would “game” the system by intentionally doing things known to increase the likelihood of a “highly professional” score--for example, lengthening explanations, enhancing clarity, or consciously avoiding behaviors that could lead to lower professionalism scores. In effect, this means officers might shift their behavior to capture better algorithmically generated scores. Such "gaming" behaviors are common in algorithmically monitored environments, where continuous performance tracking creates a competitive or strategic mindset as individuals aim to maximize recognition or avoid scrutiny based on algorithmic metrics.

Similar behaviors have been documented in various settings, from gig economy workers adapting to rating systems to corporate environments where algorithmic management fosters goal-directed actions that align with performance indicators (Benlian et al., 2022). In policing contexts, officers are known to seek favorable supervisory assessments (Workman-Stark, 2021), such as in Compstat environments, which prioritize statistical and performance-based evaluations (Willis et al., 2004). This preference for high scores and recognition highlights the potential for AI-generated feedback to drive similar strategic adaptations, fostering a focus on measurable outcomes. In this manner, AI-led review of BWC footage holds considerable promise as it provides a new metric for police performance that is not linked to traditional metrics—e.g., numbers of arrests, citations, seizures, stops, etc.—that have been heavily criticized in the 21st century.

Such gamification, however, requires a deeper consideration of the behaviors that the algorithm encourages. While we were not initially concerned with the measure utilized in this study as a first test of AI-led review’s ability to shape officer behavior, its importance and quality should be considered. For example, compared to the complex theoretical constructs like procedural justice, which recent experimental findings suggest have mixed effectiveness in real-world public interactions (Terpstra & van Wijck, 2023; Trinkner, 2023), our results suggest straightforward guidelines could foster tangible improvements in behavior. The simplified measure used here provides actionable rules that can be easier for officers to internalize and apply consistently in the field, potentially leading to meaningful gains in professionalism without the need for nuanced (Healy, 2017), academically driven frameworks that may lack empirical grounding in some settings. Nevertheless, future research should examine what types of behaviors AI-led review of BWC footage can measure and how these measures can be responsive to community demands of policing.

These insights underline the importance of designing algorithmic systems thoughtfully to balance efficiency with genuine quality and ethical standards in performance assessment. While scholars have begun to address questions about how offloading police decision making might impact community perceptions (Hobson et al., 2021; Schiff et al., 2023), there has been little debate about how these algorithms might affect the officers. Mechanisms such as those studied here – algorithmically judging the levels of professionalism through transcripts of police body-worn cameras – introduce rational control distinct from traditional policing oversight by minimizing subjective judgment and implementing a consistent standard across all interactions. However, as research on algorithmic management shows, while such systems can enhance accountability and transparency, they may also narrow officers’ focus to specific behaviors favored by the algorithm, potentially at the cost of situational judgment (Rosenblat et al., 2016; Rosenblat & Stark, 2016). This underscores the need to balance algorithmic standardization with the nuanced, context-driven nature of police work (Alpert & McLean, 2018). As algorithmic control reshapes officer behavior, it raises questions about how these systems can be designed to enhance, rather than restrict, effective and ethical policing practices.

Limitations

Our reliance on automated classification introduces both strengths and constraints to the study. While automation offers a standardized, unbiased approach to evaluating a vast dataset—allowing for trend analysis beyond the reach of human auditors—questions remain about the accuracy of these classifications. In particular, the algorithm may lack the contextual sensitivity needed to interpret nuances in officer language and behavior accurately. However, we are unaware of any systematic variation in measurement error across treatment conditions, which would introduce bias into the treatment effect estimates. Consequently, while measurement noise is expected, it should not be correlated with treatment conditions, preserving the internal validity of our effect estimates.

The professionalism measure used in this study, though effective for broad categorizations, has inherent limitations due to its simplified structure. By sorting interactions into discrete categories (high, standard, substandard), the algorithm may miss important nuances that are integral to understanding the complexity of police work. This reductionist approach, while necessary for algorithmic functionality, overlooks subtleties such as tone, body language, and situational appropriateness—factors that are not captured in text transcripts and are thus excluded from our analysis.

The generalizability of our findings is also limited by the specific contexts of the two large U.S. police agencies involved and the exclusive use of Truleo’s software. While selecting two highly varied agencies provides an implementation science-driven framework for examining context-sensitive outcomes (del Pozo et al., 2024), these findings cannot be assumed to represent the diversity among the 18,000 police agencies in the United States. Small agencies, those with different organizational cultures, or those under unique legal and community constraints may respond differently to AI-generated feedback. Additionally, alternative software solutions, such as Polis Solutions' TrustStat, which uses multimodal analysis and customizable outputs, may yield different effects, especially in agencies with unique operational or cultural characteristics.

Lastly, while this study highlights the potential of AI tools to enhance police behavior, it does not address broader ethical or policy implications. Issues of transparency, accountability, and the risk of AI perpetuating existing biases are critical areas that require further study (Council on Criminal Justice, 2024). Effective deployment of AI in policing must be accompanied by robust oversight and continuous assessment to ensure it fosters positive outcomes for police-community relations (Schiff et al., 2023), aligning with ethical standards and policy safeguards.

Materials and Methods

Experimental Design – A Multisite Randomized Trial

This study evaluates whether AI-generated feedback on body-worn camera (BWC) footage can influence police professionalism in two large U.S. law enforcement agencies. Truleo’s internal pre-post analyses and preliminary officer surveys inspired our hypothesis that AI-driven feedback can improve professionalism (Shastry, 2022). We established three pre-registered hypotheses to test these claims:

H1: Officers in either treatment group will show improved professionalism scores compared to the control group.

H2: Self-audited officers will score higher than the control group, with supervisor-mediated officers outperforming both.

H3: Across treatments, officers in the pooled treatment will exhibit improved professionalism relative to control.

Data and code to replicate our results, along with the study’s pre-registration is available at: https://osf.io/bztpd/

Sample and Randomization

Our sample comprises 219 officers from Aurora Police Department (APD) and 165 officers from Richland County Sheriff’s Department (RCSD). Officers were block-randomized by gender into three conditions: control, self-assessment, and supervisor-mediated. This block randomization ensured balanced representation across gender, race, tenure, title, and unit assignment. Tables 1 and 2 detail sample characteristics and show effective balancing across these key demographics. The intervention was implemented at the video level, with APD producing 124,443 videos and RCSD producing 65,172 videos for analysis during six-month observation periods.

Measures

The primary dependent variable, police professionalism, is operationalized through Truleo’s tripartite classification: high, standard, or substandard professionalism. Professionalism categories are determined solely by language-based markers extracted from BWC footage transcripts, with no secondary human validation. This coding scheme categorizes interactions based on the presence or absence of profane or abusive language, threats of force, and the provision of explanatory dialogue before official actions. High professionalism requires the absence of unprofessional language, no threats or use of force, and more than 25 words of contextual explanation. Standard professionalism includes interactions lacking both high and substandard indicators, serving as a baseline. This algorithm-driven classification is effective for large-scale assessments of officer conduct but does not account for subtleties like tone or situational appropriateness.

Treatment Design

To assess the Truleo feedback mechanism, officers were randomly assigned to one of three conditions. In the control group, officers wore BWCs as usual without feedback on Truleo-generated data. The self-assessment group received direct weekly access to their data, while the supervisor-mediated group met weekly with a supervisor to review feedback, with no self-access. This structure allows comparison between different levels and modalities of feedback. Importantly, officers in the control group could still be subject to Truleo data review in cases of serious incidents or complaints, maintaining a baseline level of accountability.

Statistical Analysis

We employed a mixed-effects linear regression model to estimate the effects of the Truleo interventions on high and substandard professionalism. The model accounts for the nested structure of the data, incorporating random intercepts for each officer to capture within-officer correlation across multiple video observations. The model specification is as follows:

professionalismij=β0+β1Self-assessmentij+β2Supervisor-mediatedij+ui+ϵij\text{professionalism}_{ij} = \beta_{0} + \beta_{1} \cdot \text{Self-assessment}_{ij} + \beta_{2} \cdot \text{Supervisor-mediated}_{ij} + u_{i} + \epsilon_{ij} (1)

Where:

  • professionalismij{professionalism}_{ij} is the binary outcome indicating whether video jj from officer ii was rated as highly professional (Model 1) or below-standard professional (Model 2).

  • β0\beta_{0} is the intercept, representing the expected probability of a highly professional (Model 1) or below-standard professional (Model 2) video for an officer in the control group.

  • β1\beta_{1} is the coefficient for the self-assessment treatment, indicating the change in the probability of a highly professional (Model 1) or below-standard professional (Model 2) video when the officer is in the self-assessment group compared to the control group.

  • β2\beta_{2} is the coefficient for the supervisor-mediated treatment, indicating the change in the probability of a highly professional (Model 1) or below-standard professional (Model 2) video when the officer is in the supervisor-mediated group compared to the control group.

  • uiu_{i} is the random intercept for officer ii, capturing the individual variability in the probability of producing a highly professional (Model 1) or below-standard professional (Model 2) video.​

  • ϵij\epsilon_{ij} is the residual error term, representing the unexplained variability in the outcome for video jj from officer ii.

We analyze outcomes separately by agency, corresponding to our three models: Model 1 for high professionalism, Model 2 for substandard professionalism, and Model 3 for pooled treatment effects. The number of observations (N), p-values, and test statistics are reported in Tables 3 and 4. This statistical approach ensures robust estimates, permitting evaluation of officer responses to AI-driven feedback mechanisms across diverse operational contexts.

Acknowledgments

We are grateful to the officers and command staff at the Aurora Police Department and the Richland County Sheriff’s Department. Special thanks go to Sheriff Leon Lott and Deputy Chief Harry Polis (RSCD), as well as Chief (Ret.) Art Acevedo, for their willingness to invite and host difficult experimental work in their agencies to promote professional police services.

Funding:

Laura and John Arnold Foundation grant 23-09293 (ITA, KM, GPA)

Laura and John Arnold Foundation grant 23-08506 (ITA, KM, GPA)

Author contributions:

Conceptualization: ITA, KM, GPA

Methodology: ITA, KM, GPA

Investigation: ITA, KM, GPA

Visualization: ITA, KM

Supervision: GPA

Writing—original draft: ITA, KM

Writing—review & editing: ITA, KM, GPA

Competing interests:

Authors declare that they have no competing interests.

Data and materials availability:

All data and code are publicly available at the Open Science Framework (https://osf.io/bztpd/).

References

Adams, I. T. (2022). Modeling Officer Perceptions of Body-Worn Cameras: A National Survey [Ph.D., The University of Utah]. https://www.proquest.com/docview/2697629089/abstract/55861BBD52D94FBEPQ/1

Adams, I. T., & Mastracci, S. H. (2019a). Police Body-Worn Cameras: Development of the Perceived Intensity of Monitoring Scale. Criminal Justice Review, 44(3), 386–405. https://doi.org/10.1177/0734016819846219

Adams, I. T., & Mastracci, S. H. (2019b). Police Body-Worn Cameras: Effects on Officers’ Burnout and Perceived Organizational Support. Police Quarterly, 22(1), 5–30. https://doi.org/10.1177/1098611118783987

Adams, I. T., McCrain, J., Schiff, D. S., Schiff, K. J., & Mourtgos, S. M. (2024). Police reform from the top down: Experimental evidence on police executive support for civilian oversight. Journal of Policy Analysis and Management, n/a(n/a). https://doi.org/10.1002/pam.22620

Alge, B. J. (2001). Effects of computer surveillance on perceptions of privacy and procedural justice. Journal of Applied Psychology, 86(4), 797–804. https://doi.org/10.1037/0021-9010.86.4.797

Alpert, G. P., & McLean, K. (2018). Where Is the Goal Line? A Critical Look at Police Body-Worn Camera Programs. Criminology & Public Policy, 17(3), 679–688. https://doi.org/10.1111/1745-9133.12374

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01

Benlian, A., Wiener, M., Cram, W. A., Krasnova, H., Maedche, A., Möhlmann, M., Recker, J., & Remus, U. (2022). Algorithmic Management. Business & Information Systems Engineering, 64(6), 825–839. https://doi.org/10.1007/s12599-022-00764-w

Camp, N. P., Voigt, R., Hamedani, M. G., Jurafsky, D., & Eberhardt, J. L. (2024). Leveraging body-worn camera footage to assess the effects of training on officer communication during traffic stops. PNAS Nexus, 3(9). https://doi.org/10.1093/pnasnexus/pgae359

Christodoulou, C., Paterson, H., & Kemp, R. (2019). Body-worn cameras: Evidence-base and implications. Current Issues in Criminal Justice, 1–12.

Council on Criminal Justice. (2024). The Implications of AI for Criminal Justice. https://counciloncj.org/the-implications-of-ai-for-criminal-justice/

del Pozo, B., Belenko, S., Pivovarova, E., Ray, B., Martins, K. F., & Taxman, F. S. (2024). Using Implementation Science to Improve Evidence-Based Policing: An Introduction for Researchers and Practitioners. Police Quarterly, 10986111241265290. https://doi.org/10.1177/10986111241265290

Falcone, D. N., & Wells, L. E. (1995). The county sheriff as a distinctive policing modality. American Journal of Police, 14(3/4), 123–149.

Farris, E. M., & Holman, M. R. (2017). All Politics Is Local? County Sheriffs and Localized Policies of Immigration Enforcement. Political Research Quarterly, 70(1), 142–154. https://doi.org/10.1177/1065912916680035

Fusi, F., & Feeney, M. K. (2018). Electronic monitoring in public organizations: Evidence from US local governments. Public Management Review, 20(10), 1465–1489. https://doi.org/10.1080/14719037.2017.1400584

Garner, J. H., Hickman, M. J., Malega, R. W., & Maxwell, C. D. (2018). Progress toward national estimates of police use of force. PloS One, 13(2), e0192932.

Gaub, J. E., Choate, D. E., Todak, N., Katz, C. M., & White, M. D. (2016). Officer perceptions of body-worn cameras before and after deployment: A study of three departments. Police Quarterly, 19(3), 275–302.

Gaub, J. E., Naoroz, C., & Malm, A. (2019). Understanding the Impact of Police Body-Worn Cameras on Virginia Public Defenders (pp. 1–17) [A report submitted to the Virginia Indigent Defense Commission]. University of North Carolina Charlotte. https://pages.uncc.edu/jgaub/wp-content/uploads/sites/1265/2019/12/BWC_PublicDefs_VA_Public.pdf

Gaub, J. E., Naoroz, C., & Malm, A. (2021). Police BWCs as ‘Neutral Observers’: Perceptions of public defenders. Policing: A Journal of Policy and Practice, 15(2), 1417–1428. https://doi.org/10.1093/police/paaa067

Graham, B. A. T., Brown, L., Chochlakis, G., Dehghani, M., Delerme, R., Friedman, B., Graeden, E., Golazizian, P., Hebbar, R., Hejabi, P., Kommineni, A., Salinas, M., Sierra-Arévalo, M., Trager, J., Weller, N., & Narayanan, S. (2024). A Multi-Perspective Machine Learning Approach to Evaluate Police-Driver Interaction in Los Angeles (arXiv:2402.01703). arXiv. https://doi.org/10.48550/arXiv.2402.01703

Graham, S. R., & Makowsky, M. D. (2024). Lame duck law enforcement. Economics Letters, 111707. https://doi.org/10.1016/j.econlet.2024.111707

Healy, K. (2017). Fuck Nuance. Sociological Theory, 35(2), 118–127. https://doi.org/10.1177/0735275117709046

Hobson, Z., Yesberg, J. A., Bradford, B., & Jackson, J. (2021). Artificial fairness? Trust in algorithmic police decision-making. Journal of Experimental Criminology. https://doi.org/10.1007/s11292-021-09484-9

Juris, H. A., & Feuille, P. (1973). Police Unionism—Power and Impact in Public-Sector Bargaining (11358; p. 242). Office of Justice Programs. https://www.ojp.gov/ncjrs/virtual-library/abstracts/police-unionism-power-and-impact-public-sector-bargaining

Lum, C., Stoltz, M., Koper, C. S., & Scherer, J. A. (2019). Research on body-worn cameras: What we know, what we need to know. Criminology & Public Policy, 18(1), 93–118.

Manley, K. (2024, January 5). Colorado Police Officer Sentenced to Jail in Elijah McClain Death. The New York Times. https://www.nytimes.com/2024/01/05/us/elijah-mcclain-randy-roedema-sentencing.html

Nicholson-Crotty, S., Nicholson-Crotty, J., & Lee, E. (2022). Police unions and use-of-force reforms in American cities. Policy Studies Journal, n/a(n/a). https://doi.org/10.1111/psj.12491

Pearson-Goff, M., & Herrington, V. (2014). Police Leadership: A Systematic Review of the Literature. Policing: A Journal of Policy and Practice, 8(1), 14–26. https://doi.org/10.1093/police/pat027

Piza, E. L., Connealy, N. T., Sytsma, V. A., & Chillar, V. F. (2023). Situational factors and police use of force across micro-time intervals: A video systematic social observation and panel regression analysis. Criminology, 61(1), 74–102. https://doi.org/10.1111/1745-9125.12323

Piza, E. L., & Sytsma, V. A. (2022). Video Data Analysis of Body-Worn Camera Footage: A Practical Methodology in Support of Police Reform. In Justice and Legitimacy in Policing. Routledge.

Rad, A. N., Kirk, D. S., & Jones, W. P. (2023). Police Unionism, Accountability, and Misconduct. Annual Review of Criminology, 6(1), null. https://doi.org/10.1146/annurev-criminol-030421-034244

Ravid, D. M., Tomczak, D. L., White, J. C., & Behrend, T. S. (2020). EPM 20/20: A Review, Framework, and Research Agenda for Electronic Performance Monitoring. Journal of Management, 46(1), 100–126. https://doi.org/10.1177/0149206319869435

Rosenblat, A., Levy, K., Barocas, S., & Hwang, T. (2016). Discriminating tastes: Customer ratings as vehicles for bias. Data & Society, 1–21.

Rosenblat, A., & Stark, L. (2016). Algorithmic Labor and Information Asymmetries: A Case Study of Uber’s Drivers (SSRN Scholarly Paper 2686227). Social Science Research Network. https://doi.org/10.2139/ssrn.2686227

Schiff, K. J., Schiff, D. S., Adams, I. T., McCrain, J., & Mourtgos, S. M. (2023). Institutional factors driving citizen perceptions of AI in government: Evidence from a survey experiment on policing. Public Administration Review, 0(0). https://doi.org/10.1111/puar.13754

Shastry, T. (2022). 36% Reduction in Use of Force after Implementation of Training and Body-Worn Camera Analytics [Internal pilot study]. Truleo.

Sisson, P. (2024, April 16). AI was supposed to make police bodycams better. What happened? MIT Technology Review. https://www.technologyreview.com/2024/04/16/1090846/ai-police-body-cams-cops-transparency/

Sunshine, J., & Tyler, T. R. (2003). The Role of Procedural Justice and Legitimacy in Shaping Public Support for Policing. Law & Society Review, 37(3), 513–548.

Tapp, S., & Davis, E. (2024). Contacts Between Police and the Public, 2022. Bureau of Justice Statistics.

Terpstra, B. L., & van Wijck, P. W. (2023). The Influence of Police Treatment and Decision-making on Perceptions of Procedural Justice: A Field Study. Journal of Research in Crime and Delinquency, 60(3), 344–377. https://doi.org/10.1177/00224278211030968

Trinkner, R. (2023). Toward Measuring Objective Procedural Justice: Commentary on Terpstra and van Wijck (2022). Journal of Research in Crime and Delinquency, 60(3), 378–392. https://doi.org/10.1177/00224278221135806

Watts, S., White, M. D., & Malm, A. (2024). Automating Body-Worn Camera Footage Review through AI: Preliminary Findings from a Multi-Site Randomized Control Trial. In Policing: A Journal of Policy and Practice (Vol. 0, Issue 0, p. OnlineFirst). https://doi.org/10.21428/cb6ab371.dd9645dd

White, M. D., & Malm, A. (2020). Cops, Cameras, and Crisis: The Potential and the Perils of Police Body-Worn Cameras. NYU Press.

Willis, J. J., Mastrofski, S. D., & Weisburd, D. (2004). CompStat and bureaucracy: A case study of challenges and opportunities for change. Justice Quarterly, 21(3), 463–496.

Workman-Stark, A. L. (2021). Exploring Differing Experiences of a Masculinity Contest Culture in Policing and the Impact on Individual and Organizational Outcomes. Police Quarterly, 24(3), 298–324. https://doi.org/10.1177/1098611120976090

Comments
1
?
Tony Vernon:

My bitcoin wallet was compromised few days back which resulted to me losing a huge sum of money I got depressed with no luck at sight until I read reviews about an amazing and legit fund recovery team at (cryptospamhacker@gmail .com) who retrieved my stolen money completely.