Introduction

Predictive policing aims to make smart and efficient policing decisions by analyzing historical crime data, predicting the risk of future occurrences, and allocating resources according to risk. However, the outcomes of predictive policing and modeling future crime risk has the potential to be inaccurate, biased, and racist due to reporting bias and policing’s systemically racist history. This analysis looks at the occurrence of domestic battery assault in the City of Chicago and builds a model to predict future locations of domestic battery reports to ultimately understand the bias that exists in crime reporting and predicting.

Set Up

We begin by importing the necessary libraries, themes, and datasets with geographies. Data on domestic battery reports is downloaded from Chicago’s Open Data Portal.

knitr::opts_chunk$set(echo = TRUE, warning=FALSE)

library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(classInt)   # for KDE and ML risk class intervals
# functions
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

mapTheme <- function(base_size = 10) {
  theme(
    text = element_text(color = "black"),
    plot.title = element_text(size = 12,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    axis.title = element_blank(),
    strip.background = element_rect(fill = "white", color = "white"),
    strip.text.x = element_text(size = 8),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank()
  )
}

PlotTheme <- function(base_size = 12, title_size = 16) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black"), 
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.border = element_rect(colour = "transparent", fill=NA, size=2),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

Data Wrangling

The figure shows the map of domestic battery reports in Chicago in 2017. There is some clustering in a few areas, which could be the result of higher policing and thus larger numbers of arrests related to domestic battery or higher rates of crime reporting and larger numbers of domestic battery reports. Domestic battery, like other crimes, is prone to selection bias in reporting. Victims of domestic violence are often threatened or scared out of reporting or unaware of their reporting rights. The decision to report or not report counts of domestic violence can be complex and dependent on resources and relationships available to victims.

Figure 1: Map of Dependent Variable - Domestic Battery Reports

policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)
  
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = beat_num)

bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))

domesticbattery <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>% 
    filter(Primary.Type == "BATTERY" & Description == "DOMESTIC BATTERY SIMPLE") %>%
    mutate(x = gsub("[()]", "", Location)) %>%
    separate(x,into= c("Y","X"), sep=",") %>%
    mutate(X = as.numeric(X),Y = as.numeric(Y)) %>% 
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
    st_transform('ESRI:102271') %>% 
    distinct()

chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271') 

# uses grid.arrange to organize independent plots
grid.arrange(ncol=2,
ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey80") +
  geom_sf(data = domesticbattery, colour="#3b528b", size=0.3, show.legend = "point") +
  labs(title= "Domestic Battery Reports", subtitle = "Chicago, 2017") +
  mapTheme(),

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey80") +
  stat_density2d(data = data.frame(st_coordinates(domesticbattery)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Domestic Battery", subtitle = "Chicago, 2017") +
  mapTheme() + theme(legend.position = "none"))

Creating Fishnet

We know that crime and crime risk does not operate based on arbitrary borders like neighborhoods or police districts. To account for this, we create a fishnet in the code chunk below. A fishnet allows us to conduct analysis with spatial trends by aggregating point data into grid cells.

## using {sf} to create the grid
## Note the `.[chicagoBoundary] %>% ` line. This is needed to clip the grid to our data
fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>%            # fast way to select intersecting polygons
  st_sf() %>%
  mutate(uniqueID = 1:n())

Joining Crime Reports to the Fishnet

We joined the domestic battery reports data to our fishnet grid to count the number of reports in each grid cell.

Figure 2: Count of Domestic Battery Reports by Fishnet

## add a value of 1 to each crime, sum them with aggregate
crime_net <- 
  dplyr::select(domesticbattery) %>% 
  mutate(countdomesticbattery = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countdomesticbattery = replace_na(countdomesticbattery, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = crime_net, aes(fill = countdomesticbattery), color = NA) +
  scale_fill_viridis() +
  labs(title = "Count of Domestic Battery Reports: Fishnet", subtitle ="Chicago, 2017") +
  mapTheme()

Wrangling Risk Factors

We also downloaded data on the selected risk factors from Chicago’s Open Data Portal. In this analysis, risk factors include abandoned cars, abandoned buildings, alley lights out requests, graffiti, liquor retail locations, sanitation requests, streetlights out requests, and support center locations. Support centers in Chicago offer resources and support for domestic violence victims, as well as youth services, senior services, and homeless services. Many of these indicators are also prone to problems related to selection and reporting bias as disused above. Additional predictors include neighborhoods, police districts, and distance to The Loop.

## only pulling a single variable for our model to keep it simple
## using Socrata again
abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Cars")

abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%
    filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Abandoned_Buildings")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
    mutate(year = substr(creation_date,1,4)) %>%
    filter(year == "2017") %>%
    filter(where_is_the_graffiti_located_ == "Front" |
           where_is_the_graffiti_located_ == "Rear" | where_is_the_graffiti_located_ == "Side") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Graffiti")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>%
    filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Street_Lights_Out")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
    mutate(year = substr(creation_date,1,4)) %>%
    filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Sanitation")

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Cur   rent-Liquor-and-Public-Places/nrmj-3kcf") %>%
  filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
  dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Liquor_Retail")

supportCenters <-
  read.socrata("https://data.cityofchicago.org/resource/jmw7-ijg5.json") %>%
  filter(division == "Domestic Violence") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Support_Center")

Alleylights <- 
  read.socrata("https://data.cityofchicago.org/resource/up7z-t43p.json") %>%
    mutate(year = substr(creation_date,1,4)) %>%
    filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "AlleyLightsOut")

acs_variable_list.2017 <- load_variables(2017,"acs5",cache = TRUE)
unemployment <- 
  get_acs(geography = "tract", 
          variables = c("B23025_005"), 
          year=2017, state=17, county=031, geometry=T, output="wide") %>%
  st_transform(st_crs(fishnet)) %>%
  rename(unemployment = B23025_005E)%>%
  mutate( legend = "unemployment")%>%
  dplyr::select(-NAME,-starts_with("B"))

## Neighborhoods to use in LOOCV in a bit
neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet))

Feature Engineering

The figures below show our risk factors by fishnet.

Figure 3: Risk Factors in Fishnet

vars_net <- 
  rbind(abandonCars,streetLightsOut,abandonBuildings,
        liquorRetail, graffiti, sanitation, supportCenters,
        Alleylights) %>%
  st_join(., fishnet, join=st_within) %>%
  st_set_geometry(NULL) %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet) %>%
  spread(Legend, count, fill=0) %>%
  st_sf() %>%
  dplyr::select(-`<NA>`) %>%
  na.omit()

vars_net.long <- 
  vars_net %>%
  gather(Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i,size=10) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol =4, top = "Risk Factors by Fishnet"))

This analysis engineered nearest neighbor features in hopes of accounting for any spatial or scale bias. The nearest neighbor features below shows the clustering of risk factor occurrences in Chicago. We see that the south side of Chicago often has less exposure to risk factors than other areas in the city.

Figure 4: Risk Factors Nearest Neighbors in Fishnet

st_c <- st_coordinates
st_coid <- st_centroid

vars_net$Abandoned_Buildings.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),3)
vars_net$Abandoned_Cars.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),3)
vars_net$Graffiti.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(graffiti),3)
vars_net$Liquor_Retail.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3)
vars_net$Street_Lights_Out.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3)
vars_net$Sanitation.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3)
vars_net$Alley_Lights_Out.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(Alleylights),3)
vars_net$Support_Centers.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(supportCenters),3)

vars_net.long.nn <- 
  vars_net %>%
  dplyr::select(ends_with(".nn")) %>%
  gather(Variable, value, -geometry)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="", option = "D") +
      labs(title=i) +
      mapTheme() + theme(legend.position="right")}

do.call(grid.arrange,c(mapList, ncol =4, top = "Nearest Neighbor Risk Factors by Fishnet"))

Measuring Distance to The Loop

This analysis also measured the distance to Chicago’s downtown area, The Loop.

Figure 5: Distance to The Loop

loopPoint <-
  neighborhoods %>%
  filter(name == "Loop") %>%
  st_centroid()

vars_net$loopDistance =
  st_distance(st_centroid(vars_net),loopPoint) %>%
  as.numeric() 

ggplot() +
  geom_sf(data=vars_net, aes(fill=loopDistance)) +
  scale_fill_viridis(option = "D") +
  labs(title="Euclidean Distance to The Loop") +
  mapTheme()

Creating the Final Fishnet

The next step is create the final fishnet.

## important to drop the geometry from joining features
final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID") 

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
    st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

Exploring the Spatial Processes of Domestic Battery Reports

To find the existing spatial process behind domestic battery reports, we calculated Local Moran’s I. We see a few general areas that are significant hotspots for domestic battery reports.

Figure 6: Local Moran’s I

# Join risk factors
final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID")
# Join neighborhoods and police districts
final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name)) %>%
    st_join(dplyr::select(policeDistricts, District)) %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

## see ?localmoran
local_morans <- localmoran(final_net$countdomesticbattery, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()

# join local Moran's I results to fishnet
final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(countdomesticbattery = countdomesticbattery, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)

## This is just for plotting
vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme() + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Moran's I Statistics: Domestic Battery"))

We also provide a map of distance to significant hotspots of domestic battery reports below.

Figure 7: Distance to Hotspots

final_net <- final_net %>% 
  mutate(countdomesticbattery.isSig = 
           ifelse(local_morans[,5] <= 0.001, 1, 0)) %>%
  mutate(countdomesticbattery.isSig.dist = 
           nn_function(st_c(st_coid(final_net)),
                       st_c(st_coid(filter(final_net, 
                                           countdomesticbattery.isSig == 1))), 
                       k = 1))

ggplot() +
      geom_sf(data = final_net, aes(fill=countdomesticbattery.isSig.dist), colour=NA) +
      scale_fill_viridis(name="NN Distance") +
      labs(title="Domestic Battery: NN Distance", subtitle = "Chicago, 2017") +
      mapTheme()

Model Building

This analysis developed multiple regressions to model domestic battery risk.

Correlation

First, we look at correlation, which provides important information on relationship between indicators or risk factors and our dependent variable, domestic battery reports. The plots below show these relationships with the correlation coefficient for each factor.

Figure 8: Correlation Scatterplot

plot_vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "Alleylights", "Support_Centers.nn", "loopDistance", "countdomesticbattery.isSig", "countdomesticbattery.isSig.dist")

correlation.long <-
  st_drop_geometry(final_net) %>%
    dplyr::select(-uniqueID, -cvID, -name, -District) %>%
    gather(Variable, Value, -countdomesticbattery) %>%
    filter(Variable %in% plot_vars)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countdomesticbattery, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, countdomesticbattery)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1, hjust = -1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Domestic Battery: Count as a function of risk factors") + 
  PlotTheme()

Poisson Regression

A histogram of the dependent variable, domestic battery reports, is presented below. The distribution of the dependent variable is not normal, so this analysis will use Poisson Regression and Cross-Validated Poisson Regression going forward.

Figure 9: Histogram of Dependent Variable

final_net %>%
  ggplot(aes(countdomesticbattery)) + 
    geom_histogram(bins = 20, colour="black", fill="#5dc863") +
    labs(title="Domestic Battery Report Distribution",
         x="Number of Domestic Battery Reports in 1 Grid Cell", y="Count") +
  PlotTheme()

Cross-Validated Poisson Regression

Lastly, we perform a k-fold cross validation method called Leave-One-Group-Out cross-validation, or LOGO-CV. In LOGO-CV, each neighborhood is Chicago will used as the hold-out once. We created two regressions, one that includes only the risk factors and another than includes the risk factors and the local spatial process as determined through the Local Moran’s I test.

# View(crossValidate)

## define the variables we want
reg.vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "AlleyLightsOut", "Support_Centers.nn", "loopDistance")
reg.ss.vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "AlleyLightsOut", "Support_Centers.nn", "loopDistance", "countdomesticbattery.isSig", "countdomesticbattery.isSig.dist")

crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    
    regression <-
      glm(countdomesticbattery ~ ., family = "poisson", 
          data = fold.train %>% 
            dplyr::select(-geometry, -id))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countdomesticbattery",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = cvID, countdomesticbattery, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countdomesticbattery",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = cvID, countdomesticbattery, Prediction, geometry)

reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countdomesticbattery",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = name, countdomesticbattery, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countdomesticbattery",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = name, countdomesticbattery, Prediction, geometry)

Accuracy & Generalizability

It is critical that models are both accurate and generalizable.

Accuracy

We calculated mean absolute error (MAE) for each prediction. The map below shows the MAE across the fishnet and across Chicago’s neighborhoods. We see larger errors in the areas that align with the hotspot locations.

Figure 9: Model Errors by Regression

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countdomesticbattery,
           Regression = "Random k-fold CV: Just Risk Factors"),
    
    mutate(reg.ss.cv,        Error = Prediction - countdomesticbattery,
           Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countdomesticbattery,
           Regression = "Spatial LOGO-CV: Just Risk Factors"),
    
    mutate(reg.ss.spatialCV, Error = Prediction - countdomesticbattery,
           Regression = "Spatial LOGO-CV: Spatial Process")) %>%
  st_sf()


error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countdomesticbattery, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) 

error_by_reg_and_fold %>%
  ggplot() +
    geom_sf(aes(fill = MAE)) +
    facet_wrap(~Regression, ncol=4) +
    scale_fill_viridis() +
    labs(title = "Errors by Random k-fold and LOGO-CV Regressions") +
    mapTheme() + theme(plot.title=element_text(hjust=0.5),legend.position="bottom")

A table of mean and standard deviation MAE is presented below. We see that including the spatial process improves the models.

Table 1: MAE and Standard Deviation MAE by Regression

# table of MAE and sd MAE by reg
st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable(caption = "MAE by regression") %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#5dc863") %>%
    row_spec(4, color = "black", background = "#5dc863")

MAE by regression
Regression	Mean_MAE	SD_MAE
Random k-fold CV: Just Risk Factors	1.62	1.29
Random k-fold CV: Spatial Process	1.34	1.08
Spatial LOGO-CV: Just Risk Factors	4.85	3.88
Spatial LOGO-CV: Spatial Process	3.70	3.13

Generalizability

To better understand the model’s generalizability, we used census data to split Chicago into two groups: majority white neighborhoods and majority non-white neighborhoods. The map below shows the breakdown of race in Chicago.

Figure 10: Race Context in Chicago

tractsCookCounty <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2018, state=17, county=031, geometry=T) %>%
  st_transform('ESRI:102271')  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[neighborhoods,]

ggplot()+
  geom_sf(data = tractsCookCounty, aes(fill = raceContext)) +
  scale_fill_manual(values = c("#5dc863", "#440154"), name="Race Context") +
  labs(title = "Race Context") +
  mapTheme() + theme(legend.position="bottom")

The table shows the MAE by neighborhoods racial context. We see that the model often under-predicts in majority non-white neighborhoods. Even when accounting for race, the model with spatial processes performs better than just the risk factors.

Table 2: Table of Raw Errors for Race Context

#table of raw errors for race context 
reg.summary %>% 
    st_centroid() %>%
    st_join(tractsCookCounty) %>%
    na.omit() %>%
      st_drop_geometry() %>%
      group_by(Regression, raceContext) %>%
      summarize(mean.Error = mean(Error, na.rm = T)) %>%
      spread(raceContext, mean.Error) %>%
      kable(caption = "MAE by neighborhood racial context") %>%
        kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#5dc863") %>%
    row_spec(4, color = "black", background = "#5dc863")

MAE by neighborhood racial context
Regression	Majority_Non_White	Majority_White
Random k-fold CV: Just Risk Factors	-2.825322	3.168058
Random k-fold CV: Spatial Process	-1.162638	1.320920
Spatial LOGO-CV: Just Risk Factors	-2.744340	3.311502
Spatial LOGO-CV: Spatial Process	-1.164173	1.366041

Kernal Density Comparison

Below, we compare the traditional kernel density map with our prediction results. We see that the risk prediction method is able to provide more precise results.

Figure 11: Comparing Kernal Density to Risk Predictions

dom.bats18 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>% 
  filter(Primary.Type == "BATTERY" & 
         Description == "DOMESTIC BATTERY SIMPLE") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]

# demo of kernel width
dom.batt_ppp <- as.ppp(st_coordinates(domesticbattery), W = st_bbox(final_net))
dom.batt_KD.1000 <- spatstat.core::density.ppp(dom.batt_ppp, 1000)
dom.batt_KD.1500 <- spatstat.core::density.ppp(dom.batt_ppp, 1500)
dom.batt_KD.2000 <- spatstat.core::density.ppp(dom.batt_ppp, 2000)
dom.batt_KD.df <- rbind(
  mutate(data.frame(rasterToPoints(mask(raster(dom.batt_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(dom.batt_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(dom.batt_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft.")) 

dom.batt_KD.df$Legend <- factor(dom.batt_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))

dom.batt_KDE_sum <- as.data.frame(dom.batt_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) 
kde_breaks <- classIntervals(dom.batt_KDE_sum$value, 
                             n = 5, "fisher")
dom.batt_KDE_sf <- dom.batt_KDE_sum %>%
  mutate(label = "Kernel Density",
         Risk_Category = classInt::findCols(kde_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(dom.bats18) %>% mutate(dom.battCount = 1), ., sum) %>%
    mutate(dom.battCount = replace_na(dom.battCount, 0))) %>%
  dplyr::select(label, Risk_Category, dom.battCount)


# calculating kernel density for dombats in 2017 and adds count for 2018
dom.batt_ppp <- as.ppp(st_coordinates(domesticbattery), W = st_bbox(final_net))
dom.batt_KD <- density.ppp(dom.batt_ppp, 1000)

dom.batt_KDE_sf <- as.data.frame(dom.batt_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  mutate(label = "Hotspot Mapping",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(dom.bats18) %>% mutate(dom.battCount = 1), ., sum) %>%
    mutate(dom.battCount = replace_na(dom.battCount, 0))) %>%
  dplyr::select(label, Risk_Category, dom.battCount)

#repeat for risk prediction
dom.batt_risk_sf <-
  filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Process") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(dom.bats18) %>% mutate(dom.battCount = 1), ., sum) %>%
      mutate(dom.battCount = replace_na(dom.battCount, 0))) %>%
  dplyr::select(label,Risk_Category, dom.battCount)

# map comparing kernel density to risk predictions for the 2018 crime
rbind(dom.batt_KDE_sf, dom.batt_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(dom.bats18, 2032), size = .5, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE) +
    labs(title="Comparison of Hotspot Mapping and Risk Predictions",
         subtitle="Chicago: 2017 Domestic Battery Report risk predictions; 2018 Domestic Battery Reports") +
    mapTheme()

The bar plot below depicts the above comparison using 2018 reports. We can see that the risk prediction model provided more accurate predictions in areas that are high risk (90-100% risk category) for domestic battery. In areas that have lower risks, the models perform similarly but our prediction model is slightly less accurate.

Figure 12: Bar Plot Comparison

rbind(dom.batt_KDE_sf, dom.batt_risk_sf) %>%
  st_set_geometry(NULL) %>% na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countdomesticbattery = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countdomesticbattery / sum(countdomesticbattery)) %>%
    ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_manual(values = c("#5dc863", "#440154")) +
      labs(title = "Risk prediction vs. Kernel density", subtitle="Chicago, 2018: Domestic Battery Reports", x="Risk Category", y = "Rate of Crime (Test Set)") +
      PlotTheme() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

Conclusion

While this model is somewhat accurate and generalizable, we do not recommend that this algorithm be used by the Chicago Police Department. The data on which we built this model is prone to selection bias that is racist and unjust. As previously discussed, domestic battery has a reporting bias as victims must choose to report occurrences and the self-reported nature of the data cannot be accounted for in this model. This model also does not account for racial biases in policing and resource allocation and use implementation of this model could further the historically racist policies that affect communities of color and minorities. In order to recommend the model, additional variables that account for spatial processes and racial bias should be implemented.

However, this model does a better job at predicting domestic battery reports in high-risk areas than the traditional approach using kernel density. Additionally, the predicted results from this model do align with the 2018 locations of domestic battery reports. There is a potential to use this model, and other models that are based on potentially biased data, for purposes not related to policing. Similar models could be helpful to for providing support to victims or other early intervention resources in report hotspots.

Introduction

Set Up

We begin by importing the necessary libraries, themes, and datasets with geographies. Data on domestic battery reports is downloaded from Chicago’s Open Data Portal.

knitr::opts_chunk$set(echo = TRUE, warning=FALSE)

library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(classInt)   # for KDE and ML risk class intervals
# functions
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

mapTheme <- function(base_size = 10) {
  theme(
    text = element_text(color = "black"),
    plot.title = element_text(size = 12,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    axis.title = element_blank(),
    strip.background = element_rect(fill = "white", color = "white"),
    strip.text.x = element_text(size = 8),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank()
  )
}

PlotTheme <- function(base_size = 12, title_size = 16) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black"), 
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.border = element_rect(colour = "transparent", fill=NA, size=2),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

Data Wrangling

Figure 1: Map of Dependent Variable - Domestic Battery Reports

policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)
  
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = beat_num)

bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))

domesticbattery <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>% 
    filter(Primary.Type == "BATTERY" & Description == "DOMESTIC BATTERY SIMPLE") %>%
    mutate(x = gsub("[()]", "", Location)) %>%
    separate(x,into= c("Y","X"), sep=",") %>%
    mutate(X = as.numeric(X),Y = as.numeric(Y)) %>% 
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
    st_transform('ESRI:102271') %>% 
    distinct()

chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271') 

# uses grid.arrange to organize independent plots
grid.arrange(ncol=2,
ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey80") +
  geom_sf(data = domesticbattery, colour="#3b528b", size=0.3, show.legend = "point") +
  labs(title= "Domestic Battery Reports", subtitle = "Chicago, 2017") +
  mapTheme(),

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey80") +
  stat_density2d(data = data.frame(st_coordinates(domesticbattery)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Domestic Battery", subtitle = "Chicago, 2017") +
  mapTheme() + theme(legend.position = "none"))

Creating Fishnet

## using {sf} to create the grid
## Note the `.[chicagoBoundary] %>% ` line. This is needed to clip the grid to our data
fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>%            # fast way to select intersecting polygons
  st_sf() %>%
  mutate(uniqueID = 1:n())

Joining Crime Reports to the Fishnet

We joined the domestic battery reports data to our fishnet grid to count the number of reports in each grid cell.

Figure 2: Count of Domestic Battery Reports by Fishnet

## add a value of 1 to each crime, sum them with aggregate
crime_net <- 
  dplyr::select(domesticbattery) %>% 
  mutate(countdomesticbattery = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countdomesticbattery = replace_na(countdomesticbattery, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = crime_net, aes(fill = countdomesticbattery), color = NA) +
  scale_fill_viridis() +
  labs(title = "Count of Domestic Battery Reports: Fishnet", subtitle ="Chicago, 2017") +
  mapTheme()

Wrangling Risk Factors

## only pulling a single variable for our model to keep it simple
## using Socrata again
abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Cars")

abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%
    filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Abandoned_Buildings")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
    mutate(year = substr(creation_date,1,4)) %>%
    filter(year == "2017") %>%
    filter(where_is_the_graffiti_located_ == "Front" |
           where_is_the_graffiti_located_ == "Rear" | where_is_the_graffiti_located_ == "Side") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Graffiti")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>%
    filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Street_Lights_Out")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
    mutate(year = substr(creation_date,1,4)) %>%
    filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Sanitation")

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Cur   rent-Liquor-and-Public-Places/nrmj-3kcf") %>%
  filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
  dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Liquor_Retail")

supportCenters <-
  read.socrata("https://data.cityofchicago.org/resource/jmw7-ijg5.json") %>%
  filter(division == "Domestic Violence") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Support_Center")

Alleylights <- 
  read.socrata("https://data.cityofchicago.org/resource/up7z-t43p.json") %>%
    mutate(year = substr(creation_date,1,4)) %>%
    filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "AlleyLightsOut")

acs_variable_list.2017 <- load_variables(2017,"acs5",cache = TRUE)
unemployment <- 
  get_acs(geography = "tract", 
          variables = c("B23025_005"), 
          year=2017, state=17, county=031, geometry=T, output="wide") %>%
  st_transform(st_crs(fishnet)) %>%
  rename(unemployment = B23025_005E)%>%
  mutate( legend = "unemployment")%>%
  dplyr::select(-NAME,-starts_with("B"))

## Neighborhoods to use in LOOCV in a bit
neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet))

Feature Engineering

The figures below show our risk factors by fishnet.

Figure 3: Risk Factors in Fishnet

vars_net <- 
  rbind(abandonCars,streetLightsOut,abandonBuildings,
        liquorRetail, graffiti, sanitation, supportCenters,
        Alleylights) %>%
  st_join(., fishnet, join=st_within) %>%
  st_set_geometry(NULL) %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet) %>%
  spread(Legend, count, fill=0) %>%
  st_sf() %>%
  dplyr::select(-`<NA>`) %>%
  na.omit()

vars_net.long <- 
  vars_net %>%
  gather(Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i,size=10) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol =4, top = "Risk Factors by Fishnet"))

Figure 4: Risk Factors Nearest Neighbors in Fishnet

st_c <- st_coordinates
st_coid <- st_centroid

vars_net$Abandoned_Buildings.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),3)
vars_net$Abandoned_Cars.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),3)
vars_net$Graffiti.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(graffiti),3)
vars_net$Liquor_Retail.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3)
vars_net$Street_Lights_Out.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3)
vars_net$Sanitation.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3)
vars_net$Alley_Lights_Out.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(Alleylights),3)
vars_net$Support_Centers.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(supportCenters),3)

vars_net.long.nn <- 
  vars_net %>%
  dplyr::select(ends_with(".nn")) %>%
  gather(Variable, value, -geometry)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="", option = "D") +
      labs(title=i) +
      mapTheme() + theme(legend.position="right")}

do.call(grid.arrange,c(mapList, ncol =4, top = "Nearest Neighbor Risk Factors by Fishnet"))

Measuring Distance to The Loop

This analysis also measured the distance to Chicago’s downtown area, The Loop.

Figure 5: Distance to The Loop

loopPoint <-
  neighborhoods %>%
  filter(name == "Loop") %>%
  st_centroid()

vars_net$loopDistance =
  st_distance(st_centroid(vars_net),loopPoint) %>%
  as.numeric() 

ggplot() +
  geom_sf(data=vars_net, aes(fill=loopDistance)) +
  scale_fill_viridis(option = "D") +
  labs(title="Euclidean Distance to The Loop") +
  mapTheme()

Creating the Final Fishnet

The next step is create the final fishnet.

## important to drop the geometry from joining features
final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID") 

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
    st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

Exploring the Spatial Processes of Domestic Battery Reports

To find the existing spatial process behind domestic battery reports, we calculated Local Moran’s I. We see a few general areas that are significant hotspots for domestic battery reports.

Figure 6: Local Moran’s I

# Join risk factors
final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID")
# Join neighborhoods and police districts
final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name)) %>%
    st_join(dplyr::select(policeDistricts, District)) %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

## see ?localmoran
local_morans <- localmoran(final_net$countdomesticbattery, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()

# join local Moran's I results to fishnet
final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(countdomesticbattery = countdomesticbattery, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)

## This is just for plotting
vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme() + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Moran's I Statistics: Domestic Battery"))

We also provide a map of distance to significant hotspots of domestic battery reports below.

Figure 7: Distance to Hotspots

final_net <- final_net %>% 
  mutate(countdomesticbattery.isSig = 
           ifelse(local_morans[,5] <= 0.001, 1, 0)) %>%
  mutate(countdomesticbattery.isSig.dist = 
           nn_function(st_c(st_coid(final_net)),
                       st_c(st_coid(filter(final_net, 
                                           countdomesticbattery.isSig == 1))), 
                       k = 1))

ggplot() +
      geom_sf(data = final_net, aes(fill=countdomesticbattery.isSig.dist), colour=NA) +
      scale_fill_viridis(name="NN Distance") +
      labs(title="Domestic Battery: NN Distance", subtitle = "Chicago, 2017") +
      mapTheme()

Model Building

This analysis developed multiple regressions to model domestic battery risk.

Correlation

Figure 8: Correlation Scatterplot

plot_vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "Alleylights", "Support_Centers.nn", "loopDistance", "countdomesticbattery.isSig", "countdomesticbattery.isSig.dist")

correlation.long <-
  st_drop_geometry(final_net) %>%
    dplyr::select(-uniqueID, -cvID, -name, -District) %>%
    gather(Variable, Value, -countdomesticbattery) %>%
    filter(Variable %in% plot_vars)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countdomesticbattery, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, countdomesticbattery)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1, hjust = -1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Domestic Battery: Count as a function of risk factors") + 
  PlotTheme()

Poisson Regression

Figure 9: Histogram of Dependent Variable

final_net %>%
  ggplot(aes(countdomesticbattery)) + 
    geom_histogram(bins = 20, colour="black", fill="#5dc863") +
    labs(title="Domestic Battery Report Distribution",
         x="Number of Domestic Battery Reports in 1 Grid Cell", y="Count") +
  PlotTheme()

Cross-Validated Poisson Regression

# View(crossValidate)

## define the variables we want
reg.vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "AlleyLightsOut", "Support_Centers.nn", "loopDistance")
reg.ss.vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "AlleyLightsOut", "Support_Centers.nn", "loopDistance", "countdomesticbattery.isSig", "countdomesticbattery.isSig.dist")

crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    
    regression <-
      glm(countdomesticbattery ~ ., family = "poisson", 
          data = fold.train %>% 
            dplyr::select(-geometry, -id))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countdomesticbattery",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = cvID, countdomesticbattery, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countdomesticbattery",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = cvID, countdomesticbattery, Prediction, geometry)

reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countdomesticbattery",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = name, countdomesticbattery, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countdomesticbattery",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = name, countdomesticbattery, Prediction, geometry)

Accuracy & Generalizability

It is critical that models are both accurate and generalizable.

Accuracy

Figure 9: Model Errors by Regression

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countdomesticbattery,
           Regression = "Random k-fold CV: Just Risk Factors"),
    
    mutate(reg.ss.cv,        Error = Prediction - countdomesticbattery,
           Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countdomesticbattery,
           Regression = "Spatial LOGO-CV: Just Risk Factors"),
    
    mutate(reg.ss.spatialCV, Error = Prediction - countdomesticbattery,
           Regression = "Spatial LOGO-CV: Spatial Process")) %>%
  st_sf()


error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countdomesticbattery, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) 

error_by_reg_and_fold %>%
  ggplot() +
    geom_sf(aes(fill = MAE)) +
    facet_wrap(~Regression, ncol=4) +
    scale_fill_viridis() +
    labs(title = "Errors by Random k-fold and LOGO-CV Regressions") +
    mapTheme() + theme(plot.title=element_text(hjust=0.5),legend.position="bottom")

A table of mean and standard deviation MAE is presented below. We see that including the spatial process improves the models.

Table 1: MAE and Standard Deviation MAE by Regression

# table of MAE and sd MAE by reg
st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable(caption = "MAE by regression") %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#5dc863") %>%
    row_spec(4, color = "black", background = "#5dc863")

MAE by regression
Regression	Mean_MAE	SD_MAE
Random k-fold CV: Just Risk Factors	1.62	1.29
Random k-fold CV: Spatial Process	1.34	1.08
Spatial LOGO-CV: Just Risk Factors	4.85	3.88
Spatial LOGO-CV: Spatial Process	3.70	3.13

Generalizability

Figure 10: Race Context in Chicago

tractsCookCounty <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2018, state=17, county=031, geometry=T) %>%
  st_transform('ESRI:102271')  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[neighborhoods,]

ggplot()+
  geom_sf(data = tractsCookCounty, aes(fill = raceContext)) +
  scale_fill_manual(values = c("#5dc863", "#440154"), name="Race Context") +
  labs(title = "Race Context") +
  mapTheme() + theme(legend.position="bottom")

Table 2: Table of Raw Errors for Race Context

#table of raw errors for race context 
reg.summary %>% 
    st_centroid() %>%
    st_join(tractsCookCounty) %>%
    na.omit() %>%
      st_drop_geometry() %>%
      group_by(Regression, raceContext) %>%
      summarize(mean.Error = mean(Error, na.rm = T)) %>%
      spread(raceContext, mean.Error) %>%
      kable(caption = "MAE by neighborhood racial context") %>%
        kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#5dc863") %>%
    row_spec(4, color = "black", background = "#5dc863")

MAE by neighborhood racial context
Regression	Majority_Non_White	Majority_White
Random k-fold CV: Just Risk Factors	-2.825322	3.168058
Random k-fold CV: Spatial Process	-1.162638	1.320920
Spatial LOGO-CV: Just Risk Factors	-2.744340	3.311502
Spatial LOGO-CV: Spatial Process	-1.164173	1.366041

Kernal Density Comparison

Below, we compare the traditional kernel density map with our prediction results. We see that the risk prediction method is able to provide more precise results.

Figure 11: Comparing Kernal Density to Risk Predictions

dom.bats18 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>% 
  filter(Primary.Type == "BATTERY" & 
         Description == "DOMESTIC BATTERY SIMPLE") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]

# demo of kernel width
dom.batt_ppp <- as.ppp(st_coordinates(domesticbattery), W = st_bbox(final_net))
dom.batt_KD.1000 <- spatstat.core::density.ppp(dom.batt_ppp, 1000)
dom.batt_KD.1500 <- spatstat.core::density.ppp(dom.batt_ppp, 1500)
dom.batt_KD.2000 <- spatstat.core::density.ppp(dom.batt_ppp, 2000)
dom.batt_KD.df <- rbind(
  mutate(data.frame(rasterToPoints(mask(raster(dom.batt_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(dom.batt_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(dom.batt_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft.")) 

dom.batt_KD.df$Legend <- factor(dom.batt_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))

dom.batt_KDE_sum <- as.data.frame(dom.batt_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) 
kde_breaks <- classIntervals(dom.batt_KDE_sum$value, 
                             n = 5, "fisher")
dom.batt_KDE_sf <- dom.batt_KDE_sum %>%
  mutate(label = "Kernel Density",
         Risk_Category = classInt::findCols(kde_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(dom.bats18) %>% mutate(dom.battCount = 1), ., sum) %>%
    mutate(dom.battCount = replace_na(dom.battCount, 0))) %>%
  dplyr::select(label, Risk_Category, dom.battCount)


# calculating kernel density for dombats in 2017 and adds count for 2018
dom.batt_ppp <- as.ppp(st_coordinates(domesticbattery), W = st_bbox(final_net))
dom.batt_KD <- density.ppp(dom.batt_ppp, 1000)

dom.batt_KDE_sf <- as.data.frame(dom.batt_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  mutate(label = "Hotspot Mapping",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(dom.bats18) %>% mutate(dom.battCount = 1), ., sum) %>%
    mutate(dom.battCount = replace_na(dom.battCount, 0))) %>%
  dplyr::select(label, Risk_Category, dom.battCount)

#repeat for risk prediction
dom.batt_risk_sf <-
  filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Process") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(dom.bats18) %>% mutate(dom.battCount = 1), ., sum) %>%
      mutate(dom.battCount = replace_na(dom.battCount, 0))) %>%
  dplyr::select(label,Risk_Category, dom.battCount)

# map comparing kernel density to risk predictions for the 2018 crime
rbind(dom.batt_KDE_sf, dom.batt_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(dom.bats18, 2032), size = .5, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE) +
    labs(title="Comparison of Hotspot Mapping and Risk Predictions",
         subtitle="Chicago: 2017 Domestic Battery Report risk predictions; 2018 Domestic Battery Reports") +
    mapTheme()

Figure 12: Bar Plot Comparison

rbind(dom.batt_KDE_sf, dom.batt_risk_sf) %>%
  st_set_geometry(NULL) %>% na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countdomesticbattery = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countdomesticbattery / sum(countdomesticbattery)) %>%
    ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_manual(values = c("#5dc863", "#440154")) +
      labs(title = "Risk prediction vs. Kernel density", subtitle="Chicago, 2018: Domestic Battery Reports", x="Risk Category", y = "Rate of Crime (Test Set)") +
      PlotTheme() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

Machine Learning #3: Predictive Policing in Chicago

Kate Tanabe

October 21, 2022

Introduction

Set Up

Data Wrangling

Figure 1: Map of Dependent Variable - Domestic Battery Reports

Creating Fishnet

Joining Crime Reports to the Fishnet

Figure 2: Count of Domestic Battery Reports by Fishnet

Wrangling Risk Factors

Feature Engineering

Figure 3: Risk Factors in Fishnet

Figure 4: Risk Factors Nearest Neighbors in Fishnet

Measuring Distance to The Loop

Figure 5: Distance to The Loop

Creating the Final Fishnet

Exploring the Spatial Processes of Domestic Battery Reports

Figure 6: Local Moran’s I

Figure 7: Distance to Hotspots

Model Building

Correlation

Figure 8: Correlation Scatterplot

Poisson Regression

Figure 9: Histogram of Dependent Variable

Cross-Validated Poisson Regression

Accuracy & Generalizability

Accuracy

Figure 9: Model Errors by Regression

Table 1: MAE and Standard Deviation MAE by Regression

Generalizability

Figure 10: Race Context in Chicago

Table 2: Table of Raw Errors for Race Context

Kernal Density Comparison

Figure 11: Comparing Kernal Density to Risk Predictions

Figure 12: Bar Plot Comparison

Conclusion

Machine Learning #3: Predictive Policing in Chicago

Kate Tanabe

October 21, 2022

Introduction

Set Up

Data Wrangling

Figure 1: Map of Dependent Variable - Domestic Battery Reports

Creating Fishnet

Joining Crime Reports to the Fishnet

Figure 2: Count of Domestic Battery Reports by Fishnet

Wrangling Risk Factors

Feature Engineering

Figure 3: Risk Factors in Fishnet

Figure 4: Risk Factors Nearest Neighbors in Fishnet

Measuring Distance to The Loop

Figure 5: Distance to The Loop

Creating the Final Fishnet

Exploring the Spatial Processes of Domestic Battery Reports

Figure 6: Local Moran’s I

Figure 7: Distance to Hotspots

Model Building

Correlation

Figure 8: Correlation Scatterplot

Poisson Regression

Figure 9: Histogram of Dependent Variable

Cross-Validated Poisson Regression

Accuracy & Generalizability

Accuracy

Figure 9: Model Errors by Regression

Table 1: MAE and Standard Deviation MAE by Regression

Generalizability

Figure 10: Race Context in Chicago

Table 2: Table of Raw Errors for Race Context

Kernal Density Comparison

Figure 11: Comparing Kernal Density to Risk Predictions

Figure 12: Bar Plot Comparison

Conclusion