HEAD
Predictive policing aims to make smart and efficient policing decisions by analyzing historical crime data, predicting the risk of future occurrences, and allocating resources according to risk. However, the outcomes of predictive policing and modeling future crime risk has the potential to be inaccurate, biased, and racist due to reporting bias and policing’s systemically racist history. This analysis looks at the occurrence of domestic battery assault in the City of Chicago and builds a model to predict future locations of domestic battery reports to ultimately understand the bias that exists in crime reporting and predicting.
We begin by importing the necessary libraries, themes, and datasets with geographies. Data on domestic battery reports is downloaded from Chicago’s Open Data Portal.
knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(classInt) # for KDE and ML risk class intervals
# functions
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
mapTheme <- function(base_size = 10) {
theme(
text = element_text(color = "black"),
plot.title = element_text(size = 12,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
axis.title = element_blank(),
strip.background = element_rect(fill = "white", color = "white"),
strip.text.x = element_text(size = 8),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank()
)
}
PlotTheme <- function(base_size = 12, title_size = 16) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = title_size, colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.border = element_rect(colour = "transparent", fill=NA, size=2),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
The figure shows the map of domestic battery reports in Chicago in 2017. There is some clustering in a few areas, which could be the result of higher policing and thus larger numbers of arrests related to domestic battery or higher rates of crime reporting and larger numbers of domestic battery reports. Domestic battery, like other crimes, is prone to selection bias in reporting. Victims of domestic violence are often threatened or scared out of reporting or unaware of their reporting rights. The decision to report or not report counts of domestic violence can be complex and dependent on resources and relationships available to victims.
policeDistricts <-
st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = dist_num)
policeBeats <-
st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = beat_num)
bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"),
mutate(policeBeats, Legend = "Police Beats"))
domesticbattery <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>%
filter(Primary.Type == "BATTERY" & Description == "DOMESTIC BATTERY SIMPLE") %>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x,into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),Y = as.numeric(Y)) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
st_transform('ESRI:102271') %>%
distinct()
chicagoBoundary <-
st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
st_transform('ESRI:102271')
# uses grid.arrange to organize independent plots
grid.arrange(ncol=2,
ggplot() +
geom_sf(data = chicagoBoundary, fill = "grey80") +
geom_sf(data = domesticbattery, colour="#3b528b", size=0.3, show.legend = "point") +
labs(title= "Domestic Battery Reports", subtitle = "Chicago, 2017") +
mapTheme(),
ggplot() +
geom_sf(data = chicagoBoundary, fill = "grey80") +
stat_density2d(data = data.frame(st_coordinates(domesticbattery)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_viridis() +
scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
labs(title = "Density of Domestic Battery", subtitle = "Chicago, 2017") +
mapTheme() + theme(legend.position = "none"))
We know that crime and crime risk does not operate based on arbitrary borders like neighborhoods or police districts. To account for this, we create a fishnet in the code chunk below. A fishnet allows us to conduct analysis with spatial trends by aggregating point data into grid cells.
## using {sf} to create the grid
## Note the `.[chicagoBoundary] %>% ` line. This is needed to clip the grid to our data
fishnet <-
st_make_grid(chicagoBoundary,
cellsize = 500,
square = TRUE) %>%
.[chicagoBoundary] %>% # fast way to select intersecting polygons
st_sf() %>%
mutate(uniqueID = 1:n())
We joined the domestic battery reports data to our fishnet grid to count the number of reports in each grid cell.
## add a value of 1 to each crime, sum them with aggregate
crime_net <-
dplyr::select(domesticbattery) %>%
mutate(countdomesticbattery = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countdomesticbattery = replace_na(countdomesticbattery, 0),
uniqueID = 1:n(),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
ggplot() +
geom_sf(data = crime_net, aes(fill = countdomesticbattery), color = NA) +
scale_fill_viridis() +
labs(title = "Count of Domestic Battery Reports: Fishnet", subtitle ="Chicago, 2017") +
mapTheme()
We also downloaded data on the selected risk factors from Chicago’s Open Data Portal. In this analysis, risk factors include abandoned cars, abandoned buildings, alley lights out requests, graffiti, liquor retail locations, sanitation requests, streetlights out requests, and support center locations. Support centers in Chicago offer resources and support for domestic violence victims, as well as youth services, senior services, and homeless services. Many of these indicators are also prone to problems related to selection and reporting bias as disused above. Additional predictors include neighborhoods, police districts, and distance to The Loop.
## only pulling a single variable for our model to keep it simple
## using Socrata again
abandonCars <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Cars")
abandonBuildings <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
mutate(year = substr(date_service_request_was_received,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Buildings")
graffiti <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
filter(where_is_the_graffiti_located_ == "Front" |
where_is_the_graffiti_located_ == "Rear" | where_is_the_graffiti_located_ == "Side") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Graffiti")
streetLightsOut <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Street_Lights_Out")
sanitation <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Sanitation")
liquorRetail <-
read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Cur rent-Liquor-and-Public-Places/nrmj-3kcf") %>%
filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Liquor_Retail")
supportCenters <-
read.socrata("https://data.cityofchicago.org/resource/jmw7-ijg5.json") %>%
filter(division == "Domestic Violence") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Support_Center")
Alleylights <-
read.socrata("https://data.cityofchicago.org/resource/up7z-t43p.json") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "AlleyLightsOut")
acs_variable_list.2017 <- load_variables(2017,"acs5",cache = TRUE)
unemployment <-
get_acs(geography = "tract",
variables = c("B23025_005"),
year=2017, state=17, county=031, geometry=T, output="wide") %>%
st_transform(st_crs(fishnet)) %>%
rename(unemployment = B23025_005E)%>%
mutate( legend = "unemployment")%>%
dplyr::select(-NAME,-starts_with("B"))
## Neighborhoods to use in LOOCV in a bit
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(fishnet))
The figures below show our risk factors by fishnet.
vars_net <-
rbind(abandonCars,streetLightsOut,abandonBuildings,
liquorRetail, graffiti, sanitation, supportCenters,
Alleylights) %>%
st_join(., fishnet, join=st_within) %>%
st_set_geometry(NULL) %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit()
vars_net.long <-
vars_net %>%
gather(Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i,size=10) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol =4, top = "Risk Factors by Fishnet"))
This analysis engineered nearest neighbor features in hopes of accounting for any spatial or scale bias. The nearest neighbor features below shows the clustering of risk factor occurrences in Chicago. We see that the south side of Chicago often has less exposure to risk factors than other areas in the city.
st_c <- st_coordinates
st_coid <- st_centroid
vars_net$Abandoned_Buildings.nn =
nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),3)
vars_net$Abandoned_Cars.nn =
nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),3)
vars_net$Graffiti.nn =
nn_function(st_c(st_coid(vars_net)), st_c(graffiti),3)
vars_net$Liquor_Retail.nn =
nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3)
vars_net$Street_Lights_Out.nn =
nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3)
vars_net$Sanitation.nn =
nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3)
vars_net$Alley_Lights_Out.nn =
nn_function(st_c(st_coid(vars_net)), st_c(Alleylights),3)
vars_net$Support_Centers.nn =
nn_function(st_c(st_coid(vars_net)), st_c(supportCenters),3)
vars_net.long.nn <-
vars_net %>%
dplyr::select(ends_with(".nn")) %>%
gather(Variable, value, -geometry)
vars <- unique(vars_net.long.nn$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="", option = "D") +
labs(title=i) +
mapTheme() + theme(legend.position="right")}
do.call(grid.arrange,c(mapList, ncol =4, top = "Nearest Neighbor Risk Factors by Fishnet"))
This analysis also measured the distance to Chicago’s downtown area, The Loop.
loopPoint <-
neighborhoods %>%
filter(name == "Loop") %>%
st_centroid()
vars_net$loopDistance =
st_distance(st_centroid(vars_net),loopPoint) %>%
as.numeric()
ggplot() +
geom_sf(data=vars_net, aes(fill=loopDistance)) +
scale_fill_viridis(option = "D") +
labs(title="Euclidean Distance to The Loop") +
mapTheme()
The next step is create the final fishnet.
## important to drop the geometry from joining features
final_net <-
left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID")
final_net <-
st_centroid(final_net) %>%
st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
st_drop_geometry() %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
To find the existing spatial process behind domestic battery reports, we calculated Local Moran’s I. We see a few general areas that are significant hotspots for domestic battery reports.
# Join risk factors
final_net <-
left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID")
# Join neighborhoods and police districts
final_net <-
st_centroid(final_net) %>%
st_join(dplyr::select(neighborhoods, name)) %>%
st_join(dplyr::select(policeDistricts, District)) %>%
st_drop_geometry() %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()
## see ?localmoran
local_morans <- localmoran(final_net$countdomesticbattery, final_net.weights, zero.policy=TRUE) %>%
as.data.frame()
# join local Moran's I results to fishnet
final_net.localMorans <-
cbind(local_morans, as.data.frame(final_net)) %>%
st_sf() %>%
dplyr::select(countdomesticbattery = countdomesticbattery,
Local_Morans_I = Ii,
P_Value = `Pr(z != E(Ii))`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
gather(Variable, Value, -geometry)
## This is just for plotting
vars <- unique(final_net.localMorans$Variable)
varList <- list()
for(i in vars){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, Variable == i),
aes(fill = Value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme() + theme(legend.position="bottom")}
do.call(grid.arrange,c(varList, ncol = 4, top = "Local Moran's I Statistics: Domestic Battery"))
We also provide a map of distance to significant hotspots of domestic battery reports below.
final_net <- final_net %>%
mutate(countdomesticbattery.isSig =
ifelse(local_morans[,5] <= 0.001, 1, 0)) %>%
mutate(countdomesticbattery.isSig.dist =
nn_function(st_c(st_coid(final_net)),
st_c(st_coid(filter(final_net,
countdomesticbattery.isSig == 1))),
k = 1))
ggplot() +
geom_sf(data = final_net, aes(fill=countdomesticbattery.isSig.dist), colour=NA) +
scale_fill_viridis(name="NN Distance") +
labs(title="Domestic Battery: NN Distance", subtitle = "Chicago, 2017") +
mapTheme()
This analysis developed multiple regressions to model domestic battery risk.
First, we look at correlation, which provides important information on relationship between indicators or risk factors and our dependent variable, domestic battery reports. The plots below show these relationships with the correlation coefficient for each factor.
plot_vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn",
"Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "Alleylights", "Support_Centers.nn", "loopDistance", "countdomesticbattery.isSig", "countdomesticbattery.isSig.dist")
correlation.long <-
st_drop_geometry(final_net) %>%
dplyr::select(-uniqueID, -cvID, -name, -District) %>%
gather(Variable, Value, -countdomesticbattery) %>%
filter(Variable %in% plot_vars)
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, countdomesticbattery, use = "complete.obs"))
ggplot(correlation.long, aes(Value, countdomesticbattery)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1, hjust = -1) +
geom_smooth(method = "lm", se = FALSE, colour = "black") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Domestic Battery: Count as a function of risk factors") +
PlotTheme()
A histogram of the dependent variable, domestic battery reports, is presented below. The distribution of the dependent variable is not normal, so this analysis will use Poisson Regression and Cross-Validated Poisson Regression going forward.
final_net %>%
ggplot(aes(countdomesticbattery)) +
geom_histogram(bins = 20, colour="black", fill="#5dc863") +
labs(title="Domestic Battery Report Distribution",
x="Number of Domestic Battery Reports in 1 Grid Cell", y="Count") +
PlotTheme()
Lastly, we perform a k-fold cross validation method called Leave-One-Group-Out cross-validation, or LOGO-CV. In LOGO-CV, each neighborhood is Chicago will used as the hold-out once. We created two regressions, one that includes only the risk factors and another than includes the risk factors and the local spatial process as determined through the Local Moran’s I test.
# View(crossValidate)
## define the variables we want
reg.vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn",
"Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "AlleyLightsOut", "Support_Centers.nn", "loopDistance")
reg.ss.vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn",
"Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "AlleyLightsOut", "Support_Centers.nn", "loopDistance", "countdomesticbattery.isSig", "countdomesticbattery.isSig.dist")
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])
for (i in cvID_list) {
thisFold <- i
cat("This hold out fold is", thisFold, "\n")
fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
regression <-
glm(countdomesticbattery ~ ., family = "poisson",
data = fold.train %>%
dplyr::select(-geometry, -id))
thisPrediction <-
mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
allPredictions <-
rbind(allPredictions, thisPrediction)
}
return(st_sf(allPredictions))
}
reg.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countdomesticbattery",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID, countdomesticbattery, Prediction, geometry)
reg.ss.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countdomesticbattery",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = cvID, countdomesticbattery, Prediction, geometry)
reg.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countdomesticbattery",
indVariables = reg.vars) %>%
dplyr::select(cvID = name, countdomesticbattery, Prediction, geometry)
reg.ss.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countdomesticbattery",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = name, countdomesticbattery, Prediction, geometry)
It is critical that models are both accurate and generalizable.
We calculated mean absolute error (MAE) for each prediction. The map below shows the MAE across the fishnet and across Chicago’s neighborhoods. We see larger errors in the areas that align with the hotspot locations.
reg.summary <-
rbind(
mutate(reg.cv, Error = Prediction - countdomesticbattery,
Regression = "Random k-fold CV: Just Risk Factors"),
mutate(reg.ss.cv, Error = Prediction - countdomesticbattery,
Regression = "Random k-fold CV: Spatial Process"),
mutate(reg.spatialCV, Error = Prediction - countdomesticbattery,
Regression = "Spatial LOGO-CV: Just Risk Factors"),
mutate(reg.ss.spatialCV, Error = Prediction - countdomesticbattery,
Regression = "Spatial LOGO-CV: Spatial Process")) %>%
st_sf()
error_by_reg_and_fold <-
reg.summary %>%
group_by(Regression, cvID) %>%
summarize(Mean_Error = mean(Prediction - countdomesticbattery, na.rm = T),
MAE = mean(abs(Mean_Error), na.rm = T),
SD_MAE = mean(abs(Mean_Error), na.rm = T))
error_by_reg_and_fold %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
facet_wrap(~Regression, ncol=4) +
scale_fill_viridis() +
labs(title = "Errors by Random k-fold and LOGO-CV Regressions") +
mapTheme() + theme(plot.title=element_text(hjust=0.5),legend.position="bottom")
A table of mean and standard deviation MAE is presented below. We see that including the spatial process improves the models.
# table of MAE and sd MAE by reg
st_drop_geometry(error_by_reg_and_fold) %>%
group_by(Regression) %>%
summarize(Mean_MAE = round(mean(MAE), 2),
SD_MAE = round(sd(MAE), 2)) %>%
kable(caption = "MAE by regression") %>%
kable_styling("striped", full_width = F) %>%
row_spec(2, color = "black", background = "#5dc863") %>%
row_spec(4, color = "black", background = "#5dc863")
Regression | Mean_MAE | SD_MAE |
---|---|---|
Random k-fold CV: Just Risk Factors | 1.62 | 1.29 |
Random k-fold CV: Spatial Process | 1.34 | 1.08 |
Spatial LOGO-CV: Just Risk Factors | 4.85 | 3.88 |
Spatial LOGO-CV: Spatial Process | 3.70 | 3.13 |
To better understand the model’s generalizability, we used census data to split Chicago into two groups: majority white neighborhoods and majority non-white neighborhoods. The map below shows the breakdown of race in Chicago.
tractsCookCounty <-
get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"),
year = 2018, state=17, county=031, geometry=T) %>%
st_transform('ESRI:102271') %>%
dplyr::select(variable, estimate, GEOID) %>%
spread(variable, estimate) %>%
rename(TotalPop = B01001_001,
NumberWhites = B01001A_001) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
.[neighborhoods,]
ggplot()+
geom_sf(data = tractsCookCounty, aes(fill = raceContext)) +
scale_fill_manual(values = c("#5dc863", "#440154"), name="Race Context") +
labs(title = "Race Context") +
mapTheme() + theme(legend.position="bottom")
The table shows the MAE by neighborhoods racial context. We see that the model often under-predicts in majority non-white neighborhoods. Even when accounting for race, the model with spatial processes performs better than just the risk factors.
#table of raw errors for race context
reg.summary %>%
st_centroid() %>%
st_join(tractsCookCounty) %>%
na.omit() %>%
st_drop_geometry() %>%
group_by(Regression, raceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(raceContext, mean.Error) %>%
kable(caption = "MAE by neighborhood racial context") %>%
kable_styling("striped", full_width = F) %>%
row_spec(2, color = "black", background = "#5dc863") %>%
row_spec(4, color = "black", background = "#5dc863")
Regression | Majority_Non_White | Majority_White |
---|---|---|
Random k-fold CV: Just Risk Factors | -2.825322 | 3.168058 |
Random k-fold CV: Spatial Process | -1.162638 | 1.320920 |
Spatial LOGO-CV: Just Risk Factors | -2.744340 | 3.311502 |
Spatial LOGO-CV: Spatial Process | -1.164173 | 1.366041 |
Below, we compare the traditional kernel density map with our prediction results. We see that the risk prediction method is able to provide more precise results.
dom.bats18 <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>%
filter(Primary.Type == "BATTERY" &
Description == "DOMESTIC BATTERY SIMPLE") %>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x,into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),
Y = as.numeric(Y)) %>%
na.omit %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102271') %>%
distinct() %>%
.[fishnet,]
# demo of kernel width
dom.batt_ppp <- as.ppp(st_coordinates(domesticbattery), W = st_bbox(final_net))
dom.batt_KD.1000 <- spatstat.core::density.ppp(dom.batt_ppp, 1000)
dom.batt_KD.1500 <- spatstat.core::density.ppp(dom.batt_ppp, 1500)
dom.batt_KD.2000 <- spatstat.core::density.ppp(dom.batt_ppp, 2000)
dom.batt_KD.df <- rbind(
mutate(data.frame(rasterToPoints(mask(raster(dom.batt_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(dom.batt_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(dom.batt_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft."))
dom.batt_KD.df$Legend <- factor(dom.batt_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))
dom.batt_KDE_sum <- as.data.frame(dom.batt_KD.1000) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean)
kde_breaks <- classIntervals(dom.batt_KDE_sum$value,
n = 5, "fisher")
dom.batt_KDE_sf <- dom.batt_KDE_sum %>%
mutate(label = "Kernel Density",
Risk_Category = classInt::findCols(kde_breaks),
Risk_Category = case_when(
Risk_Category == 5 ~ "5th",
Risk_Category == 4 ~ "4th",
Risk_Category == 3 ~ "3rd",
Risk_Category == 2 ~ "2nd",
Risk_Category == 1 ~ "1st")) %>%
cbind(
aggregate(
dplyr::select(dom.bats18) %>% mutate(dom.battCount = 1), ., sum) %>%
mutate(dom.battCount = replace_na(dom.battCount, 0))) %>%
dplyr::select(label, Risk_Category, dom.battCount)
# calculating kernel density for dombats in 2017 and adds count for 2018
dom.batt_ppp <- as.ppp(st_coordinates(domesticbattery), W = st_bbox(final_net))
dom.batt_KD <- density.ppp(dom.batt_ppp, 1000)
dom.batt_KDE_sf <- as.data.frame(dom.batt_KD) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
mutate(label = "Hotspot Mapping",
Risk_Category = ntile(value, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(dom.bats18) %>% mutate(dom.battCount = 1), ., sum) %>%
mutate(dom.battCount = replace_na(dom.battCount, 0))) %>%
dplyr::select(label, Risk_Category, dom.battCount)
#repeat for risk prediction
dom.batt_risk_sf <-
filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Process") %>%
mutate(label = "Risk Predictions",
Risk_Category = ntile(Prediction, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(dom.bats18) %>% mutate(dom.battCount = 1), ., sum) %>%
mutate(dom.battCount = replace_na(dom.battCount, 0))) %>%
dplyr::select(label,Risk_Category, dom.battCount)
# map comparing kernel density to risk predictions for the 2018 crime
rbind(dom.batt_KDE_sf, dom.batt_risk_sf) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = sample_n(dom.bats18, 2032), size = .5, colour = "black") +
facet_wrap(~label, ) +
scale_fill_viridis(discrete = TRUE) +
labs(title="Comparison of Hotspot Mapping and Risk Predictions",
subtitle="Chicago: 2017 Domestic Battery Report risk predictions; 2018 Domestic Battery Reports") +
mapTheme()
The bar plot below depicts the above comparison using 2018 reports. We can see that the risk prediction model provided more accurate predictions in areas that are high risk (90-100% risk category) for domestic battery. In areas that have lower risks, the models perform similarly but our prediction model is slightly less accurate.
rbind(dom.batt_KDE_sf, dom.batt_risk_sf) %>%
st_set_geometry(NULL) %>% na.omit() %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(countdomesticbattery = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Rate_of_test_set_crimes = countdomesticbattery / sum(countdomesticbattery)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
scale_fill_manual(values = c("#5dc863", "#440154")) +
labs(title = "Risk prediction vs. Kernel density", subtitle="Chicago, 2018: Domestic Battery Reports", x="Risk Category", y = "Rate of Crime (Test Set)") +
PlotTheme() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
While this model is somewhat accurate and generalizable, we do not recommend that this algorithm be used by the Chicago Police Department. The data on which we built this model is prone to selection bias that is racist and unjust. As previously discussed, domestic battery has a reporting bias as victims must choose to report occurrences and the self-reported nature of the data cannot be accounted for in this model. This model also does not account for racial biases in policing and resource allocation and use implementation of this model could further the historically racist policies that affect communities of color and minorities. In order to recommend the model, additional variables that account for spatial processes and racial bias should be implemented.
However, this model does a better job at predicting domestic battery reports in high-risk areas than the traditional approach using kernel density. Additionally, the predicted results from this model do align with the 2018 locations of domestic battery reports. There is a potential to use this model, and other models that are based on potentially biased data, for purposes not related to policing. Similar models could be helpful to for providing support to victims or other early intervention resources in report hotspots.
Predictive policing aims to make smart and efficient policing decisions by analyzing historical crime data, predicting the risk of future occurrences, and allocating resources according to risk. However, the outcomes of predictive policing and modeling future crime risk has the potential to be inaccurate, biased, and racist due to reporting bias and policing’s systemically racist history. This analysis looks at the occurrence of domestic battery assault in the City of Chicago and builds a model to predict future locations of domestic battery reports to ultimately understand the bias that exists in crime reporting and predicting.
We begin by importing the necessary libraries, themes, and datasets with geographies. Data on domestic battery reports is downloaded from Chicago’s Open Data Portal.
knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(classInt) # for KDE and ML risk class intervals
# functions
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
mapTheme <- function(base_size = 10) {
theme(
text = element_text(color = "black"),
plot.title = element_text(size = 12,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
axis.title = element_blank(),
strip.background = element_rect(fill = "white", color = "white"),
strip.text.x = element_text(size = 8),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank()
)
}
PlotTheme <- function(base_size = 12, title_size = 16) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = title_size, colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.border = element_rect(colour = "transparent", fill=NA, size=2),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
The figure shows the map of domestic battery reports in Chicago in 2017. There is some clustering in a few areas, which could be the result of higher policing and thus larger numbers of arrests related to domestic battery or higher rates of crime reporting and larger numbers of domestic battery reports. Domestic battery, like other crimes, is prone to selection bias in reporting. Victims of domestic violence are often threatened or scared out of reporting or unaware of their reporting rights. The decision to report or not report counts of domestic violence can be complex and dependent on resources and relationships available to victims.
policeDistricts <-
st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = dist_num)
policeBeats <-
st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = beat_num)
bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"),
mutate(policeBeats, Legend = "Police Beats"))
domesticbattery <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>%
filter(Primary.Type == "BATTERY" & Description == "DOMESTIC BATTERY SIMPLE") %>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x,into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),Y = as.numeric(Y)) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
st_transform('ESRI:102271') %>%
distinct()
chicagoBoundary <-
st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
st_transform('ESRI:102271')
# uses grid.arrange to organize independent plots
grid.arrange(ncol=2,
ggplot() +
geom_sf(data = chicagoBoundary, fill = "grey80") +
geom_sf(data = domesticbattery, colour="#3b528b", size=0.3, show.legend = "point") +
labs(title= "Domestic Battery Reports", subtitle = "Chicago, 2017") +
mapTheme(),
ggplot() +
geom_sf(data = chicagoBoundary, fill = "grey80") +
stat_density2d(data = data.frame(st_coordinates(domesticbattery)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_viridis() +
scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
labs(title = "Density of Domestic Battery", subtitle = "Chicago, 2017") +
mapTheme() + theme(legend.position = "none"))
We know that crime and crime risk does not operate based on arbitrary borders like neighborhoods or police districts. To account for this, we create a fishnet in the code chunk below. A fishnet allows us to conduct analysis with spatial trends by aggregating point data into grid cells.
## using {sf} to create the grid
## Note the `.[chicagoBoundary] %>% ` line. This is needed to clip the grid to our data
fishnet <-
st_make_grid(chicagoBoundary,
cellsize = 500,
square = TRUE) %>%
.[chicagoBoundary] %>% # fast way to select intersecting polygons
st_sf() %>%
mutate(uniqueID = 1:n())
We joined the domestic battery reports data to our fishnet grid to count the number of reports in each grid cell.
## add a value of 1 to each crime, sum them with aggregate
crime_net <-
dplyr::select(domesticbattery) %>%
mutate(countdomesticbattery = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countdomesticbattery = replace_na(countdomesticbattery, 0),
uniqueID = 1:n(),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
ggplot() +
geom_sf(data = crime_net, aes(fill = countdomesticbattery), color = NA) +
scale_fill_viridis() +
labs(title = "Count of Domestic Battery Reports: Fishnet", subtitle ="Chicago, 2017") +
mapTheme()
We also downloaded data on the selected risk factors from Chicago’s Open Data Portal. In this analysis, risk factors include abandoned cars, abandoned buildings, alley lights out requests, graffiti, liquor retail locations, sanitation requests, streetlights out requests, and support center locations. Support centers in Chicago offer resources and support for domestic violence victims, as well as youth services, senior services, and homeless services. Many of these indicators are also prone to problems related to selection and reporting bias as disused above. Additional predictors include neighborhoods, police districts, and distance to The Loop.
## only pulling a single variable for our model to keep it simple
## using Socrata again
abandonCars <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Cars")
abandonBuildings <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
mutate(year = substr(date_service_request_was_received,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Buildings")
graffiti <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
filter(where_is_the_graffiti_located_ == "Front" |
where_is_the_graffiti_located_ == "Rear" | where_is_the_graffiti_located_ == "Side") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Graffiti")
streetLightsOut <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Street_Lights_Out")
sanitation <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Sanitation")
liquorRetail <-
read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Cur rent-Liquor-and-Public-Places/nrmj-3kcf") %>%
filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Liquor_Retail")
supportCenters <-
read.socrata("https://data.cityofchicago.org/resource/jmw7-ijg5.json") %>%
filter(division == "Domestic Violence") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Support_Center")
Alleylights <-
read.socrata("https://data.cityofchicago.org/resource/up7z-t43p.json") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "AlleyLightsOut")
acs_variable_list.2017 <- load_variables(2017,"acs5",cache = TRUE)
unemployment <-
get_acs(geography = "tract",
variables = c("B23025_005"),
year=2017, state=17, county=031, geometry=T, output="wide") %>%
st_transform(st_crs(fishnet)) %>%
rename(unemployment = B23025_005E)%>%
mutate( legend = "unemployment")%>%
dplyr::select(-NAME,-starts_with("B"))
## Neighborhoods to use in LOOCV in a bit
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(fishnet))
The figures below show our risk factors by fishnet.
vars_net <-
rbind(abandonCars,streetLightsOut,abandonBuildings,
liquorRetail, graffiti, sanitation, supportCenters,
Alleylights) %>%
st_join(., fishnet, join=st_within) %>%
st_set_geometry(NULL) %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit()
vars_net.long <-
vars_net %>%
gather(Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i,size=10) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol =4, top = "Risk Factors by Fishnet"))
This analysis engineered nearest neighbor features in hopes of accounting for any spatial or scale bias. The nearest neighbor features below shows the clustering of risk factor occurrences in Chicago. We see that the south side of Chicago often has less exposure to risk factors than other areas in the city.
st_c <- st_coordinates
st_coid <- st_centroid
vars_net$Abandoned_Buildings.nn =
nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),3)
vars_net$Abandoned_Cars.nn =
nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),3)
vars_net$Graffiti.nn =
nn_function(st_c(st_coid(vars_net)), st_c(graffiti),3)
vars_net$Liquor_Retail.nn =
nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3)
vars_net$Street_Lights_Out.nn =
nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3)
vars_net$Sanitation.nn =
nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3)
vars_net$Alley_Lights_Out.nn =
nn_function(st_c(st_coid(vars_net)), st_c(Alleylights),3)
vars_net$Support_Centers.nn =
nn_function(st_c(st_coid(vars_net)), st_c(supportCenters),3)
vars_net.long.nn <-
vars_net %>%
dplyr::select(ends_with(".nn")) %>%
gather(Variable, value, -geometry)
vars <- unique(vars_net.long.nn$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="", option = "D") +
labs(title=i) +
mapTheme() + theme(legend.position="right")}
do.call(grid.arrange,c(mapList, ncol =4, top = "Nearest Neighbor Risk Factors by Fishnet"))
This analysis also measured the distance to Chicago’s downtown area, The Loop.
loopPoint <-
neighborhoods %>%
filter(name == "Loop") %>%
st_centroid()
vars_net$loopDistance =
st_distance(st_centroid(vars_net),loopPoint) %>%
as.numeric()
ggplot() +
geom_sf(data=vars_net, aes(fill=loopDistance)) +
scale_fill_viridis(option = "D") +
labs(title="Euclidean Distance to The Loop") +
mapTheme()
The next step is create the final fishnet.
## important to drop the geometry from joining features
final_net <-
left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID")
final_net <-
st_centroid(final_net) %>%
st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
st_drop_geometry() %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
To find the existing spatial process behind domestic battery reports, we calculated Local Moran’s I. We see a few general areas that are significant hotspots for domestic battery reports.
# Join risk factors
final_net <-
left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID")
# Join neighborhoods and police districts
final_net <-
st_centroid(final_net) %>%
st_join(dplyr::select(neighborhoods, name)) %>%
st_join(dplyr::select(policeDistricts, District)) %>%
st_drop_geometry() %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()
## see ?localmoran
local_morans <- localmoran(final_net$countdomesticbattery, final_net.weights, zero.policy=TRUE) %>%
as.data.frame()
# join local Moran's I results to fishnet
final_net.localMorans <-
cbind(local_morans, as.data.frame(final_net)) %>%
st_sf() %>%
dplyr::select(countdomesticbattery = countdomesticbattery,
Local_Morans_I = Ii,
P_Value = `Pr(z != E(Ii))`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
gather(Variable, Value, -geometry)
## This is just for plotting
vars <- unique(final_net.localMorans$Variable)
varList <- list()
for(i in vars){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, Variable == i),
aes(fill = Value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme() + theme(legend.position="bottom")}
do.call(grid.arrange,c(varList, ncol = 4, top = "Local Moran's I Statistics: Domestic Battery"))
We also provide a map of distance to significant hotspots of domestic battery reports below.
final_net <- final_net %>%
mutate(countdomesticbattery.isSig =
ifelse(local_morans[,5] <= 0.001, 1, 0)) %>%
mutate(countdomesticbattery.isSig.dist =
nn_function(st_c(st_coid(final_net)),
st_c(st_coid(filter(final_net,
countdomesticbattery.isSig == 1))),
k = 1))
ggplot() +
geom_sf(data = final_net, aes(fill=countdomesticbattery.isSig.dist), colour=NA) +
scale_fill_viridis(name="NN Distance") +
labs(title="Domestic Battery: NN Distance", subtitle = "Chicago, 2017") +
mapTheme()
This analysis developed multiple regressions to model domestic battery risk.
First, we look at correlation, which provides important information on relationship between indicators or risk factors and our dependent variable, domestic battery reports. The plots below show these relationships with the correlation coefficient for each factor.
plot_vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn",
"Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "Alleylights", "Support_Centers.nn", "loopDistance", "countdomesticbattery.isSig", "countdomesticbattery.isSig.dist")
correlation.long <-
st_drop_geometry(final_net) %>%
dplyr::select(-uniqueID, -cvID, -name, -District) %>%
gather(Variable, Value, -countdomesticbattery) %>%
filter(Variable %in% plot_vars)
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, countdomesticbattery, use = "complete.obs"))
ggplot(correlation.long, aes(Value, countdomesticbattery)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1, hjust = -1) +
geom_smooth(method = "lm", se = FALSE, colour = "black") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Domestic Battery: Count as a function of risk factors") +
PlotTheme()
A histogram of the dependent variable, domestic battery reports, is presented below. The distribution of the dependent variable is not normal, so this analysis will use Poisson Regression and Cross-Validated Poisson Regression going forward.
final_net %>%
ggplot(aes(countdomesticbattery)) +
geom_histogram(bins = 20, colour="black", fill="#5dc863") +
labs(title="Domestic Battery Report Distribution",
x="Number of Domestic Battery Reports in 1 Grid Cell", y="Count") +
PlotTheme()
Lastly, we perform a k-fold cross validation method called Leave-One-Group-Out cross-validation, or LOGO-CV. In LOGO-CV, each neighborhood is Chicago will used as the hold-out once. We created two regressions, one that includes only the risk factors and another than includes the risk factors and the local spatial process as determined through the Local Moran’s I test.
# View(crossValidate)
## define the variables we want
reg.vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn",
"Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "AlleyLightsOut", "Support_Centers.nn", "loopDistance")
reg.ss.vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn",
"Liquor_Retail.nn", "Street_Lights_Out", "Sanitation", "AlleyLightsOut", "Support_Centers.nn", "loopDistance", "countdomesticbattery.isSig", "countdomesticbattery.isSig.dist")
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])
for (i in cvID_list) {
thisFold <- i
cat("This hold out fold is", thisFold, "\n")
fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
regression <-
glm(countdomesticbattery ~ ., family = "poisson",
data = fold.train %>%
dplyr::select(-geometry, -id))
thisPrediction <-
mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
allPredictions <-
rbind(allPredictions, thisPrediction)
}
return(st_sf(allPredictions))
}
reg.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countdomesticbattery",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID, countdomesticbattery, Prediction, geometry)
reg.ss.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countdomesticbattery",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = cvID, countdomesticbattery, Prediction, geometry)
reg.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countdomesticbattery",
indVariables = reg.vars) %>%
dplyr::select(cvID = name, countdomesticbattery, Prediction, geometry)
reg.ss.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countdomesticbattery",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = name, countdomesticbattery, Prediction, geometry)
It is critical that models are both accurate and generalizable.
We calculated mean absolute error (MAE) for each prediction. The map below shows the MAE across the fishnet and across Chicago’s neighborhoods. We see larger errors in the areas that align with the hotspot locations.
reg.summary <-
rbind(
mutate(reg.cv, Error = Prediction - countdomesticbattery,
Regression = "Random k-fold CV: Just Risk Factors"),
mutate(reg.ss.cv, Error = Prediction - countdomesticbattery,
Regression = "Random k-fold CV: Spatial Process"),
mutate(reg.spatialCV, Error = Prediction - countdomesticbattery,
Regression = "Spatial LOGO-CV: Just Risk Factors"),
mutate(reg.ss.spatialCV, Error = Prediction - countdomesticbattery,
Regression = "Spatial LOGO-CV: Spatial Process")) %>%
st_sf()
error_by_reg_and_fold <-
reg.summary %>%
group_by(Regression, cvID) %>%
summarize(Mean_Error = mean(Prediction - countdomesticbattery, na.rm = T),
MAE = mean(abs(Mean_Error), na.rm = T),
SD_MAE = mean(abs(Mean_Error), na.rm = T))
error_by_reg_and_fold %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
facet_wrap(~Regression, ncol=4) +
scale_fill_viridis() +
labs(title = "Errors by Random k-fold and LOGO-CV Regressions") +
mapTheme() + theme(plot.title=element_text(hjust=0.5),legend.position="bottom")
A table of mean and standard deviation MAE is presented below. We see that including the spatial process improves the models.
# table of MAE and sd MAE by reg
st_drop_geometry(error_by_reg_and_fold) %>%
group_by(Regression) %>%
summarize(Mean_MAE = round(mean(MAE), 2),
SD_MAE = round(sd(MAE), 2)) %>%
kable(caption = "MAE by regression") %>%
kable_styling("striped", full_width = F) %>%
row_spec(2, color = "black", background = "#5dc863") %>%
row_spec(4, color = "black", background = "#5dc863")
Regression | Mean_MAE | SD_MAE |
---|---|---|
Random k-fold CV: Just Risk Factors | 1.62 | 1.29 |
Random k-fold CV: Spatial Process | 1.34 | 1.08 |
Spatial LOGO-CV: Just Risk Factors | 4.85 | 3.88 |
Spatial LOGO-CV: Spatial Process | 3.70 | 3.13 |
To better understand the model’s generalizability, we used census data to split Chicago into two groups: majority white neighborhoods and majority non-white neighborhoods. The map below shows the breakdown of race in Chicago.
tractsCookCounty <-
get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"),
year = 2018, state=17, county=031, geometry=T) %>%
st_transform('ESRI:102271') %>%
dplyr::select(variable, estimate, GEOID) %>%
spread(variable, estimate) %>%
rename(TotalPop = B01001_001,
NumberWhites = B01001A_001) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
.[neighborhoods,]
ggplot()+
geom_sf(data = tractsCookCounty, aes(fill = raceContext)) +
scale_fill_manual(values = c("#5dc863", "#440154"), name="Race Context") +
labs(title = "Race Context") +
mapTheme() + theme(legend.position="bottom")
The table shows the MAE by neighborhoods racial context. We see that the model often under-predicts in majority non-white neighborhoods. Even when accounting for race, the model with spatial processes performs better than just the risk factors.
#table of raw errors for race context
reg.summary %>%
st_centroid() %>%
st_join(tractsCookCounty) %>%
na.omit() %>%
st_drop_geometry() %>%
group_by(Regression, raceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(raceContext, mean.Error) %>%
kable(caption = "MAE by neighborhood racial context") %>%
kable_styling("striped", full_width = F) %>%
row_spec(2, color = "black", background = "#5dc863") %>%
row_spec(4, color = "black", background = "#5dc863")
Regression | Majority_Non_White | Majority_White |
---|---|---|
Random k-fold CV: Just Risk Factors | -2.825322 | 3.168058 |
Random k-fold CV: Spatial Process | -1.162638 | 1.320920 |
Spatial LOGO-CV: Just Risk Factors | -2.744340 | 3.311502 |
Spatial LOGO-CV: Spatial Process | -1.164173 | 1.366041 |
Below, we compare the traditional kernel density map with our prediction results. We see that the risk prediction method is able to provide more precise results.
dom.bats18 <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>%
filter(Primary.Type == "BATTERY" &
Description == "DOMESTIC BATTERY SIMPLE") %>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x,into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),
Y = as.numeric(Y)) %>%
na.omit %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102271') %>%
distinct() %>%
.[fishnet,]
# demo of kernel width
dom.batt_ppp <- as.ppp(st_coordinates(domesticbattery), W = st_bbox(final_net))
dom.batt_KD.1000 <- spatstat.core::density.ppp(dom.batt_ppp, 1000)
dom.batt_KD.1500 <- spatstat.core::density.ppp(dom.batt_ppp, 1500)
dom.batt_KD.2000 <- spatstat.core::density.ppp(dom.batt_ppp, 2000)
dom.batt_KD.df <- rbind(
mutate(data.frame(rasterToPoints(mask(raster(dom.batt_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(dom.batt_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(dom.batt_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft."))
dom.batt_KD.df$Legend <- factor(dom.batt_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))
dom.batt_KDE_sum <- as.data.frame(dom.batt_KD.1000) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean)
kde_breaks <- classIntervals(dom.batt_KDE_sum$value,
n = 5, "fisher")
dom.batt_KDE_sf <- dom.batt_KDE_sum %>%
mutate(label = "Kernel Density",
Risk_Category = classInt::findCols(kde_breaks),
Risk_Category = case_when(
Risk_Category == 5 ~ "5th",
Risk_Category == 4 ~ "4th",
Risk_Category == 3 ~ "3rd",
Risk_Category == 2 ~ "2nd",
Risk_Category == 1 ~ "1st")) %>%
cbind(
aggregate(
dplyr::select(dom.bats18) %>% mutate(dom.battCount = 1), ., sum) %>%
mutate(dom.battCount = replace_na(dom.battCount, 0))) %>%
dplyr::select(label, Risk_Category, dom.battCount)
# calculating kernel density for dombats in 2017 and adds count for 2018
dom.batt_ppp <- as.ppp(st_coordinates(domesticbattery), W = st_bbox(final_net))
dom.batt_KD <- density.ppp(dom.batt_ppp, 1000)
dom.batt_KDE_sf <- as.data.frame(dom.batt_KD) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
mutate(label = "Hotspot Mapping",
Risk_Category = ntile(value, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(dom.bats18) %>% mutate(dom.battCount = 1), ., sum) %>%
mutate(dom.battCount = replace_na(dom.battCount, 0))) %>%
dplyr::select(label, Risk_Category, dom.battCount)
#repeat for risk prediction
dom.batt_risk_sf <-
filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Process") %>%
mutate(label = "Risk Predictions",
Risk_Category = ntile(Prediction, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(dom.bats18) %>% mutate(dom.battCount = 1), ., sum) %>%
mutate(dom.battCount = replace_na(dom.battCount, 0))) %>%
dplyr::select(label,Risk_Category, dom.battCount)
# map comparing kernel density to risk predictions for the 2018 crime
rbind(dom.batt_KDE_sf, dom.batt_risk_sf) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = sample_n(dom.bats18, 2032), size = .5, colour = "black") +
facet_wrap(~label, ) +
scale_fill_viridis(discrete = TRUE) +
labs(title="Comparison of Hotspot Mapping and Risk Predictions",
subtitle="Chicago: 2017 Domestic Battery Report risk predictions; 2018 Domestic Battery Reports") +
mapTheme()
The bar plot below depicts the above comparison using 2018 reports. We can see that the risk prediction model provided more accurate predictions in areas that are high risk (90-100% risk category) for domestic battery. In areas that have lower risks, the models perform similarly but our prediction model is slightly less accurate.
rbind(dom.batt_KDE_sf, dom.batt_risk_sf) %>%
st_set_geometry(NULL) %>% na.omit() %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(countdomesticbattery = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Rate_of_test_set_crimes = countdomesticbattery / sum(countdomesticbattery)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
scale_fill_manual(values = c("#5dc863", "#440154")) +
labs(title = "Risk prediction vs. Kernel density", subtitle="Chicago, 2018: Domestic Battery Reports", x="Risk Category", y = "Rate of Crime (Test Set)") +
PlotTheme() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
While this model is somewhat accurate and generalizable, we do not recommend that this algorithm be used by the Chicago Police Department. The data on which we built this model is prone to selection bias that is racist and unjust. As previously discussed, domestic battery has a reporting bias as victims must choose to report occurrences and the self-reported nature of the data cannot be accounted for in this model. This model also does not account for racial biases in policing and resource allocation and use implementation of this model could further the historically racist policies that affect communities of color and minorities. In order to recommend the model, additional variables that account for spatial processes and racial bias should be implemented.
However, this model does a better job at predicting domestic battery reports in high-risk areas than the traditional approach using kernel density. Additionally, the predicted results from this model do align with the 2018 locations of domestic battery reports. There is a potential to use this model, and other models that are based on potentially biased data, for purposes not related to policing. Similar models could be helpful to for providing support to victims or other early intervention resources in report hotspots.