Introduction

Floods are frequent natural disasters that are costly to communities. Every year floods are responsible for fatalities and serious injuries, infrastructure and property damage, and transportation disruptions, especially when communities are unprepared for floods. Flooding can also cause pollution to water systems, landslides, and mudslides. The Department of Commerce and the National Oceanic and Atmospheric Association estimates that floods cost an average of $5 billion per year and more damage than other severe weather events [1]. Flooding will continue to threaten more communities, lives, and resources as climate changes progresses.

Motivation

Planners are in a unique position to help communities become more resilient by predicting and preparing for flooding events. Planners can model flooding scenarios using techniques that simulate rainfall and flashfloods to identify the extent, direction, and accumulation points of flooding events. With these results, planners can identify areas and communities at risk of flooding and the associated impacts. Planners and community-based organizations can use this information to coordinate with local governments to communicate the technical details to communities, explain planning and redevelopment strategies, and empower communities to plan for floods. When cities understand their flood risk, they can take action to mitigate flooding impacts by understanding the current uses of areas at risk and develop plans to shift the land to uses that assist with flood mitigation in a manner that provides benefits to their community.

This exercise aims to determine the variables associated with flooding and build a linear model to identify areas at risk. We hope the application of this model provides a framework that planners can use to ensure that cities are prepared to minimize flooding impacts for the most vulnerable communities. We use a machine learning algorithm to predict which areas of a city are at risk of inundation during a flooding event.

We used historic flood data from the City of Calgary, Alberta to develop and test the accuracy of this model. We then apply this model to the City of Minneapolis, Minnesota and predict inundation risk there as a measure of generalizability.

If you are intersted in learning more about this project, check out our video here: https://www.youtube.com/watch?v=8kkovNI0Wrw

Set Up

First, we load the necessary libraries and create fishnet grids for Calgary and Minneapolis. The fishnets are created using a cell size of 300.

setwd("C:/")
unlink(".RData")

knitr::opts_chunk$set(echo = TRUE)

plotTheme1 <- function(base_size = 12, title_size = 16) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black"), 
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
   panel.grid.major = element_line("grey60", size = 0.1),
    panel.grid.minor = element_blank(),
  # panel.border = element_rect(colour = "black", fill=NA, size=1),
    strip.background = element_rect(fill = "white", color = "NA"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
      #axis.line.y.left   = element_line(color = 'black', size =1),
   # legend.title = element_text(colour = "black", face = "italic"),
    #legend.text = element_text(colour = "black", face = "italic"),
   # strip.text.x = element_text(size = 14)
  )
}

plotTheme5 <- function(base_size = 12, title_size = 16) {
  theme(
    plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.75),
  axis.ticks=element_blank()
  )
}

  mapTheme1 <- function(base_size = 12, title_size = 16) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size,colour = "black"),
    plot.subtitle = element_text(size=12),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.line.y.left   = element_line(color = 'black', size =1),
    panel.grid.minor = element_blank(),
    strip.text.x = element_text(size = 14))
}
 

# load libraries
library(caret)
library(pscl)
library(plotROC)
library(pROC)
library(sf)
library(tidyverse)
library(knitr)
library(kableExtra)
library(FNN)
library(scales)
library(jtools)
library(viridis)
library(gridExtra)
library(raster)
library(rgdal)
  
colors <- c("#2F4858", "#1372A4", "#86BBD8", "#A8C686","#E75E15")

Feature Enginnering

We engineered five features into our Calgary model. We leveraged both ArcGIS Pro and R for this process using the Create Fishnet and Zonal Statistics as a Table tool for several of these features.

  1. Land Cover: Land cover data was reclassified using ArcGIS into previous or impervious. Using zonal statistics as a table, we aggregated the land cover into fishnet cells, and whichever of the binary options was more predominant in that cell corresponded to the value of the cell. In Calgary, there is a bit more of a random distribution whereas in Minneapolis, there seems to be more spatial clustering of pervious land types, which could be attributed to the denser urban landscape of Minneapolis in comparison.

  2. Slope: Slope was calculated again in both R and ArcGIS, for each cell, the mean slope change was calculated. We chose 5% as a cut off value to determine whether the average slope was steep or not steep. While the value may be slightly arbitrary, it seemed to help our model.

  3. Distance to Nearest Park: Parks data was found using both cities’ open data platforms. The distance in meters was calculated, and the average value in each cell was used.

  4. Distance to Nearest Hydrological Feature: This process was similar to park features, as city hydrology was found using open data, and the distance in meters was calculated, and the average value in the cell was used.

  5. Flow Accumulation: Using a digital elevation model, we used the fill, flow direction, and flow accumulation tools to calculate the flow accumulation, where the maximum flow accumulation of each cell was used as the final value.

fishnet_calgary <- st_read("C:/fishnet_calgary_data.geojson") %>%
  st_transform('EPSG:3780')

fishnet_calgary$slope <- as.factor(fishnet_calgary$slope)
fishnet_calgary$impervious <- as.factor(fishnet_calgary$impervious)
fishnet_calgary$inundation <- as.factor(fishnet_calgary$inundation)
fishnet_calgary$log_flow_acc <- log(fishnet_calgary$flow_acc)
fishnet_calgary$log_flow_acc <- ifelse(fishnet_calgary$log_flow_acc <= 0, 0, fishnet_calgary$log_flow_acc)
fishnet_calgary$log_parks_dist <- log(fishnet_calgary$parks_dist)
fishnet_calgary$log_water_dist <- log(fishnet_calgary$water_dist)
fishnet_calgary$log_parks_dist <- ifelse(fishnet_calgary$log_parks_dist <= 0, 0, fishnet_calgary$log_parks_dist)
fishnet_calgary$log_water_dist <- ifelse(fishnet_calgary$log_water_dist <= 0, 0, fishnet_calgary$log_water_dist)
fishnet_calgary

fishnet_calgary.sf <- fishnet_calgary %>%
  st_as_sf()
minneapolis <- 
  st_read("C:/Flood/City_Boundary.shp") %>%
  st_transform('ESRI:102271') 

mpls_fn <- 
  st_make_grid(minneapolis, 
               cellsize = 300,
               square = TRUE) %>%
  .[minneapolis] %>%
  st_sf() %>%
  mutate(uniqueID = rownames(.))

mpls_fn_centroid <- mpls_fn %>%
  st_centroid()

# DEM
dem_mpls <- read_sf("C:/Flood/fishnet_to_dem/dem.shp") %>%
  dplyr::select(-c(dem_Rowid, dem_UNIQUE, dem_ZONE_C, dem_COUNT, dem_AREA, geometry)) %>%
  dplyr::rename(dem = dem_MAX) %>%
  st_drop_geometry()


mpls_fn_dem <- left_join(mpls_fn, dem_mpls, by = "uniqueID")

# Land Cover
#Land cover data was taken from Minnesota Land Cover Classification System (MLCCS) and cropped to the city of Minneapolis. Impervious versus pervious classification was completed in ArcGIS Pro.

lc_mpls <- read_sf("C:/Flood/fishnet_to_landclass/fishnet_to_landclass.shp") %>%
  dplyr::select(-c(land_class, land_cla_1, land_cla_2, land_cla_3, land_cla_4, geometry)) %>%
  dplyr::rename(pervious = land_cla_5) %>%
  st_drop_geometry()

lc_mpls$pervious <- ifelse(lc_mpls$pervious %in% c(0, 1), "pervious", "impervious")

mpls_fn_landcover <- left_join(mpls_fn, lc_mpls, by = "uniqueID")

# Distance - Parks
parks_mpls <- read_sf("C:/Flood/fishnet_to_parks/fishnet_to_parks.shp") %>%
  dplyr::select(-c(NEAR_FID)) %>%
  dplyr::rename(dist_parks = NEAR_DIST) %>%
  st_drop_geometry()

mpls_fn_parks<- left_join(mpls_fn, parks_mpls, by = "uniqueID")

# Distance - Hydrology

water_mpls <- read_sf("C:/Flood/fishnet_to_water/fishnet_to_water.shp") %>%
  dplyr::select(-c(NEAR_FID)) %>%
  dplyr::rename(dist_water = NEAR_DIST) %>%
  st_drop_geometry()

mpls_fn_water <- left_join(mpls_fn, water_mpls, by = "uniqueID")

#slope

slope_mpls <- read_sf("C:/Flood/fishnet_to_slope/slope.shp") %>%
  dplyr::select(-c(slope_Rowi, slope_UNIQ, slope_ZONE, slope_COUN, slope_AREA, geometry)) %>%
  dplyr::rename(slope = slope_MEAN) %>%
  st_drop_geometry()

mpls_fn_slope <- left_join(mpls_fn, slope_mpls, by = "uniqueID")


# Flow Accumulation 
fac_mpls <- read_sf("C:/Flood/fishnet_to_fac/fac.shp") %>%
  dplyr::select(-c(fac_Rowid, fac_UNIQUE, fac_ZONE_C, fac_COUNT, fac_AREA, geometry)) %>%
  dplyr::rename(fac = fac_MAX) %>%
  st_drop_geometry()

mpls_fn_fac <- left_join(mpls_fn, fac_mpls, by = "uniqueID")


# Merging fishnet data together
mpls_fishnet <- merge(mpls_fn, dem_mpls, by="uniqueID")
mpls_fishnet <- merge(mpls_fishnet, slope_mpls, by="uniqueID")
mpls_fishnet <- merge(mpls_fishnet, fac_mpls, by="uniqueID")
mpls_fishnet <- merge(mpls_fishnet, parks_mpls, by="uniqueID")
mpls_fishnet <- merge(mpls_fishnet, water_mpls, by="uniqueID")
mpls_fishnet <- merge(mpls_fishnet, lc_mpls, by="uniqueID")

colnames(mpls_fishnet)[4] ="flow_acc"
mpls_fishnet$impervious <- ifelse(mpls_fishnet$pervious =='impervious',1,0)
mpls_fishnet$slope1 <- ifelse(mpls_fishnet$slope >= 5, 1, 0)
fishnet_minneapolis <- mpls_fishnet %>%
  dplyr::select(uniqueID, flow_acc, dist_water, dist_parks, impervious, slope1, geometry)
colnames(fishnet_minneapolis)[6] ="slope"
colnames(fishnet_minneapolis)[3] ="water_dist"
colnames(fishnet_minneapolis)[4] ="parks_dist"
fishnet_minneapolis$slope <- as.factor(fishnet_minneapolis$slope)
fishnet_minneapolis$impervious <- as.factor(fishnet_minneapolis$impervious)
fishnet_minneapolis.sf <- fishnet_minneapolis %>%
  st_as_sf()


fishnet_minneapolis$log_flow_acc <- log(fishnet_minneapolis$flow_acc)
fishnet_minneapolis$log_flow_acc <- ifelse(fishnet_minneapolis$log_flow_acc <= 0, 0, fishnet_minneapolis$log_flow_acc)
fishnet_minneapolis$log_parks_dist <- log(fishnet_minneapolis$parks_dist)
fishnet_minneapolis$log_water_dist <- log(fishnet_minneapolis$water_dist)
fishnet_minneapolis$log_parks_dist <- ifelse(fishnet_minneapolis$log_parks_dist <= 0, 0, fishnet_minneapolis$log_parks_dist)
fishnet_minneapolis$log_water_dist <- ifelse(fishnet_minneapolis$log_water_dist <= 0, 0, fishnet_minneapolis$log_water_dist)

Exploraty Analysis & Feature Engineering

Calgary Data

We present the observed flooding data for Calgary below.

ggplot() + 
 # geom_sf(data=fishnet_calgary.sf, fill = "#86BBD8") +
  geom_sf(data = fishnet_calgary.sf, aes(fill=as.factor(inundation)), color = "transparent")+
  scale_fill_manual(values = c("#A8C686", "#1372A4"),
                    labels = c("Not Flooded","Flooded"),
                    name = "") +
  labs(title="Observed Flooding", subtitle ="Calgary, AL") +
  mapTheme1()

Variables

We present our variables for the model below.

# Calgary: Land Cover

calgA <- ggplot() +
  geom_sf(data = fishnet_calgary.sf, 
          aes(fill = as.factor(impervious)),
          color=NA) +
  scale_fill_manual(values = c("#86BBD8","#2F4858"), name = "Factor")+
  labs(title="Land Cover\n", subtitle="Calgary, AL")+
  mapTheme1()

# Calgary: Slope

calgB <- ggplot() +
  geom_sf(data = fishnet_calgary.sf, 
          aes(fill = slope),
          color=NA)+
  scale_fill_manual(values = c("#86BBD8","#2F4858"), name = "Factor")+
  labs(title="Slope\n", subtitle="Calgary, AL")+
  mapTheme1()

# Calgary: FAC

calgC <- ggplot() +
  geom_sf(data = fishnet_calgary.sf, 
          aes(fill = flow_acc),
          color=NA) +
  scale_fill_gradient(low =  "#86BBD8" , high = "#2F4858",
                      name = "Max Flow") +
labs(title="Flow Accumulation\n", subtitle="Calgary, AL")+
  mapTheme1()

# Calgary: Parks

calgD <- ggplot() +
  geom_sf(data = fishnet_calgary.sf, aes(fill = parks_dist), color=NA) +
  scale_fill_gradient(low =  "#86BBD8" , high = "#2F4858")+
  labs(title = "Distance to \nNearest Park", subtitle="Calgary, AL", fill="Distance \n(meters)") + 
  mapTheme1()

# Calgary: Hydrological features

calgE <- ggplot() +
  geom_sf(data = fishnet_calgary.sf, 
          aes(fill = water_dist),
          color=NA) +
  scale_fill_gradient(low =  "#86BBD8" , high = "#2F4858",
                      name = "Distance (Meters)") +
labs(title="Distance to Nearest\nHydro Feature", subtitle="Calgary, AL")+
  mapTheme1()

# Minneapolis: Land Cover

minA <- ggplot() +
  geom_sf(data = fishnet_minneapolis.sf, 
          aes(fill = impervious),
          color=NA)+
  scale_fill_manual(values = c("#86BBD8","#2F4858"), name = "Factor")+
  labs(title="Land Cover",subtitle="Minneapolis, MN")+
  mapTheme1()

# Minneapolis: Slope

minB <-ggplot() +
  geom_sf(data = fishnet_minneapolis.sf, 
          aes(fill = slope),
          color=NA)+
  scale_fill_manual(values = c("#86BBD8","#2F4858"), name = "Factor")+
  labs(title="Slope",subtitle="Minneapolis, MN")+
  mapTheme1()

# Minneapolis: FAC

minC <-ggplot() +
  geom_sf(data = fishnet_minneapolis.sf, 
          aes(fill = flow_acc),
          color=NA) +
  scale_fill_gradient(low =  "#86BBD8" , high = "#2F4858",
                      name = "Max Flow") +
labs(title="Flow Accumulation\n",subtitle="Minneapolis, MN")+
  mapTheme1()

# Minneapolis: Parks

minD <-ggplot() +
  geom_sf(data = fishnet_minneapolis.sf, 
          aes(fill = parks_dist),
          color=NA) +
  scale_fill_gradient(low =  "#86BBD8" , high = "#2F4858",
                      name = "Distance\n(meters)") +
labs(title="Distance to \nNearest Park",subtitle="Minneapolis, MN")+
  mapTheme1()

# Minneapolis: Hydrological features

minE <-ggplot() +
  geom_sf(data = fishnet_minneapolis.sf, 
          aes(fill = water_dist),
          color=NA) +
  scale_fill_gradient(low =  "#86BBD8" , high = "#2F4858",
                      name = "Distance\n (Meters)") + 
labs(title="Distance to Nearest\nHydro Feature",subtitle="Minneapolis, MN")+
  mapTheme1()

grid.arrange(ncol=3, calgA, calgB, calgC, calgD, calgE)

grid.arrange(ncol=3, minA, minB, minC, minD, minE)

Model Building

We used a binomial logit model for this exercise. We began the model building process by partitioning the dataset into testing and training sets. We used a 70/30 split, with 70% of the data assigned to the training set and the remaining 30% assigned to the testing set. We then built our model using the following variables: logged flow accumulation (numeric), logged distance to nearest hydrological feature (numeric), logged distance to nearest park (numeric), impervious land cover (factor), and slope (factor).

#setting up data partition 

set.seed(634)
trainIndex <- createDataPartition(fishnet_calgary$inundation, p = .70,
                                  list = FALSE,
                                  times = 1)
calgaryTrain <- fishnet_calgary[ trainIndex,]
calgaryTest  <- fishnet_calgary[-trainIndex,]

options(scipen=999)

calgaryModel <- glm(inundation ~ log_flow_acc + log_water_dist + slope + log_parks_dist + impervious,
                    family="binomial"(link="logit"), data = calgaryTrain %>%
                      as.data.frame %>%
                      dplyr::select(-objectid, -geometry, -ldscmetric))

Model Performance

Model Summary

We provide a summary of our model below. The output displays various indicators of the performance of the model, as well as the formula function. We can see that our numeric and logged variables are more significant to this model.

summary(calgaryModel)
## 
## Call:
## glm(formula = inundation ~ log_flow_acc + log_water_dist + slope + 
##     log_parks_dist + impervious, family = binomial(link = "logit"), 
##     data = calgaryTrain %>% as.data.frame %>% dplyr::select(-objectid, 
##         -geometry, -ldscmetric))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6732  -0.2629  -0.1984  -0.1544   3.0610  
## 
## Coefficients:
##                Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)     0.49229    0.33258   1.480               0.1388    
## log_flow_acc    0.06087    0.02743   2.219               0.0265 *  
## log_water_dist -0.63573    0.02698 -23.561 < 0.0000000000000002 ***
## slope1         -0.28200    0.26426  -1.067               0.2859    
## log_parks_dist -0.15708    0.02457  -6.392       0.000000000163 ***
## impervious1     0.59373    0.13538   4.386       0.000011557281 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2856.4  on 6849  degrees of freedom
## Residual deviance: 2092.2  on 6844  degrees of freedom
## AIC: 2104.2
## 
## Number of Fisher Scoring iterations: 6

Histogram of Probability Prediction Distributions

We ran the model on our test set to further examine its results and accuracy. Below we display a histogram of the predictions made on the test set. The plot provides a quick glance at the distribution of predictions of the probability for flooding in each cell. We would expect to see a larger quantity of predictions hovering around 0.0 and 0.1 for a strong model, but our model has more predictions at the lower end of the distribution and very few above 0.5 (and none at 1.0). This indicates that our model is better at predicting non-flooded areas than flooded ones. This may be because our dataset included a limited number of flooded cells with which we could train the model.

classProbs <- predict(calgaryModel, calgaryTest, type="response")

hist((classProbs), main = paste("Histogram of Predictions"), col = "#1372A4", xlab = "Inundation Probability") + plotTheme1()

Density of Probability Prediction Distribution

The density plot shown below echos the takeaways of the histogram plot above. Here, we that there is a higher density of predictions made around 0 (not flooded), while there are very few predictions for 1 (flooded). It also reinforces that our model predicts better on non-flooded cells. Based on the distribution and density plots, we decoded to use a threshold of 0.4 to designate a predicted probability as flooded.

testProbs <- data.frame(obs = as.numeric(calgaryTest$inundation),
                        pred = classProbs,
                        uniqueID = calgaryTest$uniqueID)
testProbs$obs <- ifelse(calgaryTest$inundation == 1, 1, 0)

ggplot(testProbs, aes(x = pred, fill=as.factor(obs))) + geom_density() +
  facet_grid(obs ~ .) + xlab("Probability") + ylab("Density") + geom_vline(xintercept = .5) +
  scale_x_continuous(limits = c(0, 1)) +
  scale_fill_manual(values = c("#A8C686", "#1372A4"),
                      labels = c("Not Flooded","Flooded"),
                                 name="") +
                      labs(title = "Distribution of Probabilities") + plotTheme1()

Confusion Matrix

This model has four possible outcomes based on the predicted and observed classification. 1. True Positive - Predicted inundation (1), Observed inundation (1) 2. True Negative - Predicted no indunation (0), Observed no inundation (0) 3. False Positive - Predicted inundation (1), Observed no inundation (0) 4. False Negative - Predicted no inundation (0), Observed inundation (1)

These are also known as a confusion matrix. We present our confusion matrix below. Our models outcomes were: 54 true positives, 2,717 true negatives, 102 false positives, and 61 false negatives. While the number of 1s overall is small compared to the number of 0s in the dataset, our model still predicts better for true negatives (a cell for which we predicted no inundation and observed no inundation). However, the model is 94% accurate.

testProbs$predClass  = ifelse(testProbs$pred > .4 ,1,0)

caret::confusionMatrix(reference = as.factor(testProbs$obs), 
                       data = as.factor(testProbs$predClass), 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2717  102
##          1   61   54
##                                           
##                Accuracy : 0.9444          
##                  95% CI : (0.9355, 0.9525)
##     No Information Rate : 0.9468          
##     P-Value [Acc > NIR] : 0.73391         
##                                           
##                   Kappa : 0.3701          
##                                           
##  Mcnemar's Test P-Value : 0.00173         
##                                           
##             Sensitivity : 0.34615         
##             Specificity : 0.97804         
##          Pos Pred Value : 0.46957         
##          Neg Pred Value : 0.96382         
##              Prevalence : 0.05317         
##          Detection Rate : 0.01840         
##    Detection Prevalence : 0.03920         
##       Balanced Accuracy : 0.66210         
##                                           
##        'Positive' Class : 1               
## 

ROC Curve

We plot the true positive rate against the false positive rate in the ROC Curve plot below. The curve indicates the how well the model can distinguish the two classifications (inundation from no inundation) well. From the ROC Curve, we can find the area under the curve. Here, the area under the curve is 0.87, which is a good sign that our model is effective.

ggplot(testProbs, aes(d = obs, m = pred)) + 
  geom_roc(n.cuts = 50, labels = FALSE, color = "#1372A4") + 
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = "#2F4858") 

Area Under Curve

auc(testProbs$obs, testProbs$pred)
## Area under the curve: 0.8745

Cross Validation

We use cross validation to continue to evaluate our model. Cross validation allows us to understand the generalizability of our model. We ran a 100-fold cross validation, which split the dataset into equal sized subsets (or folds) and measured the average goodness of fit across all of the folds. Again, we see that the accuracy across folds is 94%.

#rm(calgary_slope.sf, flood.sf)
#gc()
ctrl <- trainControl(method = "cv", 
                     number = 100, 
                     savePredictions = TRUE,
                     returnData = FALSE)

cvdata <- fishnet_calgary %>% 
  as.data.frame() %>%
  dplyr::select(-objectid, -geometry, -ldscmetric, -uniqueID) %>%
  dplyr::mutate(inundation = as.factor(inundation),
                slope = as.factor(slope),
                impervious = as.factor(impervious))

cv_labels <- cvdata[,"inundation"]
cv_data <- cvdata[,-2]

cvFit <- caret::train(x=cv_data, y=cv_labels, 
               method="glm", family="binomial",
               trControl = ctrl, model = FALSE)

cvFit

We also plot the accuracy from the cross validation below.

ggplot(as.data.frame(cvFit$resample), aes(Accuracy)) + 
  geom_histogram(fill="#1372A4") +
  scale_x_continuous(limits = c(0, 1)) +
  labs(x="Accuracy",
       y="Count")+
  plotTheme5()

Predictions

We now will use our model to make predictions for Calgary. Below we plot our predicted flood probabilities and predicted flood classifications. We can see clear spatial patterns of higher predicted probabilities in the first map, while the second demonstrates our model’s ability to predict accurately on non-flooded cells.

calgary_allPredictions <- 
  predict(cvFit, fishnet_calgary, type="prob")[,2]
  
fishnet_calgary <- 
  cbind(fishnet_calgary,calgary_allPredictions) 

fishnet_calgary1 <- fishnet_calgary %>%
  mutate(PredClass = ifelse(calgary_allPredictions >= 0.4, 1, 0))

fishnet_calgary1 <- fishnet_calgary1 %>%
  mutate(Correct = ifelse(PredClass == 1, "1", "0"),
         Incorrect = ifelse(PredClass != 1, "1", "0"))

fishnet_calgary1$predprob <- fishnet_calgary1$calgary_allPredictions*100

grid.arrange(ncol = 2,
             ggplot() + 
    geom_sf(data= fishnet_calgary1, aes(fill=factor(ntile(predprob,5))), 
            colour=NA) +
    scale_fill_manual(values = c("#A8C686", "#C1DCEB","#86BBD8","#1372A4", "#2F4858"),
                      labels=c(0,20,40,60,80), 
                      name="Predicted\nProbabilities (%)") +
  mapTheme1() +
   labs(title="Predicted Flood Probabilities", subtitle = "Calgary, AL"),
  
  ggplot() + 
  geom_sf(data=fishnet_calgary1, aes(fill=as.factor(PredClass)), color = "transparent") +
  #geom_sf(data = fishnet_calgary1, fill = "transparent", color = "white")+
  scale_fill_manual(values = c("#A8C686", "#1372A4"),
                    labels = c("Not Flooded","Flooded"),
                    name = "Predicted \nClassification") +
  labs(title="Predicted Flood Outcomes", subtitle = "Calgary, AL") +
  mapTheme1())

Confusion Metrics

We also plot the confusion metrics we mentioned earlier on the map to see if there are any spatial patterns. Again, its clear that our model predicts really well on nonflooded cells, but struggles more with flooded cells. We can see that our false negatives follow the natural hydrological features in Calgary, but our model is not picking up on those cells.

fishnet_calgary1 %>%
  mutate(confResult=case_when(predprob < 40 & inundation==0 ~ "True Negative",
                              predprob >= 40 & inundation==1 ~ "True Positive",
                              predprob < 40 & inundation==1 ~ "False Negative",
                              predprob >= 40 & inundation==0 ~ "False Positive")) %>%
  ggplot()+
  geom_sf(aes(fill = confResult), color = "transparent")+
  scale_fill_manual(values = c("#1372A4","#C1DCEB","#A8C686","#2F4858"),
                    name="Outcomes")+
  labs(title="Confusion Metrics of Predicted Outcome", subtitle= "Calgary, AL") +
  mapTheme1()

Spatial Cross Validation

We also looked into the way our model performed spatially. By aggregating the prediction accuracies to the sector level, we can see that the model performs better in some sectors and worse in others. We can see that our model was more accurate for the sectors in the south of the city, as well as the factors that are located near hydrological features.

cal_sectors <- st_read("C:/Flood/geo_export_44b23231-c8ad-43eb-aef8-a969bf17457c.shp") %>% st_transform(crs = 'EPSG:3780')
cal_sectors <- st_intersection(fishnet_calgary1, cal_sectors)
cal_sectors <- cal_sectors %>% mutate(Correct = as.numeric(Correct),
                                  Incorrect = as.numeric(Incorrect))
cal_sectors <- cal_sectors %>% dplyr::select(sector, inundation, calgary_allPredictions, PredClass, Correct, Incorrect, -geometry) %>% 
  group_by(sector) %>% 
  summarise(maxPrediction = max(calgary_allPredictions),
            accuracy = ((sum(Correct) / (9784)) * 100))

ggplot() + 
    geom_sf(data= cal_sectors, aes(fill=factor(ntile(accuracy,4))), 
            colour=NA) +
    scale_fill_manual(values = c("#C1DCEB","#86BBD8","#1372A4", "#2F4858"),
                      labels=c(5,20,35,50), 
                      name="Accuracy (%)") +
  geom_sf(data=fishnet_calgary1  %>% 
               filter(inundation == 1), 
               fill="#E75E15",colour="NA") +
  mapTheme1() +
   labs(title="Prediction Accuracy by Sector", subtitle = "Observed Flooding in Orange \nCalgary, AL")

Predictions for Minneapolis, MN

To further test our model’s ability to generalize to new data, we used data from Minneapolis as a new test set and predicted flooding there. We followed a similar methodology of collecting data, creating a fishnet, feature engineering, and predicting outcomes. We used the same variables from the Calgary dataset for Minneapolis, which included logged flow accumulation (numeric), logged distance to nearest hydrological feature (numeric), logged distance to nearest park (numeric), impervious land cover (factor), and slope (factor).

We present our flood predictions for Minneapolis below. While we can not be certain that this is an accurate prediction, the predictions follow a similar spatial structure in terms of proximity to the major river, meaning our model is likely generalizable to other context areas.

fishnet_minneapolis2 <- fishnet_minneapolis %>%
  as.data.frame() %>%
  dplyr::select(-geometry, -uniqueID) %>%
  dplyr::mutate(  slope = as.factor(slope),
                impervious = as.factor(impervious))

#cv_data <- cvdata[sample(nrow(cv_data), 1761), ]
#cv_data <- cv_data %>%
 # dplyr::select(-calgary_allPredictions, -inundation)
#fishnet_minneapolis2 <- fishnet_minneapolis2 %>%
 # dplyr::select(slope, water_dist, parks_dist, flow_acc, impervious, log_flow_acc, log_parks_dist, #log_water_dist)

minneapolis_allPredictions <- 
  predict(cvFit, fishnet_minneapolis, type="prob")[,2]
  
fishnet_minneapolis <-
  cbind(fishnet_minneapolis,minneapolis_allPredictions)

fishnet_minneapolis$prediction <- (fishnet_minneapolis$minneapolis_allPredictions * 100)

fishnet_minneapolis$predclass <- ifelse(fishnet_minneapolis$prediction > 40, 1, 0)

grid.arrange(ncol = 2,
             ggplot() + 
    geom_sf(data= fishnet_minneapolis, aes(fill=factor(ntile(prediction,5))), 
            colour=NA) +
    scale_fill_manual(values = c("#A8C686", "#C1DCEB","#86BBD8","#1372A4", "#2F4858"),
                      labels=c(0,20,40,60,80), 
                      name="Predicted\nProbabilities (%)") +
  mapTheme1() +
   labs(title="Predicted Flood Probabilities", subtitle = "Minneapolis, MN"),
  
  ggplot() + 
  geom_sf(data=fishnet_minneapolis, aes(fill=as.factor(predclass)), color = "transparent") +
  #geom_sf(data = fishnet_calgary1, fill = "transparent", color = "white")+
  scale_fill_manual(values = c("#A8C686", "#1372A4"),
                    labels = c("Not Flooded","Flooded"),
                    name = "Predicted \nClassification") +
  labs(title="Predicted Flood Outcomes", subtitle = "Minneapolis, MN") +
  mapTheme1())

Discussion & Conclusion

As communities begin to plan for flooding, it is imperative that planners have accurate and generalizable models to predict areas at risk of inundation. Planners are in a unique position to use the technical knowledge that result from these models to communicate with communities about next steps. Our model attempts to provide planners with framework to ensure that cities are prepared to minimize flooding impacts for the most vulnerable communities.

This model has strong accuracy and generalizability. Our predictions for Calgary were 94% accuracy and our AUC value was 0.87, both of these factors indicate that our model is accurate. However, this model performs better on non-flooded areas better than flooded areas, which will be important to keep in mind moving forward. Our model is also generalizable as it performs well on external data, as seen in the cross validation and Minneapolis predictions.

We believe this model provides important information for communities about flood possibilities. However, there may be challenges with implementation as the model predicts better for non-flooded cells. We believe that with improved feature engineering, tuning the parameters of our fishnet, and exploring different probability thresholds, we can improve the predictive power of our model, and hopefully better prioritize at risk communities. Additionally, challenges related to executing emergency planning programs may arise due to funding issues or political and collaboration problems. We believe such models and tools can still be used to communicate important and technical stories about the risk of flooding to decision-makers and community members.

[1] https://community.fema.gov/ProtectiveActions/s/article/Flood-Impact#:~:text=Floods%20can%20cause%20power%2C%20water,problems%20including%20landslides%20and%20mudslides.