Introduction

This analysis examines Emil City’s home repair tax credit, which is offered by the city’s Department of Housing and Community Development. The tax credit is an important offering that encourages residents to repair their homes, resulting in increased home values for residents and neighbors and higher tax revenue for the city. Currently, only 11% of eligible homeowners take advantage of the $5,000 credit. The city wants to increase the number of households taking the credit by targeting its marketing to homeowners who qualify for the program. This analysis aims to minimize outreach, plus the associated costs, to homeowners who are not likely to take the credit while increasing outreach to those who are more likely to use the credit program.

To achieve these goals, the analysis uses a logistic regression to understand characteristics of homeowners who take the credit and estimate if other homeowners will take the credit. We determine the sensitivity and specificity of two models, cross validate these models, and construct a cost benefit analysis of each approach within the models.

Set up

We first perform the necessary set up here.

knitr::opts_chunk$set(message = FALSE, warning = FALSE)

options(scipen=10000000)

library(tidyverse)
library(kableExtra)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(pROC)
library(lubridate)
library(gridExtra)
library(grid)
library(ggplot2)

palette1 <- c("#4c72b0","#55a868","#dd8452","#c44c54")

root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

housingsub <- read.csv("C:/Users/admin/Desktop/Public-Policy-Analytics-Landing-master/Public-Policy-Analytics-Landing-master/DATA/Chapter6/housingSubsidy.csv")

Exploratory Data Analysis

Below we present data visualizations of the continuous and categorical variables. The categorical variables are broken down by financial, personal information, and outreach details.

Key takeaways include that younger people are less likely to take the credit and that there are fluctuations in credit update based on inflation rate, unemployment rate and the amount of money spent on home repairs. For categorical features, we see that people with mortgages and full-time residence in Philadelphia are more likely to take the credit. Additionally, married people and those with high school degrees or above show higher update rates. Finally, people contacted by cellular took the credit more than people who were contacted by telephone.

housingsub %>%
    dplyr::select(y, age, spent_on_repairs, unemploy_rate, inflation_rate) %>%
    gather(Variable, value, -y) %>%
    ggplot() + 
    geom_density(aes(value, color = y), size= 1, fill = "transparent") + 
    facet_wrap(~Variable, scales = "free") +
    scale_color_manual(values = c("#55a868","#4c72b0")) +
    labs(title = "Feature Associations with the Likelihood of Credit",
         subtitle = "Continous Features")

housingsub %>%
    dplyr::select(y,taxLien, mortgage, taxbill_in_phl) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = c("#55a868","#4c72b0")) +
        labs(x="Credit", y="Value",
             title = "Feature Associations with the Likelihood of Credit",
             subtitle = "Financial Categorical Features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

housingsub %>%
    dplyr::select(y, education, marital, job) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = c("#55a868","#4c72b0")) +
        labs(x="Credit", y="Value",
             title = "Feature Associations with the Likelihood of Credit",
             subtitle = "Personal Categorical Features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

housingsub %>%
    dplyr::select(y, contact, month, day_of_week) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = c("#55a868","#4c72b0")) +
        labs(x="Credit", y="Value",
             title = "Feature Associations with the Likelihood of Credit",
             subtitle = "Contact Categorical Features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Feature Engineering

In this step, we engineer new features based on the exploratory data analysis results. We create new features that identify homeowner’s marriage and employment status. Using information gathered during the exploratory data analysis, we identified a few jobs that were more likely to take the credit. We create a new feature that identifies if a homeowner’s job falls into this category which includes administrative work, blue-collar work, technicians, service jobs, and management jobs. Similarly, we separate education levels into two categories: higher education, which includes professional courses and university degrees, and lower education, which includes basic and high school education. Finally, we look at the manners and times in which homeowners were contacts. We create new features for those contacted by cellular and those contacted by telephone. We also look at the days and month of which the homeowner was contacted.

#marriage status
housingsub <-
  housingsub %>%
  mutate(married = ifelse(str_detect(housingsub$marital, "married"), "Yes", "No"))

#job status
housingsub <- 
  housingsub %>% 
  mutate(unemployed = ifelse(str_detect(housingsub$job, "unemployed"), "Yes", "No"))
         
#jobsmorelikely
housingsub <- 
  housingsub %>% 
  mutate(creditjobs = ifelse(str_detect(housingsub$job, "admin.|blue-collar|technician|services|management"), "Yes", "No"))

#month
housingsub <- 
  housingsub %>% 
  group_by(month) %>% 
  summarize(totalSubsidies = sum(y_numeric), 
            n = n(), 
            MonthSubsidyAvg = 100*(totalSubsidies/n)) %>%
  dplyr::select(-n, -totalSubsidies) %>%
  right_join(housingsub, .)

#day of week 
housingsub <- 
  housingsub %>% 
  group_by(day_of_week) %>% 
  summarize(totalSubsidies = sum(y_numeric), 
            n = n(), 
            DaySubsidyAvg = 100*(totalSubsidies/n)) %>%
  dplyr::select(-n, -totalSubsidies) %>%
  right_join(housingsub, .)

#cellular contact
housingsub <-
  housingsub %>%
  mutate(cellular = ifelse(str_detect(housingsub$contact, "cellular"), "Yes", "No"))

#telephone contact
housingsub <-
  housingsub %>%
  mutate(telephone = ifelse(str_detect(housingsub$contact, "telephone"), "Yes", "No"))

#education - high school/under
housingsub <-
  housingsub %>%
  mutate(EduLow = ifelse(str_detect(housingsub$education, "basic.4y|basic.6y|basic.9y|high.school"), "Yes", "No"))

#education - higher ed
housingsub <-
  housingsub %>%
  mutate(EduHigh = ifelse(str_detect(housingsub$education, "professional.course|university.degree"), "Yes", "No"))

Data Training

Testing and Training Sets

The next step splits our data into a testing and training set. We separated 65% of the data into the training set and the remaining 35% of observations are put into the test set.

set.seed(3456)
trainIndex <- createDataPartition(housingsub$poutcome, p = .65,
                                  list = FALSE,
                                  times = 1)
housingTrain <- housingsub[ trainIndex,]
housingTest  <- housingsub[-trainIndex,]

Create Kitchen Sink Model

Now we create two different models. Our first model includes all of the original data and is called Kitchen Sink. We present the summary of the Kitchen Sink regression below.

#kitchen sink
kitchensink <- glm(y_numeric ~ .,
                   data= housingTrain %>% 
                     dplyr::select(-y, -job, -marital, -age, -campaign, -day_of_week, -month, -previous, -pdays, -inflation_rate, -taxLien, -education, -unemploy_rate, -poutcome, -contact, -cons.price.idx, -cons.conf.idx),
                   family="binomial" (link="logit"))

summary(kitchensink)

## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = housingTrain %>% dplyr::select(-y, -job, -marital, 
##         -age, -campaign, -day_of_week, -month, -previous, -pdays, 
##         -inflation_rate, -taxLien, -education, -unemploy_rate, 
##         -poutcome, -contact, -cons.price.idx, -cons.conf.idx))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6168  -0.4187  -0.3069  -0.2535   2.8418  
## 
## Coefficients: (1 not defined because of singularities)
##                      Estimate  Std. Error z value             Pr(>|z|)    
## (Intercept)       54.29115376  5.20565644  10.429 < 0.0000000000000002 ***
## X                  0.00003924  0.00005669   0.692              0.48883    
## mortgageunknown   -0.45443757  0.49849005  -0.912              0.36196    
## mortgageyes       -0.18127728  0.13623214  -1.331              0.18330    
## taxbill_in_phlyes -0.03922259  0.18026488  -0.218              0.82775    
## spent_on_repairs  -0.01114628  0.00092171 -12.093 < 0.0000000000000002 ***
## marriedYes        -0.05354615  0.13576121  -0.394              0.69328    
## unemployedYes      0.13025185  0.39956611   0.326              0.74444    
## creditjobsYes     -0.04316569  0.17149490  -0.252              0.80127    
## MonthSubsidyAvg    0.01835215  0.00564194   3.253              0.00114 ** 
## DaySubsidyAvg      0.07049291  0.19094979   0.369              0.71200    
## cellularYes        0.63144930  0.17508938   3.606              0.00031 ***
## telephoneYes               NA          NA      NA                   NA    
## EduLowYes         -0.56559635  0.29581713  -1.912              0.05588 .  
## EduHighYes        -0.37166153  0.29572976  -1.257              0.20884    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1899.3  on 2678  degrees of freedom
## Residual deviance: 1551.7  on 2665  degrees of freedom
## AIC: 1579.7
## 
## Number of Fisher Scoring iterations: 6

Create Engineered Features Model

The second model includes our new engineered features. We also present the summary of the engineered features regression below.

#features model
FeaturesModel <- glm(y_numeric ~ .,
                   data= housingTrain %>% 
                     dplyr::select(-y, -MonthSubsidyAvg, -DaySubsidyAvg, -cellular, -EduLow, -EduHigh, -married, -creditjobs, -unemployed),
                   family="binomial" (link="logit"))

summary(FeaturesModel)

## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = housingTrain %>% dplyr::select(-y, -MonthSubsidyAvg, 
##         -DaySubsidyAvg, -cellular, -EduLow, -EduHigh, -married, 
##         -creditjobs, -unemployed))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8313  -0.4017  -0.3130  -0.2410   2.9982  
## 
## Coefficients: (1 not defined because of singularities)
##                                   Estimate    Std. Error z value Pr(>|z|)    
## (Intercept)                  -171.92439311  133.44927787  -1.288 0.197637    
## X                               0.00002096    0.00006059   0.346 0.729389    
## age                             0.01158123    0.00851557   1.360 0.173828    
## jobblue-collar                 -0.35228663    0.28202830  -1.249 0.211622    
## jobentrepreneur                -0.50769844    0.50660091  -1.002 0.316263    
## jobhousemaid                   -0.00021812    0.49837898   0.000 0.999651    
## jobmanagement                  -0.51833359    0.30620048  -1.693 0.090495 .  
## jobretired                      0.06780271    0.37491192   0.181 0.856486    
## jobself-employed               -0.31914899    0.37483044  -0.851 0.394520    
## jobservices                    -0.10067130    0.29377045  -0.343 0.731834    
## jobstudent                     -0.00719975    0.42331289  -0.017 0.986430    
## jobtechnician                  -0.00614976    0.24137683  -0.025 0.979674    
## jobunemployed                  -0.15917603    0.43491842  -0.366 0.714372    
## jobunknown                     -0.85751158    0.88735333  -0.966 0.333859    
## maritalmarried                 -0.01608184    0.24154395  -0.067 0.946917    
## maritalsingle                   0.03863738    0.27683376   0.140 0.889001    
## maritalunknown                 -0.01807439    1.19917405  -0.015 0.987974    
## educationbasic.6y               0.11540077    0.42618484   0.271 0.786563    
## educationbasic.9y               0.14095145    0.33984037   0.415 0.678319    
## educationhigh.school            0.14174023    0.32431265   0.437 0.662076    
## educationilliterate           -12.43585266  535.41146425  -0.023 0.981469    
## educationprofessional.course   -0.02281027    0.35821969  -0.064 0.949228    
## educationuniversity.degree      0.23867481    0.32367945   0.737 0.460891    
## educationunknown                0.56104985    0.40869817   1.373 0.169823    
## taxLienunknown                  0.03359282    0.21324729   0.158 0.874827    
## taxLienyes                    -10.42974298  535.41147296  -0.019 0.984458    
## mortgageunknown                -0.71077680    0.56397045  -1.260 0.207558    
## mortgageyes                    -0.22504656    0.14576547  -1.544 0.122614    
## taxbill_in_phlyes              -0.06483539    0.19075180  -0.340 0.733936    
## contacttelephone               -1.09496479    0.29315832  -3.735 0.000188 ***
## monthaug                       -0.08550443    0.45416104  -0.188 0.850666    
## monthdec                        0.88069572    0.70827283   1.243 0.213705    
## monthjul                       -0.23175321    0.37739942  -0.614 0.539163    
## monthjun                        0.18157116    0.46762312   0.388 0.697805    
## monthmar                        1.67582108    0.55407641   3.025 0.002490 ** 
## monthmay                       -0.23594916    0.30931810  -0.763 0.445580    
## monthnov                       -0.33830812    0.43633165  -0.775 0.438135    
## monthoct                       -0.00493612    0.54775124  -0.009 0.992810    
## monthsep                       -0.10105178    0.65921330  -0.153 0.878168    
## day_of_weekmon                  0.10090424    0.23011277   0.438 0.661025    
## day_of_weekthu                  0.13989298    0.23117879   0.605 0.545093    
## day_of_weektue                  0.17862787    0.23198007   0.770 0.441292    
## day_of_weekwed                  0.12778461    0.23765605   0.538 0.590793    
## campaign                       -0.06638459    0.04223360  -1.572 0.115987    
## pdays                          -0.00012582    0.00075441  -0.167 0.867547    
## previous                        0.06764814    0.20976798   0.322 0.747081    
## poutcomenonexistent             0.45889546    0.32635216   1.406 0.159684    
## poutcomesuccess                 1.71155647    0.73397424   2.332 0.019706 *  
## unemploy_rate                  -0.76761700    0.50524113  -1.519 0.128685    
## cons.price.idx                  1.52994252    0.88122556   1.736 0.082537 .  
## cons.conf.idx                   0.05245813    0.02821061   1.860 0.062954 .  
## inflation_rate                 -0.24312752    0.44358275  -0.548 0.583624    
## spent_on_repairs                0.00564875    0.01077484   0.524 0.600102    
## telephoneYes                            NA            NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1899.3  on 2678  degrees of freedom
## Residual deviance: 1444.5  on 2626  degrees of freedom
## AIC: 1550.5
## 
## Number of Fisher Scoring iterations: 12

McFadden R-Squared

We also present each of the McFadden R-Squared values below. Higher McFadden values represent a stronger model. We can sese that the model with the engineered features is better.

pR2(kitchensink)

## fitting null model for pseudo-r2

##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -775.8415544 -949.6695202  347.6559315    0.1830405    0.1217033    0.2396438

pR2(FeaturesModel)

## fitting null model for pseudo-r2

##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -722.2540028 -949.6695202  454.8310347    0.2394681    0.1561465    0.3074654

Making Predictions

To better understand the goodness of fit of the models, we make predictions and look at the distribution of predicted probabilities. These models have four possible outcomes, which are represented by the numbers 0 and 1. Going forward, we use these numeric values to represent the possible outcomes. The possible outcomes include in the confusion matrix are: True Positive – We predicted the homeowner would take the credit and they did (Predicted 1, Observed 1) True Negative – We predicted the homeowner would not take the credit and they didn’t (Predicted 0, Observed 0) False Positive – We predicted the homeowner would take the credit, but they didn’t (Predicted 1, Observed 0) False Negative – We predicted the homeowner would not take the credit, but they did (Predicted 0, Observed 1)

Taking the credit is represented by 1 and not taking the credit is represented by 0. The plots below show the distribution of these outcomes. A strong model would have clustering of each value around its associated number (i.e., clustering of 1’s around 1). Neither of these models show clustering around 1, meaning the models are not accurate in predicting uptake of the credit.

kitchensink.testProbs <- data.frame(Outcome = as.factor(housingTest$y_numeric),
                        Probs = predict(kitchensink, housingTest, type= "response"))

ggplot(kitchensink.testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = c("#55a868","#4c72b0")) +
  labs(x = "Took Credit", y = "Density of Probabilities",
       title = "Distribution of Predicted Probabilities by Observed Outcome", subtitle = "Kitchen Sink Model") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

features.testProbs <- data.frame(Outcome = as.factor(housingTest$y_numeric),
                        Probs = predict(FeaturesModel, housingTest, type= "response"))


ggplot(features.testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = c("#55a868","#4c72b0")) +
  labs(x = "Took Credit", y = "Density of probabilities",
       title = "Distribution of Predicted Probabilities by Observed Outcome", subtitle = "Engineered Features Model") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

Confusion Matrix

Now we will explore the confusion matrix more. As mentioned previously, each model has four possible outcomes.

The Kitchen Sink model produces 21 true positives, 1,270 true negatives, 25 false positives, and 125 false negatives. The engineered features model produces 31 true positives, 1,270 true negatives, 24 false positives, and 115 false negatives. Neither model is particularly strong at predicting true positives, which is the city’s goal.

kitchensink.testProbs <- 
  kitchensink.testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(kitchensink.testProbs$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(kitchensink.testProbs$predOutcome, kitchensink.testProbs$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1270  125
##          1   24   21
##                                                
##                Accuracy : 0.8965               
##                  95% CI : (0.8796, 0.9118)     
##     No Information Rate : 0.8986               
##     P-Value [Acc > NIR] : 0.624                
##                                                
##                   Kappa : 0.1808               
##                                                
##  Mcnemar's Test P-Value : 0.0000000000000002562
##                                                
##             Sensitivity : 0.14384              
##             Specificity : 0.98145              
##          Pos Pred Value : 0.46667              
##          Neg Pred Value : 0.91039              
##              Prevalence : 0.10139              
##          Detection Rate : 0.01458              
##    Detection Prevalence : 0.03125              
##       Balanced Accuracy : 0.56264              
##                                                
##        'Positive' Class : 1                    
##

features.testProbs <- 
 features.testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(features.testProbs$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(features.testProbs$predOutcome, features.testProbs$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1271  115
##          1   23   31
##                                               
##                Accuracy : 0.9042              
##                  95% CI : (0.8878, 0.9189)    
##     No Information Rate : 0.8986              
##     P-Value [Acc > NIR] : 0.2584              
##                                               
##                   Kappa : 0.27                
##                                               
##  Mcnemar's Test P-Value : 0.000000000000009451
##                                               
##             Sensitivity : 0.21233             
##             Specificity : 0.98223             
##          Pos Pred Value : 0.57407             
##          Neg Pred Value : 0.91703             
##              Prevalence : 0.10139             
##          Detection Rate : 0.02153             
##    Detection Prevalence : 0.03750             
##       Balanced Accuracy : 0.59728             
##                                               
##        'Positive' Class : 1                   
##

Cross Validation

The next step is to preform cross-validation on both models to understand the goodness of fit. We are interested in the receiver operating characteristic (or ROC), sensitivity (or true positive rate), and specificity (true negative rate).

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFitkitchensink <- train(y ~ ., data = housingsub %>% 
                    dplyr::select(-y_numeric, -job, -marital, -age, -campaign, -day_of_week, -month, -previous, -pdays, -inflation_rate, -taxLien, -education, -unemploy_rate, -poutcome, -contact, -cons.price.idx, -cons.conf.idx) %>%  
                        dplyr::mutate(y = ifelse(y=="yes","c1.yes","c2.no")), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFitkitchensink

## Generalized Linear Model 
## 
## 4119 samples
##   13 predictor
##    2 classes: 'c1.yes', 'c2.no' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4079, 4079, 4078, 4078, 4077, 4078, ... 
## Resampling results:
## 
##   ROC        Sens    Spec     
##   0.7522928  0.1125  0.9849775

cvFitFeatures <- train(y ~ ., data = housingsub %>% 
                    dplyr::select(-y_numeric, -MonthSubsidyAvg, -DaySubsidyAvg, -cellular, -EduLow, -EduHigh, -married, -creditjobs, -unemployed, -spent_on_repairs, -inflation_rate) %>%  
                        dplyr::mutate(y = ifelse(y=="yes","c1.yes","c2.no")), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFitFeatures

## Generalized Linear Model 
## 
## 4119 samples
##   19 predictor
##    2 classes: 'c1.yes', 'c2.no' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4078, 4078, 4078, 4078, 4077, 4077, ... 
## Resampling results:
## 
##   ROC      Sens    Spec    
##   0.76547  0.2245  0.982027

Goodness of Fit

The Kitchen Sink model has an ROC of 0.75, sensitivity of 0.11, and specificity of 0.98. The engineered features model has an ROC of 0.76, sensitivity of 0.22, and specificity of 0.98. Both models have strong ROC and specificity values, but low sensitivity values. This is problematic as the city’s goal is to increase true positives. Achieving this goal means building a model that produces a high sensitivity value. Below we plot these results.

dplyr::select(cvFitkitchensink$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFitkitchensink$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#55a868") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "black", linetype = 1, size = 1) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="Cross Validated Goodness of Fit Metrics: Kitchen Sink",
         subtitle = "Black line represents across-fold mean")

dplyr::select(cvFitFeatures$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFitFeatures$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#4c72b0") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "black", linetype = 1, size = 1) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="Cross Validated Goodness of Fit Metrics: Kitchen Sink",
         subtitle = "Black line represents across-fold mean")

ROC Curve

We will move forward with the engineered features model as it has a higher sensitivity value. In addition to the cross-validation visuals above, we plot the ROC curve below. We have also calculated the area under the curve.

auc(features.testProbs$Outcome, features.testProbs$Probs)

## Area under the curve: 0.753

The ROC curve is another indicator of goodness of fit. It visualizes the true and false positives produced by the model. The area under the curve, or AUC, tells us if the fit is useful. The AUC of this model is 75%, meaning the model is reasonability accurate.

ggplot(features.testProbs, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'black') +
  labs(title = "ROC Curve - Engineered Features")

Cost Benefit Analysis

In this section we conduct a cost benefit analysis for each outcome under the featured engineered model. The results are presented in the table below. The benefits resulting from true positives are calculated by taking the marking expenditures plus the cost of the tax credit for each homeowner that takes the credit.

The equation is (-$5,000 + $10,000 + $56,000) * (Count of Credit Update * 0.25) + (Count of Credit Update * -$2,850).

For true negatives, there is no expenditures and no benefits. The equation is (Count of Credit Update * $0).

For false negatives, there is no marketing costs even though the homework took the credit. The equation is (Count of Credit Update * $0).

For false positives, there are marketing costs but no benefits. The equation is (Count of Credit Update * -$2,850).

cost_benefit_table <-
  features.testProbs %>%
  count(predOutcome, Outcome) %>%
  summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
            True_Positive = sum(n[predOutcome==1 & Outcome==1]),
            False_Negative = sum(n[predOutcome==0 & Outcome==1]),
            False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
  gather(Variable, Count) %>%
  mutate(Benefit =
           case_when(Variable == "True_Negative"  ~ Count * 0,
                     Variable == "True_Positive"  ~ ((-5000+10000+56000) * (Count * .25)) + 
                       (-2850 * (Count)),
                     Variable == "False_Negative" ~ (-0) * Count,
                     Variable == "False_Positive" ~ (-2850) * Count)) %>%
  bind_cols(data.frame(Description = c(
    "Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no credit was allocated.",
    "Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took the credit.",
    "Predicted homeowner would not take the credit, did not market allocate resources, they took the credit",
    "Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated.")))

  kable(cost_benefit_table, caption= "Cost Benefit Analysis") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Cost Benefit Analysis
Variable	Count	Benefit	Description
True_Negative	1271	0	Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no credit was allocated.
True_Positive	31	384400	Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took the credit.
False_Negative	115	0	Predicted homeowner would not take the credit, did not market allocate resources, they took the credit
False_Positive	23	-65550	Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated.

Optimizing Thresholds

The next step is to identify the most optimal threshold for the cost benefit analysis. Previously we had used 50%. First, we calculate the benefits (i.e., revenue) associated with each threshold for each outcome. We plot the benefits for each outcome below. We see that there are higher benefits/revenue for true positive and lower benefits/revenue for false positives at the lower thresholds before the even out.

whichThreshold <- 
  iterateThresholds(
     data=features.testProbs, observedClass = Outcome, predictedProbs = Probs)

#whichThreshold[1:5,]

whichThreshold <- 
  whichThreshold %>%
    dplyr::select(starts_with("Count"), Threshold) %>%
    gather(Variable, Count, -Threshold) %>%
    mutate(Revenue =
             case_when(Variable == "Count_TN"  ~ Count * 0,
                     Variable == "Count_TP"  ~ ((-5000+10000+56000) * (Count * .25)) + 
                       (-2850 * (Count)),
                     Variable == "Count_FN"  ~ (-0) * Count,
                     Variable == "Count_FP"  ~ (-2850) * Count))

whichThreshold %>%
  ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette1[c(5, 1:3)]) +    
  labs(title = "Revenue by Confusion Matrix Type and Threshold",
       y = "Revenue") +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix"))

Identify Optimal Threshold

Now we will identify the most optimal threshold given the cost benefit analysis. We do this by finding the threshold at which the benefits are highest. According to the analysis, 19% is the most optimal threshold. We present the count, revenue, and losses associated with the 19% threshold below.

whichThreshold_benefits <- 
  whichThreshold %>% 
    mutate(actualCredit = ifelse(Variable == "Count_TP", (Count * .5),
                         ifelse(Variable == "Count_FN", Count, 0))) %>% 
    group_by(Threshold) %>% 
    summarize(Revenue = sum(Revenue),
            Actual_Credit_Count = sum(actualCredit),
            Actual_Credit_Rate = sum(actualCredit) / sum(Count) ,
            Actual_Credit_Revenue_Loss =  sum(actualCredit * -2850 + -5000))

whichThreshold_benefits[1:5,]

## # A tibble: 5 × 5
##   Threshold  Revenue Actual_Credit_Count Actual_Credit_Rate Actual_Credit_Reve…¹
##       <dbl>    <dbl>               <dbl>              <dbl>                <dbl>
## 1      0.01 -1857550                73               0.0507              -228050
## 2      0.02 -1664750                73.5             0.0510              -229475
## 3      0.03 -1364100                78.5             0.0545              -243725
## 4      0.04  -979800                83               0.0576              -256550
## 5      0.05  -594500                87               0.0604              -267950
## # … with abbreviated variable name ¹Actual_Credit_Revenue_Loss

BestThresh <- whichThreshold_benefits[which.max(whichThreshold_benefits$Revenue),]

kable(BestThresh, caption="Optimal Threshold") %>% 
   kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Optimal Threshold
Threshold	Revenue	Actual_Credit_Count	Actual_Credit_Rate	Actual_Credit_Revenue_Loss
0.18	673800	105.5	0.0732639	-320675

Revenue and Credit Counts by Threshold Plot

We plot the revenue and credit counts by threshold below. We include a black line at the optimal threshold where we can see the revennue begins to drop. This is due to the associated costs with increased marketing and credit update.

whichThreshold_benefits$totcount <- 
  whichThreshold %>% 
  group_by(Threshold) %>% na.omit() %>%
  summarize(Total_Count = sum(Count))

whichThreshold_benefits$rev  <- 
  whichThreshold %>% 
  group_by(Threshold) %>% na.omit() %>%
  summarize(Total_Revenue = sum(Revenue))

grid.arrange(ncol = 2,
  ggplot(whichThreshold_benefits)+
  geom_line(aes(x = Threshold, y = Revenue), color = "#4c72b0", size = 1)+
  geom_vline(xintercept = 0.19)+
  labs(title = "Revenue By Threshold",
       subtitle = "Black line represents optimal threshold"), 
  ggplot(whichThreshold_benefits)+
  geom_line(aes(x = Threshold, y = Actual_Credit_Count), color = "#55a868", size = 1)+
  geom_vline(xintercept = 0.19)+
  labs(title = "Number of Credits By Threshold",
       subtitle = "Black line represents optimal threshold")
)

Revenue and Credit Counts by Threshold Table

We present a table on the revenue and credit counts by the original and optimal thresholds below. We see that the revenue at 19% uptake is higher than the revenue at 50%, maximizing the increased home values and city tax revenue.

fiftythresh <- filter(whichThreshold_benefits, whichThreshold_benefits$Threshold== .50)%>%
  dplyr::select(Threshold, Revenue, Actual_Credit_Count)

optimalthresh <- filter(whichThreshold_benefits, whichThreshold_benefits$Threshold== .19)%>%
  dplyr::select(Threshold, Revenue, Actual_Credit_Count)

Final_Table <- rbind(optimalthresh, fiftythresh)

kable(Final_Table, caption="Total Revenue and Count of Credits at Thresholds") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Total Revenue and Count of Credits at Thresholds
Threshold	Revenue	Actual_Credit_Count
0.19	663250	106.5
0.50	318850	130.5

Conclusion

In this analysis, we use logistic regression models to predict the update of a housing repair tax credit with the ultimate goals of increasing the number of households taking the credit and minimizing costs to the city. We aimed to increase the sensitivity of our model by engineering features that would best predict whether a household would utilize the credit. Using these results, we produced a cost benefit analysis for each possible outcome and identified the most optimal threshold.

While the new model did produce a higher sensitivity rate, we do not recommend this model be put into production in Emil City. The data used to build the model was very limited and the engineered features did not notably increase any indicators of accuracy or goodness of fit. Overall, both models were not able to predict true positives very well. In the future we would recommend incorporating more data, especially any spatial data or information related to income, education, and job status. It would also be interesting to add variables that take energy/utility costs into account as higher energy burdens may encourage a homeowner to take advantage of the repair credit. Similarly, we recommend adding data on the occupants of the property. A property that is owner-occupied may be more likely to take the credit compared to one that is renter-occupied.

In order to ensure marketing materials result in a higher response rate, we recommend incorporating the suggested data elements above. We also suggest diversifying the marketing strategies by implementing online or social media advertising.

Introduction

Set up

We first perform the necessary set up here.

knitr::opts_chunk$set(message = FALSE, warning = FALSE)

options(scipen=10000000)

library(tidyverse)
library(kableExtra)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(pROC)
library(lubridate)
library(gridExtra)
library(grid)
library(ggplot2)

palette1 <- c("#4c72b0","#55a868","#dd8452","#c44c54")

root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

housingsub <- read.csv("C:/Users/admin/Desktop/Public-Policy-Analytics-Landing-master/Public-Policy-Analytics-Landing-master/DATA/Chapter6/housingSubsidy.csv")

Exploratory Data Analysis

Below we present data visualizations of the continuous and categorical variables. The categorical variables are broken down by financial, personal information, and outreach details.

housingsub %>%
    dplyr::select(y, age, spent_on_repairs, unemploy_rate, inflation_rate) %>%
    gather(Variable, value, -y) %>%
    ggplot() + 
    geom_density(aes(value, color = y), size= 1, fill = "transparent") + 
    facet_wrap(~Variable, scales = "free") +
    scale_color_manual(values = c("#55a868","#4c72b0")) +
    labs(title = "Feature Associations with the Likelihood of Credit",
         subtitle = "Continous Features")

housingsub %>%
    dplyr::select(y,taxLien, mortgage, taxbill_in_phl) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = c("#55a868","#4c72b0")) +
        labs(x="Credit", y="Value",
             title = "Feature Associations with the Likelihood of Credit",
             subtitle = "Financial Categorical Features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

housingsub %>%
    dplyr::select(y, education, marital, job) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = c("#55a868","#4c72b0")) +
        labs(x="Credit", y="Value",
             title = "Feature Associations with the Likelihood of Credit",
             subtitle = "Personal Categorical Features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

housingsub %>%
    dplyr::select(y, contact, month, day_of_week) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = c("#55a868","#4c72b0")) +
        labs(x="Credit", y="Value",
             title = "Feature Associations with the Likelihood of Credit",
             subtitle = "Contact Categorical Features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Feature Engineering

#marriage status
housingsub <-
  housingsub %>%
  mutate(married = ifelse(str_detect(housingsub$marital, "married"), "Yes", "No"))

#job status
housingsub <- 
  housingsub %>% 
  mutate(unemployed = ifelse(str_detect(housingsub$job, "unemployed"), "Yes", "No"))
         
#jobsmorelikely
housingsub <- 
  housingsub %>% 
  mutate(creditjobs = ifelse(str_detect(housingsub$job, "admin.|blue-collar|technician|services|management"), "Yes", "No"))

#month
housingsub <- 
  housingsub %>% 
  group_by(month) %>% 
  summarize(totalSubsidies = sum(y_numeric), 
            n = n(), 
            MonthSubsidyAvg = 100*(totalSubsidies/n)) %>%
  dplyr::select(-n, -totalSubsidies) %>%
  right_join(housingsub, .)

#day of week 
housingsub <- 
  housingsub %>% 
  group_by(day_of_week) %>% 
  summarize(totalSubsidies = sum(y_numeric), 
            n = n(), 
            DaySubsidyAvg = 100*(totalSubsidies/n)) %>%
  dplyr::select(-n, -totalSubsidies) %>%
  right_join(housingsub, .)

#cellular contact
housingsub <-
  housingsub %>%
  mutate(cellular = ifelse(str_detect(housingsub$contact, "cellular"), "Yes", "No"))

#telephone contact
housingsub <-
  housingsub %>%
  mutate(telephone = ifelse(str_detect(housingsub$contact, "telephone"), "Yes", "No"))

#education - high school/under
housingsub <-
  housingsub %>%
  mutate(EduLow = ifelse(str_detect(housingsub$education, "basic.4y|basic.6y|basic.9y|high.school"), "Yes", "No"))

#education - higher ed
housingsub <-
  housingsub %>%
  mutate(EduHigh = ifelse(str_detect(housingsub$education, "professional.course|university.degree"), "Yes", "No"))

Data Training

Testing and Training Sets

The next step splits our data into a testing and training set. We separated 65% of the data into the training set and the remaining 35% of observations are put into the test set.

set.seed(3456)
trainIndex <- createDataPartition(housingsub$poutcome, p = .65,
                                  list = FALSE,
                                  times = 1)
housingTrain <- housingsub[ trainIndex,]
housingTest  <- housingsub[-trainIndex,]

Create Kitchen Sink Model

Now we create two different models. Our first model includes all of the original data and is called Kitchen Sink. We present the summary of the Kitchen Sink regression below.

#kitchen sink
kitchensink <- glm(y_numeric ~ .,
                   data= housingTrain %>% 
                     dplyr::select(-y, -job, -marital, -age, -campaign, -day_of_week, -month, -previous, -pdays, -inflation_rate, -taxLien, -education, -unemploy_rate, -poutcome, -contact, -cons.price.idx, -cons.conf.idx),
                   family="binomial" (link="logit"))

summary(kitchensink)

## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = housingTrain %>% dplyr::select(-y, -job, -marital, 
##         -age, -campaign, -day_of_week, -month, -previous, -pdays, 
##         -inflation_rate, -taxLien, -education, -unemploy_rate, 
##         -poutcome, -contact, -cons.price.idx, -cons.conf.idx))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6168  -0.4187  -0.3069  -0.2535   2.8418  
## 
## Coefficients: (1 not defined because of singularities)
##                      Estimate  Std. Error z value             Pr(>|z|)    
## (Intercept)       54.29115376  5.20565644  10.429 < 0.0000000000000002 ***
## X                  0.00003924  0.00005669   0.692              0.48883    
## mortgageunknown   -0.45443757  0.49849005  -0.912              0.36196    
## mortgageyes       -0.18127728  0.13623214  -1.331              0.18330    
## taxbill_in_phlyes -0.03922259  0.18026488  -0.218              0.82775    
## spent_on_repairs  -0.01114628  0.00092171 -12.093 < 0.0000000000000002 ***
## marriedYes        -0.05354615  0.13576121  -0.394              0.69328    
## unemployedYes      0.13025185  0.39956611   0.326              0.74444    
## creditjobsYes     -0.04316569  0.17149490  -0.252              0.80127    
## MonthSubsidyAvg    0.01835215  0.00564194   3.253              0.00114 ** 
## DaySubsidyAvg      0.07049291  0.19094979   0.369              0.71200    
## cellularYes        0.63144930  0.17508938   3.606              0.00031 ***
## telephoneYes               NA          NA      NA                   NA    
## EduLowYes         -0.56559635  0.29581713  -1.912              0.05588 .  
## EduHighYes        -0.37166153  0.29572976  -1.257              0.20884    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1899.3  on 2678  degrees of freedom
## Residual deviance: 1551.7  on 2665  degrees of freedom
## AIC: 1579.7
## 
## Number of Fisher Scoring iterations: 6

Create Engineered Features Model

The second model includes our new engineered features. We also present the summary of the engineered features regression below.

#features model
FeaturesModel <- glm(y_numeric ~ .,
                   data= housingTrain %>% 
                     dplyr::select(-y, -MonthSubsidyAvg, -DaySubsidyAvg, -cellular, -EduLow, -EduHigh, -married, -creditjobs, -unemployed),
                   family="binomial" (link="logit"))

summary(FeaturesModel)

## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = housingTrain %>% dplyr::select(-y, -MonthSubsidyAvg, 
##         -DaySubsidyAvg, -cellular, -EduLow, -EduHigh, -married, 
##         -creditjobs, -unemployed))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8313  -0.4017  -0.3130  -0.2410   2.9982  
## 
## Coefficients: (1 not defined because of singularities)
##                                   Estimate    Std. Error z value Pr(>|z|)    
## (Intercept)                  -171.92439311  133.44927787  -1.288 0.197637    
## X                               0.00002096    0.00006059   0.346 0.729389    
## age                             0.01158123    0.00851557   1.360 0.173828    
## jobblue-collar                 -0.35228663    0.28202830  -1.249 0.211622    
## jobentrepreneur                -0.50769844    0.50660091  -1.002 0.316263    
## jobhousemaid                   -0.00021812    0.49837898   0.000 0.999651    
## jobmanagement                  -0.51833359    0.30620048  -1.693 0.090495 .  
## jobretired                      0.06780271    0.37491192   0.181 0.856486    
## jobself-employed               -0.31914899    0.37483044  -0.851 0.394520    
## jobservices                    -0.10067130    0.29377045  -0.343 0.731834    
## jobstudent                     -0.00719975    0.42331289  -0.017 0.986430    
## jobtechnician                  -0.00614976    0.24137683  -0.025 0.979674    
## jobunemployed                  -0.15917603    0.43491842  -0.366 0.714372    
## jobunknown                     -0.85751158    0.88735333  -0.966 0.333859    
## maritalmarried                 -0.01608184    0.24154395  -0.067 0.946917    
## maritalsingle                   0.03863738    0.27683376   0.140 0.889001    
## maritalunknown                 -0.01807439    1.19917405  -0.015 0.987974    
## educationbasic.6y               0.11540077    0.42618484   0.271 0.786563    
## educationbasic.9y               0.14095145    0.33984037   0.415 0.678319    
## educationhigh.school            0.14174023    0.32431265   0.437 0.662076    
## educationilliterate           -12.43585266  535.41146425  -0.023 0.981469    
## educationprofessional.course   -0.02281027    0.35821969  -0.064 0.949228    
## educationuniversity.degree      0.23867481    0.32367945   0.737 0.460891    
## educationunknown                0.56104985    0.40869817   1.373 0.169823    
## taxLienunknown                  0.03359282    0.21324729   0.158 0.874827    
## taxLienyes                    -10.42974298  535.41147296  -0.019 0.984458    
## mortgageunknown                -0.71077680    0.56397045  -1.260 0.207558    
## mortgageyes                    -0.22504656    0.14576547  -1.544 0.122614    
## taxbill_in_phlyes              -0.06483539    0.19075180  -0.340 0.733936    
## contacttelephone               -1.09496479    0.29315832  -3.735 0.000188 ***
## monthaug                       -0.08550443    0.45416104  -0.188 0.850666    
## monthdec                        0.88069572    0.70827283   1.243 0.213705    
## monthjul                       -0.23175321    0.37739942  -0.614 0.539163    
## monthjun                        0.18157116    0.46762312   0.388 0.697805    
## monthmar                        1.67582108    0.55407641   3.025 0.002490 ** 
## monthmay                       -0.23594916    0.30931810  -0.763 0.445580    
## monthnov                       -0.33830812    0.43633165  -0.775 0.438135    
## monthoct                       -0.00493612    0.54775124  -0.009 0.992810    
## monthsep                       -0.10105178    0.65921330  -0.153 0.878168    
## day_of_weekmon                  0.10090424    0.23011277   0.438 0.661025    
## day_of_weekthu                  0.13989298    0.23117879   0.605 0.545093    
## day_of_weektue                  0.17862787    0.23198007   0.770 0.441292    
## day_of_weekwed                  0.12778461    0.23765605   0.538 0.590793    
## campaign                       -0.06638459    0.04223360  -1.572 0.115987    
## pdays                          -0.00012582    0.00075441  -0.167 0.867547    
## previous                        0.06764814    0.20976798   0.322 0.747081    
## poutcomenonexistent             0.45889546    0.32635216   1.406 0.159684    
## poutcomesuccess                 1.71155647    0.73397424   2.332 0.019706 *  
## unemploy_rate                  -0.76761700    0.50524113  -1.519 0.128685    
## cons.price.idx                  1.52994252    0.88122556   1.736 0.082537 .  
## cons.conf.idx                   0.05245813    0.02821061   1.860 0.062954 .  
## inflation_rate                 -0.24312752    0.44358275  -0.548 0.583624    
## spent_on_repairs                0.00564875    0.01077484   0.524 0.600102    
## telephoneYes                            NA            NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1899.3  on 2678  degrees of freedom
## Residual deviance: 1444.5  on 2626  degrees of freedom
## AIC: 1550.5
## 
## Number of Fisher Scoring iterations: 12

McFadden R-Squared

We also present each of the McFadden R-Squared values below. Higher McFadden values represent a stronger model. We can sese that the model with the engineered features is better.

pR2(kitchensink)

## fitting null model for pseudo-r2

##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -775.8415544 -949.6695202  347.6559315    0.1830405    0.1217033    0.2396438

pR2(FeaturesModel)

## fitting null model for pseudo-r2

##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -722.2540028 -949.6695202  454.8310347    0.2394681    0.1561465    0.3074654

Making Predictions

kitchensink.testProbs <- data.frame(Outcome = as.factor(housingTest$y_numeric),
                        Probs = predict(kitchensink, housingTest, type= "response"))

ggplot(kitchensink.testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = c("#55a868","#4c72b0")) +
  labs(x = "Took Credit", y = "Density of Probabilities",
       title = "Distribution of Predicted Probabilities by Observed Outcome", subtitle = "Kitchen Sink Model") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

features.testProbs <- data.frame(Outcome = as.factor(housingTest$y_numeric),
                        Probs = predict(FeaturesModel, housingTest, type= "response"))


ggplot(features.testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = c("#55a868","#4c72b0")) +
  labs(x = "Took Credit", y = "Density of probabilities",
       title = "Distribution of Predicted Probabilities by Observed Outcome", subtitle = "Engineered Features Model") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

Confusion Matrix

Now we will explore the confusion matrix more. As mentioned previously, each model has four possible outcomes.

kitchensink.testProbs <- 
  kitchensink.testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(kitchensink.testProbs$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(kitchensink.testProbs$predOutcome, kitchensink.testProbs$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1270  125
##          1   24   21
##                                                
##                Accuracy : 0.8965               
##                  95% CI : (0.8796, 0.9118)     
##     No Information Rate : 0.8986               
##     P-Value [Acc > NIR] : 0.624                
##                                                
##                   Kappa : 0.1808               
##                                                
##  Mcnemar's Test P-Value : 0.0000000000000002562
##                                                
##             Sensitivity : 0.14384              
##             Specificity : 0.98145              
##          Pos Pred Value : 0.46667              
##          Neg Pred Value : 0.91039              
##              Prevalence : 0.10139              
##          Detection Rate : 0.01458              
##    Detection Prevalence : 0.03125              
##       Balanced Accuracy : 0.56264              
##                                                
##        'Positive' Class : 1                    
##

features.testProbs <- 
 features.testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(features.testProbs$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(features.testProbs$predOutcome, features.testProbs$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1271  115
##          1   23   31
##                                               
##                Accuracy : 0.9042              
##                  95% CI : (0.8878, 0.9189)    
##     No Information Rate : 0.8986              
##     P-Value [Acc > NIR] : 0.2584              
##                                               
##                   Kappa : 0.27                
##                                               
##  Mcnemar's Test P-Value : 0.000000000000009451
##                                               
##             Sensitivity : 0.21233             
##             Specificity : 0.98223             
##          Pos Pred Value : 0.57407             
##          Neg Pred Value : 0.91703             
##              Prevalence : 0.10139             
##          Detection Rate : 0.02153             
##    Detection Prevalence : 0.03750             
##       Balanced Accuracy : 0.59728             
##                                               
##        'Positive' Class : 1                   
##

Cross Validation

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFitkitchensink <- train(y ~ ., data = housingsub %>% 
                    dplyr::select(-y_numeric, -job, -marital, -age, -campaign, -day_of_week, -month, -previous, -pdays, -inflation_rate, -taxLien, -education, -unemploy_rate, -poutcome, -contact, -cons.price.idx, -cons.conf.idx) %>%  
                        dplyr::mutate(y = ifelse(y=="yes","c1.yes","c2.no")), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFitkitchensink

## Generalized Linear Model 
## 
## 4119 samples
##   13 predictor
##    2 classes: 'c1.yes', 'c2.no' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4079, 4079, 4078, 4078, 4077, 4078, ... 
## Resampling results:
## 
##   ROC        Sens    Spec     
##   0.7522928  0.1125  0.9849775

cvFitFeatures <- train(y ~ ., data = housingsub %>% 
                    dplyr::select(-y_numeric, -MonthSubsidyAvg, -DaySubsidyAvg, -cellular, -EduLow, -EduHigh, -married, -creditjobs, -unemployed, -spent_on_repairs, -inflation_rate) %>%  
                        dplyr::mutate(y = ifelse(y=="yes","c1.yes","c2.no")), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFitFeatures

## Generalized Linear Model 
## 
## 4119 samples
##   19 predictor
##    2 classes: 'c1.yes', 'c2.no' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4078, 4078, 4078, 4078, 4077, 4077, ... 
## Resampling results:
## 
##   ROC      Sens    Spec    
##   0.76547  0.2245  0.982027

Goodness of Fit

dplyr::select(cvFitkitchensink$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFitkitchensink$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#55a868") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "black", linetype = 1, size = 1) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="Cross Validated Goodness of Fit Metrics: Kitchen Sink",
         subtitle = "Black line represents across-fold mean")

dplyr::select(cvFitFeatures$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFitFeatures$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#4c72b0") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "black", linetype = 1, size = 1) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="Cross Validated Goodness of Fit Metrics: Kitchen Sink",
         subtitle = "Black line represents across-fold mean")

ROC Curve

auc(features.testProbs$Outcome, features.testProbs$Probs)

## Area under the curve: 0.753

ggplot(features.testProbs, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'black') +
  labs(title = "ROC Curve - Engineered Features")

Cost Benefit Analysis

The equation is (-$5,000 + $10,000 + $56,000) * (Count of Credit Update * 0.25) + (Count of Credit Update * -$2,850).

For true negatives, there is no expenditures and no benefits. The equation is (Count of Credit Update * $0).

For false negatives, there is no marketing costs even though the homework took the credit. The equation is (Count of Credit Update * $0).

For false positives, there are marketing costs but no benefits. The equation is (Count of Credit Update * -$2,850).

cost_benefit_table <-
  features.testProbs %>%
  count(predOutcome, Outcome) %>%
  summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
            True_Positive = sum(n[predOutcome==1 & Outcome==1]),
            False_Negative = sum(n[predOutcome==0 & Outcome==1]),
            False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
  gather(Variable, Count) %>%
  mutate(Benefit =
           case_when(Variable == "True_Negative"  ~ Count * 0,
                     Variable == "True_Positive"  ~ ((-5000+10000+56000) * (Count * .25)) + 
                       (-2850 * (Count)),
                     Variable == "False_Negative" ~ (-0) * Count,
                     Variable == "False_Positive" ~ (-2850) * Count)) %>%
  bind_cols(data.frame(Description = c(
    "Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no credit was allocated.",
    "Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took the credit.",
    "Predicted homeowner would not take the credit, did not market allocate resources, they took the credit",
    "Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated.")))

  kable(cost_benefit_table, caption= "Cost Benefit Analysis") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Cost Benefit Analysis
Variable	Count	Benefit	Description
True_Negative	1271	0	Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no credit was allocated.
True_Positive	31	384400	Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took the credit.
False_Negative	115	0	Predicted homeowner would not take the credit, did not market allocate resources, they took the credit
False_Positive	23	-65550	Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated.

Optimizing Thresholds

whichThreshold <- 
  iterateThresholds(
     data=features.testProbs, observedClass = Outcome, predictedProbs = Probs)

#whichThreshold[1:5,]

whichThreshold <- 
  whichThreshold %>%
    dplyr::select(starts_with("Count"), Threshold) %>%
    gather(Variable, Count, -Threshold) %>%
    mutate(Revenue =
             case_when(Variable == "Count_TN"  ~ Count * 0,
                     Variable == "Count_TP"  ~ ((-5000+10000+56000) * (Count * .25)) + 
                       (-2850 * (Count)),
                     Variable == "Count_FN"  ~ (-0) * Count,
                     Variable == "Count_FP"  ~ (-2850) * Count))

whichThreshold %>%
  ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette1[c(5, 1:3)]) +    
  labs(title = "Revenue by Confusion Matrix Type and Threshold",
       y = "Revenue") +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix"))

Identify Optimal Threshold

whichThreshold_benefits <- 
  whichThreshold %>% 
    mutate(actualCredit = ifelse(Variable == "Count_TP", (Count * .5),
                         ifelse(Variable == "Count_FN", Count, 0))) %>% 
    group_by(Threshold) %>% 
    summarize(Revenue = sum(Revenue),
            Actual_Credit_Count = sum(actualCredit),
            Actual_Credit_Rate = sum(actualCredit) / sum(Count) ,
            Actual_Credit_Revenue_Loss =  sum(actualCredit * -2850 + -5000))

whichThreshold_benefits[1:5,]

## # A tibble: 5 × 5
##   Threshold  Revenue Actual_Credit_Count Actual_Credit_Rate Actual_Credit_Reve…¹
##       <dbl>    <dbl>               <dbl>              <dbl>                <dbl>
## 1      0.01 -1857550                73               0.0507              -228050
## 2      0.02 -1664750                73.5             0.0510              -229475
## 3      0.03 -1364100                78.5             0.0545              -243725
## 4      0.04  -979800                83               0.0576              -256550
## 5      0.05  -594500                87               0.0604              -267950
## # … with abbreviated variable name ¹Actual_Credit_Revenue_Loss

BestThresh <- whichThreshold_benefits[which.max(whichThreshold_benefits$Revenue),]

kable(BestThresh, caption="Optimal Threshold") %>% 
   kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Optimal Threshold
Threshold	Revenue	Actual_Credit_Count	Actual_Credit_Rate	Actual_Credit_Revenue_Loss
0.18	673800	105.5	0.0732639	-320675

Revenue and Credit Counts by Threshold Plot

whichThreshold_benefits$totcount <- 
  whichThreshold %>% 
  group_by(Threshold) %>% na.omit() %>%
  summarize(Total_Count = sum(Count))

whichThreshold_benefits$rev  <- 
  whichThreshold %>% 
  group_by(Threshold) %>% na.omit() %>%
  summarize(Total_Revenue = sum(Revenue))

grid.arrange(ncol = 2,
  ggplot(whichThreshold_benefits)+
  geom_line(aes(x = Threshold, y = Revenue), color = "#4c72b0", size = 1)+
  geom_vline(xintercept = 0.19)+
  labs(title = "Revenue By Threshold",
       subtitle = "Black line represents optimal threshold"), 
  ggplot(whichThreshold_benefits)+
  geom_line(aes(x = Threshold, y = Actual_Credit_Count), color = "#55a868", size = 1)+
  geom_vline(xintercept = 0.19)+
  labs(title = "Number of Credits By Threshold",
       subtitle = "Black line represents optimal threshold")
)

Revenue and Credit Counts by Threshold Table

fiftythresh <- filter(whichThreshold_benefits, whichThreshold_benefits$Threshold== .50)%>%
  dplyr::select(Threshold, Revenue, Actual_Credit_Count)

optimalthresh <- filter(whichThreshold_benefits, whichThreshold_benefits$Threshold== .19)%>%
  dplyr::select(Threshold, Revenue, Actual_Credit_Count)

Final_Table <- rbind(optimalthresh, fiftythresh)

kable(Final_Table, caption="Total Revenue and Count of Credits at Thresholds") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Total Revenue and Count of Credits at Thresholds
Threshold	Revenue	Actual_Credit_Count
0.19	663250	106.5
0.50	318850	130.5

Targeting A Housing Subsidy

Kate Tanabe

November 4, 2022

Introduction

Set up

Exploratory Data Analysis

Feature Engineering

Data Training

Testing and Training Sets

Create Kitchen Sink Model

Create Engineered Features Model

McFadden R-Squared

Making Predictions

Confusion Matrix

Cross Validation

Goodness of Fit

ROC Curve

Cost Benefit Analysis

Optimizing Thresholds

Identify Optimal Threshold

Revenue and Credit Counts by Threshold Plot

Revenue and Credit Counts by Threshold Table

Conclusion

Targeting A Housing Subsidy

Kate Tanabe

November 4, 2022

Introduction

Set up

Exploratory Data Analysis

Feature Engineering

Data Training

Testing and Training Sets

Create Kitchen Sink Model

Create Engineered Features Model

McFadden R-Squared

Making Predictions

Confusion Matrix

Cross Validation

Goodness of Fit

ROC Curve

Cost Benefit Analysis

Optimizing Thresholds

Identify Optimal Threshold

Revenue and Credit Counts by Threshold Plot

Revenue and Credit Counts by Threshold Table

Conclusion