<<<<<<< HEAD Predicting Housing Prices - Mecklenburg County, NC

Introduction

This project explores the effect that various independent variables have on the sales price of a property. More specifically, this project uses a linear regression model in an attempt to predict the exact price of properties in Mecklenburg County, North Carolina based on a property’s relationship to the variables selected by our team.

This exercise is difficult because in order to make an accurate prediction, our team had to determine which variables were most likely to have the greatest impact on a property’s value. This step required exploratory data analysis for several variables before identifying which ones would be the most effective in the model. Once this decision was made, our team proceeded to engineer several features, so it fit the needs of our model. This prepared the data for future steps including assessing the correlation between our variables and sales price and a K Nearest Neighbor (KNN) analysis to identify a property’s exposure to several variables. Lastly, our modeling strategy included model estimation and validation phase to determine the model’s generalizability.

Following our Ordinary Least Squares (OLS) regression, we found that our model accounted for 68% of variation in home prices. Following a k-fold cross-validation test, our team determined that our model was somewhat generalizable, as there was only moderate variation across folds. This infers that our model could be applied to other data sets in the future.

Data

This project used data from Mecklenburg County to predict property values. Several variables used for this prediction were derived from the “student data set”. Variables derived from this data set included a property’s age, air conditioning type, building grade, heated area, sale year, and sale month. In addition, grocery store and capital improvement project spatial data were gathered from the Mecklenburg County Open Data portal and included in the model.

Table 1: Summary Table

Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
NCData.sf.bedrooms 46183 3.51 0.943 0 3 4 65
NCData.sf.fullbaths 46183 2.282 0.825 0 2 3 9
NCData.sf.halfbaths 46183 0.596 0.528 0 0 1 4
NCData.sf.percentheated 46183 0.219 0.13 0 0.121 0.297 1.634
NCData.sf.basement 46176 0.019 0.136 0 0 0 1
NCData.sf.ac 46151 0.967 0.18 0 1 1 1
NCData.sf.totalac 46183 1.772 106.717 0 0 0.218 9757.44
NCData.sf.year 46183
… 2020 16466 35.7%
… 2021 18601 40.3%
… 2022 11115 24.1%
… 2028 1 0%
NCData.sf.grade 46168
… 0 22 0%
… 1 670 1.5%
… 2 31211 67.6%
… 3 9604 20.8%
… 4 3513 7.6%
… 5 913 2%
… 6 235 0.5%
NCData.sf.numfirepla 46183 0.787 0.465 0 1 1 7
NCData.sf.Age 46183 28.614 31.836 0 7 43 2022
NCData.sf.units 46183 0.981 0.969 0 1 1 205
NCData.sf.month 46183
… Jan 3183 6.9%
… Feb 3156 6.8%
… Mar 4646 10.1%
… Apr 4424 9.6%
… May 4338 9.4%
… Jun 5041 10.9%
… Jul 4837 10.5%
… Aug 3843 8.3%
… Sep 3290 7.1%
… Oct 3259 7.1%
… Nov 2959 6.4%
… Dec 3207 6.9%

Once this data was gathered, our team was prepared to conduct some exploratory data analysis to identify the impact that our variables had on sales price. One example of this can be observed in Figure 1 below, which reflects the sales price of properties as a function of building area. Additional visualizations of the variables selected for prediction in this project will be presented in the methods and results sections of this report.

Figure 1: Property Sales Prices Map

NC_Data$PricePerArea <- NC_Data$price/NC_Data$heatedarea

ggplot() +
    geom_sf(data = NPA.sf, fill = "white") +
  geom_sf(data = NC_Data, aes(colour = q5(PricePerArea)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels= qBr(NCData.sf, "price"),
                   name="Quintile\nBreaks") +
  labs(title="Price Per Area, Charlotte") +
  mapTheme()

During the early stages of our exploratory data analysis, our team created four price correlation scatterplots to visualize the relationship between our variables and property sales price. These visualizations are presented in Figures 2 through 5 below and reflect the relationship between building grade, percent of heated area, property exposure to grocery stores, and property exposure to capital improvement projects. These scatterplots indicate a strong, positive, linear relationship between the selected variables and property sales price. Figures 6 through 8 below also visualize the relationship between these variables and property sales price in a spatial format. Based on these findings, these variables were selected for future analysis in our OLS Regression Model discussed in the Methods section below.

Figure 2: Building Age Price Correlation Scatterplot

st_drop_geometry(NCData.sf) %>% 
  dplyr::select(price, Age) %>%
  filter(price <= 1000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#2c7fb8") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of Building Age",
            x = "Sale Price",
          y = "Age")

Figure 3: Percent Heated Price Correlation Scatterplot

st_drop_geometry(NCData.sf) %>% 
  dplyr::select(price, percentheated) %>%
  filter(price <= 1000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#2c7fb8") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of Heated Area",
          x = "Percent Area Heated",
          y = "Sale Price")

Figure 4: Grocery Stores Correlation Scatterplot

## Nearest Neighbor Neighbor - grocery stores

NCData.sf <-
  NCData.sf %>% 
    mutate(
      grocery_nn1 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(Groceries.sf), k = 1),
      
      grocery_nn2 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(Groceries.sf), k = 2), 
      
      grocery_nn3 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(Groceries.sf), k = 3), 
      
      grocery_nn4 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(Groceries.sf), k = 4), 
      
      grocery_nn5 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(Groceries.sf), k = 5))
st_drop_geometry(NCData.sf) %>% 
  dplyr::select(price, grocery_nn1) %>%
  filter(price <= 1000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#2c7fb8") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of Exposure to Grocery Stores",
          x = "Exposure to nearest grocery store",
          y = "Sale Price")

Figure 5: Capital Improvement Project Correlation Scatterplot

## Nearest Neighbor Neighbor - cap Improvement projects

NCData.sf <-
  NCData.sf %>% 
    mutate(
      proj_nn1 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(ImproveProj.sf), k = 1),
      
      proj_nn2 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(ImproveProj.sf), k = 2), 
      
      proj_nn3 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(ImproveProj.sf), k = 3), 
      
      proj_nn4 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(ImproveProj.sf), k = 4), 
      
      proj_nn5 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(ImproveProj.sf), k = 5))

st_drop_geometry(NCData.sf) %>% 
  dplyr::select(price, percentheated) %>%
  filter(price <= 1000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#2c7fb8") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of Exposure to Capital Improvement Projects",
          x = "Percent Area Heated",
          y = "Sale Price")

Figure 6: Building Age Map

ggplot() +
    geom_sf(data = NPA.sf, fill = "white") +
  geom_sf(data = NCData.sf, aes(colour = q5(Age)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels= qBr(NCData.sf, "Age"),
                   name="Quintile\nBreaks") +
  labs(title="Building Age") +
  mapTheme()

Figure 7: Percent Heated Area Map

ggplot() +
    geom_sf(data = NPA.sf, fill = "white") +
  geom_sf(data = NCData.sf, aes(colour = q5(percentheated)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels= qBr(NCData.sf, "price"),
                   name="Quintile\nBreaks") +
  labs(title="Percent of Property Heated Area") +
  mapTheme()

Figure 8: Exposure to Grocery Stores Map

ggplot() +
  geom_sf(data = NPA.sf, fill = "white") +
  geom_sf(data = NCData.sf, aes(colour = q5(NCData.sf$grocery_nn1)),
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(NCData.sf,"grocery_nn1"),
                   name="Quintile\nBreaks") +
  labs(title="Property Exposure to Grocery Stores") +
  mapTheme()

Our team also created a correlation matrix to visualize the correlation between our selected variables and property sales price. Correlation values closer to zero indicate a weaker relationship between variables, a value closer to positive 1 reflects a strong positive relationship between variables, and a value closer to negative 1 reflects a strong negative relationship between variables. For example, a strong positive relationship between building grade and sales price would indicate that as building grade increases, sales price increases as a result. In contrast, a strong negative relationship between the same variables would indicate that as building grade increases, sales price decreases. The price correlation matrix used to further quantify the effect that our selected variables had on property sales prices is reflected in Figure 9 below.

Figure 9: Price Correlation Matrix

numericVars <-
  select_if(st_drop_geometry(NCData.sf), is.numeric) %>% na.omit()


ggcorrplot(
  round(cor(numericVars), 1),
  p.mat = cor_pmat(numericVars),
  colors = c("#7fcdbb", "white", "#2c7fb8"),
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation across numeric variables")  

Methods

Our team began our model building process by researching Mecklenburg County to gain a gain a sense of local expertise. This knowledge was used to identify what variables would be most effective at predicting property values in the county. We then proceeded to find the required data sources and wrangle them into our workspace. Our team worked to further, “clean the data” to ensure that the required fields were present and to remove any features that were not useful for our efforts.

With the data wrangled and cleaned, the team was then ready to explore the data sets. We began our exploratory data analysis by creating maps to reflect our variables in a spatial format. Then, we created correlation scatterplots to understand the potential relationship between our variables and property sales price. Using the scatterplots, we were able to select variables for further analysis based on the strength and direction of their relationship to sales prices. More specifically, we selected variables that had a strong, positive relationship to sales price. This type of relationship would indicate that as the magnitude of a variable increased, sales price would increase as a direct result. The variables selected using this methodology included age, air conditioning type, building grade, heated area, sale year, sale month, exposure to grocery stores, and exposure to capital projects. This process is referred to as data mining.

Once our exploratory data analysis was completed, our team conducted further feature engineering. Feature engineering was conducted to convert our raw variables into useful predictive features. Another major component of feature engineering is measuring exposure. A K-Nearest Neighbor (KNN) Analysis was conducted as part of this and was done for two of our variables including grocery stores and capital improvement projects. The analysis measures the “exposure”, or distance, between properties and grocery stores and capital improvements projects. A higher exposure score would indicate that properties are in close proximity to grocery stores and capital improvement projects.

## split test vs train
set.seed(825)

modelling_df <- NCData.sf[NCData.sf$toPredict == "MODELLING",]
challenge_df <- NCData.sf[NCData.sf$toPredict == "CHALLENGE",]

inTrain <- createDataPartition(
              y = paste(modelling_df$zipcode), 
              p = .60, list = FALSE)

Charlotte.training <- modelling_df[inTrain,] 
Charlotte.test <- modelling_df[-inTrain,] 
Charlotte.test2 <- Charlotte.test

Charlotte.training <- st_drop_geometry(Charlotte.training)
reg1 <- lm(price ~ ., data =  Charlotte.training %>% 
                                 dplyr::select(price, bedrooms,fullbaths, halfbaths, percentheated, basement, ac, totalac, year, storyheigh, numfirepla, heat,
                                                grade, Age, municipali, units, month, grocery_nn1, proj_nn1))

Charlotte.test1 <- st_drop_geometry(Charlotte.test)
Charlotte.test1 <- Charlotte.test1 %>%
  mutate(Regression = "Baseline Regression",
         price.Predict = predict(reg1, Charlotte.test),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict)%>%
  filter(price < 5000000) 

challenge_df <- st_drop_geometry(challenge_df)
challenge_df <- challenge_df %>%
  mutate(Regression = "Baseline Regression",
         price.Predict = predict(reg1, challenge_df))

Lastly, our team began the model estimation and validation step. This step was used to determine the generalizability of our model and the effect that our feature engineering had on predicting sales prices within the county. Strong generalizability indicates that a model will be useful in predicting the values of new data that is applied to it. We accomplished this by conducting Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), k-fold cross-validation, and Moran’s I tests. The MAE and MAPE tests determined potential errors resulting from our model, while the k-fold cross-validation test measured the generalizability of our model, and the Moran’s I test measured the spatial autocorrelation of our values. Lastly, to further test our model’s generalizability, our team applied census data to our model to test how well our model generalizes in different contexts.

Results

Summary

We used Ordinary Least Squares (OLS) to model housing price estimates in Mecklenburg County, NC. OLS finds the linear relationship between our dependent variable and predictors. In this situation, house price is the dependent variable, and our predictors are the variables and engineered features discussed above. We aimed to maximize the accuracy and generalizability of our predictions by using different predictors in our model. In order to do this, we split our modeling set into two additional sets: a training set which includes 60% of the observations and a test set which includes the remaining 40% of observations.

The table below provides a summary of our model, including our model’s r-squared and adjusted r-squared for the training sample. Our model explains about 68% of the variance in dependent variables.

Table 2: OLS Summary Results

#summary table for training set
summary_table <- summary(reg1) 

stargazer(reg1, type="text", style="qje", digits=2, 
          title = "Linear Regression Summary Results",
          dep.var.labels = "Home Price",
          covariate.labels = c("Bedrooms", "Full Bathrooms", "Half Bathrooms", "Percent Area Heated", "Basement", "AC" , "Total Acres", "Sold 2021", "Sold 2022", "Sold 2020", "1 Story", "1.5 Stories", "2 Stories", "2.5 Stories", "3.0 Stories", "Bi-Level", "Cape Cod Style", "Ranch Style", "Split Level", "Fireplaces", "Heat", "Grade: Fair", "Grade: Average", "Grade: Good", "Grade: Very Good", "Grade: Excellent", "Grade: Custom", "Age", "Cornelius", "Davidson", "Huntersville", "Matthews", "Mint Hill", "Pineville", "Stallings", "Unincorporated",  "Units", "Sold Feb", "Sold Mar", "Sold Apr", "Sold May", "Sold Jun", "Sold Jul", "Sold Aug", "Sold Sept", "Sold Oct", "Sold Nov", "Sold Dec", "Grocery Store Exposure", "Improvement Project Exposure"))
## 
## Linear Regression Summary Results
## ===================================================================
##                                            Home Price              
## -------------------------------------------------------------------
## Bedrooms                                  15,593.16***             
##                                            (3,069.87)              
##                                                                    
## Full Bathrooms                           120,292.20***             
##                                            (4,388.16)              
##                                                                    
## Half Bathrooms                            66,379.28***             
##                                            (5,478.97)              
##                                                                    
## Percent Area Heated                       43,666.31**              
##                                           (22,234.39)              
##                                                                    
## Basement                                  45,746.39***             
##                                           (16,294.50)              
##                                                                    
## AC                                        48,986.58***             
##                                           (13,190.73)              
##                                                                    
## Total Acres                                  -1.93                 
##                                             (20.47)                
##                                                                    
## Sold 2021                                 75,182.80***             
##                                            (5,073.84)              
##                                                                    
## Sold 2022                                204,284.20***             
##                                            (6,257.20)              
##                                                                    
## Sold 2020                                  118,845.00              
##                                           (367,970.80)             
##                                                                    
## 1 Story                                    92,088.64               
##                                           (368,369.70)             
##                                                                    
## 1.5 Stories                                140,881.60              
##                                           (368,395.20)             
##                                                                    
## 2 Stories                                  80,206.47               
##                                           (368,309.70)             
##                                                                    
## 2.5 Stories                                90,309.26               
##                                           (368,505.20)             
##                                                                    
## 3.0 Stories                                -71,163.16              
##                                           (372,592.50)             
##                                                                    
## Bi-Level                                   18,227.58               
##                                           (370,190.80)             
##                                                                    
## Cape Cod Style                              5,740.95               
##                                           (381,263.50)             
##                                                                    
## Ranch Style                                96,861.12               
##                                           (369,072.00)             
##                                                                    
## Split Level                                46,640.77               
##                                           (368,701.70)             
##                                                                    
## Fireplaces                                14,008.51***             
##                                            (5,199.15)              
##                                                                    
## Heat                                                               
##                                                                    
##                                                                    
## Grade: Fair                                88,549.37               
##                                           (93,838.32)              
##                                                                    
## Grade: Average                             109,427.70              
##                                           (92,273.67)              
##                                                                    
## Grade: Good                              256,566.10***             
##                                           (92,439.88)              
##                                                                    
## Grade: Very Good                         508,773.30***             
##                                           (92,758.84)              
##                                                                    
## Grade: Excellent                        1,091,591.00***            
##                                           (94,019.27)              
##                                                                    
## Grade: Custom                           1,978,565.00***            
##                                           (97,989.37)              
##                                                                    
## Age                                       2,119.99***              
##                                             (130.85)               
##                                                                    
## Cornelius                                117,970.40***             
##                                           (26,654.13)              
##                                                                    
## Davidson                                   -23,158.18              
##                                           (25,066.04)              
##                                                                    
## Huntersville                              -22,174.85*              
##                                           (12,642.83)              
##                                                                    
## Matthews                                 -42,255.91***             
##                                           (14,218.35)              
##                                                                    
## Mint Hill                                  -12,334.09              
##                                           (14,434.75)              
##                                                                    
## Pineville                                  28,970.93               
##                                           (23,249.51)              
##                                                                    
## Stallings                                  -64,384.47              
##                                           (103,104.90)             
##                                                                    
## Unincorporated                              6,082.12               
##                                            (9,578.32)              
##                                                                    
## Units                                    357,561.60***             
##                                            (1,789.91)              
##                                                                    
## Sold Feb                                  26,137.03**              
##                                           (11,878.04)              
##                                                                    
## Sold Mar                                  56,448.30***             
##                                           (10,860.32)              
##                                                                    
## Sold Apr                                  40,106.51***             
##                                           (11,000.30)              
##                                                                    
## Sold May                                  54,022.62***             
##                                           (11,050.51)              
##                                                                    
## Sold Jun                                  56,688.84***             
##                                           (10,688.97)              
##                                                                    
## Sold Jul                                  39,562.94***             
##                                           (10,832.10)              
##                                                                    
## Sold Aug                                  49,230.10***             
##                                           (11,398.44)              
##                                                                    
## Sold Sept                                 55,278.51***             
##                                           (12,035.04)              
##                                                                    
## Sold Oct                                  78,234.37***             
##                                           (11,961.75)              
##                                                                    
## Sold Nov                                  81,794.09***             
##                                           (12,255.00)              
##                                                                    
## Sold Dec                                  82,506.75***             
##                                           (12,014.33)              
##                                                                    
## Grocery Store Exposure                       -3.76*                
##                                              (2.27)                
##                                                                    
## Improvement Project Exposure                5.59***                
##                                              (1.94)                
##                                                                    
## Constant                                 -806,195.90**             
##                                           (380,002.50)             
##                                                                    
## N                                            28,023                
## R2                                            0.68                 
## Adjusted R2                                   0.68                 
## Residual Std. Error                 367,837.90 (df = 27973)        
## F Statistic                       1,215.23*** (df = 49; 27973)     
## ===================================================================
## Notes:                       ***Significant at the 1 percent level.
##                               **Significant at the 5 percent level.
##                               *Significant at the 10 percent level.

After running our model on the test set, we found that the mean absolute error (MAE) of our predictions was $114,322 and the mean absolute percentage error (MAPE) of our predictions was about 32%. These figures are presented in the table below.

Table 3: MAE and MAPE

#Polished table of mean absolute error and MAPE
Charlotte.testSum <- 
  st_drop_geometry(Charlotte.test1) %>%
  summarize(MAE = mean(price.AbsError, na.rm=TRUE),
            MAPE = mean(price.APE, na.rm=TRUE))

Charlotte.testSum %>%
  rename("Mean Absolute Error" = MAE) %>%
  rename("Mean Absolute Percentage Error" = MAPE)
##   Mean Absolute Error Mean Absolute Percentage Error
## 1            114322.4                      0.3208169
  kable(Charlotte.testSum, caption = "Test Set Results", align = "c") %>%
    kable_styling(full_width = F)
Test Set Results
MAE MAPE
114322.4 0.3208169

Cross-Validation

It is possible our model is still inaccurate or misleading because we split our data into training and testing sets randomly. We conducted k-fold cross-validation tests in order to better understand the generalizability of our model. We ran a 100-fold cross-validation test, which split the dataset into equal sized subsets and measures for average goodness of fit across all folds. The summary of our cross-validation test, including mean, maximum, minimum, and standard deviation, is presented below.

Table 4: Cross-Validation test results

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(price ~ ., data = st_drop_geometry(Charlotte.test) %>% 
                                dplyr::select(price, bedrooms,fullbaths, halfbaths, percentheated, basement, ac, totalac, year, storyheigh, numfirepla, heat,
                                                grade, Age, municipali, units, month, grocery_nn1, proj_nn1), 
     method = "lm", trControl = fitControl, na.action = na.pass)

#cross val results 
data.frame(Test = c("Cross_Validation"), 
           Mean = mean(reg.cv$resample[,3]),
           Max = max(reg.cv$resample[,3]),
           Min = min(reg.cv$resample[,3]),
           Standard_Deviation = sd(reg.cv$resample[,3]))%>%
  kable() %>%
  kable_styling() %>%
  footnote(general_title = "Summary Statistics of Cross Validation, k = 100 folds")
Test Mean Max Min Standard_Deviation
Cross_Validation 111435.4 357634.7 84760.1 30130.19

Our model was somewhat generalizable. The standard deviation of the MAE across all folds was about 30,130.19. This figure shows the goodness of fit metrics across all folds and means there is still some, but not a large amount of, variation across the folds. The distribution of MAE is partially clustered tightly together between $100,000-150,000, but it is also positively skewed by a small number of higher values. This could be due to outliers in our dataset. We also provide a plot of the distribution of MAE across all 100 folds below.

Figure 10: Distribution of MAE

hist(reg.cv$resample[,3],xlab="MAE", col="#2c7fb8", breaks = 50, main = "Distribution of Mean Absolute Error")

Accuracy

The plot below shows the linear models for the predicted and actual sale price of our dataset. The black line represents our model’s prediction and the green line represents a perfect fit. We can see that there is a small amount of error in our model.

Figure 11: Prediction vs observed price

ggplot(data = Charlotte.test1, aes(x=price.Predict,y=price))+
  geom_point(colour = "#2c7fb8", size = 3, alpha =0.75)+
  geom_abline(intercept = 0, slope = 1, size = 3,color = "#7fcdbb")+
  geom_smooth(method = lm, se=F,colour = "black",size=2)+
  labs(title="Prediction as a function of observed price",
       subtitle="Black line represents model prediction; Green line represents perfect prediction",
       x="Predicted Price ($)",
       y="Observed Price ($)") +
  plotTheme()+
  theme(plot.title = element_text(size=22),
        legend.title = element_text(),
        legend.position="right",
        axis.text=element_text(size=12),
        axis.title=element_text(size=12))

Generalizability

Generalizability across space is also important for creating house price prediction models. Looking at the map of our test set residuals, we can see that there is some clustering, or spatial autocorrelation, of higher residuals in certain areas.

Figure 12: Residuals

coords <- st_coordinates(NCData.sf) 

neighborList <- knn2nb(knearneigh(coords, 5))

spatialWeights <- nb2listw(neighborList, style="W")

NCData.sf$lagPrice <- lag.listw(spatialWeights, NCData.sf$price)

coords.test <-  st_coordinates(Charlotte.test2) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

Charlotte.test2$lagPrice <- lag.listw(spatialWeights.test, Charlotte.test2$price)

Charlotte.test2 <- Charlotte.test2 %>%
  mutate(Regression = "Baseline Regression",
         price.Predict = predict(reg1, Charlotte.test2),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict)%>%
  filter(price < 5000000)

ggplot() +
    geom_sf(data = NPA.sf, fill = "white") +
    geom_sf(data = Charlotte.test2, aes(colour = q5(price.AbsError)),
            show.legend = "point", size = 1) +
    scale_colour_manual(values = palette5,
                     labels=qBr(Charlotte.test2,"price.AbsError"),
                     name="Quintile\nBreaks ($)") +
    labs(title="Test Set Residuals", 
         subtitle="Mecklenburg County") +
    mapTheme()

The plots below show the spatial autocorrelation of home prices with other nearby home prices. We calculated the average sale price for each home’s five nearest neighbors. We see that as the price of a given home increases, the price of its neighbors increases significantly as well. We also see that as home price increases, the spatial lag of errors also increases but not at the same rate as the first test. This spatial lag of prices and errors are presented in the two plots below.

Figures 13 and 14: Spatial Lag Plots

#Charlotte.test1 <- st_drop_geometry(Charlotte.test)
Charlotte.test2 <- Charlotte.test2 %>%
  mutate(Regression = "Baseline Regression",
         price.Predict = predict(reg1, Charlotte.test2),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict)%>%
  filter(price < 5000000) 

ggplot(Charlotte.test2, aes(x=lagPrice, y=price)) +
  geom_point(colour = "#2c7fb8", size = 3, alpha =0.75) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  labs(title = "Price as a function of the spatial lag of price",
       x = "Spatial lag of price (Mean price of 5 nearest neighbors)",
       y = "Sale Price") +
  plotTheme()

Charlotte.test2 <- Charlotte.test2 %>%
  mutate(Regression = "Baseline Regression",
         price.Predict = predict(reg1, Charlotte.test2),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict)%>%
  filter(price < 5000000) 

coords.test <-  st_coordinates(Charlotte.test2) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

Charlotte.test2$lagPrice <- lag.listw(spatialWeights.test, Charlotte.test2$price)

Charlotte.test2$lagPriceError <- lag.listw(spatialWeights.test, Charlotte.test2$price.AbsError, NAOK=TRUE)

ggplot(Charlotte.test2, aes(x=lagPriceError, y=price)) +
  geom_point(colour = "#2c7fb8", size = 3, alpha =0.75) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  labs(title = "Error as a function of the spatial lag of price",
       x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
       y = "Sale Price") +
  plotTheme()

Finally, we used Moran’s I to better understand if spatial autocorrelation is present in our model. Our model’s Moran’s I is positive and around 0.25, but the p-value is 0.001 which means that there is spatial autocorrelation in our model. We present the outcome of the Moran’s I test below.

Figure 15: Moran’s I

#morans
moranTest <- moran.mc(Charlotte.test2$price.Error, na.action=na.omit, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#2c7fb8",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in blue",
       x="Moran's I",
       y="Count") +
  plotTheme()

We plot our predicted values below. Here we can clearly see clustering of predicted home prices with more expensive predictions clustered around the south central and northern parts of the county. We also see clustering of less expensive price predictions in an arch around downtown Charlotte.

Figure 15: Predicted Prices

ggplot() +
  geom_sf(data = NPA.sf, fill = "white") +
  geom_sf(data = Charlotte.test2, aes(colour = q5(price.Predict)), 
          show.legend = "point", size = 1) +
  scale_colour_manual(values=palette5,
                      labels=qBr(Charlotte.test2,"price.Predict"),
                      name="Quintile\nBreaks") +
  labs(title="Predicted Sale Price", subtitle = "Mecklenburg County") +
  mapTheme()+ 
  theme(plot.title = element_text(size=22))

We plot the MAPE of our predictions by neighborhood below. We still see some clustering of errors around Charlotte and near the outskirts of the county.

Figure 16: MAPE by neighborhood

library(mapview)

charlotte.test_sf = st_as_sf(Charlotte.test2)
nhoods = st_intersection(NPA.sf, charlotte.test_sf)
nhoods_MAPE <- nhoods %>%
  group_by(id) %>%
  summarise(meanMAPE = mean(price.APE, na.rm=TRUE)*100, 
            meanPrice = mean(price))


ggplot() +
  geom_sf(data = NPA.sf %>%
            left_join(st_drop_geometry(nhoods_MAPE), by = "id"),
          aes(fill = q5(meanMAPE))) +
  #geom_sf(data = Charlotte.test2, colour = "black", size = .25) +
  scale_fill_manual(values = palette5,
                    labels=qBr(nhoods_MAPE,"meanMAPE"),
                    name="Quintile\nBreaks") +
  mapTheme() +
  labs(title="Absolute sale price percent errors by Neighborhood")

Additionally, we present a scatterplot of MAPE by neighborhood as function of mean price by neighborhood below. We see that as mean price by neighborhood increases, MAPE of neighborhood also increases. This plot also allows us to see an outlier in our dataset.

Figure 17: Scatterplot of MAPE by neighborhood

ggplot(nhoods_MAPE, aes(x=meanPrice, y=meanMAPE))+
  geom_point(colour = "#2c7fb8", size = 3, alpha =0.75) +
  geom_smooth(method = "lm", color = "black") +
  labs(title = "MAPE by neighborhood as a function of price by neighborhood",
       x = "Sale Price",
       y = "MAPE") +
  theme(
    legend.position = "none") +
  plotTheme()

Finally, we split out study area into two groups. In this case, we split the county by race and then by income to test how well our model generalizes in different contexts. We present the race and income splits by neighborhood below. These plots show the segregation of race and income in Mecklenburg County.

Figure 18: Race and Income Context

NCtracts <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E","B06011_001"), 
          year = 2017, state=37, county= 119, geometry=T, output = "wide") %>%
  st_transform('ESRI:102286')  %>%
  rename(TotalPop = B01001_001E,
         NumberWhites = B01001A_001E,
         Median_Income = B06011_001E) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(Median_Income > 32322, "High Income", "Low Income"))

grid.arrange(ncol = 2,
  ggplot() + geom_sf(data = na.omit(NCtracts), aes(fill = raceContext)) +
    scale_fill_manual(values = c("#7fcdbb", "#2c7fb8"), name="Race Context") +
    labs(title = "Race Context") +
    mapTheme() + theme(legend.position="bottom"), 
  ggplot() + geom_sf(data = na.omit(NCtracts), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#7fcdbb", "#2c7fb8"), name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="bottom"))

In the race context, our model predicts better for areas that are majority non-white. We see that the MAPE in majority white neighborhoods is higher than that in majority non-white neighborhoods.

Race Context

st_join(Charlotte.test2, NCtracts) %>% 
  group_by(raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood race context")%>%
  kable_styling() 
Test set MAPE by neighborhood race context
Majority Non-White Majority White <NA>
30% 33% 16%

In terms of income, our model also produced higher errors in neighborhoods with higher median incomes than neighborhoods with lower median incomes.

Income Context

st_join(Charlotte.test2, NCtracts) %>% 
  group_by(incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood income context") %>%
  kable_styling() 
Test set MAPE by neighborhood income context
High Income Low Income <NA>
30% 36% 16%

Our model is somewhat generalizable but performs better in for predicting house prices in neighborhoods with lower median incomes or neighborhoods that are majority non-white.

Discussion

Overall, our model is relatively effective at predicting house prices in Mecklenburg County. We found that the percentage of heated area in the house, as well as exposure to grocery stores and capital improvement projects, to be interesting and important variables in our model. Our model predicted about 68% of variation in home prices. The MAE of our predictions was $114,322.40, while the MAPE of our predictions was about 32%. We were able to account for some spatial variation in our prices, but we still have significant spatial variation present in our model.

We found that our model was less accurate when predicting home prices for more expensive properties. Our model was also less accurate for homes located in majority white neighborhoods. This could be due to the breakdown of home prices in our training set and fewer observations with higher prices, which would lead to less accurate predictions. However, our model was more accurate when predicting less expensive homes.

Conclusion

This model predicts sales price as a function of several variables, including property age, air conditioning type, building grade, heated area, sale year, sale month, exposure to grocery stores, and exposure to capital projects. Our model was relatively successful in predicting property values in Charlotte. Given that it was able to account for 68% with a MAPE of 32%, we would most likely not recommend this model to Zillow. However, we believe the supplemental recommendation of adding more spatial features and more diverse training data observations would help create a more successful model.

======= Predicting Housing Prices - Mecklenburg County, NC

Introduction

This project explores the effect that various independent variables have on the sales price of a property. More specifically, this project uses a linear regression model in an attempt to predict the exact price of properties in Mecklenburg County, North Carolina based on a property’s relationship to the variables selected by our team.

This exercise is difficult because in order to make an accurate prediction, our team had to determine which variables were most likely to have the greatest impact on a property’s value. This step required exploratory data analysis for several variables before identifying which ones would be the most effective in the model. Once this decision was made, our team proceeded to engineer several features, so it fit the needs of our model. This prepared the data for future steps including assessing the correlation between our variables and sales price and a K Nearest Neighbor (KNN) analysis to identify a property’s exposure to several variables. Lastly, our modeling strategy included model estimation and validation phase to determine the model’s generalizability.

Following our Ordinary Least Squares (OLS) regression, we found that our model accounted for 68% of variation in home prices. Following a k-fold cross-validation test, our team determined that our model was somewhat generalizable, as there was only moderate variation across folds. This infers that our model could be applied to other data sets in the future.

Data

This project used data from Mecklenburg County to predict property values. Several variables used for this prediction were derived from the “student data set”. Variables derived from this data set included a property’s age, air conditioning type, building grade, heated area, sale year, and sale month. In addition, grocery store and capital improvement project spatial data were gathered from the Mecklenburg County Open Data portal and included in the model.

Table 1: Summary Table

Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
NCData.sf.bedrooms 46183 3.51 0.943 0 3 4 65
NCData.sf.fullbaths 46183 2.282 0.825 0 2 3 9
NCData.sf.halfbaths 46183 0.596 0.528 0 0 1 4
NCData.sf.percentheated 46183 0.219 0.13 0 0.121 0.297 1.634
NCData.sf.basement 46176 0.019 0.136 0 0 0 1
NCData.sf.ac 46151 0.967 0.18 0 1 1 1
NCData.sf.totalac 46183 1.772 106.717 0 0 0.218 9757.44
NCData.sf.year 46183
… 2020 16466 35.7%
… 2021 18601 40.3%
… 2022 11115 24.1%
… 2028 1 0%
NCData.sf.grade 46168
… 0 22 0%
… 1 670 1.5%
… 2 31211 67.6%
… 3 9604 20.8%
… 4 3513 7.6%
… 5 913 2%
… 6 235 0.5%
NCData.sf.numfirepla 46183 0.787 0.465 0 1 1 7
NCData.sf.Age 46183 28.614 31.836 0 7 43 2022
NCData.sf.units 46183 0.981 0.969 0 1 1 205
NCData.sf.month 46183
… Jan 3183 6.9%
… Feb 3156 6.8%
… Mar 4646 10.1%
… Apr 4424 9.6%
… May 4338 9.4%
… Jun 5041 10.9%
… Jul 4837 10.5%
… Aug 3843 8.3%
… Sep 3290 7.1%
… Oct 3259 7.1%
… Nov 2959 6.4%
… Dec 3207 6.9%

Once this data was gathered, our team was prepared to conduct some exploratory data analysis to identify the impact that our variables had on sales price. One example of this can be observed in Figure 1 below, which reflects the sales price of properties as a function of building area. Additional visualizations of the variables selected for prediction in this project will be presented in the methods and results sections of this report.

Figure 1: Property Sales Prices Map

NC_Data$PricePerArea <- NC_Data$price/NC_Data$heatedarea

ggplot() +
    geom_sf(data = NPA.sf, fill = "white") +
  geom_sf(data = NC_Data, aes(colour = q5(PricePerArea)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels= qBr(NCData.sf, "price"),
                   name="Quintile\nBreaks") +
  labs(title="Price Per Area, Charlotte") +
  mapTheme()

During the early stages of our exploratory data analysis, our team created four price correlation scatterplots to visualize the relationship between our variables and property sales price. These visualizations are presented in Figures 2 through 5 below and reflect the relationship between building grade, percent of heated area, property exposure to grocery stores, and property exposure to capital improvement projects. These scatterplots indicate a strong, positive, linear relationship between the selected variables and property sales price. Figures 6 through 8 below also visualize the relationship between these variables and property sales price in a spatial format. Based on these findings, these variables were selected for future analysis in our OLS Regression Model discussed in the Methods section below.

Figure 2: Building Age Price Correlation Scatterplot

st_drop_geometry(NCData.sf) %>% 
  dplyr::select(price, Age) %>%
  filter(price <= 1000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#2c7fb8") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of Building Age",
            x = "Sale Price",
          y = "Age")

Figure 3: Percent Heated Price Correlation Scatterplot

st_drop_geometry(NCData.sf) %>% 
  dplyr::select(price, percentheated) %>%
  filter(price <= 1000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#2c7fb8") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of Heated Area",
          x = "Percent Area Heated",
          y = "Sale Price")

Figure 4: Grocery Stores Correlation Scatterplot

## Nearest Neighbor Neighbor - grocery stores

NCData.sf <-
  NCData.sf %>% 
    mutate(
      grocery_nn1 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(Groceries.sf), k = 1),
      
      grocery_nn2 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(Groceries.sf), k = 2), 
      
      grocery_nn3 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(Groceries.sf), k = 3), 
      
      grocery_nn4 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(Groceries.sf), k = 4), 
      
      grocery_nn5 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(Groceries.sf), k = 5))
st_drop_geometry(NCData.sf) %>% 
  dplyr::select(price, grocery_nn1) %>%
  filter(price <= 1000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#2c7fb8") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of Exposure to Grocery Stores",
          x = "Exposure to nearest grocery store",
          y = "Sale Price")

Figure 5: Capital Improvement Project Correlation Scatterplot

## Nearest Neighbor Neighbor - cap Improvement projects

NCData.sf <-
  NCData.sf %>% 
    mutate(
      proj_nn1 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(ImproveProj.sf), k = 1),
      
      proj_nn2 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(ImproveProj.sf), k = 2), 
      
      proj_nn3 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(ImproveProj.sf), k = 3), 
      
      proj_nn4 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(ImproveProj.sf), k = 4), 
      
      proj_nn5 = nn_function(st_coordinates(NCData.sf), 
                              st_coordinates(ImproveProj.sf), k = 5))

st_drop_geometry(NCData.sf) %>% 
  dplyr::select(price, percentheated) %>%
  filter(price <= 1000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#2c7fb8") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of Exposure to Capital Improvement Projects",
          x = "Percent Area Heated",
          y = "Sale Price")

Figure 6: Building Age Map

ggplot() +
    geom_sf(data = NPA.sf, fill = "white") +
  geom_sf(data = NCData.sf, aes(colour = q5(Age)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels= qBr(NCData.sf, "Age"),
                   name="Quintile\nBreaks") +
  labs(title="Building Age") +
  mapTheme()

Figure 7: Percent Heated Area Map

ggplot() +
    geom_sf(data = NPA.sf, fill = "white") +
  geom_sf(data = NCData.sf, aes(colour = q5(percentheated)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels= qBr(NCData.sf, "price"),
                   name="Quintile\nBreaks") +
  labs(title="Percent of Property Heated Area") +
  mapTheme()

Figure 8: Exposure to Grocery Stores Map

ggplot() +
  geom_sf(data = NPA.sf, fill = "white") +
  geom_sf(data = NCData.sf, aes(colour = q5(NCData.sf$grocery_nn1)),
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(NCData.sf,"grocery_nn1"),
                   name="Quintile\nBreaks") +
  labs(title="Property Exposure to Grocery Stores") +
  mapTheme()

Our team also created a correlation matrix to visualize the correlation between our selected variables and property sales price. Correlation values closer to zero indicate a weaker relationship between variables, a value closer to positive 1 reflects a strong positive relationship between variables, and a value closer to negative 1 reflects a strong negative relationship between variables. For example, a strong positive relationship between building grade and sales price would indicate that as building grade increases, sales price increases as a result. In contrast, a strong negative relationship between the same variables would indicate that as building grade increases, sales price decreases. The price correlation matrix used to further quantify the effect that our selected variables had on property sales prices is reflected in Figure 9 below.

Figure 9: Price Correlation Matrix

numericVars <-
  select_if(st_drop_geometry(NCData.sf), is.numeric) %>% na.omit()


ggcorrplot(
  round(cor(numericVars), 1),
  p.mat = cor_pmat(numericVars),
  colors = c("#7fcdbb", "white", "#2c7fb8"),
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation across numeric variables")  

Methods

Our team began our model building process by researching Mecklenburg County to gain a gain a sense of local expertise. This knowledge was used to identify what variables would be most effective at predicting property values in the county. We then proceeded to find the required data sources and wrangle them into our workspace. Our team worked to further, “clean the data” to ensure that the required fields were present and to remove any features that were not useful for our efforts.

With the data wrangled and cleaned, the team was then ready to explore the data sets. We began our exploratory data analysis by creating maps to reflect our variables in a spatial format. Then, we created correlation scatterplots to understand the potential relationship between our variables and property sales price. Using the scatterplots, we were able to select variables for further analysis based on the strength and direction of their relationship to sales prices. More specifically, we selected variables that had a strong, positive relationship to sales price. This type of relationship would indicate that as the magnitude of a variable increased, sales price would increase as a direct result. The variables selected using this methodology included age, air conditioning type, building grade, heated area, sale year, sale month, exposure to grocery stores, and exposure to capital projects. This process is referred to as data mining.

Once our exploratory data analysis was completed, our team conducted further feature engineering. Feature engineering was conducted to convert our raw variables into useful predictive features. Another major component of feature engineering is measuring exposure. A K-Nearest Neighbor (KNN) Analysis was conducted as part of this and was done for two of our variables including grocery stores and capital improvement projects. The analysis measures the “exposure”, or distance, between properties and grocery stores and capital improvements projects. A higher exposure score would indicate that properties are in close proximity to grocery stores and capital improvement projects.

## split test vs train
set.seed(825)

modelling_df <- NCData.sf[NCData.sf$toPredict == "MODELLING",]
challenge_df <- NCData.sf[NCData.sf$toPredict == "CHALLENGE",]

inTrain <- createDataPartition(
              y = paste(modelling_df$zipcode), 
              p = .60, list = FALSE)

Charlotte.training <- modelling_df[inTrain,] 
Charlotte.test <- modelling_df[-inTrain,] 
Charlotte.test2 <- Charlotte.test

Charlotte.training <- st_drop_geometry(Charlotte.training)
reg1 <- lm(price ~ ., data =  Charlotte.training %>% 
                                 dplyr::select(price, bedrooms,fullbaths, halfbaths, percentheated, basement, ac, totalac, year, storyheigh, numfirepla, heat,
                                                grade, Age, municipali, units, month, grocery_nn1, proj_nn1))

Charlotte.test1 <- st_drop_geometry(Charlotte.test)
Charlotte.test1 <- Charlotte.test1 %>%
  mutate(Regression = "Baseline Regression",
         price.Predict = predict(reg1, Charlotte.test),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict)%>%
  filter(price < 5000000) 

challenge_df <- st_drop_geometry(challenge_df)
challenge_df <- challenge_df %>%
  mutate(Regression = "Baseline Regression",
         price.Predict = predict(reg1, challenge_df))

Lastly, our team began the model estimation and validation step. This step was used to determine the generalizability of our model and the effect that our feature engineering had on predicting sales prices within the county. Strong generalizability indicates that a model will be useful in predicting the values of new data that is applied to it. We accomplished this by conducting Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), k-fold cross-validation, and Moran’s I tests. The MAE and MAPE tests determined potential errors resulting from our model, while the k-fold cross-validation test measured the generalizability of our model, and the Moran’s I test measured the spatial autocorrelation of our values. Lastly, to further test our model’s generalizability, our team applied census data to our model to test how well our model generalizes in different contexts.

Results

Summary

We used Ordinary Least Squares (OLS) to model housing price estimates in Mecklenburg County, NC. OLS finds the linear relationship between our dependent variable and predictors. In this situation, house price is the dependent variable, and our predictors are the variables and engineered features discussed above. We aimed to maximize the accuracy and generalizability of our predictions by using different predictors in our model. In order to do this, we split our modeling set into two additional sets: a training set which includes 60% of the observations and a test set which includes the remaining 40% of observations.

The table below provides a summary of our model, including our model’s r-squared and adjusted r-squared for the training sample. Our model explains about 68% of the variance in dependent variables.

Table 2: OLS Summary Results

#summary table for training set
summary_table <- summary(reg1) 

stargazer(reg1, type="text", style="qje", digits=2, 
          title = "Linear Regression Summary Results",
          dep.var.labels = "Home Price",
          covariate.labels = c("Bedrooms", "Full Bathrooms", "Half Bathrooms", "Percent Area Heated", "Basement", "AC" , "Total Acres", "Sold 2021", "Sold 2022", "Sold 2020", "1 Story", "1.5 Stories", "2 Stories", "2.5 Stories", "3.0 Stories", "Bi-Level", "Cape Cod Style", "Ranch Style", "Split Level", "Fireplaces", "Heat", "Grade: Fair", "Grade: Average", "Grade: Good", "Grade: Very Good", "Grade: Excellent", "Grade: Custom", "Age", "Cornelius", "Davidson", "Huntersville", "Matthews", "Mint Hill", "Pineville", "Stallings", "Unincorporated",  "Units", "Sold Feb", "Sold Mar", "Sold Apr", "Sold May", "Sold Jun", "Sold Jul", "Sold Aug", "Sold Sept", "Sold Oct", "Sold Nov", "Sold Dec", "Grocery Store Exposure", "Improvement Project Exposure"))
## 
## Linear Regression Summary Results
## ===================================================================
##                                            Home Price              
## -------------------------------------------------------------------
## Bedrooms                                  15,593.16***             
##                                            (3,069.87)              
##                                                                    
## Full Bathrooms                           120,292.20***             
##                                            (4,388.16)              
##                                                                    
## Half Bathrooms                            66,379.28***             
##                                            (5,478.97)              
##                                                                    
## Percent Area Heated                       43,666.31**              
##                                           (22,234.39)              
##                                                                    
## Basement                                  45,746.39***             
##                                           (16,294.50)              
##                                                                    
## AC                                        48,986.58***             
##                                           (13,190.73)              
##                                                                    
## Total Acres                                  -1.93                 
##                                             (20.47)                
##                                                                    
## Sold 2021                                 75,182.80***             
##                                            (5,073.84)              
##                                                                    
## Sold 2022                                204,284.20***             
##                                            (6,257.20)              
##                                                                    
## Sold 2020                                  118,845.00              
##                                           (367,970.80)             
##                                                                    
## 1 Story                                    92,088.64               
##                                           (368,369.70)             
##                                                                    
## 1.5 Stories                                140,881.60              
##                                           (368,395.20)             
##                                                                    
## 2 Stories                                  80,206.47               
##                                           (368,309.70)             
##                                                                    
## 2.5 Stories                                90,309.26               
##                                           (368,505.20)             
##                                                                    
## 3.0 Stories                                -71,163.16              
##                                           (372,592.50)             
##                                                                    
## Bi-Level                                   18,227.58               
##                                           (370,190.80)             
##                                                                    
## Cape Cod Style                              5,740.95               
##                                           (381,263.50)             
##                                                                    
## Ranch Style                                96,861.12               
##                                           (369,072.00)             
##                                                                    
## Split Level                                46,640.77               
##                                           (368,701.70)             
##                                                                    
## Fireplaces                                14,008.51***             
##                                            (5,199.15)              
##                                                                    
## Heat                                                               
##                                                                    
##                                                                    
## Grade: Fair                                88,549.37               
##                                           (93,838.32)              
##                                                                    
## Grade: Average                             109,427.70              
##                                           (92,273.67)              
##                                                                    
## Grade: Good                              256,566.10***             
##                                           (92,439.88)              
##                                                                    
## Grade: Very Good                         508,773.30***             
##                                           (92,758.84)              
##                                                                    
## Grade: Excellent                        1,091,591.00***            
##                                           (94,019.27)              
##                                                                    
## Grade: Custom                           1,978,565.00***            
##                                           (97,989.37)              
##                                                                    
## Age                                       2,119.99***              
##                                             (130.85)               
##                                                                    
## Cornelius                                117,970.40***             
##                                           (26,654.13)              
##                                                                    
## Davidson                                   -23,158.18              
##                                           (25,066.04)              
##                                                                    
## Huntersville                              -22,174.85*              
##                                           (12,642.83)              
##                                                                    
## Matthews                                 -42,255.91***             
##                                           (14,218.35)              
##                                                                    
## Mint Hill                                  -12,334.09              
##                                           (14,434.75)              
##                                                                    
## Pineville                                  28,970.93               
##                                           (23,249.51)              
##                                                                    
## Stallings                                  -64,384.47              
##                                           (103,104.90)             
##                                                                    
## Unincorporated                              6,082.12               
##                                            (9,578.32)              
##                                                                    
## Units                                    357,561.60***             
##                                            (1,789.91)              
##                                                                    
## Sold Feb                                  26,137.03**              
##                                           (11,878.04)              
##                                                                    
## Sold Mar                                  56,448.30***             
##                                           (10,860.32)              
##                                                                    
## Sold Apr                                  40,106.51***             
##                                           (11,000.30)              
##                                                                    
## Sold May                                  54,022.62***             
##                                           (11,050.51)              
##                                                                    
## Sold Jun                                  56,688.84***             
##                                           (10,688.97)              
##                                                                    
## Sold Jul                                  39,562.94***             
##                                           (10,832.10)              
##                                                                    
## Sold Aug                                  49,230.10***             
##                                           (11,398.44)              
##                                                                    
## Sold Sept                                 55,278.51***             
##                                           (12,035.04)              
##                                                                    
## Sold Oct                                  78,234.37***             
##                                           (11,961.75)              
##                                                                    
## Sold Nov                                  81,794.09***             
##                                           (12,255.00)              
##                                                                    
## Sold Dec                                  82,506.75***             
##                                           (12,014.33)              
##                                                                    
## Grocery Store Exposure                       -3.76*                
##                                              (2.27)                
##                                                                    
## Improvement Project Exposure                5.59***                
##                                              (1.94)                
##                                                                    
## Constant                                 -806,195.90**             
##                                           (380,002.50)             
##                                                                    
## N                                            28,023                
## R2                                            0.68                 
## Adjusted R2                                   0.68                 
## Residual Std. Error                 367,837.90 (df = 27973)        
## F Statistic                       1,215.23*** (df = 49; 27973)     
## ===================================================================
## Notes:                       ***Significant at the 1 percent level.
##                               **Significant at the 5 percent level.
##                               *Significant at the 10 percent level.

After running our model on the test set, we found that the mean absolute error (MAE) of our predictions was $114,322 and the mean absolute percentage error (MAPE) of our predictions was about 32%. These figures are presented in the table below.

Table 3: MAE and MAPE

#Polished table of mean absolute error and MAPE
Charlotte.testSum <- 
  st_drop_geometry(Charlotte.test1) %>%
  summarize(MAE = mean(price.AbsError, na.rm=TRUE),
            MAPE = mean(price.APE, na.rm=TRUE))

Charlotte.testSum %>%
  rename("Mean Absolute Error" = MAE) %>%
  rename("Mean Absolute Percentage Error" = MAPE)
##   Mean Absolute Error Mean Absolute Percentage Error
## 1            114322.4                      0.3208169
  kable(Charlotte.testSum, caption = "Test Set Results", align = "c") %>%
    kable_styling(full_width = F)
Test Set Results
MAE MAPE
114322.4 0.3208169

Cross-Validation

It is possible our model is still inaccurate or misleading because we split our data into training and testing sets randomly. We conducted k-fold cross-validation tests in order to better understand the generalizability of our model. We ran a 100-fold cross-validation test, which split the dataset into equal sized subsets and measures for average goodness of fit across all folds. The summary of our cross-validation test, including mean, maximum, minimum, and standard deviation, is presented below.

Table 4: Cross-Validation test results

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(price ~ ., data = st_drop_geometry(Charlotte.test) %>% 
                                dplyr::select(price, bedrooms,fullbaths, halfbaths, percentheated, basement, ac, totalac, year, storyheigh, numfirepla, heat,
                                                grade, Age, municipali, units, month, grocery_nn1, proj_nn1), 
     method = "lm", trControl = fitControl, na.action = na.pass)

#cross val results 
data.frame(Test = c("Cross_Validation"), 
           Mean = mean(reg.cv$resample[,3]),
           Max = max(reg.cv$resample[,3]),
           Min = min(reg.cv$resample[,3]),
           Standard_Deviation = sd(reg.cv$resample[,3]))%>%
  kable() %>%
  kable_styling() %>%
  footnote(general_title = "Summary Statistics of Cross Validation, k = 100 folds")
Test Mean Max Min Standard_Deviation
Cross_Validation 111435.4 357634.7 84760.1 30130.19

Our model was somewhat generalizable. The standard deviation of the MAE across all folds was about 30,130.19. This figure shows the goodness of fit metrics across all folds and means there is still some, but not a large amount of, variation across the folds. The distribution of MAE is partially clustered tightly together between $100,000-150,000, but it is also positively skewed by a small number of higher values. This could be due to outliers in our dataset. We also provide a plot of the distribution of MAE across all 100 folds below.

Figure 10: Distribution of MAE

hist(reg.cv$resample[,3],xlab="MAE", col="#2c7fb8", breaks = 50, main = "Distribution of Mean Absolute Error")

Accuracy

The plot below shows the linear models for the predicted and actual sale price of our dataset. The black line represents our model’s prediction and the green line represents a perfect fit. We can see that there is a small amount of error in our model.

Figure 11: Prediction vs observed price

ggplot(data = Charlotte.test1, aes(x=price.Predict,y=price))+
  geom_point(colour = "#2c7fb8", size = 3, alpha =0.75)+
  geom_abline(intercept = 0, slope = 1, size = 3,color = "#7fcdbb")+
  geom_smooth(method = lm, se=F,colour = "black",size=2)+
  labs(title="Prediction as a function of observed price",
       subtitle="Black line represents model prediction; Green line represents perfect prediction",
       x="Predicted Price ($)",
       y="Observed Price ($)") +
  plotTheme()+
  theme(plot.title = element_text(size=22),
        legend.title = element_text(),
        legend.position="right",
        axis.text=element_text(size=12),
        axis.title=element_text(size=12))

Generalizability

Generalizability across space is also important for creating house price prediction models. Looking at the map of our test set residuals, we can see that there is some clustering, or spatial autocorrelation, of higher residuals in certain areas.

Figure 12: Residuals

coords <- st_coordinates(NCData.sf) 

neighborList <- knn2nb(knearneigh(coords, 5))

spatialWeights <- nb2listw(neighborList, style="W")

NCData.sf$lagPrice <- lag.listw(spatialWeights, NCData.sf$price)

coords.test <-  st_coordinates(Charlotte.test2) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

Charlotte.test2$lagPrice <- lag.listw(spatialWeights.test, Charlotte.test2$price)

Charlotte.test2 <- Charlotte.test2 %>%
  mutate(Regression = "Baseline Regression",
         price.Predict = predict(reg1, Charlotte.test2),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict)%>%
  filter(price < 5000000)

ggplot() +
    geom_sf(data = NPA.sf, fill = "white") +
    geom_sf(data = Charlotte.test2, aes(colour = q5(price.AbsError)),
            show.legend = "point", size = 1) +
    scale_colour_manual(values = palette5,
                     labels=qBr(Charlotte.test2,"price.AbsError"),
                     name="Quintile\nBreaks ($)") +
    labs(title="Test Set Residuals", 
         subtitle="Mecklenburg County") +
    mapTheme()

The plots below show the spatial autocorrelation of home prices with other nearby home prices. We calculated the average sale price for each home’s five nearest neighbors. We see that as the price of a given home increases, the price of its neighbors increases significantly as well. We also see that as home price increases, the spatial lag of errors also increases but not at the same rate as the first test. This spatial lag of prices and errors are presented in the two plots below.

Figures 13 and 14: Spatial Lag Plots

#Charlotte.test1 <- st_drop_geometry(Charlotte.test)
Charlotte.test2 <- Charlotte.test2 %>%
  mutate(Regression = "Baseline Regression",
         price.Predict = predict(reg1, Charlotte.test2),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict)%>%
  filter(price < 5000000) 

ggplot(Charlotte.test2, aes(x=lagPrice, y=price)) +
  geom_point(colour = "#2c7fb8", size = 3, alpha =0.75) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  labs(title = "Price as a function of the spatial lag of price",
       x = "Spatial lag of price (Mean price of 5 nearest neighbors)",
       y = "Sale Price") +
  plotTheme()

Charlotte.test2 <- Charlotte.test2 %>%
  mutate(Regression = "Baseline Regression",
         price.Predict = predict(reg1, Charlotte.test2),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict)%>%
  filter(price < 5000000) 

coords.test <-  st_coordinates(Charlotte.test2) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

Charlotte.test2$lagPrice <- lag.listw(spatialWeights.test, Charlotte.test2$price)

Charlotte.test2$lagPriceError <- lag.listw(spatialWeights.test, Charlotte.test2$price.AbsError, NAOK=TRUE)

ggplot(Charlotte.test2, aes(x=lagPriceError, y=price)) +
  geom_point(colour = "#2c7fb8", size = 3, alpha =0.75) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  labs(title = "Error as a function of the spatial lag of price",
       x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
       y = "Sale Price") +
  plotTheme()

Finally, we used Moran’s I to better understand if spatial autocorrelation is present in our model. Our model’s Moran’s I is positive and around 0.25, but the p-value is 0.001 which means that there is spatial autocorrelation in our model. We present the outcome of the Moran’s I test below.

Figure 15: Moran’s I

#morans
moranTest <- moran.mc(Charlotte.test2$price.Error, na.action=na.omit, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#2c7fb8",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in blue",
       x="Moran's I",
       y="Count") +
  plotTheme()

We plot our predicted values below. Here we can clearly see clustering of predicted home prices with more expensive predictions clustered around the south central and northern parts of the county. We also see clustering of less expensive price predictions in an arch around downtown Charlotte.

Figure 15: Predicted Prices

ggplot() +
  geom_sf(data = NPA.sf, fill = "white") +
  geom_sf(data = Charlotte.test2, aes(colour = q5(price.Predict)), 
          show.legend = "point", size = 1) +
  scale_colour_manual(values=palette5,
                      labels=qBr(Charlotte.test2,"price.Predict"),
                      name="Quintile\nBreaks") +
  labs(title="Predicted Sale Price", subtitle = "Mecklenburg County") +
  mapTheme()+ 
  theme(plot.title = element_text(size=22))

We plot the MAPE of our predictions by neighborhood below. We still see some clustering of errors around Charlotte and near the outskirts of the county.

Figure 16: MAPE by neighborhood

library(mapview)

charlotte.test_sf = st_as_sf(Charlotte.test2)
nhoods = st_intersection(NPA.sf, charlotte.test_sf)
nhoods_MAPE <- nhoods %>%
  group_by(id) %>%
  summarise(meanMAPE = mean(price.APE, na.rm=TRUE)*100, 
            meanPrice = mean(price))


ggplot() +
  geom_sf(data = NPA.sf %>%
            left_join(st_drop_geometry(nhoods_MAPE), by = "id"),
          aes(fill = q5(meanMAPE))) +
  #geom_sf(data = Charlotte.test2, colour = "black", size = .25) +
  scale_fill_manual(values = palette5,
                    labels=qBr(nhoods_MAPE,"meanMAPE"),
                    name="Quintile\nBreaks") +
  mapTheme() +
  labs(title="Absolute sale price percent errors by Neighborhood")

Additionally, we present a scatterplot of MAPE by neighborhood as function of mean price by neighborhood below. We see that as mean price by neighborhood increases, MAPE of neighborhood also increases. This plot also allows us to see an outlier in our dataset.

Figure 17: Scatterplot of MAPE by neighborhood

ggplot(nhoods_MAPE, aes(x=meanPrice, y=meanMAPE))+
  geom_point(colour = "#2c7fb8", size = 3, alpha =0.75) +
  geom_smooth(method = "lm", color = "black") +
  labs(title = "MAPE by neighborhood as a function of price by neighborhood",
       x = "Sale Price",
       y = "MAPE") +
  theme(
    legend.position = "none") +
  plotTheme()

Finally, we split out study area into two groups. In this case, we split the county by race and then by income to test how well our model generalizes in different contexts. We present the race and income splits by neighborhood below. These plots show the segregation of race and income in Mecklenburg County.

Figure 18: Race and Income Context

NCtracts <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E","B06011_001"), 
          year = 2017, state=37, county= 119, geometry=T, output = "wide") %>%
  st_transform('ESRI:102286')  %>%
  rename(TotalPop = B01001_001E,
         NumberWhites = B01001A_001E,
         Median_Income = B06011_001E) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(Median_Income > 32322, "High Income", "Low Income"))

grid.arrange(ncol = 2,
  ggplot() + geom_sf(data = na.omit(NCtracts), aes(fill = raceContext)) +
    scale_fill_manual(values = c("#7fcdbb", "#2c7fb8"), name="Race Context") +
    labs(title = "Race Context") +
    mapTheme() + theme(legend.position="bottom"), 
  ggplot() + geom_sf(data = na.omit(NCtracts), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#7fcdbb", "#2c7fb8"), name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="bottom"))

In the race context, our model predicts better for areas that are majority non-white. We see that the MAPE in majority white neighborhoods is higher than that in majority non-white neighborhoods.

Race Context

st_join(Charlotte.test2, NCtracts) %>% 
  group_by(raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood race context")%>%
  kable_styling() 
Test set MAPE by neighborhood race context
Majority Non-White Majority White <NA>
30% 33% 16%

In terms of income, our model also produced higher errors in neighborhoods with higher median incomes than neighborhoods with lower median incomes.

Income Context

st_join(Charlotte.test2, NCtracts) %>% 
  group_by(incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood income context") %>%
  kable_styling() 
Test set MAPE by neighborhood income context
High Income Low Income <NA>
30% 36% 16%

Our model is somewhat generalizable but performs better in for predicting house prices in neighborhoods with lower median incomes or neighborhoods that are majority non-white.

Discussion

Overall, our model is relatively effective at predicting house prices in Mecklenburg County. We found that the percentage of heated area in the house, as well as exposure to grocery stores and capital improvement projects, to be interesting and important variables in our model. Our model predicted about 68% of variation in home prices. The MAE of our predictions was $114,322.40, while the MAPE of our predictions was about 32%. We were able to account for some spatial variation in our prices, but we still have significant spatial variation present in our model.

We found that our model was less accurate when predicting home prices for more expensive properties. Our model was also less accurate for homes located in majority white neighborhoods. This could be due to the breakdown of home prices in our training set and fewer observations with higher prices, which would lead to less accurate predictions. However, our model was more accurate when predicting less expensive homes.

Conclusion

This model predicts sales price as a function of several variables, including property age, air conditioning type, building grade, heated area, sale year, sale month, exposure to grocery stores, and exposure to capital projects. Our model was relatively successful in predicting property values in Charlotte. Given that it was able to account for 68% with a MAPE of 32%, we would most likely not recommend this model to Zillow. However, we believe the supplemental recommendation of adding more spatial features and more diverse training data observations would help create a more successful model.

>>>>>>> 91c4d2ddc1c7bd448e3ebc4dcce644964ae93383