Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape |
---|---|---|---|---|---|---|---|
1 | 60 | RL | 65 | 8450 | Pave | NA | Reg |
2 | 20 | RL | 80 | 9600 | Pave | NA | Reg |
3 | 60 | RL | 68 | 11250 | Pave | NA | IR1 |
4 | 70 | RL | 60 | 9550 | Pave | NA | IR1 |
5 | 60 | RL | 84 | 14260 | Pave | NA | IR1 |
6 | 50 | RL | 85 | 14115 | Pave | NA | IR1 |
Predicting Housing Prices with Machine Learning Models
Abstract
There are many different variables that affect housing prices which can vary drastically. In order to make the best decisions about housing prices, it is useful to be able to predict the sales price of a house given a set of descriptors. This project will seek to identify a model for predicting housing prices using a dataset provided by Kaggle. This will be achieved by comparing the accuracy and predictive power of different machine learning models.
Introduction
General Background Information
Machine learning models provide a massive opportunity for real estate investors to identify housing prices by using a set of predictors. This project will utilize Regression Prediction. Unlike Classification Prediction, which attempts to predict a classifier based on given predictors, Regression Prediction seeks to identify a continuous value (e.g., x amount of dollars, x amount of cells). In our case, our response variable is a continuous variable, the sales price of a house.
Description of data
There are 1,460 records, 81 columns/variables. There is an ID column, as well as a response value, meaning there are 79 predictors, which include descriptors of the house in question. These include both categorical predictors as well as numerical predictors. Areas described by the predictors include, Basements, Garages, Bathrooms, Location, Age, and many, many others. Additionally, as mentioned, the response variable is the sales price of the house.
Questions/Hypotheses to be addressed
The research question I plan to address with my analysis is: which features are the best predictors of housing prices? Additionally, the desired output of this analysis is a machine learning model which allows real estate investors or other interested stakeholders to better make decisions regarding future real estate purchases.
Methods
Data import and cleaning
Reading in the Data
Cleaning Data
There were several different data cleaning methods that were experimented with in order to find the best way to prepare the data for modeling. First I removed all variables that had more than 80% NA values. Then for those categorical predictors with some missing values, I replaced those with a “None” value. Next, I converted all the categorical variables to factors (they were initially read in as character values). Once the variables were recognized as factors, I could evaluate which predictors were victims of class imbalance. I removed the predictors with around 80% of observations in one class of a variable. Finally, I replaced the NA values in the numeric predictors with the median value of that variable. When I was finished with the data cleaning portion of this project, I had reduced the data set down from 79 to 63 predictors, 26 categorical and 37 numeric variables.
Background on Chosen Models
I chose to utilize the following models: the K-Nearest Neighbors model (KNN), Linear Regression model, and RandomForest models to determine the best prediction model for this project.
KNN:
The KNN model predicts based on the closest samples or neighbors. Essentially, to predict a value, the data is broken up into samples/neighbors, and then the nearest samples (using Euclidean distance, typically)to the value of interest are examined to either classify or find a mean between the chosen samples. K represents the number of neighbors to utilize to come to this conclusion (Kuhn).
Linear Regression:
Linear Regression focuses on minimizing the SSE (Sum of squared errors) between the predicted and original response value.
RandomForest:
RandomForest models take advantage of decision trees. If we think about the scenario in our project (the price of a house), we could imagine a decision tree starting with, “does having two bathrooms cause a house to be priced at over 400K?” This would be our first node to split the data on. We could continue asking things like, “Are houses with larger basements priced at more or less than 400K?” Questions like these would represent more decision nodes for us to split the data on, getting us closer to the most accurate prediction model. The RandomForest algorithm uses different methods to create several uncorrelated “forests” of decision trees (IBM).
Background Regarding Performance Metrics
R-Squared (R^2):
The R^2 value explains what percent (proportion) of the total variance in the data is explained by the model.
RMSE (Root Mean Squared Error):
Represents the average size of the difference between real values and the predicted values, then
MAE (Mean Absolute Error):
Similar to RMSE, but the absolute value of the difference between real and predicted is used.
Results
Basic statistical analysis
Before I began with the machine learning analysis, I sought to test the variables of interest in their significance of affecting the response variable.
Below are the results of the linear model fit with all variables as predictors
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 4.604292e+05 | 1.413535e+06 | 0.3257288 | 0.7446773 |
MSSubClass | -1.820243e+02 | 2.767475e+01 | -6.5772690 | 0.0000000 |
LotFrontage | -5.634362e+01 | 5.176798e+01 | -1.0883875 | 0.2766081 |
LotArea | 4.304975e-01 | 1.021082e-01 | 4.2160918 | 0.0000264 |
OverallQual | 1.733111e+04 | 1.187394e+03 | 14.5959170 | 0.0000000 |
OverallCond | 4.674676e+03 | 1.032513e+03 | 4.5274745 | 0.0000065 |
YearBuilt | 2.694912e+02 | 6.741315e+01 | 3.9976056 | 0.0000673 |
YearRemodAdd | 1.344831e+02 | 6.858803e+01 | 1.9607366 | 0.0501043 |
MasVnrArea | 3.134596e+01 | 5.932820e+00 | 5.2834831 | 0.0000001 |
BsmtFinSF1 | 1.921300e+01 | 4.666827e+00 | 4.1169307 | 0.0000406 |
BsmtFinSF2 | 8.273997e+00 | 7.057001e+00 | 1.1724524 | 0.2412114 |
BsmtUnfSF | 9.297103e+00 | 4.193927e+00 | 2.2168013 | 0.0267940 |
TotalBsmtSF | NA | NA | NA | NA |
X1stFlrSF | 4.901342e+01 | 5.809589e+00 | 8.4366426 | 0.0000000 |
X2ndFlrSF | 4.902978e+01 | 4.983306e+00 | 9.8388062 | 0.0000000 |
LowQualFinSF | 2.534285e+01 | 1.996942e+01 | 1.2690830 | 0.2046187 |
GrLivArea | NA | NA | NA | NA |
BsmtFullBath | 9.369818e+03 | 2.611716e+03 | 3.5876105 | 0.0003450 |
BsmtHalfBath | 2.051814e+03 | 4.091012e+03 | 0.5015420 | 0.6160672 |
FullBath | 3.439741e+03 | 2.836717e+03 | 1.2125780 | 0.2254922 |
HalfBath | -1.872747e+03 | 2.662817e+03 | -0.7032955 | 0.4819865 |
BedroomAbvGr | -1.008636e+04 | 1.701690e+03 | -5.9272618 | 0.0000000 |
KitchenAbvGr | -1.215840e+04 | 5.211646e+03 | -2.3329299 | 0.0197904 |
TotRmsAbvGrd | 5.044090e+03 | 1.236951e+03 | 4.0778400 | 0.0000480 |
Fireplaces | 3.984870e+03 | 1.776709e+03 | 2.2428376 | 0.0250605 |
GarageYrBlt | 1.268380e+02 | 6.897832e+01 | 1.8388100 | 0.0661511 |
GarageCars | 1.129285e+04 | 2.876386e+03 | 3.9260550 | 0.0000905 |
GarageArea | -4.382456e+00 | 9.941118e+00 | -0.4408414 | 0.6593947 |
WoodDeckSF | 2.388005e+01 | 8.011714e+00 | 2.9806414 | 0.0029252 |
OpenPorchSF | -2.872111e+00 | 1.518148e+01 | -0.1891852 | 0.8499746 |
EnclosedPorch | 1.193628e+01 | 1.686386e+01 | 0.7078021 | 0.4791839 |
X3SsnPorch | 2.038201e+01 | 3.139056e+01 | 0.6493039 | 0.5162466 |
ScreenPorch | 5.596076e+01 | 1.719053e+01 | 3.2553247 | 0.0011592 |
PoolArea | -2.908211e+01 | 2.380658e+01 | -1.2215996 | 0.2220611 |
MiscVal | -7.313279e-01 | 1.854773e+00 | -0.3942951 | 0.6934221 |
MoSold | -4.856806e+01 | 3.447724e+02 | -0.1408699 | 0.8879926 |
YrSold | -7.796538e+02 | 7.024882e+02 | -1.1098461 | 0.2672526 |
In both the numeric and categorical linear regression tests, not all variables were significant. I decided to keep all variables in to see if those variables had any affect on the final results.
Machine Learning Modeling
Model Performance Metrics
Table 2 displays the relevant metrics for the regression models runs.
LNM | KNN | RFM | |
---|---|---|---|
RMSE | 54819.808 | 45321.318 | 37742.731 |
Rsquared | 0.612 | 0.646 | 0.754 |
MAE | 20989.778 | 24633.264 | 19650.826 |
The RandomForest model performed the best of the three models, outputting the best results by a fairly considerable margin, in all three categories. The KNN model performed the second best of the three, but was closer to the Linear Model’s results than to the RandomForest model.
Figure 1 shows the distribution of predicted values for each model, overlayed with the real observed values (OBS).
Figure 2 and Figure 3 provide additional views comparing RMSE and MAE across the different models
Variables of Importance
Finally, we can examine which predictors specifically improved prediction the most. That is, which had the greatest weight on the final result.
Figure 4 shows the Predictors that had the most impact on the Linear Regression Model
X2ndFlrSF indicates the variable Second floor square feet (X is added prefix to align with R naming conventions). BsmtFinSF1 represents the variable for Basement finished area square feet. BsmtUnfSF represents the variable for Basement unfinished area square feet.
Figure 5 shows the Predictors that had the most impact on the KNN Model
OverallQual is a categorical variable that rates the overall material and finish of the house in terms of “Excellent”, “Poor”, etc. GrLivArea represents the total square feet of the above ground living area. TotalBsmtSF is the total basement square footage.
Figure 6 shows the Predictors that had the most impact on the RF Model
We can observe similar values to those highlighted by the KNN model, with the Neighborhood value appearing higher for the RandomForest model.
Discussion
Summary and Interpretation
The results generated in this project indicate that there is some quality predictive power attributed to the chosen machine learning models. While the performance metrics were not quite as impressive as would be desired, given the relatively small size of the training data, these results are substantial. In addition to these findings, the models were able to produce which variables provided the most weight towards the prediction power. The Overall Quality and General Living Area Square Footage were the two predictors that were listed in all three models. Total square footage for Basements was also a variable common in both the RandomForest and KNN models.
Conclusions
I attempted to develop different regression models that accurately predicted the prices of houses with minimal error. The three models that I ran all showed substantial predictive power for determining the correct sales price of a house, however, the performance metrics showed some room for future improvement.
The other goal of this project was to identify which variables could best aid in identifying correct housing prices. While there was some variance among the predictors in terms of importance, some predictors clearly stood out above the rest. This information could be used as a items of focus in future analysis of housing prices as well as features of importance for real estate firms to prioritize.
Further research would include utilizing different datasets to determine if these results are reproducible. Specifically, using a larger dataset with more observations could provide additional insights as well as improve predictability. Using housing data for a different general location could also be useful in determining whether the results presented here are indicative of most areas or if they are subject to localized bias.
References
IBM. (n.d.). What is random forest?
Kuhn, &. J., M. (2018). Applied predictive modeling.