EDA | Building Advanced Regression Techniques to Predict House Price on the Ames Dataset🏠

A Real Estate Problem Analysis with Real-World Data

13 min readApr 18, 2022

--

Part 1 — Data exploration (EDA)

Photo by Rihards Sergis on Unsplash

Introduction

We are going to find the best-performing model for the prediction of housing prices on the Ames dataset (data source: Kaggle). The dataset describes the sale of individual residential properties in Ames, Iowa from 2006 to 2010. The data set contains 2930 observations and a large number of explanatory variables involved in assessing home values. This dataset gives us a chance to look into the data on what really influences the value of a house.

This is going to be a 3 to 4-part series at least. Could be more, but not less. Part 1 covers data exploratory data analysis (EDA), Part 2 covers data preparation or pre-processing whilst Part 3 and 4 will dive into the modeling.

The main objective is to get the step-by-step guide to complete the real Data Science project with best performing model that will generalize across unseen data.

Data exploration (EDA)

Exploratory Data Analysis (EDA) is an important step in any data analysis or machine learning project. EDA is the process of investigating the dataset to discover patterns, relationships, and outliers. EDA gives us a chance to look into the data on what really influences the value of a house.

Andrew Andrade concisely describes EDA as follows.

The purpose of EDA is to use summary statistics and visualizations to better understand data, and find clues about the tendencies of the data, its quality and to formulate assumptions and the hypothesis of our analysis.

Table of contents

1 | Import libraries & download datasets

2 | Data overview

Dimension of train and test data
Numerical and categorical features
Unique column values
Cardinality (labels) of categorical columns
Check problematic categorical columns
Missing value analysis
In summary

3 | Statistical overview

3.1 | Descriptive statistics of train and test data

Check a five-number summary of training and testing data
Comparison of five-number summary between train and test data
In summary

3.2 | Distributions

Analysis of outcome feature (‘SalePrice’)
Check the distribution of all numerical independent features
In summary

3.3 | Relationships

Check correlation between the numerical features
Check heatmap of numerical features
Check ANOVA for all categorical features
Explore boxplot for all categorical features
Explore pair plots between dependent and independent features
Explore scatter plot
In summary

So, let’s get started 🧑👈🙏💪

1 | Import libraries & download datasets

2 | Data overview

2.1 | Dimension of train and test data

Back to Table of Contents

Training data set dimension : (1460, 81)
Testing data set dimension : (1459, 80)

2.2 | Numerical and categorical columns in training data

Back to Table of Contents

********************************************************************
Continuous features
********************************************************************['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']
********************************************************************
count of continuous features: 36
****************************************************************************************************************************************
categorical features
********************************************************************
['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']
********************************************************************
count of categorical features: 43
********************************************************************

2.3 | Unique column values

Back to Table of Contents

Only Id column is unique.

2.4 | Cardinality of categorical columns

Back to Table of Contents

**************************************************
  Training                  Testing              *
**************************************************
Neighborhood     25         Neighborhood     25 
Exterior2nd      16         Exterior2nd      16
Exterior1st      15         Exterior1st      14
SaleType          9         SaleType         10
Condition1        9         Condition1        9 
Condition2        8         Condition2        5
HouseStyle        8         HouseStyle        7 
RoofMatl          8         RoofMatl          4
Functional        7         Functional        8
BsmtFinType1      7         BsmtFinType1      7 
GarageType        7         GarageType        7
BsmtFinType2      7         BsmtFinType2      7 
RoofStyle         6         RoofStyle         6 
Heating           6         Heating           4
SaleCondition     6         SaleCondition     6
Electrical        6         Electrical        4
FireplaceQu       6         FireplaceQu       6
GarageQual        6         GarageQual        5
GarageCond        6         GarageCond        6
Foundation        6         Foundation        6
MSZoning          5         MSZoning          6****************************************************

Showing above only 20 cardinalities of categorical columns from the training and testing dataset.

2.5 | Check problematic categorical columns

Back to Table of Contents

36 categorical columns out of 43 are good columns. There are 7 bad or problamatic columns.['MSZoning',
 'Functional',
 'SaleType',
 'Exterior1st',
 'Utilities',
 'Exterior2nd',
 'KitchenQual']What is the problem?MSZoning: train - ({'C (all)', 'FV', 'RH', 'RL', 'RM'}, test - {'C (all)', 'FV', 'RH', 'RL', 'RM', nan})
Functional: train - ({'Maj1', 'Maj2', 'Min1', 'Min2', 'Mod', 'Sev', 'Typ'},test - {'Maj1', 'Maj2', 'Min1', 'Min2', 'Mod', 'Sev', 'Typ', nan})
SaleType: train - ({'COD', 'CWD', 'Con', 'ConLD', 'ConLI', 'ConLw', 'New', 'Oth', 'WD'},test - {'COD', 'CWD', 'Con', 'ConLD', 'ConLI', 'ConLw', 'New', 'Oth', 'WD', nan})
Exterior1st: train - ({'AsbShng','AsphShn','BrkComm','BrkFace','CBlock','CemntBd','HdBoard','ImStucc','MetalSd','Plywood','Stone','Stucco','VinylSd',
  'Wd Sdng','WdShing'},
test -'AsbShng','AsphShn','BrkComm','BrkFace','CBlock','CemntBd','HdBoard','MetalSd','Plywood','Stucco','VinylSd','Wd Sdng','WdShing',nan})
Utilities: train - ({'AllPub', 'NoSeWa'}, test - {'AllPub', nan})Exterior2nd: train - ({'AsbShng','AsphShn','Brk Cmn','BrkFace','CBlock','CmentBd','HdBoard','ImStucc','MetalSd','Other','Plywood','Stone','Stucco','VinylSd','Wd Sdng','Wd Shng'},test - {'AsbShng','AsphShn','BrkCmn','BrkFace','CBlock','CmentBd','HdBoard','ImStucc','MetalSd','Plywood','Stone','Stucco','VinylSd','WdSdng','Wd Shng',nan})
KitchenQual: train - ({'Ex', 'Fa', 'Gd', 'TA'}, test - {'Ex', 'Fa', 'Gd', 'TA', nan})Problem is due to missing values (nan) in the test data set

2.6 | Missing value analysis

Back to Table of Contents

Missing values in the training dataset

Photo by user

Photo by user

Total 19 features have missing values. PoolQC, MiscFeature, Alley, and Fence have more than 50% missing values in the training dataset.

Missing values in the testing dataset

Photo by user

Photo by user

Total 33 features have missing values. PoolQC, MiscFeature, Alley,Fence, and FireplaceQu have the more than 50% missing values in the testing dataset.

Check differences in missing values in the dataset.

photo by user

There is a difference in missing values in the train and test datasets.

In summary — Data overview

Back to Table of Contents

We can conclude that:

There are 1460 instances of training data and 1460 of testing data.
A total number of features equals 81, of which 36 are numerical, 43 are categorical plus Id and SalePrice.
Numerical columns: 1stFlrSF, 2ndFlrSF, 3SsnPorch, BedroomAbvGr, BsmtFinSF1, BsmtFinSF2, BsmtFullBath, BsmtHalfBath, BsmtUnfSF, EnclosedPorch, Fireplaces, FullBath, GarageArea, GarageCars, GarageYrBlt, GrLivArea, HalfBath, KitchenAbvGr, LotArea, LotFrontage, LowQualFinSF, MSSubClass, MasVnrArea, MiscVal, MoSold, OpenPorchSF, OverallCond, OverallQual, PoolArea, ScreenPorch, TotRmsAbvGrd, TotalBsmtSF, WoodDeckSF, YearBuilt, YearRemodAdd, YrSold
Categorical columns: Alley, BldgType, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, BsmtQual, CentralAir, Condition1, Condition2, Electrical, ExterCond, ExterQual, Exterior1st, Exterior2nd, Fence, FireplaceQu, Foundation, Functional, GarageCond, GarageFinish, GarageQual, GarageType, Heating, HeatingQC, HouseStyle, KitchenQual, LandContour, LandSlope, LotConfig, LotShape, MSZoning, MasVnrType, MiscFeature, Neighborhood, PavedDrive, PoolQC, RoofMatl, RoofStyle, SaleCondition, SaleType, Street, Utilities
The highest cardinality of categorical column (NEIGHBORHOOD) is 25 and three columns (NEIGHBORHOOD -25, Exterior2nd-16 and Exterior1st-15) have more than 10.
Cardinality (labels) of 7 features (‘Utilities’, ‘KitchenQual’, ‘Functional’, ‘MSZoning’, ‘SaleType’, ‘Exterior1st’, ‘Exterior2nd’) is showing differently in train and test data due to missing values. The problematic categorical problem will be fixed after missing value imputation and then all categorical columns can be safely encoded.
19 features have missing values in the training dataset and 4 features (PoolQC, MiscFeature, Alley, and Fence) have over 50% missing values. 33 features have missing values in the testing dataset and 5 features (PoolQC, MiscFeature, Alley, Fence, and FireplaceQu) have over 50% missing values. There is a difference in missing values in the train and test datasets. Most of the time NA means lack of subject described by attribute, like missing pool, fence, no garage, and basement.
There is one unique column which is ‘Id’

3 | Statistical overview

Back to Table of Contents

3.1 | Describe train and test data (Five-number summary of training data)

Statistical information can be viewed in the below table. For numerical parameters, fields like mean, standard deviation, percentiles, and maximum have been populated. This gives us a broad idea of our dataset.

Photo by user

Describe train and test data (Five-number summary of test data)

Photo by user

3.1.2 | Comparison of five-number summary between train and test data

Back to Table of Contents

Photo by user

In summary — Descriptive statistics of train and test data

We can conclude that:

Important summary statistics of all the numerical variables like the mean, std, min, 25%, 50%, 75%, and max values.
The minimum sale price is larger than zero. Minimum, maximum, and average sale prices are 34900, 755000, and 180921 respectively.
There are cases where the minimum value and the maximum value are different, which means that the feature range of the test may be different in the train.
There is a high variation in values in LotArea (std -9981.26).
There are many variables that have a median value of 0.
The representative statistics of train and test are almost similar.

3.2 | Distributions

Back to Table of Contents

3.2.1 | Analysis of outcome feature (‘SalePrice’)

We checked the distribution plot, normal probability plot, and Skewness for the analysis of ‘SalePrice’.

A distribution plot is used to check how the data is distributed, a normal probability plot is used to check whether ‘SalePrice’ follows normal distribution or not, and Skewness is usually described as a measure of a dataset’s symmetry. A perfectly symmetrical data set will have a skewness of 0. The normal distribution has a skewness of 0. The value is often compared to the kurtosis of the normal distribution, which is equal to 3. If the kurtosis is greater than 3, then the dataset has heavier tails than a normal distribution. If the kurtosis is less than 3, then the dataset has lighter tails than a normal distribution.

Photo by user

Photo by user

Skewness: 1.882876
Kurtosis: 6.536282

3.2.2 | Distribution of all numerical independent features

Back to Table of Contents

We applied Shapiro–Wilk test to all the numerical features to check whether they follow a normal distribution or not.

Photo by user

And the result is False means none of the numerical features follow the normal distribution.

In summary — Distribution

We can conclude that:

The distribution plot of SalePrice shows that it deviates from the normal distribution.
The probability plot also shows the non-normality for SalePrice.
The size of the right-handed tail is larger than the left-handed tail. Have positive skewness.
Kurtosis > 3 shows heavier tails than a normal distribution
None of the continuous features follow a normal distribution.

3.3 | Relationships

Back to Table of Contents

3.3.1 | Correlation between SalePrice and numerical features

We have already checked that none of the continuous features follow normal distribution so here we have applied the nonparametric version of the Pearson product-moment correlation to check the strength and direction of association between dependent (SalePrice) and independent variables (all continuous features).

Photo by user

The above figure shows that GrageArea, YearBuilt, GarageCars, GLivArea, and OverallQul are highly positively correlated with SalePrice.

3.3.2 | Check heatmap of continuous features

Back to Table of Contents

A heatmap (or heat map) is a graphical representation of data where values are depicted by color. Here we will generate the heatmap of the correlation matrix of continuous features + SalePrice.

Photo by user

Check the top 10 highly positively correlated features, if any.

Photo by user

Highly positively correlation between independent features {(GarageCars, GarageArea — 0.88), (GrLiveArea, TotRmsAbvGrd — 0.83), (TotalBsmtSF, 1stFlrSF — 0.82)}. We can check the performance of the model by excluding one of them.

Features (OveallQual, SalePrice — 0.79), (GrLiveArea, SalePrice — 0.71), (GarageCars, SalePrice — 0.64),(GarageArea, SalePrice — 0.62),(TotalBsmtSF, SalePrice — 0.61), and (1stFlrSF, SalePrice — 0.61) are positively correleted with SalePrice.

Check the top 10 highly negatively correlated features, if any.

Photo by user

There are no highly negatively correlated features between independent features or with SalePrice.

3.3.3 | Check ANOVA for all categorical features

Back to Table of Contents

One of the biggest challenges in model building is the selection of the most reliable and useful features that are used in order to train a model. ANOVA helps in selecting the best features to train a model. ANOVA minimizes the number of input variables to reduce the complexity of the model. ANOVA helps to determine if an independent variable is influencing a target variable.

ANOVA helps to find out if the difference in the mean values is statistically significant. ANOVA also indirectly reveals if an independent variable is influencing the dependent variable.

Photo by user

Photo by user

3.3.4 | Explore boxplot for all categorical features

Back to Table of Contents

A boxplot helps us in visualizing the data in terms of quartiles. It also identifies outliers in the dataset, if any.

Photo by user

Photo by user

Photo by user

Photo by user

Photo by user

Photo by user

Photo by user

Here, boxplot of the top 5 significant features from ANOVA.

Photo by user

Photo by user

Photo by user

Photo by user

Photo by user

3.3.5 | Explore pairplots between dependent and independent features

Back to Table of Contents

Photo by user

Photo by user

Photo by user

Photo by user

Photo by user

A pair plot gives us a reasonable idea about the variable's relationship. We can see that there are some linear and non-linear relationships with SalePrice.

3.3.6 | Explore scatter plot

Back to Table of Contents

Scatter plot of ‘SalePrice’ versus GrLivArea

Photo by user

There are five observations that seem outliers. Three of them are true outliers (Partial Sales that likely don’t represent actual market values) and two of them are simply unusual sales (very large houses priced relatively appropriately). The author-(https://ww2.amstat.org/publications/jse/v19n3/decock.pdf) recommended removing any houses with more than 4000 square feet from the data set (which eliminates these five unusual observations).

In summary — Relationship

Back to Table of Contents

We can conclude that:

Spearman’s rank correlation test shows the top five features GrageArea, YearBuilt, GarageCars, GLivArea, and OverallQul are highly positively correlated with SalePrice.
Correlation heatmap shows that highly positively correlated between independent features {(GarageCars, GarageArea — 0.88), (GrLiveArea, TotRmsAbvGrd — 0.83), (TotalBsmtSF, 1stFlrSF — 0.82)}. We can check the performance of the model by excluding one of them.
Correlation heatmap also shows that features (OveallQual, SalePrice — 0.79), (GrLiveArea, SalePrice — 0.71), (GarageCars, SalePrice — 0.64),(GarageArea, SalePrice — 0.62),(TotalBsmtSF, SalePrice — 0.61), and (1stFlrSF, SalePrice — 0.61) are positively correleted with SalePrice. There is no highly negatively correlated features between independent featurs or with SalePrice.
ANOVA analysis shows top five independent variables Neighborhood, ExterQual, BsmtQual, KitchenQual, and GarageFinish are influencing a target variable (SalePrice).
Boxplot suggests a difference between groups for Neighborhood, ExterQual, BsmtQual, KitchenQual, and GarageFinish.
Boxplot shows that ‘SalePrice’ is higher for the mean value of ExterQual -Ex compares to ExterQual (Gd, TA, Fa), ‘SalePrice’ is higher for the mean value of BsmtQual -Ex compare to BsmtQual (Gd, TA, Fa), and ‘SalePrice’ is higher for the mean value of KitchenQual -Ex compares to KitchenQual (Gd, TA, Fa).
The pair plot suggests there are linear and non-linear relationships with SalePrice.
The Scatter plot of ‘SalePrice’ versus GrLivArea suggests five outliers, remove any houses with more than 4000 square feet from the data set.

Well, part 1 ends here. In this article, we did a pretty good analysis of Ames data. We understood how to explore data and note the key things before data preparation.

In part 2, we are going to focus on data preparation and processing.

I hope you guys have enjoyed reading it. Please share your thoughts/doubts in the comment section.

Many thanks for reading my kernel!🙏

Please leave a comment if you have any suggestions for improving the analysis!🏋🥇

If you liked 😊 my kernel, give 👍 LIKE!

All the code and datasets used in this article can be accessed from my Kaggle account.

Acknowledgments

Thanks for the most popular kernels:

Advanced Regression

Learner CARES

Written by Learner CARES

Data Scientist, Kaggle Expert (https://www.kaggle.com/itsmohammadshahid/code?scroll=true). Focusing on only one thing — To help people learn📚 🌱🎯️🏆

Help
Status
About
Careers
Blog
Privacy
Terms
Text to speech
Teams