Handy Python Pandas for Handling Outliers
Data Cleaning & Data Preparation Series — stats.zscore(), plt.boxplot(), np.log()
You can download the Jupyter notebook and data of this tutorial here
Table of Contents
1. Introduction
2. Detecting Outliers
3. Handling Outliers
1. Introduction
Outliers are extreme values that may significantly affect the data analysis and interpretation. They are data points that are far away from the other data points and can have a significant impact on statistical analysis. Outliers can occur due to measurement errors, data entry errors, or simply due to the natural variability of the data. In this chapter, we will discuss how to handle outliers using the pandas library in Python. In this post, we will be using the AMES House Price data.
2. Detecting Outliers
Before we can handle outliers, we need to first detect them. There are various statistical methods for detecting outliers, such as z-score, boxplots, and scatterplots. In pandas, we can use the describe() function to get a summary of the dataset, which includes information on the mean, standard deviation, minimum and maximum values, and quartiles. We can also use the plot() function to generate boxplots and scatterplots.
Let us consider an example where we have a dataset containing the prices of different houses. We can plot the data using a box plot as shown below:
import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
data = pd.read_csv('https://raw.githubusercontent.com/learnercares/Python-for-Data-Science/main/AMES%20Housing%20Dataset.csv')
data.describe()
# Create a box plot
plt.boxplot(data['SalePrice'])
plt.show()
3. Handling Outliers
Once we have detected the outliers, we need to handle them. There are several ways to handle outliers, including:
3.1 Removing the outliers — This approach involves removing the outliers from the dataset. However, this can lead to loss of valuable information and bias the analysis. Therefore, we can use other methods such as replacing outliers with a more representative value, such as the mean or median of the data.
from scipy import stats
z_scores = stats.zscore(data['price'])
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3)
new_data = data[filtered_entries]
# Create a box plot
plt.boxplot(new_data['SalePrice'])
plt.show()
3.2 Transforming the data — This approach involves transforming the data so that the outliers have less of an impact on the analysis. One way to do this is by applying a log transformation to the data.
The following code applies a log transformation to the data to reduce the impact of the outliers.
data['SalePrice'] = data['SalePrice'].apply(lambda x: np.log(x) if x > 0 else 0)
3.3 Treating outliers as missing values — This approach involves treating the outliers as missing values and using imputation techniques to replace them. This can be done using methods such as mean imputation, median imputation, or k-nearest neighbor imputation.
# Replace outliers with median value
median = data['price'].median()
data.loc[data['price'] > 500000, 'price'] = median
In the code above, we replace any price above 500,000 with the median value of the price column.
Conclusion
Handling outliers is an essential step in data analysis, as they can significantly impact the results. In this post, we discussed how to handle outliers using Pandas. We first identified outliers in the dataset using visualization techniques such as box plots. We then discussed different methods to handle outliers, including replacing outliers with a more representative value or transforming the data. Handling outliers is an essential step in data analysis and can help in improving the accuracy of statistical analysis and machine learning models.
Many thanks for reading this post!🙏.
If you found this content helpful😊, please LIKE 👍, SHARE, and FOLLOW to stay updated on our future posts.
If you have a moment, I encourage you to see my other kernels below: