Handy Python Pandas for Handling Outliers

4 min readFeb 27, 2023

Data Cleaning & Data Preparation Series — stats.zscore(), plt.boxplot(), np.log()

You can download the Jupyter notebook and data of this tutorial here

Table of Contents
1. Introduction
2. Detecting Outliers
3. Handling Outliers

1. Introduction

Outliers are extreme values that may significantly affect the data analysis and interpretation. They are data points that are far away from the other data points and can have a significant impact on statistical analysis. Outliers can occur due to measurement errors, data entry errors, or simply due to the natural variability of the data. In this chapter, we will discuss how to handle outliers using the pandas library in Python. In this post, we will be using the AMES House Price data.

2. Detecting Outliers

Before we can handle outliers, we need to first detect them. There are various statistical methods for detecting outliers, such as z-score, boxplots, and scatterplots. In pandas, we can use the describe() function to get a summary of the dataset, which includes information on the mean, standard deviation, minimum and maximum values, and quartiles. We can also use the plot() function to generate boxplots and scatterplots.

Let us consider an example where we have a dataset containing the prices of different houses. We can plot the data using a box plot as shown below:

import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
data = pd.read_csv('https://raw.githubusercontent.com/learnercares/Python-for-Data-Science/main/AMES%20Housing%20Dataset.csv')

data.describe()


# Create a box plot
plt.boxplot(data['SalePrice'])
plt.show()

3. Handling Outliers

Once we have detected the outliers, we need to handle them. There are several ways to handle outliers, including:

3.1 Removing the outliers — This approach involves removing the outliers from the dataset. However, this can lead to loss of valuable information and bias the analysis. Therefore, we can use other methods such as replacing outliers with a more representative value, such as the mean or median of the data.

from scipy import stats

z_scores = stats.zscore(data['price'])
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3)
new_data = data[filtered_entries]

# Create a box plot
plt.boxplot(new_data['SalePrice'])
plt.show()

3.2 Transforming the data — This approach involves transforming the data so that the outliers have less of an impact on the analysis. One way to do this is by applying a log transformation to the data.

The following code applies a log transformation to the data to reduce the impact of the outliers.

data['SalePrice'] = data['SalePrice'].apply(lambda x: np.log(x) if x > 0 else 0)

3.3 Treating outliers as missing values — This approach involves treating the outliers as missing values and using imputation techniques to replace them. This can be done using methods such as mean imputation, median imputation, or k-nearest neighbor imputation.

# Replace outliers with median value
median = data['price'].median()
data.loc[data['price'] > 500000, 'price'] = median

In the code above, we replace any price above 500,000 with the median value of the price column.

Conclusion

Handling outliers is an essential step in data analysis, as they can significantly impact the results. In this post, we discussed how to handle outliers using Pandas. We first identified outliers in the dataset using visualization techniques such as box plots. We then discussed different methods to handle outliers, including replacing outliers with a more representative value or transforming the data. Handling outliers is an essential step in data analysis and can help in improving the accuracy of statistical analysis and machine learning models.

Many thanks for reading this post!🙏.

If you found this content helpful😊, please LIKE 👍, SHARE, and FOLLOW to stay updated on our future posts.

If you have a moment, I encourage you to see my other kernels below:

Handy Python Pandas for Data Cleaning and Preprocessing

Data Cleaning & Data Preparation Series — Commonly used methods

learner-cares.medium.com

Handy Pandas Python Library for Handling Missing Values

Data Cleaning Series — isna(), isnull(), notnull(), dropna(),fillna(0), replace()

learner-cares.medium.com

Handy Python Pandas for Removing Duplicates, Reformatting Data, Renaming, and Reordering Columns

Data Cleaning & Data Preparation Series — drop_duplicates, to_datetime(), strftime, apply(), rename()

learner-cares.medium.com

Handy Python Pandas for Data Filtering

Data Cleaning & Data Preparation Series — df.query(), df.loc(row label, column lebel), df.iloc(integer row index…

learner-cares.medium.com

Handy Python Pandas for Data Normalization and Scaling

Data Cleaning & Data Preparation Series — from sklearn.preprocessing, scaler=MinMaxScaler()…

learner-cares.medium.com

Handy Python Pandas for Data Aggregation

Data Cleaning & Data Preparation Series — df.groupby(), df.pivot(), df.melt()

learner-cares.medium.com

EDA | Building Advanced Regression Techniques to Predict House Price on the Ames Dataset🏠

A Real Estate Problem Analysis with Real-World Data

learner-cares.medium.com

Handy Python Pandas for Handling Outliers

1. Introduction

2. Detecting Outliers

3. Handling Outliers

Conclusion

Handy Python Pandas for Data Cleaning and Preprocessing

Data Cleaning & Data Preparation Series — Commonly used methods

Handy Pandas Python Library for Handling Missing Values

Data Cleaning Series — isna(), isnull(), notnull(), dropna(),fillna(0), replace()

Handy Python Pandas for Removing Duplicates, Reformatting Data, Renaming, and Reordering Columns

Data Cleaning & Data Preparation Series — drop_duplicates, to_datetime(), strftime, apply(), rename()

Handy Python Pandas for Data Filtering

Data Cleaning & Data Preparation Series — df.query(), df.loc(row label, column lebel), df.iloc(integer row index…

Handy Python Pandas for Data Normalization and Scaling

Data Cleaning & Data Preparation Series — from sklearn.preprocessing, scaler=MinMaxScaler()…

Handy Python Pandas for Data Aggregation

Data Cleaning & Data Preparation Series — df.groupby(), df.pivot(), df.melt()

EDA | Building Advanced Regression Techniques to Predict House Price on the Ames Dataset🏠

A Real Estate Problem Analysis with Real-World Data

Written by Learner CARES

Responses (1)