Handy Python Pandas for Handling Missing Values

Learner CARES
4 min readFeb 20, 2023

--

Data Cleaning & Data Preparation Series — isna(), isnull(), notnull(), dropna(),fillna(), replace()

Image by Author

Data is often collected from multiple sources and can contain missing values or null values. Missing values can occur in any dataset and can be caused by various reasons such as data entry errors, incomplete data, or intentional omissions. It’s a common problem in data analysis as they can lead to incorrect analysis or results. Therefore, it is important to handle these missing values before further analysis.

In this series, we will explore how to handle missing values using Pandas. Pandas, a popular data manipulation library in Python, provides efficient methods to handle missing data. It provides several functions to handle missing data such as isna(), isnull(), notnull(), dropna(),fillna(), and replace(). These functions allow you to detect missing values in your data, impute missing values with a specified value, or remove rows or columns containing missing values.

Pandas represents missing values using the NaN (Not a Number) value, which is a special floating-point value. When working with data that contains missing values, it is essential to identify and handle them properly. In this series, we will explore various ways to handle missing values using pandas.

  1. Identify Missing Values

The first step in handling missing values is to identify the missing values in the data. To identify missing values, we can use the isna() or isnull() or notnull() methods in Pandas. These methods will return a Boolean array that indicates whether each value in the DataFrame is missing or not. The notnull() method returns the opposite of isnull().

import pandas as pd

# create a DataFrame with missing values
df = {'Name': ['John', 'Joseph', 'Mary', 'Mark', 'David', 'Mike'], 'Age': [25, 33, None, 28, 29, 35], 'Salary': [50000, None, 60000, 55000, 70000, 65000]}

df = pd.DataFrame(df)

# check for missing values
print(df.isnull())

Output:
Name Age Salary
0 False False False
1 False False True
2 False True False
3 False False False
4 False False False
5 False False False

print(df.notnull())

Output:
Name Age Salary
0 True True True
1 True True False
2 True False True
3 True True True
4 True True True
5 True True True

The output shows that the second row in the Age column and the second row in the Salary column contain missing values.

2. Dropping Missing Values

One way to handle missing values is to drop the rows or columns that contain missing values. The dropna() method in pandas can be used to drop rows or columns that contain missing values. By default, it drops rows that contain at least one missing value.

# drop rows with missing values
df_dropped = df.dropna() # Remove any row that contains missing values

print(df_dropped)

Output:

Name Age Salary
4 David 29.0 70000.0
5 Mike 35.0 65000.00

The output shows that the rows with missing values in the Age and Salary columns have been dropped.

We can also drop columns that contain missing values by setting the axis parameter to 1.

# drop columns with missing values
df_dropped = df.dropna(axis=1)

print(df_dropped)

Output:
Name
0 John
1 Jane
2 Mary
3 Mark
4 David
5 Mike

The output shows that the Age and Salary columns, which contain missing values, have been dropped.

3. Filling of Replace Missing Values

If your dataset has a large number of missing values, it may be more appropriate to replace them with other values. Pandas provides several methods to replace missing values, including fillna() and replace(). The fillna() method replaces missing values with a specified value or with values from a specified method. The replace() method replaces specified values with other values.

# Replace missing values with a specified value
df_filled = df.fillna(0) # Replace missing values with 0

print(df_filled)

Output:
Name Age Salary
0 John 25.0 50000.0
1 Jane 33.0 0.0

df_filled = df.fillna(df['Age'].mean())# Replace missing values with the mean of the column

print(df_filled)

Output:
Name Age Salary
0 John 25.0 50000.0
1 Joseph 33.0 30.0
2 Johns 30.0 60000.0
3 Mark 28.0 55000.0
4 David 29.0 70000.0
5 Mike 35.0 65000.0

# Replace specified values with other values
df_replace = df.replace({'Mary':'Johns'}) # Replace Mary with Johns

print(df_replace)

Output:
Name Age Salary
0 John 25.0 50000.0
1 Joseph 33.0 NaN
2 Johns NaN 60000.0
3 Mark 28.0 55000.0
4 David 29.0 70000.0
5 Mike 35.0 65000.0

In conclusion, handling missing values is an important part of data analysis and modeling. By using appropriate techniques to handle missing values, we can obtain more accurate and reliable results from our analyses.

All the code used in this article can be accessed from my Github account.

Many thanks for reading my post!🙏.

If you found this content helpful😊, please LIKE 👍, SHARE, and FOLLOW to stay updated on our future posts.

If you have a moment, I encourage you to see my other kernels below:

--

--

Learner CARES
Learner CARES

Written by Learner CARES

Data Scientist, Kaggle Expert (https://www.kaggle.com/itsmohammadshahid/code?scroll=true). Focusing on only one thing — To help people learn📚 🌱🎯️🏆

No responses yet