Handy Python Pandas for Handling Missing Values
Data Cleaning & Data Preparation Series — isna(), isnull(), notnull(), dropna(),fillna(), replace()
Data is often collected from multiple sources and can contain missing values or null values. Missing values can occur in any dataset and can be caused by various reasons such as data entry errors, incomplete data, or intentional omissions. It’s a common problem in data analysis as they can lead to incorrect analysis or results. Therefore, it is important to handle these missing values before further analysis.
In this series, we will explore how to handle missing values using Pandas. Pandas, a popular data manipulation library in Python, provides efficient methods to handle missing data. It provides several functions to handle missing data such as isna(), isnull(), notnull(), dropna(),fillna(), and replace().
These functions allow you to detect missing values in your data, impute missing values with a specified value, or remove rows or columns containing missing values.
Pandas represents missing values using the NaN (Not a Number) value, which is a special floating-point value. When working with data that contains missing values, it is essential to identify and handle them properly. In this series, we will explore various ways to handle missing values using pandas.
- Identify Missing Values
The first step in handling missing values is to identify the missing values in the data. To identify missing values, we can use the isna()
or isnull()
or notnull()
methods in Pandas. These methods will return a Boolean array that indicates whether each value in the DataFrame is missing or not. The notnull()
method returns the opposite of isnull()
.
import pandas as pd
# create a DataFrame with missing values
df = {'Name': ['John', 'Joseph', 'Mary', 'Mark', 'David', 'Mike'], 'Age': [25, 33, None, 28, 29, 35], 'Salary': [50000, None, 60000, 55000, 70000, 65000]}
df = pd.DataFrame(df)
# check for missing values
print(df.isnull())
Output:
Name Age Salary
0 False False False
1 False False True
2 False True False
3 False False False
4 False False False
5 False False False
print(df.notnull())
Output:
Name Age Salary
0 True True True
1 True True False
2 True False True
3 True True True
4 True True True
5 True True True
The output shows that the second row in the Age
column and the second row in the Salary
column contain missing values.
2. Dropping Missing Values
One way to handle missing values is to drop the rows or columns that contain missing values. The dropna()
method in pandas can be used to drop rows or columns that contain missing values. By default, it drops rows that contain at least one missing value.
# drop rows with missing values
df_dropped = df.dropna() # Remove any row that contains missing values
print(df_dropped)
Output:
Name Age Salary
4 David 29.0 70000.0
5 Mike 35.0 65000.00
The output shows that the rows with missing values in the Age
and Salary
columns have been dropped.
We can also drop columns that contain missing values by setting the axis
parameter to 1.
# drop columns with missing values
df_dropped = df.dropna(axis=1)
print(df_dropped)
Output:
Name
0 John
1 Jane
2 Mary
3 Mark
4 David
5 Mike
The output shows that the Age
and Salary
columns, which contain missing values, have been dropped.
3. Filling of Replace Missing Values
If your dataset has a large number of missing values, it may be more appropriate to replace them with other values. Pandas provides several methods to replace missing values, including fillna() and replace(). The fillna() method replaces missing values with a specified value or with values from a specified method. The replace() method replaces specified values with other values.
# Replace missing values with a specified value
df_filled = df.fillna(0) # Replace missing values with 0
print(df_filled)
Output:
Name Age Salary
0 John 25.0 50000.0
1 Jane 33.0 0.0
df_filled = df.fillna(df['Age'].mean())# Replace missing values with the mean of the column
print(df_filled)
Output:
Name Age Salary
0 John 25.0 50000.0
1 Joseph 33.0 30.0
2 Johns 30.0 60000.0
3 Mark 28.0 55000.0
4 David 29.0 70000.0
5 Mike 35.0 65000.0
# Replace specified values with other values
df_replace = df.replace({'Mary':'Johns'}) # Replace Mary with Johns
print(df_replace)
Output:
Name Age Salary
0 John 25.0 50000.0
1 Joseph 33.0 NaN
2 Johns NaN 60000.0
3 Mark 28.0 55000.0
4 David 29.0 70000.0
5 Mike 35.0 65000.0
In conclusion, handling missing values is an important part of data analysis and modeling. By using appropriate techniques to handle missing values, we can obtain more accurate and reliable results from our analyses.
All the code used in this article can be accessed from my Github account.
Many thanks for reading my post!🙏.
If you found this content helpful😊, please LIKE 👍, SHARE, and FOLLOW to stay updated on our future posts.
If you have a moment, I encourage you to see my other kernels below: