Handy Python Pandas for Data Aggregation
Data Cleaning & Data Preparation Series — df.groupby(), df.pivot_table(), df.melt()
You can download the Jupyter notebook and data of this tutorial here
Table of Contents
1. Introduction
2. Grouping Data using groupby() function
3. Grouping Data using pivot_table()
4. Reshaping Data using melt(), stack(), and unstack()
- Introduction
In today’s world, data is being generated at an unprecedented rate. From social media platforms to e-commerce websites, there is an abundance of data available that can help organizations make better decisions. However, with so much data, it can be challenging to make sense of it all. This is where pandas come in.
Pandas is a powerful library in Python that is designed for data manipulation and analysis. It is built on top of NumPy and provides an efficient way to work with structured data. Pandas is ideal for tasks such as cleaning, filtering, and aggregating data, and in this post, I will be using the AMES House Price data.
What is Data Aggregation?
Aggregation is the process of combining data from multiple sources into a single summary. It is a critical step in data analysis because it helps us to get a better understanding of the data we are working with. It is the process of summarizing, grouping, and condensing data.
The goal of data aggregation is to gain insights from large datasets and to make the data more manageable. There are many ways to aggregate data in Pandas, including:
2. Using the Groupby Function
Grouping is the process of dividing a dataset into smaller subsets based on one or more variables. It is a common method used to explore and analyze data.
The groupby() function in pandas is incredibly flexible and can be used to group data by one or more columns. We can use the groupby() function to group data by one or more variables, and then apply various aggregation functions to the groups.
2.1 Aggregating Data by Single Column using groupby()
First, We import the Pandas library, and read the dataset into a DataFrame.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/learnercares/Python-for-Data-Science/main/AMES%20Housing%20Dataset.csv')
df.head()
df_agg = df.groupby('YrSold')['SalePrice'].mean()
print(df_agg)
2.2 Aggregating Data by Multiple Columns
In some cases, you may want to group the data by multiple columns. This can be done by passing a list of column names to the groupby() function.
df_agg = df.groupby(['YrSold','Street','SaleCondition'])['SalePrice'].mean()
print(df_agg)
This will group the data by both YrSold and Street and then calculate the sum of the SalePrice for each combination of YrSold and Street.
2.3 Applying Multiple Functions to Grouped Data
In addition to applying a single function to each group, you can also apply multiple functions. This can be done using the agg() function. For example:
df_agg = df.groupby(['YrSold','SaleType'])['SalePrice'].agg(['min','max','mean','sum'])
print(df_agg)
This will group the data by ‘YrSold’, ’SaleType’ and then calculate the ‘min’, ‘max’, ‘mean’, ‘sum’ SaleType for each ‘YrSold’ and ‘SaleType’.
3. Pivot Tables
Pivot table is a powerful data analysis tool in Pandas that allows you to summarize and manipulate data in a tabular format. It takes a DataFrame as input and groups the data based on one or more columns, applies a set of aggregate functions to the data, and produces a new table of summary statistics.
Pivot tables help to quickly and easily perform complex data analysis tasks, such as data aggregation, filtering, and sorting. They are widely used in data analysis and data science projects.
Pandas provides a pivot_table() function that makes creating pivot tables easy. We can use the pivot_table() function to create a table that summarizes data by one or more variables, and then apply aggregation functions to the groups.
We will explore how to create pivot tables in Pandas and apply aggregation functions such as mean(), sum(), and count().
3.1 Aggregating Data by Single Column using Pivot
# creating pivot table of total Sale Price
# Year Sold-wise aggfunc = 'mean' will
# allow you to obtain the average of SalePrice for
# each YrSold
pivot = df.pivot_table(index =['YrSold'], # Categorical variable
values =['SalePrice'],
aggfunc =['mean'])
print(pivot)
3.2 Aggregating Data by Multiple Columns using pivot_table()
# creating pivot table of total Sale Price
# Year Sold-wise aggfunc = 'sum' will
# allow you to obtain the sumof SalePrice for
# each YrSold and SaleType
pivot = df.pivot_table(index =['YrSold', 'SaleType'], # Categorical variable
values =['SalePrice'],
aggfunc ='sum')
print(pivot)
3.3 Applying Multiple Functions to Grouped Data using pivot_table()
# creating pivot table of total Sale Price
# Year Sold-wise aggfunc = ['min','max','mean', 'sum'] will
# allow you to obtain the min, max, mean, sum of SalePrice for
# each YrSold, SaleType
pivot = df.pivot_table(index =['YrSold','SaleType'], # Categorical variable
values =['SalePrice'],
aggfunc =['min','max','mean', 'sum'])
print(pivot)
4. Reshaping Data
Reshaping is the process of converting data from one shape to another. It is a common method used to prepare data for analysis.
Pandas provides many functions for reshaping data, including melt(), stack(), and unstack(). These functions allow us to reshape data from a long format to a wide format or vice versa.
4.1 melt()
The melt() function takes a DataFrame and reshapes it so that one or more columns are “melted” into a single column, with the values of the original columns being stored in a new “value” column. The function can be used to unpivot a DataFrame and reshape it into a format suitable for further analysis or visualization.
# reshape data using melt function
df_melted = pd.melt(df, id_vars=['YrSold'], value_vars=['1stFlrSF', '2ndFlrSF'],
var_name='FlrSF', value_name='SF')
# print melted data
print(df_melted)
4.2 stack()
The stack function in pandas is used to transform the DataFrame from a wide format to a long format. It takes a level of column labels and pivots them to become the innermost level of row labels, resulting in a stacked DataFrame. This function is used to make the DataFrame more manageable and to perform operations like aggregating data or plotting data using seaborn or matplotlib.
Example 1
df_FlrSF = df.loc[:,['YrSold','1stFlrSF','1stFlrSF']]
# Reshaping the data using stack function
df_stack = df_FlrSF.set_index('YrSold').stack().reset_index(name='SF').rename(columns={'level_1':'FlrSF'})
#print(df)
print(df_stack)
Example 2
# create a sample dataframe
data = {'Name': ['John', 'Jane', 'Peter', 'Mary'],
'Maths': [90, 80, 70, 60],
'Science': [85, 75, 65, 55],
'English': [80, 95, 85, 75]}
df1 = pd.DataFrame(data)
# stack the dataframe
stacked = df1.set_index('Name').stack().reset_index()
stacked.columns = ['Name', 'Subject', 'Score']
# group the data by subject and calculate the mean score
grouped = stacked.groupby('Subject')['Score'].mean().reset_index()
print("\nRaw Data")
print(df)
print("\nStack Data")
print(grouped)
4.3 Unstack
The unstack function in pandas is a method used to convert a stacked DataFrame into a pivoted DataFrame. The stacked DataFrame contains data in a hierarchical index format, where one or more levels of the index are used to group the data.
It is used to reverse this process by moving the stacked levels into columns. This creates a pivoted DataFrame with a regular column index that is easier to work with and analyze.
It takes the level parameter, which specifies which level of the index to unstack. It can also be used with multiple levels of the index to create a more complex pivoted DataFrame.
Example 1
df_FlrSF = df.loc[:,['YrSold','1stFlrSF','1stFlrSF']]
# Reshaping the data using unstack function
df_unstack = df_FlrSF.set_index('YrSold').unstack().reset_index(name='FlrSF').rename(columns={'level_0': 'SF'})
print(df_unstack)
Example 2
# Unstack
# Creating a sample data frame
data = {'year': [2010, 2011, 2012],
'USA': [100, 200, 300],
'Canada': [150, 250, 350]}
df1 = pd.DataFrame(data)
# Reshaping the data using unstack function
df_unstack = df1.set_index('year').unstack().reset_index(name='sales').rename(columns={'level_0': 'country'})
print("Raw Data")
print(df)
print("\nUnstack Data")
print(df_unstack)
Conclusion
Data aggregation is a crucial step in data analysis and pandas provides a powerful tool for aggregating data in Python. With the help of pandas, we can group and summarize data based on different criteria, compute summary statistics, apply custom functions, and generate reports in various formats. Overall, pandas is a versatile and efficient library for data aggregation that can greatly improve our data analysis workflows.
These are just a few examples of commonly used methods in pandas for data aggregation. The choice of methods to be used depends on the type of data and the analysis to be performed.
Many thanks for reading this post!🙏.
If you found this content helpful😊, please LIKE 👍, SHARE, and FOLLOW to stay updated on our future posts.
If you have a moment, I encourage you to see my other kernels below: