Handy Python Pandas for Merging and Joining Data
Data Cleaning & Data Preparation Series — pd.merge(df1,df2,on='key', how='inner'), pd.merge(df1,df2,on='key', how='outer'), pd.merge(df1,df2,on='key', how='left'), pd.merge(df1,df2,on='key', how='right')
You can download the Jupyter notebook and data of this tutorial here
Table of Contents
1. Introduction
2. Types of merging — 'inner', 'outer', 'left', 'right'
3. Joining
4. Concatenation
- Introduction
In today’s world, data is often scattered across multiple sources, such as databases, CSV files, and spreadsheets. Combining data from these disparate sources can be a time-consuming and error-prone task. However, merging and joining data can provide significant benefits, such as:
- Providing a more complete and accurate view of the data by combining data from different sources.
- Simplifying data analysis by allowing data to be easily compared and aggregated across multiple sources.
- Improving data quality by identifying and resolving data inconsistencies and duplicates.
Merging: Merging is the process of combining two or more data sets based on a common key or column. This can be achieved using the 'merge’
function from the Pandas library.
Joining: Joining is similar to merging, but it is a specific type of merging that combines two or more data sets based on their index rather than a common key or column. This can be achieved using the 'join’
method from the Pandas library.
Concatenation: Combining data from multiple sources along a given axis (either vertically or horizontally). This can be achieved using the 'concat’
function from the Pandas library.
2. Types of merging — 'inner', 'outer', 'left', 'right'
There are several different types of merging operations that can be performed using Pandas, including:
- Inner join: Combining data based on matching values in specified columns, only including rows where the match exists in both data sets.
First, we import the pandas library and create the dataframes ('customers’, 'orders’, 'products’, 'sales’,
and 'costs’
) for the example:
import pandas as pd
# create customers dataset
customers = pd.DataFrame({'customer_id': ['C1', 'C2', 'C3', 'C4', 'C5'],
'customer_name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'customer_email': ['alice@gmail.com', 'bob@gmail.com', 'charlie@gmail.com', 'david@gmail.com', 'emma@gmail.com']})
# create orders dataset
orders = pd.DataFrame({'order_id': ['O1', 'O2', 'O3', 'O4', 'O5', 'O6'],
'product_id': ['P1', 'P2', 'P3', 'P4', 'P5','P6'],
'customer_id': ['C1', 'C1', 'C2', 'C3', 'C4', 'C5'],
'order_date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06'],
'order_total': [100, 200, 150, 50, 75, 300]})
# create products dataset
products = pd.DataFrame({'product_id': ['P1', 'P2', 'P3', 'P4', 'P5'],
'product_name': ['Product A', 'Product B', 'Product C', 'Product D', 'Product E'],
'cost_per_unit': [10.0, 20.0, 30.0, 40.0,50.0]})
# create sales dataset
sales = pd.DataFrame({'order_id': ['O1', 'O2', 'O3', 'O4', 'O5'],
'product_id': ['P1', 'P2', 'P3', 'P4', 'P5'],
'units_sold': [100, 200, 150, 50, 75],
'revenue': [1000.0, 4000.0, 4500.0, 2000.0, 3750.0]})
# create costs dataset
costs = pd.DataFrame({'product_id': ['P1', 'P2', 'P3', 'P4', 'P5'],
'cost_per_unit': [10.0, 20.0, 30.0, 40.0,50.0]})
inner join keeps only the rows where the 'customer_id’
column exists in both the 'customers’
and 'orders’
dataframes.
# merge customers and orders datasets on customer_id
customer_orders = pd.merge(customers, orders, on='customer_id', how='inner')
# print the resulting dataset
pd.DataFrame(customer_orders)
# merge customer_orders and products datasets on product_id
final_data = pd.merge(customer_orders, products, on='product_id', how='inner')
# print the resulting dataset
pd.DataFrame(final_data)
- Outer join: Combining data based on matching values in specified columns, including all rows from both data sets and filling in missing values where there is no match.
# perform an outer join on sales and costs datasets
sales_costs = pd.merge(sales, costs, on='product_id', how='outer')
# print the resulting dataset
pd.DataFrame(sales_costs)
- Left join: Combining data based on matching values in specified columns, including all rows from the left data set and filling in missing values where there is no match in the right data set.
# perform a left join on sales and costs datasets
sales_costs = pd.merge(sales, costs, on='product_id', how='left')
# print the resulting dataset
pd.DataFrame(sales_costs)
As we can see, the resulting 'sales_costs'
dataset contains all the rows from the 'sales'
dataset, with the matching rows from the 'costs'
dataset included where available.
- Right join: Combining data based on matching values in specified columns, including all rows from the right data set and filling in missing values where there is no match in the left data set.
# perform a right join on sales and costs datasets
costs_sales = pd.merge(costs, sales, on='product_id', how='right')
# print the resulting dataset
pd.DataFrame(costs_sales)
As we can see, the resulting 'costs_sales'
dataset contains all the rows from the 'costs'
dataset, with the matching rows from the 'sales'
dataset included where available.
3. Joining :
The .join() method in Pandas to join the datasets.
# join customers and orders datasets based on customer_id
customer_orders = customers.join(orders.set_index('customer_id'), on='customer_id')
# join customer_orders and sales datasets based on order_id and product_id
sales_data = customer_orders.join(sales.set_index(['order_id', 'product_id']), on=['order_id', 'product_id'])
# join sales_data and costs datasets based on product_id
sales_data = sales_data.join(costs.set_index('product_id'), on='product_id')
pd.DataFrame(sales_data)
In this example, we first used the .set_index() method to set the index of the orders and sales datasets to the customer_id and order_id and product_id columns, respectively. This allows us to join the datasets based on these columns without having to specify them explicitly.
We then used the .join() method to join the customers and orders datasets based on the customer_id column, the customer_orders and sales datasets based on the order_id and product_id columns, and so on.
Finally, we have a new dataset named sales_data that contains all the columns from the original datasets joined together. We can print the resulting dataset using pd.DataFrame(sales_data).
4. Concatenation:
Combining data from multiple sources along a given axis (either vertically or horizontally). This can be achieved using the 'concat’
function from the Pandas library.
# concatenate customers, orders, products, sales, and costs datasets
sales_data = pd.concat([customers, orders, products, sales, costs])
# reset the index of the resulting dataset
sales_data = sales_data.reset_index(drop=True)# print the resulting dataset
pd.DataFrame(sales_data)
Many thanks for reading this post!🙏.
If you found this content helpful😊, please LIKE 👍, SHARE, and FOLLOW to stay updated on our future posts.
If you have a moment, I encourage you to see my other kernels below: