Handy Python Pandas for Merging and Joining Data

6 min readApr 4, 2023

Data Cleaning & Data Preparation Series — pd.merge(df1,df2,on='key', how='inner'), pd.merge(df1,df2,on='key', how='outer'), pd.merge(df1,df2,on='key', how='left'), pd.merge(df1,df2,on='key', how='right')

You can download the Jupyter notebook and data of this tutorial here

Table of Contents
1. Introduction
2. Types of merging — 'inner', 'outer', 'left', 'right'
3. Joining
4. Concatenation

Introduction

In today’s world, data is often scattered across multiple sources, such as databases, CSV files, and spreadsheets. Combining data from these disparate sources can be a time-consuming and error-prone task. However, merging and joining data can provide significant benefits, such as:

Providing a more complete and accurate view of the data by combining data from different sources.
Simplifying data analysis by allowing data to be easily compared and aggregated across multiple sources.
Improving data quality by identifying and resolving data inconsistencies and duplicates.

Merging: Merging is the process of combining two or more data sets based on a common key or column. This can be achieved using the 'merge’function from the Pandas library.

Joining: Joining is similar to merging, but it is a specific type of merging that combines two or more data sets based on their index rather than a common key or column. This can be achieved using the 'join’ method from the Pandas library.

Concatenation: Combining data from multiple sources along a given axis (either vertically or horizontally). This can be achieved using the 'concat’function from the Pandas library.

2. Types of merging — 'inner', 'outer', 'left', 'right'

There are several different types of merging operations that can be performed using Pandas, including:

Inner join: Combining data based on matching values in specified columns, only including rows where the match exists in both data sets.

First, we import the pandas library and create the dataframes ('customers’, 'orders’, 'products’, 'sales’, and 'costs’) for the example:

import pandas as pd

# create customers dataset
customers = pd.DataFrame({'customer_id': ['C1', 'C2', 'C3', 'C4', 'C5'],
                          'customer_name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
                          'customer_email': ['alice@gmail.com', 'bob@gmail.com', 'charlie@gmail.com', 'david@gmail.com', 'emma@gmail.com']})

# create orders dataset
orders = pd.DataFrame({'order_id': ['O1', 'O2', 'O3', 'O4', 'O5', 'O6'],
                       'product_id': ['P1', 'P2', 'P3', 'P4', 'P5','P6'],
                       'customer_id': ['C1', 'C1', 'C2', 'C3', 'C4', 'C5'],
                       'order_date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06'],
                       'order_total': [100, 200, 150, 50, 75, 300]})

# create products dataset
products = pd.DataFrame({'product_id': ['P1', 'P2', 'P3', 'P4', 'P5'],
                         'product_name': ['Product A', 'Product B', 'Product C', 'Product D', 'Product E'],
                         'cost_per_unit': [10.0, 20.0, 30.0, 40.0,50.0]})

# create sales dataset
sales = pd.DataFrame({'order_id': ['O1', 'O2', 'O3', 'O4', 'O5'],
                      'product_id': ['P1', 'P2', 'P3', 'P4', 'P5'],
                      'units_sold': [100, 200, 150, 50, 75],
                      'revenue': [1000.0, 4000.0, 4500.0, 2000.0, 3750.0]})

# create costs dataset
costs = pd.DataFrame({'product_id': ['P1', 'P2', 'P3', 'P4', 'P5'],
                      'cost_per_unit': [10.0, 20.0, 30.0, 40.0,50.0]})

inner join keeps only the rows where the 'customer_id’ column exists in both the 'customers’ and 'orders’ dataframes.

# merge customers and orders datasets on customer_id
customer_orders = pd.merge(customers, orders, on='customer_id', how='inner')

# print the resulting dataset
pd.DataFrame(customer_orders)

# merge customer_orders and products datasets on product_id
final_data = pd.merge(customer_orders, products, on='product_id', how='inner')

# print the resulting dataset
pd.DataFrame(final_data)

Outer join: Combining data based on matching values in specified columns, including all rows from both data sets and filling in missing values where there is no match.

# perform an outer join on sales and costs datasets
sales_costs = pd.merge(sales, costs, on='product_id', how='outer')

# print the resulting dataset
pd.DataFrame(sales_costs)

Left join: Combining data based on matching values in specified columns, including all rows from the left data set and filling in missing values where there is no match in the right data set.

# perform a left join on sales and costs datasets
sales_costs = pd.merge(sales, costs, on='product_id', how='left')

# print the resulting dataset
pd.DataFrame(sales_costs)

As we can see, the resulting 'sales_costs' dataset contains all the rows from the 'sales' dataset, with the matching rows from the 'costs' dataset included where available.

Right join: Combining data based on matching values in specified columns, including all rows from the right data set and filling in missing values where there is no match in the left data set.

# perform a right join on sales and costs datasets
costs_sales = pd.merge(costs, sales, on='product_id', how='right')

# print the resulting dataset
pd.DataFrame(costs_sales)

As we can see, the resulting 'costs_sales' dataset contains all the rows from the 'costs' dataset, with the matching rows from the 'sales' dataset included where available.

3. Joining :

The .join() method in Pandas to join the datasets.

# join customers and orders datasets based on customer_id
customer_orders = customers.join(orders.set_index('customer_id'), on='customer_id')

# join customer_orders and sales datasets based on order_id and product_id
sales_data = customer_orders.join(sales.set_index(['order_id', 'product_id']), on=['order_id', 'product_id'])

# join sales_data and costs datasets based on product_id
sales_data = sales_data.join(costs.set_index('product_id'), on='product_id')

pd.DataFrame(sales_data)

In this example, we first used the .set_index() method to set the index of the orders and sales datasets to the customer_id and order_id and product_id columns, respectively. This allows us to join the datasets based on these columns without having to specify them explicitly.

We then used the .join() method to join the customers and orders datasets based on the customer_id column, the customer_orders and sales datasets based on the order_id and product_id columns, and so on.

Finally, we have a new dataset named sales_data that contains all the columns from the original datasets joined together. We can print the resulting dataset using pd.DataFrame(sales_data).

4. Concatenation:

Combining data from multiple sources along a given axis (either vertically or horizontally). This can be achieved using the 'concat’function from the Pandas library.

# concatenate customers, orders, products, sales, and costs datasets
sales_data = pd.concat([customers, orders, products, sales, costs])

# reset the index of the resulting dataset
sales_data = sales_data.reset_index(drop=True)# print the resulting dataset
pd.DataFrame(sales_data)

Many thanks for reading this post!🙏.

If you found this content helpful😊, please LIKE 👍, SHARE, and FOLLOW to stay updated on our future posts.

If you have a moment, I encourage you to see my other kernels below:

Handy Python Pandas for Data Cleaning and Preprocessing

Data Cleaning & Data Preparation Series — Commonly used methods

learner-cares.medium.com

EDA | AMES Housing Price Prediction 🏠

Explore and run machine learning code with Kaggle Notebooks | Using data from House Prices - Advanced Regression…

www.kaggle.com

Top 15 Free Data Sources for DATA SCIENCE Projects 💥🏆🥇

“Continuous learning is the minimum requirement for success in any field🎯.” by Dennis Waitley✨

learner-cares.medium.com

Zero-Cost Learning: Free Resources for Data Science Beginners and Professionals🧑‍🚀

“Education is the most powerful weapon which you can use to change the world.”— Nelson Mandela✍️

learner-cares.medium.com

Handy Python Pandas for Merging and Joining Data

Handy Python Pandas for Data Cleaning and Preprocessing

Data Cleaning & Data Preparation Series — Commonly used methods

EDA | AMES Housing Price Prediction 🏠

Explore and run machine learning code with Kaggle Notebooks | Using data from House Prices - Advanced Regression…

Top 15 Free Data Sources for DATA SCIENCE Projects 💥🏆🥇

“Continuous learning is the minimum requirement for success in any field🎯.” by Dennis Waitley✨

Zero-Cost Learning: Free Resources for Data Science Beginners and Professionals🧑‍🚀

“Education is the most powerful weapon which you can use to change the world.”— Nelson Mandela✍️

Written by Learner CARES