🎖️ The Promotion Game: Predict Who Will Win 👑

The Signs You’re on Track for a Promotion 📈💯

18 min readNov 30, 2023

1 | Challenge: What problem are we aiming to solve?
2 | Abstract
3 | About Dataset, and Import Dataset
· Import Dataset
· Check categorical and numerical features
4 | Key Findings in Statistical Analysis
4.1 | Descriptive Statistics
4.2 | Check and Handle Missing Values
4.3 | Mann-Whitney U Test and Correlation Analysis
4.4 | Chi-Square Test for Independence
4.5 | Summary of Statistical Analysis
5 | Key Findings in Visualization
5.1 | Count Plots and Pie Charts for Single Categorical Feature
5.2 | Cat Plots for Multiple Features
5.3 | Summary of Visualization
6 | Data Preprocessing
6.1 | Define Pipeline for Data Preprocessing
7 | Machine Learning Models
7.1 | Logistic Regression, Random Forests, XGBoost, Neural Networks
7.2 | Summary of Machine Learning Models
8 | Final Model, and Interpretation
· Predict whether a potential promotee at checkpoint in the test set will be promoted or not

1 | Challenge: What problem are we aiming to solve?

One challenge is figuring out the suitable candidates to promote, particularly for positions below the manager level, and getting them ready for the promotion on time.
Promotions are announced only after the assessment, causing a delay in transition to new roles.
So, the company needs assistance in finding the right people at a specific stage to speed up the whole process of promotions.

2 | Abstract

Back to Table of Contents

Purpose: Develop a predictive model for identifying suitable candidates for job promotion within the organization. Our main focus is to utilize employee data to understand factors influencing employee promotions and build a model for accurate predictions.

Methodology: Use comprehensive employee data including attributes like performance ratings, training history, demographics, and past promotions. Conduct statistical analysis and visualization to explore relationships between variables. Preprocess data to enhance predictive power. Employ machine learning techniques to build a predictive model based on factors.

Results: Develop a robust model accurately predicting suitable candidates for promotion. XGBoost outperform other models with higher accuracy, the best ROC AUC score, and the highest F1 score, indicating better overall performance and class separation ability. Highlight influential variables such as previous performance ratings, training history, and tenure in predicting promotions.

Conclusions: The developed model accurately identifies employees likely to be promoted, aiding in efficient talent management. Insights into key factors influencing promotions can guide strategic decisions in talent development and organizational growth.

3 | About Dataset, and Import Dataset

Dataset includes a wide range of employee-related information, including demographics, performance history, training, and promotion status.

Features:

employee_id: Unique ID for employee
department: Department of employee
region: Region of employment (unordered)
education: Education Level
gender: Gender of Employee
recruitment_channel: Channel of recruitment for employee
no of trainings: no of other trainings completed in previous year on soft skills, technical skills etc.
age: Age of Employee
previous_year_rating: Employee Rating for the previous year
length of service: Length of service in years
awards_ won?: if awards won during previous year then 1 else 0
avg training score: Average score in current training evaluations
is_promoted: (Target) Recommended for promotion

Import Libraries

import os
import platform
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import itertools
import seaborn as sns
from copy import copy
import pickle
import sklearn
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inlinefrom platform import python_version

sns.set(rc={"axes.facecolor":"#f4f0bb","figure.facecolor":"#f4f0bb"})
sns.set_context("poster",font_scale = .5)

palette = ['#fd7f6f', '#7eb0d5', '#b2e061', '#bd7ebe', '#ffb55a', '#ffee65', '#beb9db', '#fdcce5', '#8bd3c7']
plt.show()

Import Dataset

train = pd.read_csv('https://github.com/Muhd-Shahid/ML/blob/main/train.csv')
test = pd.read_csv('https://github.com/Muhd-Shahid/ML/blob/main/test.csv')

Check categorical and numerical features

# For categorical
cat = [i for i in train.columns if train.dtypes[i]=='object']
print("Categorical Features:",cat)

# For Numerical
num = [i for i in train.columns if train.dtypes[i]!='object']
print("Continuous or Numerical Features:",num)

Based on above observation, we define our categorical, numerical, and id variables

cat_columns = ["department","region","education","gender","recruitment_channel","no_of_trainings","awards_won?"]
num_columns = ["age","previous_year_rating","length_of_service","avg_training_score"]
id_dep_columns = ["employee_id","is_promoted"]

4 | Key Findings in Statistical Analysis

4.1 | Descriptive Statistics

Begin by understanding the dataset’s summary statistics (range, mean, median, and distribution) for numerical variables like age, no_of_trainings, length_of_service, and avg_training_score.

train.describe()

Observation on training dataset:

Count: There are 54,808 observations in the training dataset.
no_of_trainings: On average, employees have undergone around 1.25 training sessions, with a minimum of 1 and a maximum of 10.
age: The average age of employees is approximately 34.80 years, with a minimum age of 20 and a maximum age of 60.
previous_year_rating: The average rating from the previous year is 3.33, with a minimum rating of 1 and a maximum rating of 5.
length_of_service: The average length of service is approximately 5.87 years, with a minimum of 1 year and a maximum of 37 years.
awards_won?: A small percentage of employees (2.32%) have won awards (values are 0 or 1).
avg_training_score: The average training score is around 63.39, with a minimum score of 39 and a maximum score of 99.
is_promoted: Approximately 8.52% of employees have been promoted (values are 0 or 1).

test.describe()

Observation on testing dataset:

Count: There are 23,490 observations in the testing dataset.
no_of_trainings: On average, employees have undergone around 1.25 training sessions, with a minimum of 1 and a maximum of 9.
age: The average age of employees is approximately 34.78 years, with a minimum age of 20 and a maximum age of 60.
previous_year_rating: The average rating from the previous year is 3.34, with a minimum rating of 1 and a maximum rating of 5.
length_of_service: The average length of service is approximately 5.81 years, with a minimum of 1 year and a maximum of 34 years.
awards_won?: A small percentage of employees (2.28%) have won awards (values are 0 or 1).
avg_training_score: The average training score is around 63.26, with a minimum score of 39 and a maximum score of 99.

Overall observation:

The two datasets are similar in terms of the variables they contain.
Both datasets have similar average values for key features like age, previous_year_rating, length_of_service, awards_won?, and average_training_score.
The length_of_service has a maximum of 37 years in the training dataset, while it is 34 years in the testing dataset.
70% (54,808/78,298) used as the training dataset, and 30% (23,490/78,298) for testing.

Note: It’s crucial to ensure that the testing dataset is representative of the training dataset to build a reliable predictive model. Understanding these similarities and differences helps in preparing and preprocessing the data appropriately for model training and evaluation.

4.2 | Check and Handle Missing Values

There could be several reasons for missing values in the “Education” and “Previous_year_rating” variables.

For Education:

Non-Reporting: Some employees might not have provided their educational details during data collection or entry.
Educational Background: In some cases, certain roles or positions might not require specific educational qualifications, leading to non-disclosure.

But, I did not find any specific reason for missingness in Education variable.

Suggested Solution:

Creating a new label Unknown to represent missing values in the education variable.

For Previous_year_rating:

New Employees: Missing ratings might be due to new employees who haven’t completed a promotion year to receive a previous year’s rating.

I reviewed both the data (train and test) and noticed that the- Previous_year_rating values are missing specifically for individuals with a length_of_service of 1.

Suggested Solution:

Assigning a Default Value: Since new employees don’t have a previous year’s rating, assigning a default value 0 to indicate their lack of a previous year’s rating.

Check missing values in the training dataset

train.isnull().sum()

Check missing values in the testing dataset

test.isnull().sum()

train_df = train.copy()
test_df = test.copy()

Missing value imputation for training dataset

from sklearn.impute import SimpleImputer

# Impute 'Education' column with 'Unknown'
edu_imputer = SimpleImputer(strategy='constant', fill_value='Unknown')
train_df['education'] = edu_imputer.fit_transform(train_df[['education']])

# Impute 'Previous_Year_Rating' column with 0
rating_imputer = SimpleImputer(strategy='constant', fill_value=0)
train_df['previous_year_rating'] = rating_imputer.fit_transform(train_df[['previous_year_rating']])

train_df.isnull().sum()

train_df.head()

Missing value imputation for testing dataset

from sklearn.impute import SimpleImputer

# Impute 'Education' column with 'Unknown'
edu_imputer = SimpleImputer(strategy='constant', fill_value='Unknown')
test_df['education'] = edu_imputer.fit_transform(test_df[['education']])

# Impute 'Previous_Year_Rating' column with 0
rating_imputer = SimpleImputer(strategy='constant', fill_value=0)
test_df['previous_year_rating'] = rating_imputer.fit_transform(test_df[['previous_year_rating']])

test_df.isnull().sum()

4.3 | Mann-Whitney U Test and Correlation Analysis

Back to Table of Contents

Mann-Whitney U Test

The Mann-Whitney U-test is used to compare two independent groups when the data is either ordinal or not normally distributed.

How it Works:

Decision: If the p-value is less than the chosen significance level (commonly 0.05), you reject the null hypothesis and conclude that there is a significant difference between the two groups.

Here, we check significant difference between the two groups is_promoted vs (age, previous_year_rating, length_of_service, and avg_training_score).

train_df[num_columns].head()

First, we check visually difference between the two groups (across promoted and non-promoted) using Violin plots for is_promoted vs (age, previous_year_rating, length_of_service, and avg_training_score).

df = pd.melt(train_df, id_vars='is_promoted', value_vars=['age','length_of_service','avg_training_score','previous_year_rating'], var_name='Features', value_name='Value')

# Define custom colors
custom_colors = {0: "Red", 1: "Blue"}

print("Let's have a look on the distribution of is_promoted in the training dataset :")
plt.subplots(figsize=(15, 8))
cp=sns.violinplot(x=df["Features"],y=df["Value"], hue = df['is_promoted'], palette=custom_colors, saturation=1, linewidth = 2)

cp.axes.set_xlabel("",fontsize=18)
cp.axes.set_ylabel("",fontsize=18)
cp.axes.set_xticklabels(cp.get_xticklabels(),rotation = 0)
for container in cp.containers:
    cp.bar_label(container,label_type="center",padding=6,size=25,color="black",rotation=0,
    bbox={"boxstyle": "round", "pad": 0.4, "facecolor": "orange", "edgecolor": "#1c1c1c", "linewidth" : 4, "alpha": 1})


sns.despine(left=True, bottom=True)
plt.show()

Then, we are checking significant difference between the two groups using non-parametric test for is_promoted vs (age, previous_year_rating, length_of_service, and avg_training_score).

Note: Firstly, I have already checked normality of the variables (age, previous_year_rating, length_of_service, and avg_training_score) and confirmed that they are non-normal. that’s why I opted for the Mann-Whitney U Test.

import numpy as np
from scipy.stats import mannwhitneyu


# Function to perform Mann-Whitney U Test for each pair of variables
def mann_whitney_tests(dataframe):
    variables = dataframe.columns
    num_variables = len(variables)
    
    for col in dataframe.columns:  # Exclude 'is_promoted' from columns
        promoted = train_df[train_df['is_promoted'] == 1][col]
        not_promoted = train_df[train_df['is_promoted'] == 0][col]
        # Perform Mann-Whitney U Test
        stat, p_value = mannwhitneyu(promoted, not_promoted)
        print(f'Mann-Whitney U Test between is_promoted and {col}:')
        print(f'Statistic: {stat}')
        print(f'P-value: {p_value}')
        print('')

mann_whitney_tests(train_df[num_columns])

Observation:

Visually (Violin Plot) and Statistically (Mann-Whitney U Test): age, previous_year_rating, and avg_training_score have significant difference between the two groups (promoted and non-promoted) as p-value<0.05 and length_of_service non-significant (p-value>0.05).

Correlation Analysis

Back to Table of Contents

Correlation refers to a statistical measure that describes the strength and direction of a relationship between two variables. It assesses how changes in one variable are associated with changes in another variable. In data analysis and modeling, correlation helps identify which variables are strongly related, aiding in feature selection for predictive models.

How it Works:

Correlation Coefficient: A numerical measure between -1 and 1 that represents the strength and direction of the relationship.
Positive Correlation: When one variable increases, the other tends to increase as well (correlation coefficient close to +1).
Negative Correlation: When one variable increases, the other tends to decrease (correlation coefficient close to -1).
No Correlation: When there’s no discernible relationship between the variables (correlation coefficient close to 0).

Here, we will check correlation between numerical variables age, previous_year_rating, length_of_service, and avg_training_score for redundency.

# Calculate the correlation matrix
correlation_matrix = train_df[num_columns].corr()

# Display the correlation matrix in a well-formatted table
correlation_table = correlation_matrix.style.background_gradient(cmap='coolwarm', axis=None).set_precision(2)

# Show the correlation table
correlation_table

Observation:

Variables are not highly correlated. Assuming there is no collinearity in the variables.

4.4 | Chi-Square Test for Independence

Back to Table of Contents

The chi-square test is a statistical test used to determine whether there is a significant association between categorical variables. In feature selection for machine learning, chi-square tests can help identify relevant features for classification tasks.

How it Works:

Null Hypothesis: Assumes no association between variables (independence).
P-value Interpretation: If the p-value is below a chosen significance level (usually 0.05), the null hypothesis is rejected, indicating a significant association.

Here, we check significant association between variables is_promoted vs (awards_won, gender, recruitment_channel, education, and department).

# Create a contingency table using pandas crosstab function
contingency_awards_won = pd.crosstab(train_df['is_promoted'], train_df['awards_won?'])
contingency_gender = pd.crosstab(train_df['is_promoted'], train_df['gender'])
contingency_recruitment = pd.crosstab(train_df['is_promoted'], train_df['recruitment_channel'])
contingency_education = pd.crosstab(train_df['is_promoted'], train_df['education'])
contingency_department = pd.crosstab(train_df['is_promoted'], train_df['department'])

print(contingency_awards_won)
print(contingency_gender)
print(contingency_recruitment)
print(contingency_education)
print(contingency_department)

from scipy.stats import chi2_contingency

#Perform chi-square tests
def perform_chi2_test(table):
    chi2, p, dof, expected = chi2_contingency(table)
    return chi2, p, dof, expected

# Perform chi-square test for each pair
results = {}
results['recruitment_channel'] = perform_chi2_test(contingency_recruitment)
results['gender'] = perform_chi2_test(contingency_gender)
results['education'] = perform_chi2_test(contingency_education)
#results['region'] = perform_chi2_test(contingency_region)
results['department'] = perform_chi2_test(contingency_department)
results['awards_won'] = perform_chi2_test(contingency_awards_won)

# Output the results
for var, (chi2, p, dof, expected) in results.items():
    print(f"Chi-square test for {var}:")
    print(f"Chi-square statistic: {chi2}")
    print(f"P-value: {p}")
    print(f"Degrees of freedom: {dof}")
    print("Expected frequencies:")
    print(expected)
    print("\n")

Observation:

We observe the significant association between the categorical variables is_promoted and the following factors: awards_won, gender, recruitment_channel, education, and department, with a p-value < 0.05.

4.5 | Summary of Statistical Analysis

Back to Table of Contents

Training and Testing datasets are similar in terms of the variables they contain. We can say that testing dataset is representative of the training dataset.
Both datasets have similar average values for key features like age, previous_year_rating, length_of_service, awards_won?, and average_training_score.
Tarining and Testing dataset ratio is 70%:30%.
is_promoted: Approximately 8.52% of employees have been promoted (values are 0 or 1). There is imbalance data.
There are missing values in two variables education (No. of missing values in train and test dataset = 2409, 1034) and previous_year_rating (No. of missing values in train and test dataset = 4124, 1812). We handle this by imputing “Unknown” for education and “0” for previous_year_rating respectively.
Mann-Whitney U Test and Violin Plot suggest age, previous_year_rating, and avg_training_score are significant features and length_of_service is non-significant.
Correlation suggests there are no collinearity between the numerical variables.
Chi-Square Test indicates a significant association between the categorical variables is_promoted and the following factors: awards_won, gender, recruitment_channel, education, and department, with a p-value < 0.05.

5 | Key Findings in Visualization

Back to Table of Contents

5.1 | Count Plots and Pie Charts for Single Categorical Feature

Count plots are a powerful tool for exploring and understanding the distribution of categorical variables, providing a straightforward and interpretable visual summary of the data.

Using this for frequency visualization of each category, comparison of categories, identify imbalances in the distribution of categories, identify rare or infrequent categories that might need special attention or treatment during data preprocessing, and visualize relationships between two categorical variables.

print("The number of categories in each categorical variable:")
train_df[cat_columns].nunique()

print("Let's have a look on the distribution of department in the trainig dataset :")
plt.subplots(figsize=(14, 6))
cp=sns.countplot(x=train_df["department"],palette=palette, saturation=1, edgecolor = "#1c1c1c", linewidth = 3)
cp.axes.set_title("\nDistribution of department in the train dataset\n",fontsize=15,fontweight='bold')
cp.axes.set_xlabel("Department",fontsize=13,fontweight='bold')
cp.axes.set_ylabel("Department Counts",fontsize=13,fontweight='bold')
cp.axes.set_xticklabels(cp.get_xticklabels(),rotation = 0)
for container in cp.containers:
    cp.bar_label(container,label_type="center",padding=6,size=20,color="black",rotation=0,
    bbox={"boxstyle": "round", "pad": 0.4, "facecolor": "orange", "edgecolor": "#1c1c1c", "linewidth" : 3, "alpha": 1})


sns.despine(left=True, bottom=True)
plt.show()

Observation:

Sales & Marketing and Operations have the most substantial number of employees, indicating these departments might be pivotal in the company’s operations.
Analytics, Finance, HR, Legal, and R&D have smaller employee counts compared to Sales & Marketing and Operations, signifying potentially specialized or smaller-focused departments within the organization.

print("Let's have a look on the distribution of recruitment channel in the training dataset :")
plt.subplots(figsize=(6, 4))
cp=sns.countplot(x=train_df["recruitment_channel"],palette=palette, saturation=1, edgecolor = "#1c1c1c", linewidth = 3)
#p=sns.countplot(x=train_df["rating_round"],order=train_df["rating_round"].value_counts().index[:11],palette=palette, saturation=1, edgecolor = "#1c1c1c", linewidth = 4)
cp.axes.set_title("\nDistribution of recruitment channel in the training dataset\n",fontsize=15,fontweight='bold')
cp.axes.set_xlabel("recruitment channel",fontsize=12,fontweight='bold')
cp.axes.set_ylabel("Counts",fontsize=12,fontweight='bold')
cp.axes.set_xticklabels(cp.get_xticklabels(),rotation = 0)
for container in cp.containers:
    cp.bar_label(container,label_type="center",padding=6,size=20,color="black",rotation=0,
    bbox={"boxstyle": "round", "pad": 0.4, "facecolor": "orange", "edgecolor": "#1c1c1c", "linewidth" : 4, "alpha": 1})


sns.despine(left=True, bottom=True)
plt.show()

Observation:

The organization seems to heavily rely on the ‘Other’ and ‘Sourcing’ channels for recruitment, suggesting they might have established networks or methods for attracting a larger pool of candidates.

print("Let's have a look on the distribution of previous year rating in the training dataset :")
plt.subplots(figsize=(9, 6))
cp=sns.countplot(x=train_df["previous_year_rating"],palette=palette, saturation=1, edgecolor = "#1c1c1c", linewidth = 3)
cp.axes.set_title("\nDistribution of previous year rating in the training dataset\n",fontsize=15,fontweight='bold')
cp.axes.set_xlabel("Ratings",fontsize=15,fontweight='bold')
cp.axes.set_ylabel("Rating Counts",fontsize=15,fontweight='bold')
cp.axes.set_xticklabels(cp.get_xticklabels(),rotation = 0)
for container in cp.containers:
    cp.bar_label(container,label_type="center",padding=6,size=20,color="black",rotation=0,
    bbox={"boxstyle": "round", "pad": 0.4, "facecolor": "orange", "edgecolor": "#1c1c1c", "linewidth" : 4, "alpha": 1})


sns.despine(left=True, bottom=True)
plt.show()

Observation:

Ratings 3.0, 4.0, and 5.0 indicate a significant portion of the employee population, suggesting a more prevalent performance distribution within this range.

print(f"Let's have a look on education distribution:")
plt.subplots(figsize=(6, 6))

labels = 'Masters & above', 'Bachelors', 'Below Secondary',
sizes = [14925, 36669, 805]
size = 0.5

wedges, texts, autotexts = plt.pie(sizes, labels=labels,
                                    autopct = "%.2f%%", 
                                    pctdistance = 0.72,
                                    radius=.9, 
                                    colors = ["#ef3f28","#dddf00","#008b99"],  #"#11264e","#dcbd6e",,"#008b99"
                                    shadow = True,
                                    wedgeprops=dict(width = size, edgecolor = "black", 
                                    linewidth = 3),
                                    startangle = 85)

plt.legend(wedges, labels, title="Education Count",loc="center left",bbox_to_anchor=(1, 0, 0.5, 1), edgecolor = "black")
plt.show()

Observation:

A substantial majority of employees hold a Bachelor’s degree, indicating that it is the most prevalent educational qualification among the workforce.
Employees with education levels below secondary school are notably fewer in count, indicating that this category represents a minority within the organization.
The category ‘Unknown Education’ encompasses a considerable count, signifying either a lack of available data regarding education or potentially employees who haven’t provided this information.

print(f"Let's have a look on Gender distribution:")
plt.subplots(figsize=(6, 6))

labels = 'Female', 'Male',
sizes = [16312, 38496]
size = 0.5

wedges, texts, autotexts = plt.pie(sizes, labels=labels,
                                    autopct = "%.2f%%", 
                                    pctdistance = 0.72,
                                    radius=.9, 
                                    colors = ["#ef3f28","#dddf00"],  #"#11264e","#dcbd6e",,"#008b99"
                                    shadow = True,
                                    wedgeprops=dict(width = size, edgecolor = "black", 
                                    linewidth = 3),
                                    startangle = 85)

plt.legend(wedges, labels, title="Gender Count",loc="center left",bbox_to_anchor=(1, 0, 0.5, 1), edgecolor = "black")
plt.show()

Observation:

There is a noticeable gender gap within the organization, with a significantly higher count of male employees compared to female employees.
Female employees, while fewer in count, still constitute a substantial portion of the overall workforce, contributing to the organization’s diversity.

5.2 | Cat Plots for Multiple Features

Back to Table of Contents

sns.catplot(x="is_promoted", y="previous_year_rating", hue="department", kind="bar", data=train_df, palette=palette, height=6, aspect=2)

Observation:

Across most departments, higher previous year ratings (e.g., 4.0 and 5.0) tend to correlate with more promotions.

sns.catplot(x="is_promoted", y="previous_year_rating", hue="education", kind="bar", data=train_df, palette=palette, height=6, aspect=2)

Observation:

Employees with higher education levels (“Master’s & above”) show a more consistent ratio of promotions across various ratings, indicating a potential correlation between higher education and higher chances of promotion.
Employees with “Below Secondary” education have the lowest count across all ratings and show minimal instances of promotions.

sns.catplot(x="is_promoted", y="previous_year_rating", hue="awards_won?", kind="bar", data=train_df, palette=palette, height=6, aspect=2)

Observation:

Employees without awards represent the majority and have a significant count of promotions, emphasizing that promotions are not solely reliant on awards.

sns.catplot(x="is_promoted", y="previous_year_rating", hue="recruitment_channel", kind="bar", data=train_df, palette=palette, height=6, aspect=2)

Observation

Employees recruited through these two channels (other and sourcing) show the highest count of non-promotions (is_promoted = 0) across different previous year ratings.

sns.catplot(x="is_promoted", y="age", hue="gender", kind="bar", data=train_df, palette=palette, height=6, aspect=2)

sns.catplot(x="is_promoted", y="previous_year_rating", hue="gender", kind="bar", data=train_df,palette=palette, height=6, aspect=2)

Observation:

The workforce comprises a higher count of male employees (gender = m) across all previous year ratings compared to female employees (gender = f).
The count of non-promotions (is_promoted = 0) is notably higher for male employees across different previous year ratings.
Employees, both male and female, with lower previous year ratings (0.0 to 2.0) exhibit fewer instances of promotions, suggesting that employees with lower ratings are less likely to be promoted, irrespective of gender.
Higher previous year ratings (4.0 and 5.0) consistently display a larger count of promotions for both male and female employees, indicating that higher ratings generally correlate with more promotions for both genders.

5.3 | Summary of Visualization

Back to Table of Contents

Sales & Marketing and Operations have the most substantial number of employees, indicating these departments might be pivotal in the company’s operations.
The organization seems to heavily rely on the ‘Other’ and ‘Sourcing’ channels for recruitment, suggesting they might have established networks or methods for attracting a larger pool of candidates.
Across most departments, higher previous year ratings (e.g., 4.0 and 5.0) tend to correlate with more promotions.
Employees with higher education levels (“Master’s & above”) show a more consistent ratio of promotions across various ratings, indicating a potential correlation between higher education and higher chances of promotion.
Employees without awards represent the majority and have a significant count of promotions, emphasizing that promotions are not solely reliant on awards.
Higher previous year ratings (4.0 and 5.0) consistently display a larger count of promotions for both male and female employees, indicating that higher ratings generally correlate with more promotions for both genders.

6 | Data Preprocessing

Back to Table of Contents

Data preprocessing refers to a set of techniques used to prepare raw data into a clean, understandable format before it’s utilized in machine learning or data analysis processes.

6.1 | Define Pipeline for Data Preprocessing

Back to Table of Contents

In data preprocessing pipeline, we handle Missing Values, Categorical Data (Used: one-hot encoding), Scaling numerical variables.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

X_train = train.drop(id_dep_columns,axis=1)
y_train = train.is_promoted
X_test = test.drop("employee_id",axis=1)# Preprocessing for numerical data
# Note: only previous_year_rating will be filled missing values with 0
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value=0)),
    ('scaler', StandardScaler())
])# Preprocessing for categorical data
# Note: only education will be filled missing values with Unknown
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, num_columns),
        ('cat', categorical_transformer, cat_columns)
    ])# Create the full pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])# Fit and transform the training data
X_train_processed = pipeline.fit_transform(X_train)# Transform the test data
X_test_processed = pipeline.transform(X_test)

7 | Machine Learning Models

Back to Table of Contents

7.1 | Logistic Regression, Random Forests, XGBoost, Neural Networks

Back to Table of Contents

Logistic Regression:

Logistic Regression is like a simple, reliable tool for saying ‘yes’ or ‘no’ — great when you want to understand how one thing affects another.

Random Forests:

Random Forests work like a wise crowd, combining many opinions to make a strong decision, especially when things get a bit complicated.

XGBoost:

XGBoost is like a super-smart student who quickly figures out the best way to solve a problem, making it awesome for tricky tasks.

Neural Networks:

Neural Networks are like brains for computers, learning and understanding complex stuff, perfect for tasks where things get a bit tricky or messy.

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score

# Models
models = [
    ('Logistic Regression', LogisticRegression()),
    ('Random Forest', RandomForestClassifier()),
    ('XGBoost', XGBClassifier()),
    ('Neural Network', MLPClassifier())
]

# Scoring metrics
scoring = ['accuracy', 'roc_auc', 'f1']

results = []

# 5-fold cross-validation for each model using different scoring metrics
for name, model in models:
    model_results = []
    for score in scoring:
        kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        cv_results = cross_val_score(model, X_train_processed, y_train, cv=kfold, scoring=score)
        model_results.append((score, cv_results))
        print(f'{name} - {score}: Mean: {np.mean(cv_results):.4f}, Std Dev: {np.std(cv_results):.4f}')
    results.append((name, model_results))

# Save results to a DataFrame
results_dict = {}
for name, model_scores in results:
    for score, scores in model_scores:
        results_dict[f'{name}_{score}'] = scores

results_df = pd.DataFrame(results_dict)
results_df.to_csv('models_crossval_results_scores.csv', index=False)\
results_df

7.2 | Summary of Machine Learning Models

Back to Table of Contents

Let’s analyze the performance of the models based on the performance measures (Accuracy, ROC AUC, F1 Score):

XGBoost seems to outperform other models with higher accuracy, the best ROC AUC score, and the highest F1 score, indicating better overall performance and class separation ability.
Logistic Regression, Random Forest, and Neural Network perform moderately well but show slightly lower performance compared to XGBoost in terms of AUC and F1 score.
The Logistic Regression model, despite having consistent results with low standard deviations, shows lower performance in terms of discrimination and overall balance between precision and recall compared to XGBoost.
The choice of the best model might depend on the specific requirements and trade-offs between different performance measures. For instance, if higher precision and recall are crucial, XGBoost might be preferable due to its higher F1 score.

8 | Final Model, and Interpretation

Back to Table of Contents

After determining the best-performing model from the cross-validation, we rebuild the selected model using the entire training dataset and then predict the on the unseen dataset.

Our best-performing model: XGBoost

# Rebuilding the best model (XGBoost) on the entire training dataset
best_model = XGBClassifier()  # Initialize the model with best hyperparameters found

# Train the best model on the entire training dataset
best_model.fit(X_train_processed, y_train)

Predict whether a potential candidate in the test set will be promoted or not

Back to Table of Contents

# Predictions on the test dataset
y_pred = best_model.predict(X_test_processed)
y_pred

test_pred = pd.DataFrame(y_pred)
test_pred = pd.DataFrame(test_pred.idxmax(axis = 1))
test_pred.index.name = 'employee_id'
test_pred = test_pred.rename(columns = {0: 'Label'}).reset_index()
test_pred['employee_id'] = test_pred['employee_id'] + 1

test_pred.head()
test_pred.to_csv('submission.csv', index = False)

Many thanks for reading!🙏

Please leave a comment if you have any suggestions for improving the analysis!🏋🥇

If you liked 😊, give 👍 UPVOTE!

If you have a moment, I encourage you to see my other kernels.

🎖️ The Promotion Game: Predict Who Will Win 👑

The Signs You’re on Track for a Promotion 📈💯

Table of Contents

1 | Challenge: What problem are we aiming to solve?

2 | Abstract

3 | About Dataset, and Import Dataset

Import Dataset

Check categorical and numerical features

4 | Key Findings in Statistical Analysis

4.1 | Descriptive Statistics

Check missing values in the training dataset

Check missing values in the testing dataset

Missing value imputation for training dataset

Missing value imputation for testing dataset

4.3 | Mann-Whitney U Test and Correlation Analysis

Mann-Whitney U Test

Correlation Analysis

4.4 | Chi-Square Test for Independence

4.5 | Summary of Statistical Analysis

5 | Key Findings in Visualization

5.1 | Count Plots and Pie Charts for Single Categorical Feature

5.2 | Cat Plots for Multiple Features

5.3 | Summary of Visualization

6 | Data Preprocessing

6.1 | Define Pipeline for Data Preprocessing

7 | Machine Learning Models

7.1 | Logistic Regression, Random Forests, XGBoost, Neural Networks

7.2 | Summary of Machine Learning Models

8 | Final Model, and Interpretation

Predict whether a potential candidate in the test set will be promoted or not

🎖️ The Promotion Game: Predict Who Will Win🥂👑

Explore and run machine learning code with Kaggle Notebooks | Using data from HR Analytics: Employee Promotion Data

Written by Learner CARES