The data from this analysis is from kaggle Campus Recruit, which contains 215 student data including their education level, degree, gender, specialization ect.
The purpose of this analysis is to analyze the data and build a classifier using machine learning techniques such as decision tree and random forest, logstic regression to classify the student status into being placed a job or not being placed a job. The analysis will use different classification techniques and compare which classifier make the best prediction for this dataset.
The second part of the analysis will perform a regression analysis using only the students that get job to find out some of the key factors that influence the salary of an offer. The analysis are performed in Python using Jupyter Notebook. Various techniques that are used in this analysis are:
The analysis starts with loading data and necessary library.
# Load necessary library
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
import matplotlib.ticker as mtick
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn')
%matplotlib inline
# set default plot size
plt.rcParams["figure.figsize"] = (15,8)
# Load and preview data
recruit = pd.read_csv("/Users/leo/Personal/GitHub/Campus Recruitment Analysis/Placement_Data_Full_Class.csv")
recruit.head()
# drop id column
recruit.drop('sl_no',axis=1,inplace=True)
recruit.shape
# Summary Statistics
recruit.describe()
# Check each column for nas
recruit.isnull().sum()
After we load the data, we can see that there are 215 observations and 14 columns in this dataset with a mix of categorical and numeric variables. The target variable for our classification problem is the status column, which is stored as Placed and Not Placed. There are other binary variables such as gender, workex need to transfer into 0 and 1 later in the analysis. Other categorical variables with multiple levels, we'll use One Hot Encoding to transfer to binary variables. Also, the first column sl_no will be dropped from the dataset as it's an index column.
By the first glance of the dataset, it's pretty clean. There's no missing values other that the salary for those who wasn't offered a job. Thus salary will be excluded in the first part of the analysis (Classification).
After we get the first look at the data, we'd like to get a better understanding by performing some EDA and data visualizations.
We first start with a pairplot from the seaborn
library using sns.pariplot()
. This will give us the correlation of all numeric variables in the dataset. We've also set hue = 'status'
to see how the value distribute under different status. This can help us better identify variables later in the classifier.
From the plot, we can see that the following variables might be significantly different under different student status
Other variables shows different distribution under different status as well.
sns.pairplot(recruit.drop('salary',axis=1),hue = 'status')
# gender 0
# ssc_p 0
# ssc_b 0
# hsc_p 0
# hsc_b 0
# hsc_s 0
# degree_p 0
# degree_t 0
# workex 0
# etest_p 0
# specialisation 0
# mba_p 0
# status 0
# salary 67
We then want to see the student status by gender. From the table below and the barplot, we can see that more male students were placed a job then female.
recruit.groupby(["gender","status"]).size().unstack()
recruit.groupby(["gender","status"]).size().groupby(level=0).apply(
lambda x: 100 * x / x.sum()
).unstack().plot(kind='bar',stacked=True)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.legend(loc = 'upper right',title = 'Status')
plt.show()
# most of males are placed job than female
The following table shows the average of all the numeric variables under different status. We can see that other than mba_p, all the other variables have difference in mean
recruit.groupby('status').mean()
To further examine the distribution of numeric variables under different groups, boxplots are useful. But first, we need to tranfer the dataset so that we can visualize all the numeric variables under one graph.
The first step is to extract all the numeric variables and then transfer the dataset from wide to long format using pd.melt()
.
After that, we can plot the scatter plots. The scatter plot below shows the similiar information as what we see on the table, which is almost all variables have a higher value in the placed group than not placed group, while mba percent seem to have the least influence on whether a student is placed or not.
recruit_numeric = recruit[['ssc_p','hsc_p','degree_p','etest_p','mba_p','status']]
recruit_numeric_melt = pd.melt(recruit_numeric,id_vars='status',
value_vars =['ssc_p','hsc_p','degree_p','etest_p','mba_p'])
recruit_numeric_melt.head()
sns.boxplot(x="variable", y="value",
hue="status", data=recruit_numeric_melt)
After we finished with the numeric variables, we proceeded with all the categorical variables. Applying the same approach for all the categorical variables, we used stacked barplot to see the count of observations in each group.
# then will look at all the categorical variables
# column description
# ssc_b Board of Education- Central/ Others
# hsc_b Board of Education- Central/ Others
# hsc_s Specialization in Higher Secondary Education
# degree_t Under Graduation(Degree type)- Field of degree education
# workex Work Experience
# specialisation Post Graduation(MBA)- Specialization
# status Status of placement- Placed/Not placed
# salary Salary offered by corporate to candidates
# Board of Education - 10th grade
recruit.groupby(["ssc_b","status"]).size().groupby(level=0).apply(
lambda x: 100 * x / x.sum()
).unstack().plot(kind='bar',stacked=True)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.legend(loc = 'upper right',title = 'Board of Education')
plt.show()
# central and others almost no difference for secondary education board of education
# Board of Education - 12th grade
recruit.groupby(["hsc_b","status"]).size().groupby(level=0).apply(
lambda x: 100 * x / x.sum()
).unstack().plot(kind='bar',stacked=True)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.legend(loc = 'upper right',title = 'Board of Education')
plt.show()
# similarly central and others almost no difference for secondary education board of education
# Specialization in Higher Secondary Education
recruit.groupby(["hsc_s","status"]).size().groupby(level=0).apply(
lambda x: 100 * x / x.sum()
).unstack().plot(kind='bar',stacked=True)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.legend(loc = 'upper right',title = 'Higher Education Specialization')
plt.show()
# commerce and science are more likely to get placed
# Under Graduation(Degree type)- Field of degree education
recruit.groupby(["degree_t","status"]).size().groupby(level=0).apply(
lambda x: 100 * x / x.sum()
).unstack().plot(kind='bar',stacked=True)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.legend(loc = 'upper right',title = 'Degree type')
plt.show()
# for undergraduate degrees, comm/management and sci/tech are more likely to get placed
# Work Experience
recruit.groupby(["workex","status"]).size().groupby(level=0).apply(
lambda x: 100 * x / x.sum()
).unstack().plot(kind='bar',stacked=True)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.legend(loc = 'upper right',title = 'Work experience')
plt.show()
# having working experience is more likely to get placed and it has the most influence by comparing the graphs
# Post Graduation(MBA)- Specialization
recruit.groupby(["specialisation","status"]).size().groupby(level=0).apply(
lambda x: 100 * x / x.sum()
).unstack().plot(kind='bar',stacked=True)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.legend(loc = 'upper right',title = 'specialisation')
plt.show()
# mrkt/finance are more likely to get placed than mrkt/hr
To sum up the information found on the plots above:
This wrapped up EDA and data visualization.
The next step would be training classifier using random forest and logistic regression.
Random forest is a powerful classifier in machine learning in that it can not only be used as a classifier, but also as a regressor. Random Forest also works well for small datasets, which made it the first choice of our classification problem.
As mentioned above, before training the model, we need to transform all of the categorical variables. For binary variables, we are transfering them into 0 and 1. For other multi class variables, we are transferring them using One Hot Encoding by using pd.get_dummies
from the pandas library.
# transfer categorical vaeiables to dummy variables
recruit.loc[recruit['gender'] == 'M', 'gender'] = 1.0
recruit.loc[recruit['gender'] == 'F', 'gender'] = 0.0
recruit.loc[recruit['status'] == 'Placed', 'status'] = 1
recruit.loc[recruit['status'] == 'Not Placed', 'status'] = 0
recruit.loc[recruit['workex'] == 'Yes', 'workex'] = 1.0
recruit.loc[recruit['workex'] == 'No', 'workex'] = 0.0
categorical_var = ['ssc_b','hsc_b','hsc_s','degree_t','specialisation']
# create dummy variables for all the other categorical variables
for variable in categorical_var:
# # fill missing data
# recruit[variable].fillna('Missing',inplace=True)
# create dummy variables for given columns
dummies = pd.get_dummies(recruit[variable],prefix=variable)
# update data and drop original columns
recruit = pd.concat([recruit,dummies],axis=1)
recruit.drop([variable],axis=1,inplace=True)
recruit.head()
Also here we are creating a separate dataset for regression analysis using observations that have status Placed.
# Create separate dataset for placed status
# use this for further regression analysis
recruit_placed = recruit[recruit['status'] == 1].drop('status',axis = 1)
recruit_placed.head()
We are using all the columns as our independent variables and y variable would be status.
Split the training and testing dataset into 70/30 split.
x = recruit.drop(['status','salary'], axis=1)
y = recruit['status'].astype(float)
# split train and test dataset
train_x, test_x, train_y, test_y = train_test_split(x,y , test_size=0.3, random_state=42)
print(train_x.shape)
print(train_y.shape)
We first fit the training dataset with a Random Forest Regressor by inputting some generic parameters and print out the model score. We can see that the model has a score of 0.93.
rf_regressor = RandomForestRegressor(100, oob_score=True,
n_jobs=-1, random_state=42)
rf_regressor.fit(train_x,train_y)
print('Score: ', rf_regressor.score(train_x,train_y))
Then we used feature_importance_
function to see what are the most important features in the Random Forest Regressor and visualize the importance using a bar plot.
From the plot, we can see that the top 3 important factors are:
feature_importance = pd.Series(rf_regressor.feature_importances_,index=x.columns)
feature_importance = feature_importance.sort_values()
feature_importance.plot(kind='barh')
After we train the initial regressor, we proceed with parameter tunning to try and find the optimized value for
n_estimators
: The number of trees in the forestmax_features
: The number of features to consider when looking for the best splitmin_samples_leaf
: The minimum number of samples required to be at a leaf nodeBy inputting different values or methods, we will try and find out the value that provides the highest score.
# parameter tunning
# # of trees trained parameter tunning
results = []
n_estimator_options = [30,50,100,200,500,1000,2000]
for trees in n_estimator_options:
model = RandomForestRegressor(trees,oob_score=True,n_jobs=-1,random_state=42)
model.fit(x,y)
print(trees," trees")
score = model.score(train_x,train_y)
print(score)
results.append(score)
print("")
pd.Series(results,n_estimator_options).plot()
# max number of features parameter tunning
results = []
max_features_options = ['auto',None,'sqrt','log2',0.9,0.2]
for max_features in max_features_options:
model = RandomForestRegressor(n_estimators=200,oob_score=True,n_jobs=-1,
random_state=42,max_features=max_features)
model.fit(x,y)
print(max_features," option")
score = model.score(train_x,train_y)
print(score)
results.append(score)
print("")
pd.Series(results,max_features_options).plot(kind='barh')
# min sample leaf parameter tunning
results = []
min_sample_leaf_option = [1,2,3,4,5,6,7,8,9,10]
for min_sample_leaf in min_sample_leaf_option:
model = RandomForestRegressor(n_estimators=200,oob_score=True,n_jobs=-1,
random_state=42,max_features='sqrt',
min_samples_leaf=min_sample_leaf)
model.fit(x,y)
print(min_sample_leaf," min samples")
score = model.score(train_x,train_y)
print(score)
results.append(score)
print("")
pd.Series(results,min_sample_leaf_option).plot()
After the analysis above, the parameter we're choosing for each of the parameter are:
n_estimators
: 200max_features
: sqrtmin_samples_leaf
: 1And by inputting these new values, we got a better model with a slightly higher model score.
rf_regressor = RandomForestRegressor(200, oob_score=True,max_features='sqrt',
n_jobs=-1, random_state=42,min_samples_leaf=1)
rf_regressor.fit(x,y)
print('Score: ', rf_regressor.score(train_x,train_y))
Just to see how good the model performs, we compared the top 10 value from the test dataset and also the predicted probability from the model. We can see that most of the prediction is accurate
pred_y = rf_regressor.predict(test_x)
print(test_y[:10])
print(pred_y[:10])
Similarly, then we train a Random Forest Classifier to see how the data perform with a classifier using the same parameters we got from the regressor. The model score is 0.8 and from the confusion matric, we can see that the classifier do a better job in correctly predicting student being placed a job then not being placed a job, as only 2 were misclassified into not placed while 10 were misclassified into placed.
rf_classifier = RandomForestClassifier(200, oob_score=True,
n_jobs=-1, random_state=42)
rf_classifier.fit(train_x, train_y)
pred_y = rf_classifier.predict(test_x)
rf_classifier.score(test_x, test_y)
mat = confusion_matrix(test_y,pred_y)
sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value')
print(classification_report(test_y, pred_y))
We want to fit the data with different type of classifiers to see which classifier works best for this dataset. The second classifier used is Logistic Regression.
Following the same step of fitting the model and making prediction from the test data. Logistic regression have a model score of 0.84, which is slightly higher than the Random Forest Classifier. And from the confusion matrix, it performed better in classifying not placed students than Random Forest.
lr_model = LogisticRegression()
lr_model.fit(train_x,train_y)
lr_model.score(test_x, test_y)
pred_y = lr_model.predict(test_x)
mat = confusion_matrix(test_y,pred_y)
sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value')
print(classification_report(test_y, pred_y))
We've also print out the coefficients of the model and plot them on a barplot. The most important factors are previous work experience of the student.
lr_coef = pd.DataFrame({"Coefficients":lr_model.coef_[0]},index = x.columns.tolist())
lr_coef = lr_coef.sort_values(by = 'Coefficients')
lr_coef
lr_coef.plot(kind='barh')
The second purpose of this analysis is the find out the important factors that influenced students' salary. For this analysis, we are using the recruit_placed dataset.
recruit_placed.head()
We first plot the pairplot to examine the correlation between different variables and print out the correlation matrix.
sns.pairplot(recruit_placed[['ssc_p','hsc_p','degree_p','etest_p','mba_p','salary']])
recruit_placed[['ssc_p','hsc_p','degree_p','etest_p','mba_p','salary']].corr()
W've selected the following variables for the linear regression.
var = ['ssc_p','hsc_p','degree_p','etest_p','mba_p','gender','workex']
x = recruit_placed.loc[:,var]
# x = recruit_placed.loc[:,recruit_placed.columns != 'salary']
y = recruit_placed.loc[:,recruit_placed.columns == 'salary']
x.head()
train_x, test_x, train_y, test_y = train_test_split(x,y , test_size=0.2, random_state=42)
print(train_x.shape)
print(test_x.shape)
For linear regression, we are using OLS
from statsmodel
package. The model has a R square of 0.91. The variables that have the highest influence are:
linear_model = sm.OLS(train_y,train_x.astype(float))
results = linear_model.fit()
results.params
print(results.summary())
Print out the prediction and compare them with the test dataset, we can see that some of the prediction are close, but the model definitely needs further fine tunning.
pred_y = results.predict(test_x)
# print(pred_y[:10])
# print(test_y[:10])
col = ['actual','prediction']
prediction = pd.concat([test_y,pred_y],axis=1)
prediction.columns = col
prediction
_, ax = plt.subplots()
ax.scatter(x = range(0, test_y.size), y=test_y, c = 'blue', label = 'Actual', alpha = 0.3)
ax.scatter(x = range(0, pred_y.size), y=pred_y, c = 'red', label = 'Predicted', alpha = 0.3)
plt.title('Actual and predicted values')
plt.xlabel('Observations')
plt.ylabel('Salary')
plt.legend()
plt.show()