from IPython.display import Image
Image(filename='/Users/vlad/Desktop/credit_score.jpg', height = 900, width = 900)
Finance in developing countries is a growing market that is completely distinct from what one can find in developed countries. According to the World Bank, it is estimated that 2.5 billion adults lack access to financial services. These individuals, that do not use banks or banking institutions , in any capacity, are called the unbanked.
The biggest problem the unbanked face, especially in the emerging markets, is the lack of formal records that can be used to create traditional credit scores. In these markets, customers cannot purely rely on banks to have access to a loan or other financial services, as they usually lack a verifiable credit history. Even in developed countries, due to the risks involved in this kind of service , the Microfinance Institution loan process tends to be slow and cumbersome.
Customers are usually requested to provide a long list of documents , which means that even in the countries that have either publicly run credit registries or privately run credit agencies, a large portion of their population is not covered.
A simple solution that is staring you in the face is: the mobile phone. The rapid adoption of cell phones brings a new dynamic to financial markets. It creates an opportunity in the marketplace for start-up firms to emerge, disrupt existing industry structures and deliver a viable solution to the problem.
Ericsson’s Mobility Report (https://www.mobileworldlive.com/featured-content/home-banner/90-global-population-will-mobile-phone-ericsson-report/) in 2014 predicted that 90 percent of people around the globe above the age of six will have a phone by 2020. Cisco, more conservatively, estimates that 70 percent of people around the globe, which is 5.4 billion people, will have mobile phones by 2020. This essentially means that while half of the world may lack formal credit records, almost all of them will have a history of financial and behavioural interactions with a mobile service by 2020. With 90 percent of the world’s data created in the last two years, primarily from mobile phones , there is a lot of data that can shed light on the financial inclusion of the unbanked - and the mobile phone data is deemed to be a good type of alternative data for credit scoring. Both mobile and social media data can be used for alternative risk scoring , independent of financial data. As we leave our digital footprint across the social media landscape, we are creating valuable alternative data for credit lenders to analyse and determine our creditworthiness. The number of friends we have on our social account, how engaged we are with our community and how much our friends engage with us by liking or commenting on our posts, are taken into consideration when underwriting a loan. Even when we are logging in to a social media platform by a smartphone at a Internet café, provide lenders with clues on our affordability and creditworthiness.
By analysing social connections along with individual online behaviour and lifestyle choices, a clearer picture about the persona behind the online identity emerges and certain inferences around their credit worthiness can be made.
Thanks to data generated by social media behaviour, credit lenders now have the ability to make more informed decisions and offer credit to borrowers who otherwise may have been denied a loan or may have been regarded as high-risk and been granted a loan with high-interest payment terms.
Another data source thoroughly used by lenders as an indicator of customer affordability are the mobile phone airtime purchases. An applicant’s airtime purchase history is now being considered a good indicator of affordability and creditworthiness by some lending institutions looking to fill in the gaps of information. Some mobile network operators consider the airtime purchase data such a reliable factor that is even driving them to enter the micro-loans space to increase customer loyalty, drive mobile money usage and increase their revenue.
Once the data is gathered from different sources, mobile and social, the most important and difficult task is that of feature selection. We essentially want to narrow down to a small subset of features that we will later use in the predictive modeling process. The last step involves the adoption of a variety of machine learning techniques and classification algorithms, such as Naïve Bayes, Logistic Regression, Support Vector Machines and Ensemble Methods to predict and evaluate the creditworthiness of a potential customer.
In this notebook we will use a variety of alternative data, which includes telco and transactional information, to solve a real world problem: to predict a client's repayment ability. The data was provided by the Home Credit, a service dedicated to providing lines of credit (loans) to the unbanked population, as part of a Kaggle competition to check what kind of models the machine learning community can develop to help them in their task.
The data can be found on the Kaggle website: https://www.kaggle.com/c/home-credit-default-risk/data and consists of 7 sources/csv tables as can be seen in the table below.
from IPython.display import Image
Image(filename='/Users/vlad/Desktop/home_credit_default_risk.png', height = 900, width = 900)
To make this notebook manageable and easy to understand, we will focus mainly on the application_train table, which consists of 124 columns, including the TARGET variable, consisting of zeros, indicating that the loan was repaid, and ones, indicating that the loan was not repaid.
#import the libraries used throughout this notebook
import os
import time
import pandas as pd
import numpy as np
import time
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.plotly as py
import sklearn
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
#define the path of the application_train csv file
application_train_path = '/Users/vlad/Desktop/Home Credit Default Risk/all/application_train.csv'
#load the application_train file into a pandas dataframe
start = time.time()
application_train = pd.read_csv(filepath_or_buffer = application_train_path, sep = ',')
end = time.time()
print('DataFrame loaded in {0:.3f} sec'.format(end-start))
Let's have a look at the first and last ten rows of the application_train dataframe:
#get the first 10 rows
application_train.head(n = 10)
#get the last 10 rows
application_train.tail(n = 10)
print('The dataframe consists of {} variables.'.format(len(application_train.columns)))
The next thing to do is look at the data type of each column:
#check the data type of each column
data_types = application_train.dtypes.to_frame(name = 'column_data_type')
#display only the first 10 columns
data_types.head(n = 10)
#get the number of missing values in each column
missing_values = application_train.isnull().sum().to_frame(name = 'missing_values')
#sort the missing values in a descending order
missing_values_sorted = missing_values.sort_values(by = ['missing_values'], ascending = False)
#find the percentage of missing values in each column
#get the number of rows in the dataframe
total_rows = application_train.shape[0]
#find the percentage of missing values in each row
missing_values_percentage = (missing_values_sorted/total_rows)*100
#get the first two decimals from each entry and display the first 10 rows
missing_values_percentage.applymap(lambda e: '{0:.2f}%'.format(e)).head(n = 10)
Now, we will visualize, using Plotly, the percentage of missing values for the first 50 columns:
#get the first 50 rows and round each entry to 2 decimals
frame = missing_values_percentage[0:50].applymap(lambda e: np.around(e,2))
init_notebook_mode(connected = True)
#define data
data = [go.Bar(x = frame.index,
y = frame['missing_values'].values,
opacity = 0.6)]
#define layout
layout = go.Layout(
title = 'Missing Values (%)-First 50 Columns',
barmode = 'group',
titlefont=dict(
family='Courier New, monospace',
size=25,
color='#7f7f7f'),
margin = dict(b = 210),
yaxis=dict(
title='Percentage (%)',
titlefont=dict(
size=16,
color='rgb(107, 107, 107)'),
tickfont=dict(
size=14,
color='rgb(107, 107, 107)' )),
xaxis=dict(
title="Column Name",
titlefont=dict(
size=16,
color='rgb(107, 107, 107)'),
tickfont=dict(
size=14,
color='rgb(107, 107, 107)'),
tickangle = -50)
)
#diplay the barplot
fig = go.Figure(data = data , layout = layout)
iplot(fig)
Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/1.png', height = 1000, width = 10000)
Next, we will check if the target variable is balanced or not.
#count the number of ones and zeros in the TARGET column
target_variable_count = application_train['TARGET'].\
value_counts(ascending = True).\
to_frame(name = "ones_zeros_count")
#total number of ones and zeros
total_count = target_variable_count.sum(axis = 0)
#percentage of ones and zeros
ones_zeros_percentage = ((target_variable_count/total_count)*100).\
applymap(lambda e: np.around(e,2)).\
rename(columns = {'ones_zeros_count':'ones_zeros_percentage'})
#concatenate count and percentage dataframes
final = pd.concat([target_variable_count,ones_zeros_percentage], axis = 1)
#get a nice representation
finalCopy = final.copy()
finalCopy['ones_zeros_count'] = finalCopy['ones_zeros_count'].apply(lambda e: '{:,}'.format(e))
finalCopy['ones_zeros_percentage'] = finalCopy['ones_zeros_percentage'].apply(lambda e: '{}%'.format(e))
finalCopy
% matplotlib inline
colors = ['#66b3ff','green']
fig, ax = plt.subplots(figsize = (8,9))
labels = ['Class 1 \n (24,825)','Class 0 \n(282,686)']
sizes = [8.07,91.93]
explode = [0.1,0]
patches, texts, autotexts = ax.pie(sizes,
labels = labels,
explode = explode,
shadow = True,
autopct='%1.1f%%',
startangle = 180,
colors = colors)
ax.axis('equal')
for i in range(2):
texts[i].set_fontsize(18)
for j in range(2):
autotexts[j].set_fontsize(18)
ax.set_title('Distribution of the TARGET variable', size = 20, y = 1.009)
From the pieplot above we can see that the TARGET variable is highly unbalanced.
#get the unique entries in the NAME_CONTRACT_TYPE column
loan_types = application_train['NAME_CONTRACT_TYPE'].unique()
print("Two types of loans: \n\n\n '{0}' and '{1}'".format(loan_types[0], loan_types[1]))
#find the number of loans issued
loans = application_train.groupby(['NAME_CONTRACT_TYPE'])['TARGET'].count().to_frame(name = 'number_of_loans')
#display the dataframe
loans.applymap(lambda e: '{:,}'.format(e))
Let's plot the number of each loan type:
#ploat the distribution of issued loans
fig = {
"data": [
{
"values": loans['number_of_loans'].values,
"labels": loans.index,
"hoverinfo":"label+percent+name",
"hole": .5,
"type": "pie",
"textposition":"inside",
"hoverinfo":"label+text+value",
"domain": {"x": [0.22, .99]},#<--------prosoxh!!!!
"marker":{"line":{'color': '#000000', 'width':2}}
},
],
"layout": {
"title":"Number of Issued Loans",
"font":{"size":18},
"annotations": [
{
"font": {
"size": 20
},
"showarrow": False,
"text": "#Loans",
"x": 0.60,
"y": 0.5
}
]
}
}
iplot(fig)
Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/2.png', height = 1000, width = 10000)
print('Income sources in the dataset: \n\n\n {}'.format(application_train['NAME_INCOME_TYPE'].unique()))
#find the income source of each applicant
income_source = application_train.groupby(['NAME_INCOME_TYPE'])['TARGET']\
.count()\
.to_frame(name = 'number_of_customers')
#dispaly the new dataframe
income_source.sort_values(by = 'number_of_customers', ascending = False)\
.applymap(lambda e: '{:,}'.format(e))
#plot the number of customers per income type
fig = {
"data": [
{
"values": income_source.loc[:,'number_of_customers'].values,
"labels": income_source.index,
"hoverinfo":"label+percent+name",
"hole": .5,
"type": "pie",
"textposition":"inside",
"hoverinfo":"label+text+value",
"domain": {"x": [0.29, .99]},
"marker":{"line":{'color': '#000000', 'width':2}}
}],
"layout": {
"title":"Applicant Income Source ($)",
"font":{"size":18},
"annotations": [
{
"font": {
"size": 15},
"showarrow": False,
"text": "Income Source ($)",
"x": 0.76,
"y": 0.5
}
]
}
}
iplot(fig)
Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/3.png', height = 1000, width = 10000)
Next, we will plot the applicant income source for each target type:
#repaid loan
mask_repaid = application_train['TARGET'] == 1
repaid = application_train[mask_repaid]
#not repaid loan
mask_not_repaid = application_train['TARGET'] == 0
not_repaid = application_train[mask_not_repaid]
#income source for applicants who repaid their loan
repaid_df = repaid.groupby(['NAME_INCOME_TYPE'])['TARGET'].count()
#income source for applicants who did not repay their loan
not_repaid_df = not_repaid.groupby(['NAME_INCOME_TYPE'])['TARGET'].count()
#create the plot
trace_repaid = go.Bar(
x = repaid_df.index,
y = repaid_df.values,
name='Repaid',
marker=dict(line = dict(width = 2.)))
trace_not_repaid = go.Bar(
x = not_repaid_df.index,
y = not_repaid_df.values,
name='Not Repaid',
marker=dict(
color='rgb(111, 235, 146)',
line = dict(width = 2.)),
opacity=0.6)
#get the data
data = [trace_repaid, trace_not_repaid]
#define layouts
layout = go.Layout(
title = "Income Status ($) per Repaid-Not Repaid Loan",
margin = dict(b = 200),
titlefont = dict(size = 21),
width = 1000,
xaxis=dict(
title="Applicant\'s Income Status",
titlefont=dict(size=16,color='rgb(107, 107, 107)'),
tickfont=dict(
size=14,
color='rgb(107, 107, 107)'),
tickangle = -50,
),
yaxis=dict(title='Number of Applicants',
titlefont=dict(size=16,color='rgb(107, 107, 107)'),
tickfont=dict(
size=14,
color='rgb(107, 107, 107)')))
fig = go.Figure(data=data, layout=layout)
iplot(fig)
Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/4.png', height = 1000, width = 10000)
print('Personal status of the applicant in the dataset: \n\n\n {}'.format(application_train.loc[:,'NAME_FAMILY_STATUS'].unique()))
#find the number of applicants per personal status
family_status = application_train['NAME_FAMILY_STATUS'].value_counts().to_frame(name = 'number_of_customers')
#display the dataframe
family_status.applymap(lambda e: '{:,}'.format(e))
#plot the applicant family status
fig = {
"data": [
{
"values": family_status.loc[:,'number_of_customers'].values,
"labels": family_status.index,
"hoverinfo":"label+percent+name",
"hole": .4,
"type": "pie",
"textposition":"inside",
"hoverinfo":"label+text+value",
"domain": {"x": [0.25, .99]},
"marker":{"line":{'color': '#000000', 'width':2}}
},],
"layout": {
"title":"Applicant Personal Status",
"font":{"size":18},
"annotations": [{
"font": {
"size": 15},
"showarrow": False,
"text": "Personal Status",
"x": 0.72,
"y": 0.5
}]
}
}
iplot(fig)
Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/5.png', height = 1000, width = 10000)
Similarly, as above:
#family status for applicants who repaid their loan
repaid_df = repaid.groupby(['NAME_FAMILY_STATUS'])['TARGET'].count().to_frame(name = 'number_of_customers')
#family status for applicants who did NOT repay their loan
not_repaid_df = not_repaid.groupby(['NAME_FAMILY_STATUS'])['TARGET'].count().to_frame(name = 'number_of_customers')
#make the plot
trace_repaid = go.Bar(x = repaid_df.index,
y = repaid_df['number_of_customers'].values,
name='Repaid',
marker=dict(line = dict(width = 2.)),
opacity=0.6)
trace_not_repaid = go.Bar(x = not_repaid_df.index,
y = not_repaid_df['number_of_customers'].values,
name='Not Repaid',
marker=dict(line = dict(width = 2.)),
opacity=0.6)
#define the data
data = [trace_repaid, trace_not_repaid]
#define layouts
layout = go.Layout(title = "Family Status of the Applican (Repaid/Not Repaid Loan)",
titlefont = dict(size = 23),
width = 1000,
xaxis=dict(
title="Applicant\'s Marriege Status",
titlefont=dict(
size=16,
color='rgb(107, 107, 107)'),
tickfont=dict(
size=14,
color='rgb(107, 107, 107)')),
yaxis=dict(
title='Number of Applicants',
titlefont=dict(
size=16,
color='rgb(107, 107, 107)'),
tickfont=dict(
size=14,
color='rgb(107, 107, 107)')))
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)
Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/6.png', height = 1000, width = 10000)
print('Two variables that define the loan purpose: \n\n\n "FLAG_OWN_CAR" and "FLAG_OWN_REALTY"')
print('Each variable takes two possible values: \n\n\n "{0}" and "{1}"'.format(application_train['FLAG_OWN_REALTY'].unique()[0],
application_train['FLAG_OWN_REALTY'].unique()[1]))
#define the number of customers took a loan to buy a car
own_car = application_train["FLAG_OWN_CAR"].value_counts()
#define the number of customers took a loan to buy real estate
own_real_estate = application_train["FLAG_OWN_REALTY"].value_counts()
#define the two plots
fig = {
"data": [{
"values": own_car.values,
"labels": own_car.index,
"hoverinfo":"label+percent+name",
"domain": {"x": [0, .48]},
"hole": .4,
"type": "pie",
"textposition":"inside",
"hoverinfo":"label+text+value",
"marker":{"line":{'color': '#000000', 'width':2}}},
{
"values": own_real_estate.values,
"labels": own_real_estate.index,
"hoverinfo":"label+percent+name",
"domain": {"x": [.58, 1]},
"hole": .4,
"type": "pie",
"textposition":"inside",
"hoverinfo":"label+text+value",
"marker":{"line":{'color': '#000000', 'width':2}}
}],
"layout": {
"title":"Purpose of the Loan",
"font":{"size":20},
"annotations": [
{
"font": {"size": 20},
"showarrow": False,
"text": "Own Car",
"x": 0.18,
"y": 0.5
},
{
"font": {"size": 15},
"showarrow": False,
"text": "Own Real Estate",
"x": 0.87,
"y": 0.5
}]}
}
iplot(fig)
Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/7.png', height = 1000, width = 10000)
print('Unique entries in the "OCCUPATION_TYPE" column: \n\n\n {}'.format(application_train['OCCUPATION_TYPE'].unique()))
#number of applicants per occupation
occupation = application_train.groupby(['OCCUPATION_TYPE'])['TARGET'].count().to_frame(name = 'number_of_customers')
#display the number of applicants per occupation in descending order
occupation_sorted = occupation.sort_values(by = 'number_of_customers', ascending = False)
#print the dataframe
occupation_sorted.applymap(lambda e: '{:,}'.format(e))
#define data
data = [go.Bar(x = occupation_sorted.index,
y = occupation_sorted['number_of_customers'].values,
opacity=0.6)]
#define layout
layout = go.Layout(
title = 'Occupation Type',
barmode = 'group',
titlefont=dict(
family='Courier New, monospace',
size=25,
color='#7f7f7f'),
margin = dict(b = 210),
xaxis=dict(title="Applicant\'s Occupation", titlefont=dict(size=16,
color='rgb(107, 107, 107)'),
tickfont=dict(
size=14,
color='rgb(107, 107, 107)')
,tickangle = -50),
yaxis=dict(title='Number of Applicants',
titlefont=dict(size=16,
color='rgb(107, 107, 107)' ),
tickfont=dict(size=14,
color='rgb(107, 107, 107)')))
#diplay the barplot
fig = go.Figure(data = data , layout = layout)
iplot(fig)
Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/8.png', height = 1000, width = 10000)
#occupation type for applicants who repaid their loan
occupation_repaid = repaid.groupby(['OCCUPATION_TYPE'])['TARGET'].count().to_frame(name = 'number_of_customers')
#occupation type for applicants who did not repay their loan
occupation_not_repaid = not_repaid.groupby(['OCCUPATION_TYPE'])['TARGET'].count().to_frame(name = 'number_of_customers')
#plot the occupation type per target value
trace_good = go.Bar(x = occupation_repaid.index,
y = occupation_repaid['number_of_customers'].values,
name='Repaid',
marker=dict(line = dict(width = 2.)))
trace_bad = go.Bar(x = occupation_not_repaid.index,
y = occupation_not_repaid['number_of_customers'].values,
name='Bad',
marker=dict(
color='rgb(111, 235, 146)',
line = dict(width = 2.)),
opacity=0.6
)
#define the data
data = [trace_good, trace_bad]
#define the layout
layout = go.Layout( title = "Occupation Type of the Applicant (Repaid vs Not Repaid Loan)",
titlefont = dict(size = 22),
width = 1000,
xaxis=dict(
title="Applicant\'s Occupation",
titlefont=dict(
size=16,
color='rgb(107, 107, 107)'),
tickfont=dict(
size=14,
color='rgb(107, 107, 107)'),
tickangle = -50),
yaxis=dict(
title='Number of Applicants',
titlefont=dict(
size=16,
color='rgb(107, 107, 107)'),
tickfont=dict(
size=14,
color='rgb(107, 107, 107)')),
margin = dict(b = 200))
fig = go.Figure(data=data, layout=layout)
iplot(fig)
Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/9.png', height = 1000, width = 10000)
#transfrom each entry to a positive value
application_train.loc[:,['DAYS_BIRTH']] = -application_train.loc[:,['DAYS_BIRTH']]
#find the years since birth for each applicant
application_train['YEARS_BIRTH'] = application_train.loc[:,['DAYS_BIRTH']]/365
#bin years
application_train['YEARS_BIRTH_BINNED'] = pd.cut(application_train['YEARS_BIRTH'],
bins = np.linspace(20, 70, 11))
#years binned
years_binned = application_train.loc[:,['YEARS_BIRTH_BINNED','TARGET']]
#average years
final_years = years_binned.groupby(['YEARS_BIRTH_BINNED']).mean().applymap(lambda e: np.around(e, 3)*100)
#display the dataframe
final_years = final_years.rename(columns = {'TARGET':'average_years'})
final_years
# create a dataframe which contains the bins
bins = pd.DataFrame(data = ['[20-25)',
'[25-30)',
'[30-35)',
'[35-40)',
'[40-45)',
'[45-50)',
'[50-55)',
'[55-60)',
'[60-65)',
'[65-70)'] ,columns = ['binned_years'])
#reset and then drop the index
final_years = final_years.reset_index(drop = True)
#concatenate the bins dataframe with the final_years dataframe
final_df = pd.concat([final_years, bins], axis = 1)
#display the final dataframe
final_df
#define data
data = [go.Bar(x = final_df['binned_years'].values,
y = final_df['average_years'].values,
marker=dict( color='rgb(111, 235, 146)',
line = dict(width = 2.)),
opacity=0.6)]
#define layout
layout = go.Layout( title = 'Applicant Age (Binned)',
barmode = 'group',
titlefont=dict(
family='Courier New, monospace',
size=25,
color='#7f7f7f'),
margin = dict(b = 210),
xaxis=dict(title="Applicant Age",
titlefont=dict(size=16,
color='rgb(107, 107, 107)'),
tickfont=dict(size=14,
color='rgb(107, 107, 107)')
,tickangle = 0),
yaxis=dict(
title='Percentage (%)',
titlefont=dict(size=16,
color='rgb(107, 107, 107)'),
tickfont=dict(
size=14,
color='rgb(107, 107, 107)')))
#diplay the barplot
fig = go.Figure(data = data , layout = layout)
iplot(fig)
Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/10.png', height = 1000, width = 10000)
#create pair plots for the YEARS_BIRTH and EXT_SOURCE_1, EXT_SOURCE_2 and EXT_SOURCE_3
#check the EXT_SOURCE_1, EXT_SOURCE_2 and EXT_SOURCE_3 variables missing values
application_train.loc[:,['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3']]\
.isnull().sum(axis = 0)\
.to_frame(name = 'number_of_missing_values')\
.sort_values(by = 'number_of_missing_values', ascending = False)\
.applymap(lambda e: '{:,}'.format(e))
#get the variables of interest
correlation = application_train.loc[:,['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3', 'TARGET','YEARS_BIRTH']]
#drop missing values
correlation_no_missing_values = correlation.dropna()
#print info
print('The dataframes after removing missing values contains: {0} rows and {1} columns.'\
.format(correlation_no_missing_values.shape[0],correlation_no_missing_values.shape[1]))
# ignore warning
import warnings
warnings.filterwarnings("ignore")
# calculate correlation coefficient between two columns
def corr_func(x, y, **kwargs):
r = np.corrcoef(x, y)[0][1]
ax = plt.gca()
ax.annotate("r = {:.2f}".format(r),
xy=(.2, .8),
xycoords=ax.transAxes,
size = 20)
# Create the pairgrid object
grid = sns.PairGrid(data = corr_df,
size = 3,
diag_sharey=False,
hue = 'TARGET',
vars = [x for x in list(corr_df.columns) if x != 'TARGET'])
# Upper is a scatter plot
grid.map_upper(plt.scatter, alpha = 0.2)
# Diagonal is a histogram
grid.map_diag(sns.kdeplot)
# Bottom is density plot
grid.map_lower(sns.kdeplot, cmap = plt.cm.OrRd_r);
plt.suptitle('EXT_SOURCE and YEARS Pair Plots', size = 20, y = 1.02);
#plot the heatmap
#define the data
data = [go.Heatmap(
z= application_train.corr().values,
x=application_train.columns.values,
y=application_train.columns.values,
colorscale='Viridis',
reversescale = False,
opacity = 0.9)]
#define the layout
layout = go.Layout(
title='Pearson Correlation for Features',
titlefont = dict(size = 25),
xaxis = dict(ticks='', nticks=36),
yaxis = dict(ticks='' ),
width = 900, height = 900,
margin=dict(
l=300,b = 300))
#plot figure
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)
Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/11.png', height = 1000, width = 10000)
Feature engineering and feature selection is the most critical step in the development of a succesful machine learning model. A good and correct feature selection will help us to maximize relevancy while reducing feature redundancy. This will lead us to a higher probability of building a good model while reducing the overall size of the model.
In our binary classification problem, only the application_train table contains 122 different features. It's highly unlikely that we will use all 122 features to get the best results from our model. Our dataset consists of 122 data dimentions and we want to find an integer number K that will give us the best results. This means that there is an enormous number of combinations of various-sized subsets. Our goal is to reduce the number of dimensions without losing any predictive power.
To find the K most important features, we will use a Random Forest Classifier and then plot their importance in a descending order as can be seen from the barplot below.
#get all the categorical columns in the dataframe
categorical_features = [ft for ft in application_train.columns if application_train[ft].dtype == 'object']
print('The categorical variables are: \n\n\n {}'.format(categorical_features))
#preprocess each categorical variable using scikit-learn LabelEncoder
for col in categorical_features:
lb = preprocessing.LabelEncoder()
application_train[col] = lb.fit_transform(application_train[col].values.astype(dtype = 'str'))
# to avoid the error: ValueError: Input contains NaN, infinity or a value too large for dtype('float32') we fill
# in the missing values with a large negative integer
application_train.fillna(-999, inplace = True)
#use a random forest classifier
rf = RandomForestClassifier(n_estimators=100,
max_depth=16,
min_samples_leaf=8,
max_features=0.5,
random_state=1000)
start = time.time()
#fit the model
rf.fit(application_train.drop(['TARGET', 'SK_ID_CURR'],axis=1), application_train['TARGET'])
end = time.time()
print('The RandomFores ran in: {0:.3f}s'.format(end - start))
#get the features
features = application_train.drop(['TARGET', 'SK_ID_CURR'],axis=1)
#get feature importance
feature_importance = rf.feature_importances_
x, y = (list(x) for x in zip(*sorted(zip(rf.feature_importances_, features), reverse = False)))
#plot the horizontal plot
trace = go.Bar( x=x ,
y=y,
marker=dict(
color=x,
colorscale = 'Viridis',
reversescale = True),
name='Feature importance Random Forest',
orientation='h')
layout = dict(title='Feature importance',
width = 900,
height = 2000,
yaxis=dict(
showgrid=False,
showline=False,
showticklabels=True),
margin=dict(l=300))
fig = go.Figure(data=[trace])
fig['layout'].update(layout)
py.iplot(fig)
Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/12.png', height = 1000, width = 10000)
From the barplot above we can see that the most important features are those dealing with EXT_SOURCE and DAYS_BIRTH. We see that only a handful of features have a significant importance to the model, which suggests we may be able to drop many of the features without a decrease in performance. By keeping the first, let's say 10 features, we narrow down our initial dataset of 122 features to only few, significant for our model features.
Once we found the K most important features, we can use them to develop our machine learning model. We can start with a simple approach, like a Logistic Regression or a Naive Bayes Classifier and gradually move to more complex models. We are not going to dive into this and will leave this step to the reader as an exercise.
In this notebook, we used telco and transactional information from Kaggle Home Credit Default Risk competition to predict a client ability to repay a loan. We cleaned our data, performed basic exploratory data analysis and use a Random Forest Classifier, as a baseline model, for the task of feature selection.
A very interesting perspective was offered by the consumer lending veteran Matt Harris in a recent article in Forbes magazine:
“Most start-up originators focus on the opportunity to innovate in credit decisioning. The headline appeal of new data sources and new data science seem too good to be true as a way to compete with stodgy old banks. And, in fact, they are indeed too good to be true…
… The recipe for success here starts with picking a truly under-served segment. Then figure out some new methods for sifting the gold from what everyone else sees as sand; this will end up being a combination of data sources, data science and hard won credit observations. …
… progressive and thoughtful traditional lenders like Capital One have mined most of the segments large enough to build a business around. The only way to build a durable competitive moat based on Customer Acquisition is to become integrated into a channel that is proprietary, contextual and data-rich.”
If Harris is right, the key success criterion for new Credit Scoring and Classification tools will not be their speed and accuracy in large-scale applications, for which tools like FICO already exist, but in their flexibility and ease of integration into specialized applications.
As data continues to traverse our “always on” world, the aforementioned methods will surely be fine-tuned to gather the right type of metrics that result in reduced risk for all stakeholders. But this doesn't mean that traditional credit scoring methods will go the way of the dinosaur. Instead, they will simply be complemented by a bevy of new data sets that aid in more accurate, comprehensive and profitable risk analyses. And with a new generation of consumers expecting fast, personal and secure credit, the risk might just be in ignoring these new venues.
2019-05-31 12:52:41.441911
2019-01-21 12:49:52.633478
2018-09-10 11:25:54.616537
2018-09-03 13:50:05.468625