All mobile data is credit data: the unbanked problem

Introduction

In [2]:

from IPython.display import Image
Image(filename='/Users/vlad/Desktop/credit_score.jpg', height = 900, width = 900)

Out[2]:

Finance in developing countries is a growing market that is completely distinct from what one can find in developed countries. According to the World Bank, it is estimated that 2.5 billion adults lack access to financial services. These individuals, that do not use banks or banking institutions , in any capacity, are called the unbanked.

The biggest problem the unbanked face, especially in the emerging markets, is the lack of formal records that can be used to create traditional credit scores. In these markets, customers cannot purely rely on banks to have access to a loan or other financial services, as they usually lack a verifiable credit history. Even in developed countries, due to the risks involved in this kind of service , the Microfinance Institution loan process tends to be slow and cumbersome.

Customers are usually requested to provide a long list of documents , which means that even in the countries that have either publicly run credit registries or privately run credit agencies, a large portion of their population is not covered.

A simple solution that is staring you in the face is: the mobile phone. The rapid adoption of cell phones brings a new dynamic to financial markets. It creates an opportunity in the marketplace for start-up firms to emerge, disrupt existing industry structures and deliver a viable solution to the problem.

Ericsson’s Mobility Report (https://www.mobileworldlive.com/featured-content/home-banner/90-global-population-will-mobile-phone-ericsson-report/) in 2014 predicted that 90 percent of people around the globe above the age of six will have a phone by 2020. Cisco, more conservatively, estimates that 70 percent of people around the globe, which is 5.4 billion people, will have mobile phones by 2020. This essentially means that while half of the world may lack formal credit records, almost all of them will have a history of financial and behavioural interactions with a mobile service by 2020. With 90 percent of the world’s data created in the last two years, primarily from mobile phones , there is a lot of data that can shed light on the financial inclusion of the unbanked - and the mobile phone data is deemed to be a good type of alternative data for credit scoring. Both mobile and social media data can be used for alternative risk scoring , independent of financial data. As we leave our digital footprint across the social media landscape, we are creating valuable alternative data for credit lenders to analyse and determine our creditworthiness. The number of friends we have on our social account, how engaged we are with our community and how much our friends engage with us by liking or commenting on our posts, are taken into consideration when underwriting a loan. Even when we are logging in to a social media platform by a smartphone at a Internet café, provide lenders with clues on our affordability and creditworthiness.

By analysing social connections along with individual online behaviour and lifestyle choices, a clearer picture about the persona behind the online identity emerges and certain inferences around their credit worthiness can be made.

Thanks to data generated by social media behaviour, credit lenders now have the ability to make more informed decisions and offer credit to borrowers who otherwise may have been denied a loan or may have been regarded as high-risk and been granted a loan with high-interest payment terms.

Another data source thoroughly used by lenders as an indicator of customer affordability are the mobile phone airtime purchases. An applicant’s airtime purchase history is now being considered a good indicator of affordability and creditworthiness by some lending institutions looking to fill in the gaps of information. Some mobile network operators consider the airtime purchase data such a reliable factor that is even driving them to enter the micro-loans space to increase customer loyalty, drive mobile money usage and increase their revenue.

Once the data is gathered from different sources, mobile and social, the most important and difficult task is that of feature selection. We essentially want to narrow down to a small subset of features that we will later use in the predictive modeling process. The last step involves the adoption of a variety of machine learning techniques and classification algorithms, such as Naïve Bayes, Logistic Regression, Support Vector Machines and Ensemble Methods to predict and evaluate the creditworthiness of a potential customer.

Getting the data

In this notebook we will use a variety of alternative data, which includes telco and transactional information, to solve a real world problem: to predict a client's repayment ability. The data was provided by the Home Credit, a service dedicated to providing lines of credit (loans) to the unbanked population, as part of a Kaggle competition to check what kind of models the machine learning community can develop to help them in their task.

The data can be found on the Kaggle website: https://www.kaggle.com/c/home-credit-default-risk/data and consists of 7 sources/csv tables as can be seen in the table below.

In [52]:

from IPython.display import Image
Image(filename='/Users/vlad/Desktop/home_credit_default_risk.png', height = 900, width = 900)

Out[52]:

To make this notebook manageable and easy to understand, we will focus mainly on the application_train table, which consists of 124 columns, including the TARGET variable, consisting of zeros, indicating that the loan was repaid, and ones, indicating that the loan was not repaid.

In [1]:

#import the libraries used throughout this notebook
import os
import time
import pandas as pd
import numpy as np
import time

import matplotlib 
import matplotlib.pyplot as plt

import seaborn as sns

import plotly
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.plotly as py

import sklearn
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier

In [2]:

#define the path of the application_train csv file
application_train_path = '/Users/vlad/Desktop/Home Credit Default Risk/all/application_train.csv'

In [3]:

#load the application_train file into a pandas dataframe 
start = time.time()
application_train = pd.read_csv(filepath_or_buffer = application_train_path, sep = ',')
end = time.time()

print('DataFrame loaded in {0:.3f} sec'.format(end-start))

DataFrame loaded in 8.765 sec

Let's have a look at the first and last ten rows of the application_train dataframe:

In [4]:

#get the first 10 rows
application_train.head(n = 10)

Out[4]:

	SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	...	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR
0	100002	1	Cash loans	M	N	Y	0	202500.0	406597.5	24700.5	...	0.0	0.0	0.0	0.0	0.0	1.0
1	100003	0	Cash loans	F	N	N	0	270000.0	1293502.5	35698.5	...	0.0	0.0	0.0	0.0	0.0	0.0
2	100004	0	Revolving loans	M	Y	Y	0	67500.0	135000.0	6750.0	...	0.0	0.0	0.0	0.0	0.0	0.0
3	100006	0	Cash loans	F	N	Y	0	135000.0	312682.5	29686.5	...	NaN	NaN	NaN	NaN	NaN	NaN
4	100007	0	Cash loans	M	N	Y	0	121500.0	513000.0	21865.5	...	0.0	0.0	0.0	0.0	0.0	0.0
5	100008	0	Cash loans	M	N	Y	0	99000.0	490495.5	27517.5	...	0.0	0.0	0.0	0.0	1.0	1.0
6	100009	0	Cash loans	F	Y	Y	1	171000.0	1560726.0	41301.0	...	0.0	0.0	0.0	1.0	1.0	2.0
7	100010	0	Cash loans	M	Y	Y	0	360000.0	1530000.0	42075.0	...	0.0	0.0	0.0	0.0	0.0	0.0
8	100011	0	Cash loans	F	N	Y	0	112500.0	1019610.0	33826.5	...	0.0	0.0	0.0	0.0	0.0	1.0
9	100012	0	Revolving loans	M	N	Y	0	135000.0	405000.0	20250.0	...	NaN	NaN	NaN	NaN	NaN	NaN

10 rows × 122 columns

In [5]:

#get the last 10 rows
application_train.tail(n = 10)

Out[5]:

	SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	...	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR
307501	456245	0	Cash loans	F	N	Y	3	81000.0	269550.0	11871.0	...	NaN	NaN	NaN	NaN	NaN	NaN
307502	456246	0	Cash loans	F	N	Y	1	94500.0	225000.0	10620.0	...	0.0	0.0	0.0	0.0	0.0	1.0
307503	456247	0	Cash loans	F	N	Y	0	112500.0	345510.0	17770.5	...	0.0	0.0	0.0	1.0	0.0	2.0
307504	456248	0	Cash loans	F	N	Y	0	153000.0	331920.0	16096.5	...	NaN	NaN	NaN	NaN	NaN	NaN
307505	456249	0	Cash loans	F	N	Y	0	112500.0	225000.0	22050.0	...	0.0	0.0	0.0	2.0	0.0	0.0
307506	456251	0	Cash loans	M	N	N	0	157500.0	254700.0	27558.0	...	NaN	NaN	NaN	NaN	NaN	NaN
307507	456252	0	Cash loans	F	N	Y	0	72000.0	269550.0	12001.5	...	NaN	NaN	NaN	NaN	NaN	NaN
307508	456253	0	Cash loans	F	N	Y	0	153000.0	677664.0	29979.0	...	1.0	0.0	0.0	1.0	0.0	1.0
307509	456254	1	Cash loans	F	N	Y	0	171000.0	370107.0	20205.0	...	0.0	0.0	0.0	0.0	0.0	0.0
307510	456255	0	Cash loans	F	N	N	0	157500.0	675000.0	49117.5	...	0.0	0.0	0.0	2.0	0.0	1.0

10 rows × 122 columns

In [6]:

print('The dataframe consists of {} variables.'.format(len(application_train.columns)))

The dataframe consists of 122 variables.

The next thing to do is look at the data type of each column:

In [7]:

#check the data type of each column
data_types = application_train.dtypes.to_frame(name = 'column_data_type')

#display only the first 10 columns
data_types.head(n = 10)

Out[7]:

	column_data_type
SK_ID_CURR	int64
TARGET	int64
NAME_CONTRACT_TYPE	object
CODE_GENDER	object
FLAG_OWN_CAR	object
FLAG_OWN_REALTY	object
CNT_CHILDREN	int64
AMT_INCOME_TOTAL	float64
AMT_CREDIT	float64
AMT_ANNUITY	float64

Check Dataframe for Missing Values

In [8]:

#get the number of missing values in each column
missing_values = application_train.isnull().sum().to_frame(name = 'missing_values')

#sort the missing values in a descending order
missing_values_sorted = missing_values.sort_values(by = ['missing_values'], ascending = False)

In [9]:

#find the percentage of missing values in each column

#get the number of rows in the dataframe
total_rows = application_train.shape[0]

#find the percentage of missing values in each row
missing_values_percentage = (missing_values_sorted/total_rows)*100

#get the first two decimals from each entry and display the first 10 rows
missing_values_percentage.applymap(lambda e: '{0:.2f}%'.format(e)).head(n = 10)

Out[9]:

	missing_values
COMMONAREA_MEDI	69.87%
COMMONAREA_AVG	69.87%
COMMONAREA_MODE	69.87%
NONLIVINGAPARTMENTS_MODE	69.43%
NONLIVINGAPARTMENTS_AVG	69.43%
NONLIVINGAPARTMENTS_MEDI	69.43%
FONDKAPREMONT_MODE	68.39%
LIVINGAPARTMENTS_MODE	68.35%
LIVINGAPARTMENTS_AVG	68.35%
LIVINGAPARTMENTS_MEDI	68.35%

Now, we will visualize, using Plotly, the percentage of missing values for the first 50 columns:

In [10]:

#get the first 50 rows and round each entry to 2 decimals
frame = missing_values_percentage[0:50].applymap(lambda e: np.around(e,2))

In [ ]:

init_notebook_mode(connected = True)

#define data
data = [go.Bar(x = frame.index,
               y = frame['missing_values'].values,
               opacity = 0.6)]

#define layout
layout = go.Layout(
                   title = 'Missing Values (%)-First 50 Columns',
                   barmode = 'group',
                   titlefont=dict(
                   family='Courier New, monospace',
                   size=25,
                   color='#7f7f7f'),
                   margin = dict(b = 210),
                   yaxis=dict(
                   title='Percentage (%)',
                   titlefont=dict(
                   size=16,
                   color='rgb(107, 107, 107)'),
                   tickfont=dict(
                   size=14,
                   color='rgb(107, 107, 107)' )),
                   xaxis=dict(
                   title="Column Name",
                   titlefont=dict(
                   size=16,
                   color='rgb(107, 107, 107)'),
                   tickfont=dict(
                   size=14,
                   color='rgb(107, 107, 107)'),
                   tickangle = -50)
                   )

#diplay the barplot
fig = go.Figure(data = data , layout = layout)
iplot(fig)

In [59]:

Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/1.png', height = 1000, width = 10000)

Out[59]:

Check the Distribution of the Target Variable¶

Next, we will check if the target variable is balanced or not.

In [17]:

#count the number of ones and zeros in the TARGET column
target_variable_count = application_train['TARGET'].\
                        value_counts(ascending = True).\
                        to_frame(name = "ones_zeros_count")

#total number of ones and zeros
total_count = target_variable_count.sum(axis = 0)

#percentage of ones and zeros
ones_zeros_percentage = ((target_variable_count/total_count)*100).\
                        applymap(lambda e: np.around(e,2)).\
                        rename(columns = {'ones_zeros_count':'ones_zeros_percentage'})

#concatenate count and percentage dataframes
final = pd.concat([target_variable_count,ones_zeros_percentage], axis = 1)

In [18]:

#get a nice representation
finalCopy = final.copy()

finalCopy['ones_zeros_count'] = finalCopy['ones_zeros_count'].apply(lambda e: '{:,}'.format(e))
finalCopy['ones_zeros_percentage'] = finalCopy['ones_zeros_percentage'].apply(lambda e: '{}%'.format(e))

finalCopy

Out[18]:

	ones_zeros_count	ones_zeros_percentage
1	24,825	8.07%
0	282,686	91.93%

In [30]:

% matplotlib inline

colors = ['#66b3ff','green']

fig, ax = plt.subplots(figsize = (8,9))
labels = ['Class 1 \n (24,825)','Class 0 \n(282,686)']
sizes = [8.07,91.93]
explode = [0.1,0]
patches, texts, autotexts = ax.pie(sizes,
       labels = labels,
       explode = explode,
       shadow = True,
       autopct='%1.1f%%',
       startangle = 180,
        colors = colors)
ax.axis('equal')

for i in range(2):
    texts[i].set_fontsize(18)
    
for j in range(2):
    autotexts[j].set_fontsize(18)
    
ax.set_title('Distribution of the TARGET variable', size = 20, y = 1.009)

Out[30]:

Text(0.5,1.009,'Distribution of the TARGET variable')

From the pieplot above we can see that the TARGET variable is highly unbalanced.

Type of Loan¶

In [19]:

#get the unique entries in the NAME_CONTRACT_TYPE column
loan_types = application_train['NAME_CONTRACT_TYPE'].unique()
print("Two types of loans: \n\n\n '{0}' and '{1}'".format(loan_types[0], loan_types[1]))

Two types of loans: 


 'Cash loans' and 'Revolving loans'

In [20]:

#find the number of loans issued
loans = application_train.groupby(['NAME_CONTRACT_TYPE'])['TARGET'].count().to_frame(name = 'number_of_loans')

#display the dataframe
loans.applymap(lambda e: '{:,}'.format(e))

Out[20]:

	number_of_loans
NAME_CONTRACT_TYPE
Cash loans	278,232
Revolving loans	29,279

Let's plot the number of each loan type:

In [ ]:

#ploat the distribution of issued loans
fig = {
  "data": [
    {
      "values": loans['number_of_loans'].values,
      "labels": loans.index,
      "hoverinfo":"label+percent+name",
      "hole": .5,
      "type": "pie",
       "textposition":"inside",
        "hoverinfo":"label+text+value",
        "domain": {"x": [0.22, .99]},#<--------prosoxh!!!!
        "marker":{"line":{'color': '#000000', 'width':2}}
    },
    
    ],
  "layout": {
        "title":"Number of Issued Loans",
      "font":{"size":18},
        "annotations": [
            {
                "font": {
                    "size": 20
                },
                "showarrow": False,
                "text": "#Loans",
                "x": 0.60,
                "y": 0.5 
            }
            
        ]
    }
}

iplot(fig)

In [22]:

Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/2.png', height = 1000, width = 10000)

Out[22]:

Income Source of Each Applicant¶

In [23]:

print('Income sources in the dataset: \n\n\n {}'.format(application_train['NAME_INCOME_TYPE'].unique()))

Income sources in the dataset: 


 ['Working' 'State servant' 'Commercial associate' 'Pensioner' 'Unemployed'
 'Student' 'Businessman' 'Maternity leave']

In [24]:

#find the income source of each applicant
income_source = application_train.groupby(['NAME_INCOME_TYPE'])['TARGET']\
                                 .count()\
                                 .to_frame(name = 'number_of_customers')

#dispaly the new dataframe
income_source.sort_values(by = 'number_of_customers', ascending = False)\
             .applymap(lambda e: '{:,}'.format(e))

Out[24]:

	number_of_customers
NAME_INCOME_TYPE
Working	158,774
Commercial associate	71,617
Pensioner	55,362
State servant	21,703
Unemployed	22
Student	18
Businessman	10
Maternity leave	5

In [ ]:

#plot the number of customers per income type
fig = {
  "data": [
    {
      "values": income_source.loc[:,'number_of_customers'].values,
      "labels": income_source.index,
      "hoverinfo":"label+percent+name",
      "hole": .5,
      "type": "pie",
      "textposition":"inside",
      "hoverinfo":"label+text+value",
      "domain": {"x": [0.29, .99]},
      "marker":{"line":{'color': '#000000', 'width':2}}
    }],
      "layout": {
      "title":"Applicant Income Source ($)",
      "font":{"size":18},
      "annotations": [
            {
             "font": {
             "size": 15},
             "showarrow": False,
             "text": "Income Source ($)",
             "x": 0.76,
             "y": 0.5 
            }
            
                      ]
                    }
                  }

iplot(fig)

In [26]:

Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/3.png', height = 1000, width = 10000)

Out[26]:

Next, we will plot the applicant income source for each target type:

In [27]:

#repaid loan
mask_repaid = application_train['TARGET'] == 1
repaid = application_train[mask_repaid]


#not repaid loan
mask_not_repaid = application_train['TARGET'] == 0
not_repaid = application_train[mask_not_repaid]

In [28]:

#income source for applicants who repaid their loan
repaid_df = repaid.groupby(['NAME_INCOME_TYPE'])['TARGET'].count()

#income source for applicants who did not repay their loan
not_repaid_df = not_repaid.groupby(['NAME_INCOME_TYPE'])['TARGET'].count()

In [ ]:

#create the plot
trace_repaid = go.Bar(
    x = repaid_df.index,
    y = repaid_df.values,
    name='Repaid',
    marker=dict(line = dict(width = 2.)))

trace_not_repaid = go.Bar(
    x = not_repaid_df.index,
    y = not_repaid_df.values, 
    name='Not Repaid',
    marker=dict(
        color='rgb(111, 235, 146)',
        line = dict(width = 2.)),
        opacity=0.6)

#get the data
data = [trace_repaid, trace_not_repaid]

#define layouts
layout = go.Layout(
    title = "Income Status ($) per Repaid-Not Repaid Loan",
    margin = dict(b = 200),
    titlefont = dict(size = 21),
    width = 1000,
    xaxis=dict(
    title="Applicant\'s Income Status",
    titlefont=dict(size=16,color='rgb(107, 107, 107)'),
    tickfont=dict(
    size=14,
    color='rgb(107, 107, 107)'),
    tickangle = -50,
    ),
    yaxis=dict(title='Number of Applicants',
    titlefont=dict(size=16,color='rgb(107, 107, 107)'),
    tickfont=dict(
    size=14,
    color='rgb(107, 107, 107)')))

fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [30]:

Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/4.png', height = 1000, width = 10000)

Out[30]:

Family Status of the Applicant¶

In [31]:

print('Personal status of the applicant in the dataset: \n\n\n {}'.format(application_train.loc[:,'NAME_FAMILY_STATUS'].unique()))

Personal status of the applicant in the dataset: 


 ['Single / not married' 'Married' 'Civil marriage' 'Widow' 'Separated'
 'Unknown']

In [32]:

#find the number of applicants per personal status
family_status = application_train['NAME_FAMILY_STATUS'].value_counts().to_frame(name = 'number_of_customers')

#display the dataframe
family_status.applymap(lambda e: '{:,}'.format(e))

Out[32]:

	number_of_customers
Married	196,432
Single / not married	45,444
Civil marriage	29,775
Separated	19,770
Widow	16,088
Unknown	2

In [ ]:

#plot the applicant family status
fig = {
  "data": [
    {
      "values": family_status.loc[:,'number_of_customers'].values,
      "labels": family_status.index,
      "hoverinfo":"label+percent+name",
      "hole": .4,
      "type": "pie",
      "textposition":"inside",
      "hoverinfo":"label+text+value",
      "domain": {"x": [0.25, .99]},
      "marker":{"line":{'color': '#000000', 'width':2}}
    },],
  "layout": {
             "title":"Applicant Personal Status",
             "font":{"size":18},
             "annotations": [{
                              "font": {
                                       "size": 15},
                              "showarrow": False,
                              "text": "Personal Status",
                              "x": 0.72,
                              "y": 0.5
                              }]
 }
}

iplot(fig)

In [34]:

Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/5.png', height = 1000, width = 10000)

Out[34]:

Similarly, as above:

In [35]:

#family status for applicants who repaid their loan
repaid_df = repaid.groupby(['NAME_FAMILY_STATUS'])['TARGET'].count().to_frame(name = 'number_of_customers')

#family status for applicants who did NOT repay their loan
not_repaid_df = not_repaid.groupby(['NAME_FAMILY_STATUS'])['TARGET'].count().to_frame(name = 'number_of_customers')

In [ ]:

#make the plot
trace_repaid = go.Bar(x = repaid_df.index,
                    y = repaid_df['number_of_customers'].values,
                    name='Repaid',
                    marker=dict(line = dict(width = 2.)),
                    opacity=0.6)

trace_not_repaid = go.Bar(x = not_repaid_df.index,
                   y = not_repaid_df['number_of_customers'].values, 
                   name='Not Repaid',
                   marker=dict(line = dict(width = 2.)),
                   opacity=0.6)

#define the data
data = [trace_repaid, trace_not_repaid]

#define layouts
layout = go.Layout(title = "Family Status of the Applican (Repaid/Not Repaid Loan)",
                   titlefont = dict(size = 23),
                   width = 1000,
                   xaxis=dict(
                   title="Applicant\'s Marriege Status",
                   titlefont=dict(
                   size=16,
                   color='rgb(107, 107, 107)'),
                   tickfont=dict(
                   size=14,
                   color='rgb(107, 107, 107)')),
                   yaxis=dict(
                   title='Number of Applicants',
                   titlefont=dict(
                   size=16,
                   color='rgb(107, 107, 107)'),
                   tickfont=dict(
                   size=14,
                   color='rgb(107, 107, 107)')))

fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

In [37]:

Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/6.png', height = 1000, width = 10000)

Out[37]:

Loan Purpose¶

In [38]:

print('Two variables that define the loan purpose: \n\n\n "FLAG_OWN_CAR" and "FLAG_OWN_REALTY"')

Two variables that define the loan purpose: 


 "FLAG_OWN_CAR" and "FLAG_OWN_REALTY"

In [39]:

print('Each variable takes two possible values: \n\n\n "{0}" and "{1}"'.format(application_train['FLAG_OWN_REALTY'].unique()[0],
                                                                               application_train['FLAG_OWN_REALTY'].unique()[1]))

Each variable takes two possible values: 


 "Y" and "N"

In [40]:

#define the number of customers took a loan to buy a car
own_car = application_train["FLAG_OWN_CAR"].value_counts()

#define the number of customers took a loan to buy real estate
own_real_estate = application_train["FLAG_OWN_REALTY"].value_counts()

In [ ]:

#define the two plots
fig = {
       "data": [{
       "values": own_car.values,
       "labels": own_car.index,
       "hoverinfo":"label+percent+name",
       "domain": {"x": [0, .48]},
       "hole": .4,
       "type": "pie",
       "textposition":"inside",
       "hoverinfo":"label+text+value",
       "marker":{"line":{'color': '#000000', 'width':2}}},
       {
       "values": own_real_estate.values,
       "labels": own_real_estate.index,
       "hoverinfo":"label+percent+name",
       "domain": {"x": [.58, 1]},
       "hole": .4,
       "type": "pie",
       "textposition":"inside",
       "hoverinfo":"label+text+value",
       "marker":{"line":{'color': '#000000', 'width':2}}
       }],
       "layout": {
       "title":"Purpose of the Loan",
       "font":{"size":20},
       "annotations": [
        {
        "font": {"size": 20},
        "showarrow": False,
        "text": "Own Car",
        "x": 0.18,
        "y": 0.5
        },
        {
        "font": {"size": 15},
        "showarrow": False,
        "text": "Own Real Estate",
        "x": 0.87,
        "y": 0.5  
         }]}
         }

iplot(fig)

In [42]:

Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/7.png', height = 1000, width = 10000)

Out[42]:

Applicant Occupation¶

In [43]:

print('Unique entries in the "OCCUPATION_TYPE" column: \n\n\n {}'.format(application_train['OCCUPATION_TYPE'].unique()))

Unique entries in the "OCCUPATION_TYPE" column: 


 ['Laborers' 'Core staff' 'Accountants' 'Managers' nan 'Drivers'
 'Sales staff' 'Cleaning staff' 'Cooking staff' 'Private service staff'
 'Medicine staff' 'Security staff' 'High skill tech staff'
 'Waiters/barmen staff' 'Low-skill Laborers' 'Realty agents' 'Secretaries'
 'IT staff' 'HR staff']

In [44]:

#number of applicants per occupation
occupation = application_train.groupby(['OCCUPATION_TYPE'])['TARGET'].count().to_frame(name = 'number_of_customers')

#display the number of applicants per occupation in descending order
occupation_sorted = occupation.sort_values(by = 'number_of_customers', ascending = False)

#print the dataframe
occupation_sorted.applymap(lambda e: '{:,}'.format(e))

Out[44]:

	number_of_customers
OCCUPATION_TYPE
Laborers	55,186
Sales staff	32,102
Core staff	27,570
Managers	21,371
Drivers	18,603
High skill tech staff	11,380
Accountants	9,813
Medicine staff	8,537
Security staff	6,721
Cooking staff	5,946
Cleaning staff	4,653
Private service staff	2,652
Low-skill Laborers	2,093
Waiters/barmen staff	1,348
Secretaries	1,305
Realty agents	751
HR staff	563
IT staff	526

In [ ]:

#define data
data = [go.Bar(x = occupation_sorted.index,
               y = occupation_sorted['number_of_customers'].values,
               opacity=0.6)]

#define layout
layout = go.Layout(
                   title = 'Occupation Type',
                   barmode = 'group',
                   titlefont=dict(
                   family='Courier New, monospace',
                   size=25,
                   color='#7f7f7f'),
                   margin = dict(b = 210),
                   xaxis=dict(title="Applicant\'s Occupation", titlefont=dict(size=16,
                                                                              color='rgb(107, 107, 107)'),
                                                                              tickfont=dict(
                                                                                            size=14,
                                                                                            color='rgb(107, 107, 107)')
                              ,tickangle = -50),
                yaxis=dict(title='Number of Applicants',
                           titlefont=dict(size=16,
                                          color='rgb(107, 107, 107)' ),
                           tickfont=dict(size=14,
                                         color='rgb(107, 107, 107)')))

#diplay the barplot
fig = go.Figure(data = data , layout = layout)
iplot(fig)

In [46]:

Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/8.png', height = 1000, width = 10000)

Out[46]:

In [47]:

#occupation type for applicants who repaid their loan
occupation_repaid = repaid.groupby(['OCCUPATION_TYPE'])['TARGET'].count().to_frame(name = 'number_of_customers')

#occupation type for applicants who did not repay their loan
occupation_not_repaid = not_repaid.groupby(['OCCUPATION_TYPE'])['TARGET'].count().to_frame(name = 'number_of_customers')

In [ ]:

#plot the occupation type per target value
trace_good = go.Bar(x = occupation_repaid.index,
                    y = occupation_repaid['number_of_customers'].values,
                    name='Repaid',
                    marker=dict(line = dict(width = 2.)))

trace_bad = go.Bar(x = occupation_not_repaid.index,
                   y = occupation_not_repaid['number_of_customers'].values, 
                   name='Bad',
                   marker=dict(
                                color='rgb(111, 235, 146)',
                                line = dict(width = 2.)),
                                opacity=0.6
                               )

#define the data
data = [trace_good, trace_bad]

#define the layout
layout = go.Layout( title = "Occupation Type of the Applicant (Repaid vs Not Repaid Loan)",
                    titlefont = dict(size = 22),
                    width = 1000,
                    xaxis=dict(
                    title="Applicant\'s Occupation",
                    titlefont=dict(
                    size=16,
                    color='rgb(107, 107, 107)'),
                    tickfont=dict(
                    size=14,
                    color='rgb(107, 107, 107)'),
                    tickangle = -50),
                    yaxis=dict(
                    title='Number of Applicants',
                    titlefont=dict(
                    size=16,
                    color='rgb(107, 107, 107)'),
                    tickfont=dict(
                    size=14,
                    color='rgb(107, 107, 107)')),
                    margin = dict(b = 200))

fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [49]:

Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/9.png', height = 1000, width = 10000)

Out[49]:

Number of Applicants by Age Group¶

In [50]:

#transfrom each entry to a positive value
application_train.loc[:,['DAYS_BIRTH']] = -application_train.loc[:,['DAYS_BIRTH']]

#find the years since birth for each applicant
application_train['YEARS_BIRTH'] = application_train.loc[:,['DAYS_BIRTH']]/365

#bin years
application_train['YEARS_BIRTH_BINNED'] = pd.cut(application_train['YEARS_BIRTH'],
                                                 bins = np.linspace(20, 70, 11))

#years binned
years_binned = application_train.loc[:,['YEARS_BIRTH_BINNED','TARGET']]

#average years
final_years = years_binned.groupby(['YEARS_BIRTH_BINNED']).mean().applymap(lambda e: np.around(e, 3)*100)

In [51]:

#display the dataframe
final_years = final_years.rename(columns = {'TARGET':'average_years'})
final_years

Out[51]:

	average_years
YEARS_BIRTH_BINNED
(20.0, 25.0]	12.3
(25.0, 30.0]	11.1
(30.0, 35.0]	10.3
(35.0, 40.0]	8.9
(40.0, 45.0]	7.8
(45.0, 50.0]	7.4
(50.0, 55.0]	6.7
(55.0, 60.0]	5.5
(60.0, 65.0]	5.3
(65.0, 70.0]	3.7

In [52]:

# create a dataframe which contains the bins
bins = pd.DataFrame(data = ['[20-25)',
                            '[25-30)',
                            '[30-35)',
                            '[35-40)',
                            '[40-45)',
                            '[45-50)',
                            '[50-55)',
                            '[55-60)',
                            '[60-65)',
                            '[65-70)'] ,columns = ['binned_years'])

#reset and then drop the index
final_years = final_years.reset_index(drop = True)

#concatenate the bins dataframe with the final_years dataframe
final_df = pd.concat([final_years, bins], axis = 1)

#display the final dataframe
final_df

Out[52]:

	average_years	binned_years
0	12.3	[20-25)
1	11.1	[25-30)
2	10.3	[30-35)
3	8.9	[35-40)
4	7.8	[40-45)
5	7.4	[45-50)
6	6.7	[50-55)
7	5.5	[55-60)
8	5.3	[60-65)
9	3.7	[65-70)

In [ ]:

#define data
data = [go.Bar(x = final_df['binned_years'].values,
               y = final_df['average_years'].values,
               marker=dict( color='rgb(111, 235, 146)',
                            line = dict(width = 2.)),
                            opacity=0.6)]

#define layout
layout = go.Layout( title = 'Applicant Age (Binned)',
                    barmode = 'group',
                    titlefont=dict(
                    family='Courier New, monospace',
                    size=25,
                    color='#7f7f7f'),
                    margin = dict(b = 210),
                    xaxis=dict(title="Applicant Age",
                               titlefont=dict(size=16,
                                               color='rgb(107, 107, 107)'),
                                               tickfont=dict(size=14,
                                                             color='rgb(107, 107, 107)')
                               ,tickangle = 0),
                   yaxis=dict(
                   title='Percentage (%)',
                   titlefont=dict(size=16,
                                  color='rgb(107, 107, 107)'),
                   tickfont=dict(
                   size=14,
                   color='rgb(107, 107, 107)')))

#diplay the barplot
fig = go.Figure(data = data , layout = layout)
iplot(fig)

In [54]:

Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/10.png', height = 1000, width = 10000)

Out[54]:

In [55]:

#create pair plots for the YEARS_BIRTH and EXT_SOURCE_1, EXT_SOURCE_2 and EXT_SOURCE_3

#check the EXT_SOURCE_1, EXT_SOURCE_2 and EXT_SOURCE_3 variables missing values
application_train.loc[:,['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3']]\
                 .isnull().sum(axis = 0)\
                 .to_frame(name = 'number_of_missing_values')\
                 .sort_values(by = 'number_of_missing_values', ascending = False)\
                 .applymap(lambda e: '{:,}'.format(e))

Out[55]:

	number_of_missing_values
EXT_SOURCE_1	173,378
EXT_SOURCE_3	60,965
EXT_SOURCE_2	660

In [56]:

#get the variables of interest
correlation = application_train.loc[:,['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3', 'TARGET','YEARS_BIRTH']]

#drop missing values
correlation_no_missing_values = correlation.dropna()

#print info
print('The dataframes after removing missing values contains: {0} rows and {1} columns.'\
      .format(correlation_no_missing_values.shape[0],correlation_no_missing_values.shape[1]))

The dataframes after removing missing values contains: 109589 rows and 5 columns.

In [594]:

# ignore warning
import warnings
warnings.filterwarnings("ignore")

# calculate correlation coefficient between two columns
def corr_func(x, y, **kwargs):
    r = np.corrcoef(x, y)[0][1]
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),
                xy=(.2, .8), 
                xycoords=ax.transAxes,
                size = 20)

# Create the pairgrid object
grid = sns.PairGrid(data = corr_df,
                    size = 3,
                    diag_sharey=False,
                    hue = 'TARGET', 
                    vars = [x for x in list(corr_df.columns) if x != 'TARGET'])

# Upper is a scatter plot
grid.map_upper(plt.scatter, alpha = 0.2)

# Diagonal is a histogram
grid.map_diag(sns.kdeplot)

# Bottom is density plot
grid.map_lower(sns.kdeplot, cmap = plt.cm.OrRd_r);

plt.suptitle('EXT_SOURCE and YEARS Pair Plots', size = 20, y = 1.02);

Compute Pearson Correlation¶

In [ ]:

#plot the heatmap

#define the data
data = [go.Heatmap(
        z= application_train.corr().values,
        x=application_train.columns.values,
        y=application_train.columns.values,
        colorscale='Viridis',
        reversescale = False,
        opacity = 0.9)]

#define the layout
layout = go.Layout(
    title='Pearson Correlation for Features',
    titlefont = dict(size = 25),
    xaxis = dict(ticks='', nticks=36),
    yaxis = dict(ticks='' ),
    width = 900, height = 900,
    margin=dict(
    l=300,b = 300))

#plot figure
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

In [57]:

Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/11.png', height = 1000, width = 10000)

Out[57]:

Choose the most Important K Features¶

Feature engineering and feature selection is the most critical step in the development of a succesful machine learning model. A good and correct feature selection will help us to maximize relevancy while reducing feature redundancy. This will lead us to a higher probability of building a good model while reducing the overall size of the model.

In our binary classification problem, only the application_train table contains 122 different features. It's highly unlikely that we will use all 122 features to get the best results from our model. Our dataset consists of 122 data dimentions and we want to find an integer number K that will give us the best results. This means that there is an enormous number of combinations of various-sized subsets. Our goal is to reduce the number of dimensions without losing any predictive power.

To find the K most important features, we will use a Random Forest Classifier and then plot their importance in a descending order as can be seen from the barplot below.

In [109]:

#get all the categorical columns in the dataframe
categorical_features = [ft for ft in application_train.columns if application_train[ft].dtype == 'object']

In [110]:

print('The categorical variables are: \n\n\n {}'.format(categorical_features))

The categorical variables are: 


 ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']

In [111]:

#preprocess each categorical variable using scikit-learn LabelEncoder
for col in categorical_features:
    lb = preprocessing.LabelEncoder()
    application_train[col] = lb.fit_transform(application_train[col].values.astype(dtype = 'str'))

In [112]:

# to avoid the error: ValueError: Input contains NaN, infinity or a value too large for dtype('float32') we fill
# in the missing values with a large negative integer
application_train.fillna(-999, inplace = True)

In [113]:

#use a random forest classifier
rf = RandomForestClassifier(n_estimators=100,
                            max_depth=16,
                            min_samples_leaf=8,
                            max_features=0.5, 
                            random_state=1000)
start = time.time()
#fit the model
rf.fit(application_train.drop(['TARGET', 'SK_ID_CURR'],axis=1), application_train['TARGET'])
end = time.time()

print('The RandomFores ran in: {0:.3f}s'.format(end - start))

The RandomFores ran in: 639.157s

In [114]:

#get the features
features = application_train.drop(['TARGET', 'SK_ID_CURR'],axis=1)

#get feature importance
feature_importance = rf.feature_importances_

In [ ]:

x, y = (list(x) for x in zip(*sorted(zip(rf.feature_importances_, features), reverse = False)))

#plot the horizontal plot
trace = go.Bar( x=x ,
                y=y,
                marker=dict(
                color=x,
                colorscale = 'Viridis',
                reversescale = True),
                name='Feature importance Random Forest',
                orientation='h')

layout = dict(title='Feature importance',
              width = 900,
              height = 2000,
              yaxis=dict(
              showgrid=False,
              showline=False,
              showticklabels=True),
              margin=dict(l=300))

fig = go.Figure(data=[trace])
fig['layout'].update(layout)
py.iplot(fig)

In [58]:

Image(filename='/Users/vlad/Desktop/Home Credit Default Risk/blog_pictures/pics_cropped/12.png', height = 1000, width = 10000)

Out[58]:

From the barplot above we can see that the most important features are those dealing with EXT_SOURCE and DAYS_BIRTH. We see that only a handful of features have a significant importance to the model, which suggests we may be able to drop many of the features without a decrease in performance. By keeping the first, let's say 10 features, we narrow down our initial dataset of 122 features to only few, significant for our model features.

Once we found the K most important features, we can use them to develop our machine learning model. We can start with a simple approach, like a Logistic Regression or a Naive Bayes Classifier and gradually move to more complex models. We are not going to dive into this and will leave this step to the reader as an exercise.

Conclusions¶

In this notebook, we used telco and transactional information from Kaggle Home Credit Default Risk competition to predict a client ability to repay a loan. We cleaned our data, performed basic exploratory data analysis and use a Random Forest Classifier, as a baseline model, for the task of feature selection.

A very interesting perspective was offered by the consumer lending veteran Matt Harris in a recent article in Forbes magazine:

“Most start-up originators focus on the opportunity to innovate in credit decisioning. The headline appeal of new data sources and new data science seem too good to be true as a way to compete with stodgy old banks. And, in fact, they are indeed too good to be true…

… The recipe for success here starts with picking a truly under-served segment. Then figure out some new methods for sifting the gold from what everyone else sees as sand; this will end up being a combination of data sources, data science and hard won credit observations. …

… progressive and thoughtful traditional lenders like Capital One have mined most of the segments large enough to build a business around. The only way to build a durable competitive moat based on Customer Acquisition is to become integrated into a channel that is proprietary, contextual and data-rich.”

If Harris is right, the key success criterion for new Credit Scoring and Classification tools will not be their speed and accuracy in large-scale applications, for which tools like FICO already exist, but in their flexibility and ease of integration into specialized applications.

As data continues to traverse our “always on” world, the aforementioned methods will surely be fine-tuned to gather the right type of metrics that result in reduced risk for all stakeholders. But this doesn't mean that traditional credit scoring methods will go the way of the dinosaur. Instead, they will simply be complemented by a bevy of new data sets that aid in more accurate, comprehensive and profitable risk analyses. And with a new generation of consumers expecting fast, personal and secure credit, the risk might just be in ignoring these new venues.

How-to

All mobile data is credit data: the unbanked problem

All mobile data is credit data: the unbanked problem

Introduction

Getting the data

Check Dataframe for Missing Values

Check the Distribution of the Target Variable¶

Type of Loan¶

Income Source of Each Applicant¶

Family Status of the Applicant¶

Loan Purpose¶

Applicant Occupation¶

Number of Applicants by Age Group¶

Compute Pearson Correlation¶

Choose the most Important K Features¶

Conclusions¶