Predict Loan Defaulter Using Machine Learning

Business Problem

Build a machine learning model that predicts if someone might be a defaulter or a non-defaulter.

Data source: https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

Abstract: This dataset classifies people described by a set of attributes as good or bad credit risks. Comes in two formats (one all numeric). Also comes with a cost matrix

Data Set Characteristics:	Multivariate	Number of Instances:	1000	Area:	Financial
Attribute Characteristics:	Categorical, Integer	Number of Attributes:	20	Date Donated	1994-11-17
Associated Tasks:	Classification	Missing Values?	N/A	Number of Web Hits:	677271

Independant variables

checking_balance
months_loan_duration
credit_history
purpose
existing_loans_count
phone

Independant variables

amount
savings_balance
employment_duration
percent_of_income
job

Independant variables

years_at_residence
age
other_credit
housing
dependents

Data Science Libreries

pandas
numpy
sklearn
matplotlib
seaborn

Data Science Concepts

Exploratory Data Analysis
Regularization/Reducing Over fitting
Bagging & Boosting
Random Forest Classifier
Decision Tree Classifier

What I did

See hows data looks, is it a problem that can be answered using machine learning?
Under exploratory data analysis, see data's nature
Create a DECISION TREE MODEL and visualize tree at http://webgraphviz.com

As always DECISION TREE overfit, please find accuracy as below

Train data accuracy: 100%
Test data accuracy: 69.33%
So I tried regularization to minimize overfit, please find accuracy as below

Train data accuracy: 75.28%
Test data accuracy: 74.33%

Lets see which are top five important features

	Imp
checking_balance	0.492510
months_loan_duration	0.169806
credit_history	0.166109
savings_balance	0.064467
purpose_business	0.051129

Disicion Tree Classifier Confusion Matrix's Heat Map Model Score: 74.33%	>
Random Forest Classifier (With Bagging) Confusion Matrix's Heat Map Model Score: 77.33%	>
Random Forest Classifier (With ADA Boosting) Confusion Matrix's Heat Map Model Score: 74.00%	>
Random Forest Classifier (With XG Boosting) Confusion Matrix's Heat Map Model Score: 74.00%	>
Random Forest Classifier Confusion Matrix's Heat Map Model Score: 77.66%	>

Let me see code