Predicting Creditworthiness of Loan Applicants

Olumide Savage
November 9, 2021


Analytics provides a platform for efficient and effective solving of problems.

Providing solutions quickly and accurately is now a norm in this digital age - reducing downtime and lead time.

Cascade is a no-code analytics tool that empowers me to solve analytics problems that could have taken weeks into a few hours - gaining quick insights for fast and accurate decision-making.


There is a need to come up with an efficient solution to determine the creditworthiness of customers - to classify customers on whether they can be approved for a loan or not.

It is important to automate and reduce the lead time from receiving a loan application to approving the loan. However, the business objective of reducing the risks of bad loans is still very important.

Cascade provides a platform for easily advanced analytics that works as a silver bullet to achieve an efficient and effective loan classification system.

This project starts with data quality assessment to create robust data that can be used to build the right model. This quality data is then fed into machine learning models that are used to build a series of loan classification models, and the best loan classification model is then selected to create a list of creditworthy loan applicants.


Model Training Dataset- This data contains information on credit approvals from past loan applicants.

Dataset to be Predicted - This data contains information on the new set of customers that needs to be scored based on the classification model that is created.

These datasets contain variables such as Duration of Credit Month, Credit Amount, Value Saving Stocks, Account Balance, Type of Apartment, Credit Application Result, Age, Payment Status of Previous Credit, etc. The variables used for this analysis are both numerical and categorical.

Data Quality

Low-Variability: The following fields due to their low-variability (some uniform, some highly skewed to one variable, and one containing just one unique variable) were removed from the data that was fed into the model: Concurrent-Credits, Guarantors, Occupation, No-of-dependents, and Foreign-worker.

Missing Values: Duration-in-Current-Address field was removed because it has approximately 69% percent null values. The percentage of missing data is high, greater than 50%. Therefore, instead of cleaning up the null values which will take a chunk of the field, the entire field was removed. The Age-years field with null records of approximately 2% was imputed with its Median of 33 (ignore nulls) to retain the inherent distribution of the Age-years field. This is shown in the Box Plot below where we can see the Age Years field and the Age Years field (Median Imputed) both having similar distribution.

Irrelevant Field: The Telephone field was removed because it cannot be used to build the classification model.

(Charts and tables are live embeds of assets produced in Cascade)


Two predictive models were built with different predictor variables to achieve unique but comparable models for accurately predicting the creditworthiness of loan applicants.

Decision Tree Modeling

Predictor variables: Account-Balance (Encoded), Value-Savings-Stock (Encoded), Duration of Credit Month.

Target variable: Credit-Application-Result

Training Score: 0.8125

Recall Sore: 0.7400

Logistic Regression Modeling

Predictor variables: Account-Balance (Encoded), Payment-Status-of-Previous-Credit (Encoded), Credit-Amount, Length of Current Employment (Encoded), Instalment-per-cent, Purpose (Encoded), Most Valuable Available Asset.

Target variable: Credit-Application-Result

Training Score: 0.7925

Recall Sore: 0.7800

Model Deployed

The logistic regression model was selected because it has a high training score above 0.7 and a higher recall of 0.7800 which is consequential for this model. The high recall helps to better achieve the business objective of reducing the risk of a bad loan - not selecting a non-creditworthy applicant as creditworthy.

Creditworthiness List

The model deployed predicts a list of 409 Creditworthiness loan applicants out of the 500 loan applicants list.

(Charts and tables are live embeds of assets produced in Cascade)

This is a data app, built without code

Cascade allows anyone to build elegant, sharable apps using data -- without using code.