Can financial analysts optimize a loan portfolio based on data-centric strategies & quantitative algorithms?
Published in · 13 min read · Jul 19, 2020
--
Authors: Noah Mukhtar, Shaan Kohli, Shaher yar Jahangir and Ramy Hammam
While banks have traditionally acted as the main provider of loans, the needs of smaller businesses are evolving and now come with more concerns.
2020 has brought along spikes in operating costs and increasing fluctuations in demand.
These are some of the extreme challenges SMEs face when it comes to raising capital — a crucial element of their growth. Banks often overlook these, and as a result, this has opened up the door for the P2P Lending Market to grow.
This is why we are going to create a data-driven investment strategy for Prosper Marketplace, a P2P lending firm, that will take a holistic overview from processing the raw data to inferring business outcomes.
Overall, our predictive model will act as a prescriptive tool that directly portrays the impact on the investor & business.
Two fundamental steps are required for an investor to gain a better understanding of which loan to invest in:
- Prosper’s investors will have to decide on how much money to invest (considering potential return) and allocate to other options for investment.
- Prosper’s investors will need to decide on picking “good” loans (i.e.: loans which will end up being fully paid off) in which to invest their money, and this is what our loan classification model will focus on.
However, it is important to note that there is no one ideal investment, as it is highly contingent on the risk appetite of the investor.
The datasets to be utilized in this analysis are publicly available online for download using a registered Prosper account.
Predictors
The dataset contains comprehensive information on all loans posted annually and a set of 22 variables that cover the different attributes of an individual loan.
These include but are not limited to the loan amount, the amount of interest & fees paid, loan duration, a Prosper credit rating score, and loan status with reason if completed or defaulted.
Completed vs. Defaulted Loan: 0s vs 1s
There are 4 outcomes for a loan in Prosper’s data:
- Completed (A)
- Charged Off (B)
- Current (C)
- Defaulted (D)
To simplify our Prosper data analysis, we decided to assign a value of 0 to completed loans (A) and 1 to the remaining 3 types of loan statuses (B, C, & D) that refer to loans that have not been completed.
How Can We Identify a “Good” Loan?
In order to acquire an intuition of identifying good loans, it is important to realize the correlation & interaction amongst all the variables.
However, the definition of a good loan is still quite ambiguous so it is vital that the prediction should be based on one of these goals: whether a loan will default, paid back early, time of default, or time taken to pay back if paid back early.
Analyzing Loan Terms
The plot indicates some interesting insights for the 60-month term (the longer-term) such as the highest loan amount being approximately $15,000 for both completed loans as well as uncompleted loans. Unsurprisingly, borrowers tend to take higher loan amounts to pay back over the longer-term loans.
Whereas loans in the 36-month terms (the shorter term) tend to be of much smaller amounts and are more consistently distributed for both completed and uncompleted loans.
Analyzing Interest Rate Distribution Per Loan Status
Another violin plot was built that shows the different levels of interest rates for each of the loan status per grade type (ranging from AA — HR: highest grade to lowest).
Higher-grade loans for both completed (0) and uncompleted (1) have lower interest rates, and this trend is consistent for both loan statuses.
Data Leakage
This phenomenon occurs when a model is built using predictors that will not be available at the time of future predictions. Therefore, we need to disregard these predictors in order to prevent a bias in our model.
There are two main cases on how leakage can be identified:
1. A predictor that is highly correlated with the target variable. We will check this using a heatmap or correlation matrix of all 22 predictors.
2. The predictor information is not available during the prediction of total return, for example, the late_fees_paid variable.
As suspected, we noticed that variables that are associated with data leakage such as the interest paid and principal paid stood out to have a high correlation with loan status.
Based on this definition of leakage and the fact that most of these won’t be available when predicting whether a loan will default or not, we disregard the following predictors:
loan_number, late_fees, age_in_months, days_past_due, origination_date, principal_balance, principal_paid, interest_paid, late_fees_paid, debt_sales_proceeds_received, loan_default_reason, loan_default_reason_description, next_payment_due_date, next_payment_due_amount, co_borrower_application
Calculation of Investor’s Returns
The most significant data required in determining the potential return from a given loan is the calculation of the total payments that were received on each of the loans.
To build an effective investment strategy, we need to build a strong indicator variable on the return amount of each loan. It is vital that the return should consider both partially paid off defaulted loans and loans that have been paid off earlier than the due date.
In general, there are three effective return measures that can be created using the following variables:
1. Total payment account denoted by variable p
2. Total amount invested in the loan (amount borrowed by borrower) denoted by variable f
3. Nominal length of the loan in months (term of loan) denoted by the variable t
4. Actual length of the loan in months (loan length) denoted by the variable m
We are going to calculate the expected returns by scenario testing under 3 different strategies:
1) Pessimistic (M1)
The pessimistic approach states that when the loan is paid back, the investor still cannot re-invest it until the term of the loan expires.
Using equation 1, our team was able to create a new variable called: “ret_PESS”.
It was worth noting that this approach favors long-term loans for loans that default early due to the loss spreading over a greater span.
2) Optimistic (M2)
The optimistic approach entails that once the loan is paid back, the investor’s money is returned, and the investor can invest immediately in another loan with the same return.
Using equation 2, our team was able to create a new variable called: “ret_OPT”.
This method can be rephrased as the annual monthly return of the loan over the time it was active. However, it is plagued by the assumption that money can be reinvested with the same rate and that if the loan defaults early, the loss can be a drastic overestimate of negative returns. On the plus side, short and long loans are treated equally using this approach.
3) Fixed-horizon (M3)
The fixed-horizon approach involves calculating fixed-time returns for 3 different interest rates: 1,3 & 6 %.
A function was built, called ret_method_3(), to replicate equation 3 and calculate the 3 new variables for each loan in the dataset.
This approach can be the most accurate by equalizing the differences between loan length and defaulted loans. However, it tends to disregard the depreciation value of the money over time.
Prosper Rating (Grade) Breakdown
A deeper dive into the prosper credit rating (“grade”) variable was performed to explore any significant trends with each of the type of return (M1-M3).
The higher the grade, the smaller percentage of defaulting, and the lower the interest rates.
The mean return % is very identical with minor increases across the different grades. In addition, there are no negative mean return values across any of the grades.
A quick count of the number of loan status for each type shows a significantly higher number of completed loans when compared to uncompleted loans, indicating a class imbalance in the dataset.
Therefore, the most probable option is to try to rebalance the dataset to consist of 50% of completed loans and 50% of uncompleted loans. However, doing this could prove to be detrimental and cause the average return to be negative, fully going against the point of enticing potential investors to sign up to Prosper.
Thus, to capture actual real-life trends and true proportions, physically balancing the dataset would hinder a realistic understanding of the analysis.
Instead, a calibration curve measure will be used in the model-building phase to check if a constructed model is biased by the data imbalance or not.
The predictive modeling will cover a two-stage approach:
First, we will be constructing a binary classification model to predict the loan default probability using a variety of industry-standard algorithms and evaluating the performance for the optimal and most accurate model.
The second stage involves constructing a regression model using a variety of industry-standard regressors to predict the amount of return a loan may generate to an investor for each of the return approaches (M1-M3).
The models in both phases will be optimized for higher accuracy by hyperparameter tuning through cross-validation.
Stage 1: Classifying Loan Outcomes
The first model was to determine the predictive power of the grade and interest rate of a loan, which is usually the industry’s standard practice.
Predictors: Grade & Interest Rate
Model #1 — Accuracy Score: 85%
This implies that these two features individually contribute to successfully determining the loan status as much as when all the predictors are used.
However, to build a more robust model and avoid underfitting, we have to drop these two features entirely.
Predictors: All Excluding Grade & Interest Rate
Model #2 — Accuracy Score: ~85%
Using all the predictors apart from grade and interest rate, model accuracies were maximized at roughly 85%. Although Naïve Bayes provided a very high accuracy, it was unable to account for the biases presented due to the data imbalance.
The calibration curves and AUC scores for all the above models are high, indicating strong performance. The calibration code shows a measure to see if the model is affected by the imbalance. If the calibration is perfect, it means the model is not biased by the imbalanced dataset. In other words, it will not always try to predict the outcome to be the grand majority, which is the completed loans (0) case.
Since Random forest was the best-selected model, we decided to explore the best in class for tree algorithms, also known as the Light GBM classifier developed by Microsoft. It is considered the latest industry trend and winner of multiple Kaggle competitions, so we were curious to see how well it would perform in the Prosper case. As presumed, all the model metrics were very high for the Light GBM classifier.
Feature Importance
The feature importance plots for most of the models were also generated in the classification stage.
Our model determined “Service fees paid”, “Propser fees paid” and “amount borrowed” as the most useful at predicting our target variable: “loan_status” which matches well with investing intuition.
Learning Curve
A learning curve was also plotted to assess the learning rate of the optimal random forest models when compared to a sample L2 logistic regression model.
When the training size is increased by intervals of 25 data points until a very large amount, the random forest model AUC increases, meaning that the model will continue to learn and give a much higher percentage of accurate return predictions relative to the logistic model.
Stage 2: Predicting Expected Return
A variety of models were tested using a different set of features. The R2 values of various regression models were calculated for the different return methods.
The Random forest regressor gave the highest score across all the different types of returns when compared to the rest of the models built.
Investment Strategies
By leveraging both the classification and regression models to predict whether a loan will default and the estimated return, an investment strategy can be formulated to maximize an investor’s average return based on the different methodologies used to calculate the return.
Four different investment strategies were tested:
- Random Strategy
- Default-based
- Simple return-based
- Default and return-based strategy.
In order to depict real-life scenarios, the M3 return approach seems to be a more accurate way to measure the % return. Thus, the baseline for us to compare the different investment strategies would be under the M3 method. Nonetheless, the returns seem to be very close to one and another regardless of which investment strategy is being used when looking at the M3 return approach.
The magnitude of the % return is also highly dependent on the reinvestment interest rate (M3).
Sensitivity Test of Portfolio Size
Next, a sensitivity test was run with regards to the portfolio size versus the investment return percentage
As the number of loans invested in by a potential investor increases, the percentage of investment return decreases. This is also very intuitive due to the risk factor of a loan getting defaulted increases as the number of loans invested in increases.
In other words, an investor may have a similar chance of getting a high return and then losing most of it due to a defaulted loan and this cycle can keep repeating causing the overall percentage of return to decrease as the number of loans invested in increases.
In this section, we implement three different optimization models to improve an investment strategy using Prosper.
The three different optimization methods are:
1) Directly maximize total profit
A binary variable is set-up for every loan in our data set. The number of loans constraint is added (as per the maximum number of loans in our dataset) and the objective function is defined (maximize total profit).
2) Maximize profit with budget constraint
The second optimization model considers the budget constraint of a potential investor. Like the first model, a new budget limiting constraint is added (testing different budgets), and the optimization problem is resolved.
3) Maximize profit with risk-return tradeoff
The third optimization strategy involves incorporating the portfolio risk factor by considering the variance of the returns.
To achieve this, first, a clustering model is trained with an adjustable k parameter by the investor. Second, the standard deviation is computed for each of the clusters. Third, each loan can be assigned to a specific cluster based on the Euclidean distance from each cluster. Fourth, the standard deviation of the return for each loan can be estimated using the standard deviation of the cluster.
The model is then built to maximize profit with the inclusion of the risk-return tradeoff constraint. A sensitivity/penalty factor can be set by the investor to account for the risk tolerance of the investor.
Results show that the optimal investment strategy for Prosper is maximizing profit using budget constraints with an expected return on 10.5%.
In conclusion, we saw how information from different attributes can be utilized to create new predictors that may end up boosting the statistical significance of our model as opposed to the original counterparts. It was important to note how a model’s performance over a single partition of the data into cross-validation folds can be misleading and running our model several times to develop different iterations of the train/test splits was essential. Another important part was calibration, and how it measured whether the probabilities produced by the model were correct.
It would be interesting to see how our model would perform if tied to macroeconomic external data that would represent the underlying economy’s performance at the time such as oil prices or The World Bank’s interest rates in order to further extend the pessimistic measure’s performance.
Additionally, it would be interesting to conduct sentiment analysis over the investor to understand what their risk appetite exactly is as opposed to asking for the numerical input as they may lack the understanding of how the P2P lending market operates.
Our Inspiration — Original Study
I am an experienced professional in the field of data-driven strategies and quantitative algorithms, with a focus on financial analysis and portfolio optimization. My expertise lies in understanding the intricacies of loan portfolios, utilizing data-centric approaches, and implementing predictive models to enhance investment strategies. I have a deep understanding of the concepts discussed in the article you provided.
Now, let's delve into the key concepts outlined in the article:
-
Evolution of Lending Landscape:
- Banks traditionally provided loans, but smaller businesses face challenges like increased operating costs and demand fluctuations.
- The rise of Peer-to-Peer (P2P) Lending Market addresses the gaps overlooked by traditional banks.
-
Data-Driven Investment Strategy for Prosper Marketplace:
- The goal is to create a predictive model for Prosper Marketplace, a P2P lending firm.
- The model aims to provide a holistic overview from processing raw data to inferring business outcomes.
-
Loan Classification Model:
- Investors need to decide on the amount to invest and which loans to choose.
- The article discusses the classification of loans into "Completed" and "Not Completed" categories (Charged Off, Current, Defaulted).
-
Dataset and Predictors:
- Publicly available dataset from Prosper includes comprehensive information on loans, with 22 variables covering loan attributes.
- Predictors include loan amount, interest & fees paid, loan duration, credit rating score, and loan status.
-
Analyzing Loan Terms and Interest Rates:
- Insights from analyzing loan terms, such as higher loan amounts for longer-term loans.
- Interest rate distribution analysis per loan status, with higher-grade loans having lower interest rates.
-
Data Leakage Identification:
- Addressing potential data leakage by identifying predictors highly correlated with the target variable.
-
Calculation of Investor’s Returns:
- Developing three return measures (Pessimistic, Optimistic, Fixed-horizon) to calculate expected returns.
- Factors considered include total payment, amount invested, nominal length, and actual length of the loan.
-
Prosper Rating (Grade) Breakdown:
- Exploring credit rating trends and their correlation with default rates and interest rates.
-
Predictive Modeling Stages:
- Two-stage approach involving binary classification for loan outcomes and regression for expected returns.
- Model optimization for accuracy using industry-standard algorithms.
-
Investment Strategies:
- Testing four investment strategies based on classification and regression models.
- Sensitivity tests for portfolio size versus investment return percentage.
-
Optimization Models:
- Three optimization methods to improve investment strategy, including profit maximization, budget constraints, and risk-return tradeoff.
-
Results and Conclusion:
- Optimal investment strategy involves maximizing profit with budget constraints, yielding an expected return of 10.5%.
The article showcases a comprehensive analysis of optimizing a loan portfolio using data-driven strategies, quantitative algorithms, and predictive modeling techniques. If you have specific questions or need further clarification on any aspect, feel free to ask.