1. Introduction
Companies acquire funds not only from specialized financial intermediaries but also from their respective suppliers (). This practice is denominated trade credit and frequently occurs in the B2B market when buyers delay payments to suppliers for merchandise and/or services. If credit is approved for a certain client, there is always the possibility that this client will not honor the agreement to repay the amount in question. On the other hand, if credit is denied, it is possible that a potentially profitable client was handed over to rival companies. Therefore, both issues must be taken into consideration when deciding whether to extend credit to any applicant. Credit risk, in general, is a topic of the utmost importance in financial risk management, being a major source of concern for financial and banking institutions (). In the last decades, quantitative methods to manage credit risk have grown in sophistication. The end goal is to separate good credit applicants from bad ones. The criterion used in this classification is the ability of the applicants to repay the full amount of the loan. Usually, this is achieved by feeding a predictive model with past customer data, thus finding the relationships between the clients’ characteristics and the potential for default (Huang, Liu, & Ren, 2018). There is substantial research material on this topic, as only a small improvement in prediction accuracy may lead to large gains in profitability (Kvamme, Sellereite, Aas, & Sjursen, 2018).
Until recently, to build these credit scoring models, the sole solution was to employ statistical models. The linear discriminant analysis and logistic regression are among the statistical techniques widely used for this purpose (Baesens, Setiono, Mues, & Vanthienen, 2003). However, the emergence of artificial intelligence methods provided an opportunity for credit risk professionals. There are numerous studies showing that machine learning tools like artificial neural networks, decision trees (DTs), and support vector machines present a chance to improve the prediction accuracy of statistical models with regard to credit risk (Vellido, Lisboa, & Vaughan, 1999; Huang, Chen, Hsu, Chen, & Wu, 2004; Ong, Huang, & Tzeng, 2005). Despite significant developments in terms of newer classifiers, the literature on credit risk has not kept pace with the breakthroughs in predictive learning (Lessmann, Baesens, Seow, & Thomas, 2015; Jones, Johnstone, & Wilson, 2015). Indeed, more recent techniques, such as random forests and generalized boosting, have been explored by a limited number of studies, although some sources report them as superior to previous methods ().
The main purpose of this work is to add to the existing body of research by further studying these new AI techniques, allowing for a better understanding of how these compare to older and more established methods of credit scoring with respect to performance and applicability. This research offers a comprehensive view of how diverse statistical and artificial intelligence predictors compare on credit scoring. More specifically, this study focuses on the discriminant analysis, logistic regression, artificial neural networks, and random forest (RF) methods. It significantly adds to the existing literature by assessing the robustness of these various techniques in the estimation of the default risk of a set of companies engaging in trade credit. This experiment uses a novel sample that has not been explored in the literature before.
2. Theoretical Framework
2.1. Linear Discriminant Analysis
The linear discriminant analysis (LDA) may be defined as a statistical technique utilized to classify an observation into one of several a priori groupings depending on the observation’s characteristics (). There are some limitations regarding the validity of this method. It is dependent on stringent assumptions, namely that all variables must present a normal distribution and be mutually independent (; Šušteršič, Mramor, & Zupan, 2009).
Considering a certain feature vector with dimensions, it is important to know what linear function of these values best separates the groups in question. This function corresponds to the expression that follows.
In this formula, and represent the discriminant coefficient for explanatory variable and the value for indicator , respectively. In the LDA, the goal is to find the values for these coefficients that maximize the differences between the groups as measured by a given objective function. The original method proposed by Fisher in 1936 sought to find the coefficients that maximized the ratio of the explained variance to the unexplained variance. This corresponds to the F-ratio, which may be computed with the following expression:
(2)
This formulation considers a total of groups in a dataset, with and being the index for the groups and the observation of group , respectively. Additionally, represents the number of cases in each group, while is the mean for group and is the overall sample mean. Analyzing this expression, one can observe that its numerator corresponds to the sums of squares between groups and the denominator to the within-groups sums of squares ().
Once the coefficients have been computed to maximize the discriminant power of the function, it is possible to calculate the score for each observation in the sample and assign it to a certain group accordingly.
The LDA technique was first applied to credit scoring by Edward Altman in 1968. This approach is designated by Altman’s Z-score and served as the basis for the future applications of discriminant analysis in credit scoring. Altman’s method implies assigning each instance to the group it resembles the most. The comparisons are measured by a chi-square value, and classifications are made based on the relative proximity of the instance’s score to the various group centroids (Altman,1968).
2.2. Logistic Regression
Logistic regression is one of the most widespread statistical tools for classification problems in general (). Much like the LDA, it is a technique used in problems with categorical dependent variables displaying linear relationships with the explanatory variables. Despite the similarities, it should be stressed that the logistic regression model does not assume the populations in classification problems to be normally distributed. Unlike the LDA, the logistic regression can deal with various distribution functions (; ), and is thus, arguably, a better option for credit scoring tasks.
Assuming the case of a binary logistic regression that is used to determine if an event will happen (e.g., company bankruptcy), then may be defined as the probability of occurring given the n-dimensional input vector . As there are only two possible outcomes, 1 – is equal to the probability of the event not happening. The linear form of the LR model may be obtained by applying the natural algorithm to the odds ratio, which is equivalent to the logit of . This leads to the following mathematical formulation:
(3)
different formulation of the logistic regression is usually obtained by relating the probability of a given event, , happening, conditional on the vector of observed explanatory variables, to the vector (). This corresponds to expression 4, which may also be obtained by manipulating the former formula.
The output of this expression describes a sigmoid curve, taking values between zero and one. After the parameter and the vector of coefficients are calculated, it may be used as a predictor. The maximum likelihood method commonly used in statistics can be applied to estimate these parameters.
2.3. Artificial Neural Networks
Artificial neural networks started being studied as a possible credit risk predictor in the nineties () and, since then, have become a mainstream tool utilized by several financial institutions and other companies. The potential of this technique is confirmed by comparative studies either showing this tool outperforming discriminant analysis (Khemakhem & Boujelbène, 2015; Wójcicka- Wójtowicz & Piasecki, 2017) or suggesting the use of a hybrid model as the best alternative (Lee, Chiu, Lu, & Chen, 2002; Lai, Yu, Wang, & Zhou, 2006).
Neural networks are composed of several artificial neurons, which can be regarded as processing units. These elements are interconnected via synapses that convey values, with each one of these connections having an assigned weight. When a neuron performs a computation, the first step is to do a weighted sum of the inputs; afterwards, the result is used in the transfer function to calculate the neuron’s output. Sigmoid, linear, and step functions are common transfer functions (Angelini, di Tollo, & Roli, 2008).
All neural network build-ups require the partitioning of the input data into training, validation, and testing subsets, which have distinct purposes. The training subset is used in the learning stage of the models, while the validation subset assures that every change in the models’ parameters truly reduces the overall error. In the absence of validation, the models may overfit by modeling noise in the training data. Finally, the testing subset provides an independent way to assess the predictive ability of the models.
The first artificial neural network considered in this research is the multilayer perceptron (MLP), which is the most frequently used type of neural network in credit risk assessment (), having been tested in various studies. The backpropagation rule is a widely used technique to update the weights of these networks (; ). Backpropagation algorithms are supervised learning tools. These techniques begin by initializing the weights with small random values (). Subsequently, the gradient of the error’s variation with respect to changes in the weights is computed, and these weights are modified in the direction which reduces the overall error of the network.
The other artificial neural network tested in this research is a radial basis function (RBF) neural network. The first layers of these models just carry the data directly to the ensuing layers. A fundamental aspect of these networks is that the hidden layers are entirely composed of neurons with radial basis transfer functions, such as Gaussian functions (). The outcome of a radial basis function is dependent on three parameters: the received input vector , the center of the respective neuron and the spread . The training that RBF networks undergo allows for the determination of the appropriate number of hidden layers and also the best centers and widths for each hidden neuron (Chen, Wang, Liu, & Wu, 2018). These parameters will be the ones that allow for a minimization of the network’s overall error.
The estimation of the centers can be done via a clustering algorithm. The k-means clustering technique, for example, is one of the common and intuitive methods of this type. This algorithm considers a set of initial centers and then iteratively changes the centers to minimize the total within-cluster variance (Hastie, Tibshirani, & Friedman, 2008). First, all the input data points are attributed to the closest center, which effectively corresponds to dividing the data into separate subsets. Afterward, each center is recalculated to correspond to the vector of the means for the features of the data points composing the respective subset.
Despite the great promise of ANNs in general, there is a major disadvantage that should be noted. Neural networks work as black boxes, which basically means that it is very difficult to interpret the weights and how the results are achieved (). This may severely restrict the use of such techniques.
2.4. Random Forest
This research also includes the test with the random forest, a much newer artificial intelligence technique. A random forest is a homogenous ensemble predictor. Its predictions are dependent on the individual outputs of various decision trees. The aggregation of the many outputs obtained into a single outcome may be done by averaging over all the output values when predicting a numerical outcome or by performing a vote when predicting a class (). There is evidence that this procedure of model combination can lead to increased accuracy (Paleologo, Elisseeff, & Antonini, 2010; ; , Dumitrescu, Hué, Hurlin, & Tokpavi, 2022).
Assuming it is used for classification purposes, a random forest (RF) is analogous to a voting committee. Each decision tree reaches a prediction or classification, and then the results of all trees are checked to find what is the output of the majority. It is implied in this logic that the decision trees reach different results and consequently display distinct structures. A fundamental challenge when building an RF is to ensure decision tree diversity. The diversification of decision trees is achieved via two mechanisms, bootstrap aggregating (bagging) and random feature selection.
Bootstrap aggregating is a procedure that allows each tree to use a different sample as input without partitioning the data. These replicate datasets, each consisting of a given number of cases, are drawn at random, but with replacements, from the original dataset ().
In contrast, the random feature selection mechanism dictates that each node is assigned a random subset of variables that it may use in the node-splitting procedure. This random selection of features at each node decreases the correlation between the decision trees, causing a reduction in the random forest error rate (Bryll, Gutierrez-Osuna, & Quek, 2003; ).
Random feature selection has been demonstrated to perform better than bagging alone (), namely in problems with several redundant features (). This strategy has also been proven to help prevent the overfitting phenomenon.
However, after the random forest is applied, its results are not easily interpretable, which is inconvenient when it is critical to understand the interactions between the variables of the problem ().
2.5. Comparison of Statistical and Artificial Intelligence Approaches in Credit Scoring
Although artificial intelligence and statistical methods are widely used in credit scoring, each approach has its own strengths and weaknesses. AI techniques can handle complex data structures and non-linear relationships among variables. The adoption of these methods becomes very attractive due to their high performance and ability to process large datasets (Baser, Koc, & Selcuk-Kestel, 2023). Some of these models also display an increased resilience to outliers (). AI-based credit scoring models can outperform traditional regression models in terms of accuracy and robustness.
However, these can be less transparent and more difficult to interpret, acting like black boxes, making it challenging to explain their decisions. When algorithms make decisions based on hidden patterns or factors that are difficult to understand or explain, it may raise concerns about fairness and potential bias in the models.
Statistical methods, on the other hand, have been widely used in credit scoring for many years. They are well-established, transparent, and easy to interpret, making it easier for lenders to explain their decisions to borrowers. However, statistical models are less flexible and may not be able to capture the full range of relationships among variables in credit data. Moreover, they are sensitive to outliers and multicollinearity, which can negatively affect their accuracy and robustness (Dawoud, Awwad, Eldin, & Abonazel, 2022).
3. Input Data Collection, Analysis and Treatment
3.1. Input Data Collection Process
The data used in the models were obtained from the Orbis financial database. Bureau van Dijk (BvD), a Moody’s Analytics Company, is responsible for the capture, treatment, and analytical structuring of the data present in this database. Access to the database is provided in exchange for a subscription fee, albeit there is a free trial version available online at BvD’s website.
The financial information used in this research was extracted from a list of Galp’s B2B clients and concerns the fiscal year of 2016. Additionally, the information regarding the clients’ financial status in the fiscal year of 2017 was retrieved from the internal data kept by Galp.
3.2. Description of the Input Variables
In order to obtain the most explanatory input variables, several financial and non-financial indicators were extracted from the Orbis database or computed from the exported information. This data includes raw financials, equity ratios, growth tendencies, operational ratios, the maturity of the companies, profitability ratios, sectors of activity, and structural ratios. The following table details the variables that were tested.
The final indicator that was included, company status in 2017, corresponds to the dependent variable for all the models. In this variable, all companies are assigned to the mutually excluding categories:
Active: The company remains in operation.
Insolvent: The company has filed for bankruptcy.
Undergoing a Special Revitalization Process (SPR): The company has been given protection against creditors status, preventing an imminent insolvency.
Non-compliant: The company has failed to pay for the products and/or services provided by Galp.
3.3. Aggregation of company outcomes
The company status variable poses a challenge, as it must be decided whether to aggregate the negative categories under a broader class of bad companies, merge just some of these, or keep all of them separate.
Although there are several possible groupings for the distinct strategies, a preliminary analysis is enough to understand that some seem counterintuitive. The discriminant analysis, as well as the artificial neural networks and other predictive models, offer similar predictions for close inputs; as such, it is detrimental to merge classes that are characterized by very dissimilar inputs. Therefore, one must take this factor into consideration when deciding on the best course of action regarding the aggregation of classes.
Both insolvent and SRP companies display similar and very poor financial indicators. Hence, this pair of classes is the most logical choice to undergo merging. Non-compliant companies display better financial indicators than the other two negative categories, although these indicators remain deteriorated in relation to active companies.
After experimenting with the aggregation strategies, it became evident that it is beneficial to keep only two possible outcomes. This is due to the similarity of the inputs obtained for insolvent, SPR, and non-compliant classes. Furthermore, the main goal of any creditor is to understand if there is a significant risk of default for any given potential debtor, and it is notorious that the applicants included in these three classes present such a risk. Considering this, it was ultimately decided to pursue a two-outcome aggregation strategy, merging the insolvent, SPR, and non-compliant categories in a broader class of bad companies. The active companies remain in a separate class of good companies.
3.4. Sampling Procedure
Although the majority of credit scoring research has not focused on the input samples’ characteristics, the size, and balance of such datasets have a tremendous potential to affect the performance of the predictive models. This latter characteristic refers to the proportion of the groups in the sample. Ideally, considering a binary outcome scenario, half the instances would belong to one group and the remaining to the other. Some methods are more sensitive than others to changes in the input data’s size and structure, but both statistical and AI techniques are affected by these features to varying degrees.
There are two options to manipulate the balance of a sample, under-sampling by reducing the number of instances of the majority class or over-sampling through an increase of the cases in the minority class. In this research, it was decided to under-sample the majority class, which encompasses the cases of good companies. Although over-sampling may produce better results, according to , this dataset proved extremely unbalanced due to a pronounced deficiency of bad companies, making it difficult to employ this technique. Considering that the minority class (which encompasses the cases of bad companies) is much smaller, over-sampling would cause certain cases in this class to be repeated several times. This repetition may lead the models to overfit, thus degrading the results.
After selecting a subset of instances from the good companies’ class, the nearly perfectly balanced dataset described in was obtained.
The slightly bigger number of good companies in relation to the total number of bad companies is due to a few detected cases of duplicated corporations in the data. This issue was solved by studying the causes of each repetition and assigning these cases to a sole category.
3.5. Missing and Invalid Data
Another important aspect to be addressed relates to the presence of missing values in the dataset. The usual reasons for missing values in credit scoring problems are that those values were already missing in source data or were out of the theoretically allowed range. The latter motive is quite common in these situations due to typos or transcription errors (). On the other hand, these lapses may be due to computational errors. After analyzing the dataset, two main types of missing data were detected, NA and NS lapses. The first one corresponds to data that is truly missing, NA being an acronym for not available in the database. On the other hand, NS stands for not significant and is used when indicators expressed as percentages take values near zero. As NS cases do not truly represent missing data, these were replaced by null values in the sample. This approximation allows for the use of such instances.
3.6. Correlation Analysis
The multicollinearity problem refers to the existence of strong correlations between independent variables in a dataset. Many authors have stated before that the logistic model becomes unstable in the eventuality of a strong dependence among predictors, as it seems that no single variable is important when all the others are in the model (e.g., Aguilera, Escabias, & Valderrama, 2006). This weakness is shared with the LDA method.
Multicollinearity can cause slope parameter estimates to have magnitudes or signs that are not consistent with expectations and, in some situations, lead independent variables in a regression model not to demonstrate statistical significance, despite large individual predictor-outcome correlations and a large coefficient of determination, R2 (Thompson, Kim, Aloe, & Becker, 2017).
A common technique used in the detection of multicollinearity involves the computation of the variance inflation factor. Variance inflation factors over 10 are usually considered to be indicative of multicollinearity. However, certain authors point out that this threshold is very lenient. Indeed, a VIF of 10 for a given independent variable implies that 90% of its variability is explained by the remainder indicators. Another typical threshold is a maximum VIF of 5 (). This is a more conservative approach that was deemed adequate, as certain variables displayed VIF values nearing 10 and would not be excluded with the former criterium.
The correlation analysis indicated that there are clear signs of multicollinearity in the original data, with several VIF values exceeding the threshold defined. In order to solve this problem, the variables were removed iteratively until no VIF values were over 5. This removal procedure was performed, giving preference to the variables that are more correlated. The final dataset obtained displays no indications of multicollinearity.
3.7. Outlier Analysis
Although there is no universally accepted definition, several authors refer to outlier instances as observations that appear to deviate markedly from other members of the sample in which these occur (e.g., ; ; Hodge & Austin, 2004). When fitting a model to the data, outliers need to be identified and eliminated or, alternatively, examined closely if these cases are the focus of the analysis (Beliakov, Kelarev, & Yearwood, 2011). In credit scoring, these instances are of limited interest, but the potential to negatively affect the results of the models must be eliminated.
According to , the basis for multivariate outlier detection is the Mahalanobis distance (MD). Some examples of these are techniques with the computation of Mahalanobis distances (MDs) with robust indicators, such as the method proposed by Leys, Klein, Dominicy & Ley (2018) with a minimum covariance determinant approach, or entirely distinct approaches using projection pursuit strategies. This metric measures the distance of each instance in the data to a central point in multivariate space. The key feature of this measure is that it considers the correlations between variables, as well as the respective scales (Brereton & Lloyd, 2016). The Mahalanobis distances may be computed with the following expression:
This formula considers that is the vector for a given data instance, while is the arithmetic mean of the dataset and represents the sample covariance matrix.
However, outliers are known to distort the observed mean. A small cluster of outliers may impact the mean in such a way that these are no longer detected as aberrant instances. Additionally, the distortion brought on by the outliers may be so high that normal instances are wrongly labeled as outliers. These occurrences are commonly referred to as masking and swamping, respectively.
Some studies in various research fields have proposed alternative procedures for outlier detection that seek to minimize the masking and swamping effects. Some examples of these are techniques with the computation of Mahalanobis distances (MDs) with robust indicators, such as the method proposed by Leys, Klein, Dominicy & Ley (2018) with a minimum covariance determinant approach, or entirely distinct approaches using projection pursuit strategies.
To prevent the masking and swamping effects phenomena, it was decided to examine the presence of outliers by computing MDs with geometric medians (GMs). This indicator is one of the most common robust estimators of centrality in Euclidean spaces (Fletcher, Venkatasubramanian & Joshi, 2008). To compute this parameter, the Weiszfeld algorithm is employed. This is an iterative procedure that converges with the appropriate initialization values to the point that presents the lowest sum of Euclidean distances for all the sample instances.
The computation of the GMs does not tolerate missing values. As such, it is necessary to replace these lapses with usable data. The techniques used for this purpose are called imputation procedures. After analyzing the sample’s pattern of missing data and assessing if monotonicity is present, it was decided to proceed with a fully conditional specification imputation method.
This procedure warrants the separation of the sample into two groups, which contain exclusively good and bad companies. Since the whole sample contains two distinct populations with very different characteristics, this splitting is fundamental to ensure that the MDs are computed with the GMs of the class (good or bad) to which each instance belongs.
As normality tests proved that various indicators do not follow normal distributions, it was opted to use an alternate exclusion criterion to the comparison of the MDs with a quantile of the chi-squared distribution. There is no guarantee that the MDs follow this specific distribution without multivariate normality. By building scatter plots with the sample ID numbers and the robust MDs, it is possible, via visual inspection, to detect any potential outliers. These plots are displayed in and .
Some instances in these scatter plots stand out for being clearly anomalous. It was decided to label as potential outliers all the cases with robust Mahalanobis distances above 1000. These points are marked in red for easier identification. In order to comprehend to what extent these flagged instances are aberrant, there was an analysis of the indicators presented by these companies. This study reinforced the idea that such corporations display altered values for several indicators.
Considering that the results of the robust MDs analysis were confirmed for good and bad companies by the subsequent findings of extreme values for several indicators in the flagged cases, the decision was taken to label these nine instances as outliers and remove them from the sample. The outlier detection technique implemented in this section was partially based on the work of Semechko (2019). Further details are provided in the reference section.
4. Model Development and Performance
4.1. Key Performance Indicators Used
The models in the following sub-sections are evaluated in terms of several key performance indicators. These include the percentage of correctly classified (PCC) instances, which measures the accuracy of the techniques. The sensitivity and specificity are also presented, which measure the imperviousness of the models to type I and type II errors, respectively. Assuming a null hypothesis that the company applying for credit will not default next year, the sensitivity is equal to the true positive rate, and the specificity is equal to the true false rate.
The area under the curve (AUC) is also computed for all models. The AUC corresponds to the area under the receiver operating characteristic (ROC) curve. The ROC curve is obtained by plotting, for each classification threshold, the rate of true positives against the rate of false positives (Swets, Dawes, & Monahan, 2000). Finally, the Gini Index is included. This coefficient is a chance standardized alternative to the AUC that measures how well the models separate the existing groups. Greater values for the AUC and Gini Index are desirable, as these indicate a higher discriminatory ability. It should be noted that, in cases of conflicting performance ranks, these last two measures are prioritized in this work.
4.2. Development of the Linear Discriminant Analysis Model
The discriminant analysis model was applied to the data with IBM SPSS Statistics 25. Considering the capabilities of the software, alternative discriminant analysis models were computed using different combinations of stepwise techniques and entry/removal criteria.
After experimenting with various selection rules, it was found that the best results were obtained by including in the model any explanatory variables with a minimum F value of 3.00 and excluding those with F values inferior to 1.00.
Following the computation of the discriminant coefficients, it was possible to assess the relative importance of the independent variables included in the model. The standardized coefficients are particularly important to assess the discriminating ability of the explanatory variables, as the standardization allows for the comparison of variables expressed in distinct scales. The five variables with the most predictive potential were found to be the shareholder equity ratio, Cash flow / Total assets, return on assets using net income, credit period, and the major sector of activity, by descending order of discriminating ability.
The key performance indicators were then computed for the best discriminant analysis model obtained. These are listed in .
4.3. Development of the Logistic Regression Model
The logistic regression model was also applied with IBM SPSS Statistics 25. There is no need to use a multinomial logistic regression, as the considered output is dichotomous. Therefore, a binary logistic regression model was implemented.
The first step in the development of this model is choosing the input variable selection procedure. There are a variety of stepwise techniques available in this software, namely forward selection and backward elimination procedures.
After careful experimentation, the best results were obtained using the forward selection stepwise techniques. The maximum number of iterations before model termination was kept at 20; the default setting, overriding this configuration, did not improve the results. In terms of the thresholds used in the stepwise methods, the best results were obtained when the probability for the score statistic must be less than 0.01 for entry and over 0.03 for removal. The option to include a constant in the LR model remained selected ().
Furthermore, the user interface allows for the definition of the classification cutoff directly, which was kept at 0.5. Although it is relevant to study the model’s performance under different thresholds, this will be addressed with the computation of the remaining KPIs, namely the AUC. presents all these performance metrics, which are relative to the most robust logistic regression model achieved.
4.4. Development of the Multilayer Perceptron Model
The multilayer perceptron model was applied in the neural networks’ module of IBM SPSS Statistics 25, which offers various options regarding how the ANNs are structured and the methods through which the results are computed.
First, the partitioning of the data may be set. This involves specifying the fractions of the sample that are allocated to the training, validation, and testing datasets. Secondly, the structure of the MLP network may be stipulated in terms of the number of hidden layers, the activation function to be used in these layers, and the transfer function of the output layer.
Finally, there are different options for the learning algorithm to be employed in the networks’ development. Considering these possibilities, four different MLP neural networks are proposed, which are detailed in the following table.
Regarding the partitioning of the data, several combinations were selected in accordance with the best practices in the literature. The first training-testing-validation ratio, 700:300:0, is the most popular partition, used by numerous authors (e.g., and Pacelli & Azzollini, 2010), being also the default setting in SPSS. The second option, 600:150:250, is used by Lai, Yu, Wang, and Zhou (2006). Lastly, the third partitioning, 600:200:200, which varies only slightly in relation to the second alternative, is based on the work of .
For the comparison between methods to be fair, one must be careful when setting the partitioning strategy in SPSS. The percentage of cases attributed to each set may be defined directly in the software’s user interface for a given network. However, this introduces the potential for a chance to influence the results. As the cases are randomly sampled from the dataset to build the training, testing, and validation sets, the results obtained will be strongly influenced by this arbitrary selection. By not guaranteeing the replicability of the partition, the comparison between the different architectures cannot yield meaningful results.
This issue essentially arises because some companies are more difficult to classify than others. Not all instances present overwhelmingly positive or negative indicators. These cases are the ones that contribute the most to the errors committed by the models. If a given partition randomly samples more of these instances than the others, the models using it would tend to display poorer results, although this partitioning strategy is not necessarily inferior to the others. The same reasoning applies to comparisons between different models that use the same partitioning strategy. A given model may perform better solely because it was evaluated with a test set containing a higher percentage of instances that are easier to sort.
In order to overcome this flaw, a strategy was employed that mitigates the potential for the models’ results to be influenced by chance. Firstly, three partitioning variables were defined beforehand. These variables contain values that determine the placement of each instance (i.e., whether it is used for training, testing, or validation). The variables’ values are generated in accordance with the partitioning strategy desired and used for the testing of all the MLP models for that given strategy. This essentially assures that the networks are comparable if the results were obtained for the same partitioning option, as all of these models were developed with similar initial conditions.
However, when comparing networks that used different partitioning strategies, which correspond to different auxiliary partitioning variables, there is still the potential for chance to affect the analysis. Therefore, it was deemed necessary to do multiple runs of the algorithm that generates these variables and then compute the average values for the KPIs.
By averaging out all the performance metrics across the iterations (according to the MLP model and partitioning option considered in each iteration), it was possible to compute the results that are presented in .
Analyzing the values of the KPIs displayed in this table (), which are all relative to the testing set, it can be understood how each MLP network performs for all the partitioning strategies considered. After comparing the models, it was considered that the most robust network is MLP 4 trained with a 600:150:250 training-testing-validation ratio. This artificial neural network displays the best value for the AUC, as well as the greatest Gini Index.
Finally, a sensitivity analysis is performed that computes importance estimates for each independent variable in the model. These results imply that the most important indicator is the shareholder equity ratio, followed by the Cash flow / Total assets. The variations of the cash flow and equity are considered the third and fourth most relevant variables, respectively.
4.5. Development of the Radial Basis Function Neural Network Model
The RBF neural network model was applied in the neural networks’ module of SPSS Statistics 25. In the same way as the MLP models, there is the option to define the percentages that are assigned to the training, validation, and testing sets. Additionally, there are two alternatives for the activation function used in the hidden layers, which are ordinary and normalized radial basis functions. The remaining customizable settings are the number of elements in the hidden layers and the overlap among hidden units. The overlapping factor is a multiplier applied to the width of the radial basis functions.
As SPSS offers algorithms that define the optimal number of units in the hidden layers and the best values for the overlapping factors, these features were not set manually. Thus, the software automatically defined the most advantageous architecture regarding these characteristics. Considering that there is no mechanism in place to select the transfer function in the hidden layers that achieves the best results, two alternative RBF networks are studied that differ solely in this aspect.
The partitioning schemes defined in the previous section were also considered for the development of the RBF models, similarly to what was done before, in order to mitigate the variability in the results that can happen because of the random sampling procedure used to build the various sets in SPSS, two partitioning variables were computed and used iteratively to build the networks and collect the KPIs. By averaging out all the performance metrics across the iterations, it was possible to obtain the results presented in .
It may be concluded from the results displayed in that RBF 2 under the second partitioning option (60% Training, 15% Testing, and 25% Validation) outperforms the remaining alternatives.
4.6. Development of the Random Forest Model
The random forest method was applied in MATLAB R2018b. This model can be obtained by using the TreeBagger function available in the software, which builds an ensemble of bootstrapped decision trees for either classification or regression purposes. This function also selects a random subset of predictors to use at each decision split, as described by in the original random forest algorithm.
In terms of the settings used, the model is set for classification purposes, as the outcome considered is categorical. The surrogate splits option is activated to handle cases of missing data. If the value for the best split is missing, this technique assesses to what extent alternate splits resemble the best split. Afterward, the most similar split is used instead of the original optimal division.
Additionally, optional arguments are included in the function to allow for the assessment of the variables’ explanatory power and the computation of the predicted class probabilities. The probabilities are especially important, as these are used in the latter plotting of the ROC curve and subsequent computation of the AUC.
The TreeBagger function offers two possibilities for the algorithm that selects the best split at each node, a curvature test (CT) and an interaction-curvature test (ICT). In order to understand which of these algorithms would provide the best results, two distinct random forest models were applied, differing in the splitting techniques. The relevant KPIs obtained for both models are displayed in .
The random forest using the curvature tests provided the best predictions in terms of AUC, Gini Index, and sensitivity. Although the percentage of correctly classified cases is slightly inferior to the one presented by the model trained with the interaction-curvature tests, a higher AUC is prioritized. A critical parameter that must also be defined is the number of decision trees contained in the ensembles. The results displayed so far were obtained with models composed of 50 decision trees, which is a common setting for random forest models. However, it must be analyzed if there are gains to be had by adding more trees or, on the other hand, there is an excess of DTs that does not translate into a reduction of the prediction error and increases the computation time unnecessarily. To do this, the out-of-bag prediction error is plotted for a variable number of decision trees in the graph presented in .
Analyzing , one can observe that when the total number of grown trees is small, there is a rapid decrease in the out-of-bag prediction error with additional DTs in the ensemble. However, these gains in accuracy are progressively smaller, which causes the out-of-bag prediction error to stabilize around an ensemble of 50 trees. The error rate observed for an RF containing 50 decision trees is 0.1319, whereas an ensemble of 60 DTs displays a rate of 0.1314. As this reduction is hardly significant, it was opted to keep the number of decision trees at 50.
Analyzing the estimates of the predictors’ importance, it is relevant to point out that the variable with the most explanatory power is the shareholder equity ratio, which displays a remarkable score in comparison with the other indicators. The credit period and the Cash flow / Total assets indicators display the second and third highest importance estimates, respectively. Certain measures, namely the profit per employee and gearing, are also important to the robustness of the model.
5. Benchmarking the models
By compiling the results obtained so far in terms of the relevant KPIs, it is now possible to compare the credit scoring approaches. For each category of predictive methods, the best model in the developmental stage was considered for benchmarking purposes. exhibits the values for the performance metrics, as well as a ranking based on the AUC and Gini Index displayed.
Analyzing , it can be observed that the random forest model is ranked as the best credit scoring model, displaying the highest AUC and Gini Index while also presenting a remarkable overall accuracy. Over 95% of all instances are assigned correct predictions, with 98.59% of all future good companies being classified as such. In second place, the MLP neural network displayed impressive KPIs, although not up to par with the ones obtained with the random forest. On the other hand, the RBF neural network was the overall worst AI model considered, being even outranked by the logistic regression model.
Regarding the statistical methods, the results fall in line with what was observed in other benchmarking studies. The discriminant analysis proved to be the least predictive model of all the credit scoring methods tested, which may be a result of the violation of this models’ assumptions in terms of normality and mutual independence regarding the explanatory variables. The logistic regression is ranked as the third best predictor, behind the MLP neural network and the random forest. This model provides accurate predictions in almost 90% of the cases and demonstrates good sensitivity and specificity, which translate into low rates of type I and type II errors. Despite this, the LR model fell short on the more robust KPIs, namely the AUC and Gini Index, which caused it to be ranked behind some of the AI models.
Considering these results, it can be concluded that the MLP neural network and the random forest outperformed the statistical approaches in the credit scoring experiment. However, the logistic regression proved to be a robust predictor, displaying a high level of accuracy and presenting values for other performance measures that come close to the results of the AI alternatives. This is coherent with the recent rise in popularity of the LR method, which is a solid compromise in terms of prediction performance and ease of implementation. Furthermore, the logistic regression also permits an intuitive interpretation of the model’s parameters, overcoming the black-box syndrome of AI predictors.
6. Conclusions and Further Work
6.1. Further Work
Regarding the pre-processing of the input dataset, several measures were taken to assure the quality of the data, which necessarily impacts the performance of the predictive models. However, posterior studies may adopt distinct methodological approaches to address some limitations of the current research. Specifically, the detection of the multivariate outliers could be improved in terms of the rule utilized in the labeling of these instances.
As the input data in the sample failed the normality tests, it was not possible to proceed with the typical criterium of labeling as outliers any observations with robust Mahalanobis distances beyond a given quantile of the chi-squared distribution. The detection of the multivariate outliers relied then upon the visual examination of the scatterplots with the robust MDs for each observation in the dataset. Consequently, the labeling process lacks objectivity. Therefore, it would be beneficial to develop a more sophisticated outlier labeling rule that is applicable to multivariate non-normal data.
Further research could also attempt to mitigate the detrimental effects of the missing values in the dataset. Some of the predictor methods applied in this study simply discard such cases, which reduces the size of the sample utilized. In order to deal with this situation in the context of the computation of the robust MDs, a fully conditional specification imputation procedure was put in place. However, the imputed dataset could not be used in the development of some of the models, which limited the applicability of this sample to the pre-processing stage of this project. Thus, additional studies could attempt to employ multiple imputation procedures that are compatible with the implementation of the credit scoring methods.
6.2. Conclusions
Credit risk remains one of the biggest risks for financial institutions and corporations alike. The methods utilized in this field have increased in sophistication considerably throughout the years. Nevertheless, any improvements in the accuracy rates of the current models are extremely important, as even small advances can mean significant savings for creditors by preventing defaults. This served as the main motivation for the current study.
This research allowed for the comparison of statistical and AI predictors, adding significantly to the academic literature by designing a credit scoring experiment that compares various distinct types of models using a novel dataset with financial and other relevant data for a selection of Portuguese companies. Credit scoring methods were successfully implemented based on this information and used to distinguish between good and bad applicants in the timespan of a year.
As the predictive methods employed may be susceptible to multicollinearity and the presence of outlier instances, there was a thorough pre-processing of the dataset prior to the implementation of the models. By experimenting with different settings and architectures, it was then possible to select the most robust models for each category of predictors. This allowed for the comparison of the KPIs computed for statistical and AI alternatives.
The benchmarking study completed found that the artificial intelligence methods clearly outperformed the more conventional statistical approaches, which was visible across the various performance metrics considered. The random forest model demonstrated the most potential, followed by the MLP neural network. Both these models achieved an AUC above 0.95, exhibiting an outstanding discriminatory power in the credit scoring exercise conducted. The RBF neural network and the logistic regression were the fourth and third most robust models, respectively. Finally, the discriminant analysis was the worst performing model overall.
Regarding the statistical approaches, the results are coherent with the findings of previously published benchmarking research articles. The discriminant analysis is dependent on strict assumptions in terms of normality and mutual independence regarding the input variables, which was a contributing factor to its disuse among credit risk professionals and may explain the poor performance obtained in this experiment. The logistic regression proved to be a robust predictor, displaying a high level of accuracy and presenting values for other performance measures that come close to the results of the best AI alternatives. It came as far as outperforming the RBF neural network. This is consistent with the recent rise in popularity of the LR method, which has been demonstrated to offer a solid combination of prediction performance and ease of implementation.
The random forest models, along with the MLP artificial neural networks, display tremendous potential in the credit scoring field. In contrast with the statistical techniques, these methods can model hidden non-linear relationships between the explanatory variables and the dependent variable, being also more robust to multicollinearity and the presence of outliers. Besides these advantages, these methods do not make assumptions regarding the probability distributions of the input data. These factors may have contributed to the observed superiority of the AI approaches. The major drawback of these alternatives continues to be the black-box syndrome, which makes the interpretation of the results almost impossible. This may restrict the use of such models in certain settings due to regulatory requirements.