Balancing and variable reduction of firm bankruptcy data
Financial stress experienced by supply chain elements causes stress to all members. Predictive data mining is a common tool for predicting bankruptcy. Bankruptcy often involves highly imbalanced datasets with a large number of potential variables, with bankrupt firms being by far the minority case. This study uses data from four studies of firm bankruptcy and examines the impact of data balancing and variable selection on model accuracy. The models used are random forest and gradient boosting based on decision trees, logistic regression, neural networks, and support vector machines. Two machine learning methods are used to trim the number of variables. Stepwise regression and entropy from decision trees are used to generate reduced variable sets. The complexity parameter was used to set levels on number of variables using the entropy (decision tree) option. The impact of reducing variables is examined. Error metrics used were type I and type II error (sensitivity and specificity), overall average error (accuracy), and area under the recall curve (AuC). The average error of extreme gradient boosting and random forest models was found to be better than support vector machines, which had a slight advantage over logistic regression and neural networks. Variable reduction was found to lead to mixed results with respect to relative accuracy. Overall accuracy increased with slight reduction in the number of variables (using stepwise regression), but deteriorated as the number of variables was reduced to the smaller number of variables. The experiments into balancing found that unbalanced data had high error rates, which dropped a great deal with even 10 percent balancing, but balancing beyond 10 percent was found to provide little additional accuracy.
Copyright (c) 2022 Journal of Supply Chain Management Science
This work is licensed under a Creative Commons Attribution 4.0 International License.
JSCMS is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence. The license means that anyone is free to share (to copy, distribute, and transmit the work), to remix (to adapt the work) under the following conditions:
- The original authors must be given credit
- For any reuse or distribution, it must be made clear to others what the license terms of this work are
- Any of these conditions can be waived if the copyright holders give permission
- Nothing in this license impairs or restricts the author's moral rights