Balancing and variable reduction of firm bankruptcy data


  • David L. Olson University of Nebraska Lincoln
  • Bongsug Chae Kansas State University



Financial stress experienced by supply chain elements causes stress to all members. Predictive data mining is a common tool for predicting bankruptcy. Bankruptcy often involves highly imbalanced datasets with a large number of potential variables, with bankrupt firms being by far the minority case. This study uses data from four studies of firm bankruptcy and examines the impact of data balancing and variable selection on model accuracy. The models used are random forest and gradient boosting based on decision trees, logistic regression, neural networks, and support vector machines. Two machine learning methods are used to trim the number of variables. Stepwise regression and entropy from decision trees are used to generate reduced variable sets. The complexity parameter was used to set levels on number of variables using the entropy (decision tree) option. The impact of reducing variables is examined. Error metrics used were type I and type II error (sensitivity and specificity), overall average error (accuracy), and area under the recall curve (AuC). The average error of extreme gradient boosting and random forest models was found to be better than support vector machines, which had a slight advantage over logistic regression and neural networks. Variable reduction was found to lead to mixed results with respect to relative accuracy. Overall accuracy increased with slight reduction in the number of variables (using stepwise regression), but deteriorated as the number of variables was reduced to the smaller number of variables. The experiments into balancing found that unbalanced data had high error rates, which dropped a great deal with even 10 percent balancing, but balancing beyond 10 percent was found to provide little additional accuracy.