TY - JOUR
T1 - Effects of class imbalance on resampling and ensemble learning for improved prediction of cyanobacteria blooms
AU - Shin, Jihoon
AU - Yoon, Seonghyeon
AU - Kim, Young Woo
AU - Kim, Taeho
AU - Go, Byeong Geon
AU - Cha, Yoon Kyung
N1 - Publisher Copyright:
© 2020
PY - 2021/3
Y1 - 2021/3
N2 - This study aimed to explicitly explore the effects of the degree of class imbalance on predicting infrequently occurring events, i.e., cyanobacteria blooms. Although class imbalance poses a major issue in binary classification schemes, few efforts have been made to relate model performance with real-life applications. The data utilized herein were collected from 2013 to 2019 at 13 sites within three major rivers in South Korea; a variety of physicochemical and hydrometeorological factors were obtained as input variables, and the occurrence of cyanobacteria blooms (indicated by a cell count ≥ 1000 cells/mL) was included as a response variable. The imbalance ratio (IR) for cyanobacteria blooms differed significantly by site, ranging widely from 0.93 to 9.32. The study results suggested that class imbalance negatively affected model performance, with an increase in the IR significantly increasing the false negative (FN) rate. The application of resampling decreased the FN rate while simultaneously increasing the true positive (TP) rate, which yielded improvements that tended to increase with increasing IRs. Ensemble classifiers, which combine multiple single classifiers into an integrated classifier, alone could not successfully address the class imbalance problem; however, in combination with resampling, they consistently outperformed single classifiers. Among the ensemble classifiers, AdaBoost yielded the most stable performance across a range of IRs, irrespective of the resampling application. A variable importance analysis indicated that temperature was usually the primary influencing factor of cyanobacteria blooms. These findings highlight the effectiveness of resampling applications for addressing class imbalance, while providing useful guidelines for learning from imbalance data, including the selection of classification algorithms and model evaluation metrics.
AB - This study aimed to explicitly explore the effects of the degree of class imbalance on predicting infrequently occurring events, i.e., cyanobacteria blooms. Although class imbalance poses a major issue in binary classification schemes, few efforts have been made to relate model performance with real-life applications. The data utilized herein were collected from 2013 to 2019 at 13 sites within three major rivers in South Korea; a variety of physicochemical and hydrometeorological factors were obtained as input variables, and the occurrence of cyanobacteria blooms (indicated by a cell count ≥ 1000 cells/mL) was included as a response variable. The imbalance ratio (IR) for cyanobacteria blooms differed significantly by site, ranging widely from 0.93 to 9.32. The study results suggested that class imbalance negatively affected model performance, with an increase in the IR significantly increasing the false negative (FN) rate. The application of resampling decreased the FN rate while simultaneously increasing the true positive (TP) rate, which yielded improvements that tended to increase with increasing IRs. Ensemble classifiers, which combine multiple single classifiers into an integrated classifier, alone could not successfully address the class imbalance problem; however, in combination with resampling, they consistently outperformed single classifiers. Among the ensemble classifiers, AdaBoost yielded the most stable performance across a range of IRs, irrespective of the resampling application. A variable importance analysis indicated that temperature was usually the primary influencing factor of cyanobacteria blooms. These findings highlight the effectiveness of resampling applications for addressing class imbalance, while providing useful guidelines for learning from imbalance data, including the selection of classification algorithms and model evaluation metrics.
KW - Class imbalance
KW - Cyanobacteria blooms
KW - Ensemble classifier
KW - Imbalance ratio
KW - Resampling
KW - SMOTE
UR - http://www.scopus.com/inward/record.url?scp=85096659820&partnerID=8YFLogxK
U2 - 10.1016/j.ecoinf.2020.101202
DO - 10.1016/j.ecoinf.2020.101202
M3 - Article
AN - SCOPUS:85096659820
SN - 1574-9541
VL - 61
JO - Ecological Informatics
JF - Ecological Informatics
M1 - 101202
ER -