Effects of class imbalance on resampling and ensemble learning for improved prediction of cyanobacteria blooms

Jihoon Shin, Seonghyeon Yoon, Young Woo Kim, Taeho Kim, Byeong Geon Go, Yoon Kyung Cha

Research output: Contribution to journalArticlepeer-review

38 Scopus citations

Abstract

This study aimed to explicitly explore the effects of the degree of class imbalance on predicting infrequently occurring events, i.e., cyanobacteria blooms. Although class imbalance poses a major issue in binary classification schemes, few efforts have been made to relate model performance with real-life applications. The data utilized herein were collected from 2013 to 2019 at 13 sites within three major rivers in South Korea; a variety of physicochemical and hydrometeorological factors were obtained as input variables, and the occurrence of cyanobacteria blooms (indicated by a cell count ≥ 1000 cells/mL) was included as a response variable. The imbalance ratio (IR) for cyanobacteria blooms differed significantly by site, ranging widely from 0.93 to 9.32. The study results suggested that class imbalance negatively affected model performance, with an increase in the IR significantly increasing the false negative (FN) rate. The application of resampling decreased the FN rate while simultaneously increasing the true positive (TP) rate, which yielded improvements that tended to increase with increasing IRs. Ensemble classifiers, which combine multiple single classifiers into an integrated classifier, alone could not successfully address the class imbalance problem; however, in combination with resampling, they consistently outperformed single classifiers. Among the ensemble classifiers, AdaBoost yielded the most stable performance across a range of IRs, irrespective of the resampling application. A variable importance analysis indicated that temperature was usually the primary influencing factor of cyanobacteria blooms. These findings highlight the effectiveness of resampling applications for addressing class imbalance, while providing useful guidelines for learning from imbalance data, including the selection of classification algorithms and model evaluation metrics.

Original languageEnglish
Article number101202
JournalEcological Informatics
Volume61
DOIs
StatePublished - Mar 2021

Keywords

  • Class imbalance
  • Cyanobacteria blooms
  • Ensemble classifier
  • Imbalance ratio
  • Resampling
  • SMOTE

Fingerprint

Dive into the research topics of 'Effects of class imbalance on resampling and ensemble learning for improved prediction of cyanobacteria blooms'. Together they form a unique fingerprint.

Cite this