TY - JOUR
T1 - Prediction of cyanobacteria blooms in the lower han river (South Korea) using ensemble learning algorithms
AU - Shin, Jihoon
AU - Yoon, Seonghyeon
AU - Cha, Yoonkyung
N1 - Publisher Copyright:
© 2017 Desalination Publications. All rights reserved.
PY - 2017/7
Y1 - 2017/7
N2 - We developed a prediction model for cyanobacterial blooms in the lower Han River, South Korea, using decision tree algorithms. Decision tree is a type of machine learning method that can overcome missing values or outlier problems. Despite its simple application, it can accurately predict complex natural phenomena. To improve the robustness of the model, we used ensemble methods such as Bagging, AdaBoost, and Random Forest, and the performance of each method was compared against that of a single decision tree. The indicators of cyanobacterial blooms, namely chlorophyll-a concentration and cyanobacteria cell count, were classified into either the non-exceedance or the exceedance class according to administrative guidelines or criteria, and used as the response variables. Since the cyanobacteria cell count in the exceedance class was much smaller than that in the non-exceedance class, the synthetic minority over-sampling technique (SMOTE) was used to mitigate the imbalance between classes. The prediction abilities for chlorophyll-a and cyanobacteria were evaluated based on multiple indices, including area under curve (AUC). The result showed that the performance of ensemble models improved by 1.7%–11.1% and 1.5%–4.9% compared with that of the single model for chlorophyll-a and cyanobacteria, respectively. The implementation of SMOTE to mitigate the imbalance cyanobacteria cell count data enhanced AUC by 4.3%–6.7%. The results of the variable importance analysis indicated that water temperature, flow, and month were essential factors for the prediction of the cyanobacteria classes.
AB - We developed a prediction model for cyanobacterial blooms in the lower Han River, South Korea, using decision tree algorithms. Decision tree is a type of machine learning method that can overcome missing values or outlier problems. Despite its simple application, it can accurately predict complex natural phenomena. To improve the robustness of the model, we used ensemble methods such as Bagging, AdaBoost, and Random Forest, and the performance of each method was compared against that of a single decision tree. The indicators of cyanobacterial blooms, namely chlorophyll-a concentration and cyanobacteria cell count, were classified into either the non-exceedance or the exceedance class according to administrative guidelines or criteria, and used as the response variables. Since the cyanobacteria cell count in the exceedance class was much smaller than that in the non-exceedance class, the synthetic minority over-sampling technique (SMOTE) was used to mitigate the imbalance between classes. The prediction abilities for chlorophyll-a and cyanobacteria were evaluated based on multiple indices, including area under curve (AUC). The result showed that the performance of ensemble models improved by 1.7%–11.1% and 1.5%–4.9% compared with that of the single model for chlorophyll-a and cyanobacteria, respectively. The implementation of SMOTE to mitigate the imbalance cyanobacteria cell count data enhanced AUC by 4.3%–6.7%. The results of the variable importance analysis indicated that water temperature, flow, and month were essential factors for the prediction of the cyanobacteria classes.
KW - Classification tree
KW - Cyanobacteria bloom
KW - Data imbalance
KW - Ensemble
KW - Lower Han River
UR - http://www.scopus.com/inward/record.url?scp=85031278703&partnerID=8YFLogxK
U2 - 10.5004/dwt.2017.20986
DO - 10.5004/dwt.2017.20986
M3 - Article
AN - SCOPUS:85031278703
SN - 1944-3994
VL - 84
SP - 31
EP - 39
JO - Desalination and Water Treatment
JF - Desalination and Water Treatment
ER -