Abstract
Harmful algal blooms (HABs) pose a potential risk to human and ecosystem health. HAB occurrences are influenced by numerous environmental factors; thus, accurate predictions of HABs and explanations about the predictions are required to implement preventive water quality management. In this study, machine learning (ML) algorithms, i.e., random forest (RF) and extreme gradient boosting (XGB), were employed to predict HABs in eight water supply reservoirs in South Korea. The use of synthetic minority oversampling technique for addressing imbalanced HAB occurrences improved classification performance of the ML algorithms. Although RF and XGB resulted in marginal performance differences, XGB exhibited more stable performance in the presence of data imbalance. Furthermore, a post hoc explanation technique, Shapley additive explanation was employed to estimate relative feature importance. Among the input features, water temperature and concentrations of total nitrogen and total phosphorus appeared important in predicting HAB occurrences. The results suggest that the use of ML algorithms along with explanation methods increase the usefulness of predictive models as a decision-making tool for water quality management.
Original language | English |
---|---|
Pages (from-to) | 304-318 |
Number of pages | 15 |
Journal | Water Quality Research Journal |
Volume | 57 |
Issue number | 4 |
DOIs | |
State | Published - 1 Nov 2022 |
Keywords
- SHAP
- cyanobacteria bloom
- data imbalance
- feature importance
- machine learning
- water quality management