TY - JOUR
T1 - Effective data-balancing methods for class-imbalanced genotoxicity datasets using machine learning algorithms and molecular fingerprints
AU - Bae, Su Yong
AU - Lee, Jonga
AU - Jeong, Jaeseong
AU - Lim, Changwon
AU - Choi, Jinhee
N1 - Publisher Copyright:
© 2021
PY - 2021/11
Y1 - 2021/11
N2 - Machine learning and deep learning approaches have been increasingly used in the field of toxicology through prediction models developed using various toxicity data. However, toxicity data are often class-imbalanced, which hinders the development of machine learning models with good performance. Therefore, in this study, we identified effective data-balancing methods for class-imbalanced genotoxicity datasets using machine learning algorithms and molecular fingerprints. Data-balancing methods, such as random undersampling (RUS), sample weight (SW), synthetic minority oversampling technique (SMOTE), and random oversampling (ROS) were applied to the datasets. Model performance was evaluated using the F1 score on five machine learning algorithms: gradient boosting tree (GBT), random forest (RF), support vector machine (SVM), multi-layer perceptron (MLP) network, and k-nearest neighbors (kNN) in combination with five molecular fingerprints (Morgan, MACCS, RDKit, Pattern, and Layered). The performance was evaluated for each combination of molecular fingerprints, machine learning algorithms, and data-balancing methods. The MACCS-GBT-SMOTE combination model achieved the best F1 score, followed by RDKit-GBT-SW. Thus, this study demonstrated that data balancing conducted using oversampling methods improved the performance of models. The systematic approach used in this study can also be applied to other toxicity datasets, which may facilitate the development of an improved classification model for toxicity screening.
AB - Machine learning and deep learning approaches have been increasingly used in the field of toxicology through prediction models developed using various toxicity data. However, toxicity data are often class-imbalanced, which hinders the development of machine learning models with good performance. Therefore, in this study, we identified effective data-balancing methods for class-imbalanced genotoxicity datasets using machine learning algorithms and molecular fingerprints. Data-balancing methods, such as random undersampling (RUS), sample weight (SW), synthetic minority oversampling technique (SMOTE), and random oversampling (ROS) were applied to the datasets. Model performance was evaluated using the F1 score on five machine learning algorithms: gradient boosting tree (GBT), random forest (RF), support vector machine (SVM), multi-layer perceptron (MLP) network, and k-nearest neighbors (kNN) in combination with five molecular fingerprints (Morgan, MACCS, RDKit, Pattern, and Layered). The performance was evaluated for each combination of molecular fingerprints, machine learning algorithms, and data-balancing methods. The MACCS-GBT-SMOTE combination model achieved the best F1 score, followed by RDKit-GBT-SW. Thus, this study demonstrated that data balancing conducted using oversampling methods improved the performance of models. The systematic approach used in this study can also be applied to other toxicity datasets, which may facilitate the development of an improved classification model for toxicity screening.
KW - Class imbalance
KW - Data balancing
KW - Genotoxicity
KW - Machine learning
KW - Toxicity prediction
UR - http://www.scopus.com/inward/record.url?scp=85111565061&partnerID=8YFLogxK
U2 - 10.1016/j.comtox.2021.100178
DO - 10.1016/j.comtox.2021.100178
M3 - Article
AN - SCOPUS:85111565061
SN - 2468-1113
VL - 20
JO - Computational Toxicology
JF - Computational Toxicology
M1 - 100178
ER -