TY - JOUR
T1 - Masked Language Modeling Becomes Conditional Density Estimation for Tabular Data Synthesis
AU - An, Seunghwan
AU - Woo, Gyeongdong
AU - Lim, Jaesung
AU - Kim, Chang Hyun
AU - Hong, Sungchul
AU - Jeon, Jong June
N1 - Publisher Copyright:
Copyright © 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2025/4/11
Y1 - 2025/4/11
N2 - In this paper, our goal is to generate synthetic data for heterogeneous (mixed-type) tabular datasets with high machine learning utility (MLu). Since the MLu performance depends on accurately approximating the conditional distributions, we focus on devising a synthetic data generation method based on conditional distribution estimation. We introduce MaCoDE by redefining the consecutive multi-class classification task of Masked Language Modeling (MLM) as histogram-based non-parametric conditional density estimation. Our approach enables the estimation of conditional densities across arbitrary combinations of target and conditional variables. We bridge the theoretical gap between distributional learning and MLM by demonstrating that minimizing the orderless multi-class classification loss leads to minimizing the total variation distance between conditional distributions. To validate our proposed model, we evaluate its performance in synthetic data generation across 10 real-world datasets, demonstrating its ability to adjust data privacy levels easily without retraining. Additionally, since masked input tokens in MLM are analogous to missing data, we further assess its effectiveness in handling training datasets with missing values, including multiple imputations of the missing entries.
AB - In this paper, our goal is to generate synthetic data for heterogeneous (mixed-type) tabular datasets with high machine learning utility (MLu). Since the MLu performance depends on accurately approximating the conditional distributions, we focus on devising a synthetic data generation method based on conditional distribution estimation. We introduce MaCoDE by redefining the consecutive multi-class classification task of Masked Language Modeling (MLM) as histogram-based non-parametric conditional density estimation. Our approach enables the estimation of conditional densities across arbitrary combinations of target and conditional variables. We bridge the theoretical gap between distributional learning and MLM by demonstrating that minimizing the orderless multi-class classification loss leads to minimizing the total variation distance between conditional distributions. To validate our proposed model, we evaluate its performance in synthetic data generation across 10 real-world datasets, demonstrating its ability to adjust data privacy levels easily without retraining. Additionally, since masked input tokens in MLM are analogous to missing data, we further assess its effectiveness in handling training datasets with missing values, including multiple imputations of the missing entries.
UR - https://www.scopus.com/pages/publications/105003996973
U2 - 10.1609/aaai.v39i15.33685
DO - 10.1609/aaai.v39i15.33685
M3 - Conference article
AN - SCOPUS:105003996973
SN - 2159-5399
VL - 39
SP - 15356
EP - 15364
JO - Proceedings of the AAAI Conference on Artificial Intelligence
JF - Proceedings of the AAAI Conference on Artificial Intelligence
IS - 15
T2 - 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Y2 - 25 February 2025 through 4 March 2025
ER -