Masked Language Modeling Becomes Conditional Density Estimation for Tabular Data Synthesis

  • Seunghwan An
  • , Gyeongdong Woo
  • , Jaesung Lim
  • , Chang Hyun Kim
  • , Sungchul Hong
  • , Jong June Jeon

Research output: Contribution to journalConference articlepeer-review

Abstract

In this paper, our goal is to generate synthetic data for heterogeneous (mixed-type) tabular datasets with high machine learning utility (MLu). Since the MLu performance depends on accurately approximating the conditional distributions, we focus on devising a synthetic data generation method based on conditional distribution estimation. We introduce MaCoDE by redefining the consecutive multi-class classification task of Masked Language Modeling (MLM) as histogram-based non-parametric conditional density estimation. Our approach enables the estimation of conditional densities across arbitrary combinations of target and conditional variables. We bridge the theoretical gap between distributional learning and MLM by demonstrating that minimizing the orderless multi-class classification loss leads to minimizing the total variation distance between conditional distributions. To validate our proposed model, we evaluate its performance in synthetic data generation across 10 real-world datasets, demonstrating its ability to adjust data privacy levels easily without retraining. Additionally, since masked input tokens in MLM are analogous to missing data, we further assess its effectiveness in handling training datasets with missing values, including multiple imputations of the missing entries.

Original languageEnglish
Pages (from-to)15356-15364
Number of pages9
JournalProceedings of the AAAI Conference on Artificial Intelligence
Volume39
Issue number15
DOIs
StatePublished - 11 Apr 2025
Event39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, United States
Duration: 25 Feb 20254 Mar 2025

Fingerprint

Dive into the research topics of 'Masked Language Modeling Becomes Conditional Density Estimation for Tabular Data Synthesis'. Together they form a unique fingerprint.

Cite this