TY - JOUR
T1 - Automatic Seed Word Selection for Topic Modeling
AU - Jeong, Dahyun
AU - Hwang, Jeongin
AU - Choi, Yunjin
AU - Kim, Yoon Yeong
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2025
Y1 - 2025
N2 - Topic modeling is widely used to uncover latent semantic topics from a corpus. However, topic models often struggle to identify minor topics due to their tendency to prioritize dominant patterns in the data. They are also hindered by polysemous words and general terms, which frequently appear in multiple contexts, making topic assignment difficult. Seed-guided topic modeling addresses these issues by incorporating prior knowledge through "seed words". Existing approaches, however, primarily rely on supervised selection using label-dependent metrics or manual selection. Both are limited by scalability and susceptible to human bias, particularly when dealing with unstructured real-world data. As a result, the selection of seed words in unsupervised settings remains underexplored. To address these challenges, we propose an automated seed word selection process that identifies diverse and cohesive word sets based on inter-word relationships. We instantiate this process with SeedCapture , an algorithm that utilizes co-occurrence to capture meaningful word associations. Unlike prior methods, SeedCapture operates in a fully unsupervised manner, requiring no predefined labels or human intervention. SeedCapture requires minimal parameter tuning and is highly adaptable, enabling seamless integration into existing seed-guided topic models. Through extensive quantitative and qualitative evaluations across multiple datasets and topic models, we demonstrate that SeedCapture achieves results comparable to those obtained through supervised seed word selection.
AB - Topic modeling is widely used to uncover latent semantic topics from a corpus. However, topic models often struggle to identify minor topics due to their tendency to prioritize dominant patterns in the data. They are also hindered by polysemous words and general terms, which frequently appear in multiple contexts, making topic assignment difficult. Seed-guided topic modeling addresses these issues by incorporating prior knowledge through "seed words". Existing approaches, however, primarily rely on supervised selection using label-dependent metrics or manual selection. Both are limited by scalability and susceptible to human bias, particularly when dealing with unstructured real-world data. As a result, the selection of seed words in unsupervised settings remains underexplored. To address these challenges, we propose an automated seed word selection process that identifies diverse and cohesive word sets based on inter-word relationships. We instantiate this process with SeedCapture , an algorithm that utilizes co-occurrence to capture meaningful word associations. Unlike prior methods, SeedCapture operates in a fully unsupervised manner, requiring no predefined labels or human intervention. SeedCapture requires minimal parameter tuning and is highly adaptable, enabling seamless integration into existing seed-guided topic models. Through extensive quantitative and qualitative evaluations across multiple datasets and topic models, we demonstrate that SeedCapture achieves results comparable to those obtained through supervised seed word selection.
KW - Seed-guided topic modeling
KW - automatic seed word selection
KW - seed words
UR - https://www.scopus.com/pages/publications/85217907846
U2 - 10.1109/ACCESS.2025.3540410
DO - 10.1109/ACCESS.2025.3540410
M3 - Article
AN - SCOPUS:85217907846
SN - 2169-3536
VL - 13
SP - 31269
EP - 31285
JO - IEEE Access
JF - IEEE Access
ER -