Automatic Seed Word Selection for Topic Modeling

Research output: Contribution to journalArticlepeer-review

Abstract

Topic modeling is widely used to uncover latent semantic topics from a corpus. However, topic models often struggle to identify minor topics due to their tendency to prioritize dominant patterns in the data. They are also hindered by polysemous words and general terms, which frequently appear in multiple contexts, making topic assignment difficult. Seed-guided topic modeling addresses these issues by incorporating prior knowledge through "seed words". Existing approaches, however, primarily rely on supervised selection using label-dependent metrics or manual selection. Both are limited by scalability and susceptible to human bias, particularly when dealing with unstructured real-world data. As a result, the selection of seed words in unsupervised settings remains underexplored. To address these challenges, we propose an automated seed word selection process that identifies diverse and cohesive word sets based on inter-word relationships. We instantiate this process with SeedCapture , an algorithm that utilizes co-occurrence to capture meaningful word associations. Unlike prior methods, SeedCapture operates in a fully unsupervised manner, requiring no predefined labels or human intervention. SeedCapture requires minimal parameter tuning and is highly adaptable, enabling seamless integration into existing seed-guided topic models. Through extensive quantitative and qualitative evaluations across multiple datasets and topic models, we demonstrate that SeedCapture achieves results comparable to those obtained through supervised seed word selection.

Original languageEnglish
Pages (from-to)31269-31285
Number of pages17
JournalIEEE Access
Volume13
DOIs
StatePublished - 2025

Keywords

  • Seed-guided topic modeling
  • automatic seed word selection
  • seed words

Fingerprint

Dive into the research topics of 'Automatic Seed Word Selection for Topic Modeling'. Together they form a unique fingerprint.

Cite this