A Wikipedia-based semantic tensor space model for text analytics

Han Joon Kim, Jae Young Chang

Research output: Contribution to journalArticlepeer-review

Abstract

This paper proposes a third-order tensor space model that represents textual documents, which contains the 'concept' space independently of the 'document' and 'term' spaces. In the vector space model (VSM), a document is represented as a vector in which each dimension corresponds to a term. In contrast, the model described here represents a document as a matrix. Most current text mining algorithms only take vectors as their input, but they suffer from 'term independence' and 'loss of term senses' issues. To overcome these problems, we incorporate the 'concept' as a distinct space in the VSM. For this, it is necessary to produce the concept vector for each term that occurs in a given document, which is related to word sense disambiguation. As an external knowledge source for concept weighting, we employ the Wikipedia Encyclopedia, which has been evaluated as world knowledge and used to improve many text-mining algorithms. Through experiments using two popular document corpora, we demonstrate the superiority of the model in terms of text clustering and text classification.

Original languageEnglish
Pages (from-to)264-278
Number of pages15
JournalInternational Journal of Computational Vision and Robotics
Volume11
Issue number3
DOIs
StatePublished - 2021

Keywords

  • Classification
  • Clustering
  • Concepts
  • Document representation
  • Machine learning
  • Similarity
  • Tensor space model
  • Text mining
  • VSM
  • Vector space model
  • Wikipedia

Fingerprint

Dive into the research topics of 'A Wikipedia-based semantic tensor space model for text analytics'. Together they form a unique fingerprint.

Cite this