Efficient filtering techniques for cosine similarity joins

Dongjoo Lee, Jaehui Park, Junho Shim, Sang Goo Lee

Research output: Contribution to journalArticlepeer-review

11 Scopus citations

Abstract

Similarity join, an operation that finds all pairs of similar objects in a large collection, is widely used to solve various problems in many application domains. Existing similarity join algorithms use filtering techniques to avoid unnecessary similarity computation based on inverted index. However, they are inefficient in filtering out dissimilar pairs when an aggregate weighted similarity function, such as cosine similarity, is used to quantify similarity values between objects. This is mainly because of loose filtering conditions the existing algorithms adopt. In this paper, we formalize filtering conditions adopted by the previous algorithms and contrive new similarity upper bounds that can be used to make tighter filtering conditions for cosine similarity joins over weight vectors. Our algorithm efficiently filters out dissimilar pairs by exploiting the new similarity upper bounds. We demonstrate that our algorithm outperforms a state-of-the-art algorithm by performing empirical evaluations on large- scale datasets. In addition, we present that our algorithm can be extended to Dice and Tanimito similarity joins over weight vectors.

Original languageEnglish
Pages (from-to)1265-1289
Number of pages25
JournalInformation
Volume14
Issue number4
StatePublished - Apr 2011

Keywords

  • Cosine similarity join
  • Inverted index
  • Length filtering
  • Prefix filtering
  • Similarity join

Fingerprint

Dive into the research topics of 'Efficient filtering techniques for cosine similarity joins'. Together they form a unique fingerprint.

Cite this