Approximate consistent weighted sampling for efficient top-k search

Yunna Kim, Heasoo Hwang

Research output: Contribution to journalArticlepeer-review

1 Scopus citations


Top-k search on weighted sets can be very slow since the computation cost of generalized Jaccard similarity is proportional to the dimensionality of sets. ICWS generates samples of high quality, but its hashing cost is too high to generate samples from high-dimensional weighted sets. We propose simple hashing methods, ICWS P and its variants, that approximate ICWS very efficiently in O(D. K/B). Extensive experiments show that hashing cost is reduced significantly while top-k precision and classification accuracy with estimated set similarity are almost as high as those of ICWS. Query time can also be improved since less than K samples are compared for sets of low similarity.

Original languageEnglish
Pages (from-to)1125-1132
Number of pages8
JournalInternational Journal of Innovative Computing, Information and Control
Issue number3
StatePublished - 2020


  • Generalized jaccard similarity
  • Minwise hashing
  • Weighted sampling


Dive into the research topics of 'Approximate consistent weighted sampling for efficient top-k search'. Together they form a unique fingerprint.

Cite this