Abstract
Top-k search on weighted sets can be very slow since the computation cost of generalized Jaccard similarity is proportional to the dimensionality of sets. ICWS generates samples of high quality, but its hashing cost is too high to generate samples from high-dimensional weighted sets. We propose simple hashing methods, ICWS P and its variants, that approximate ICWS very efficiently in O(D. K/B). Extensive experiments show that hashing cost is reduced significantly while top-k precision and classification accuracy with estimated set similarity are almost as high as those of ICWS. Query time can also be improved since less than K samples are compared for sets of low similarity.
Original language | English |
---|---|
Pages (from-to) | 1125-1132 |
Number of pages | 8 |
Journal | International Journal of Innovative Computing, Information and Control |
Volume | 16 |
Issue number | 3 |
DOIs | |
State | Published - 2020 |
Keywords
- Generalized jaccard similarity
- Minwise hashing
- Weighted sampling