TY - GEN
T1 - An efficient similarity join algorithm with cosine similarity predicate
AU - Lee, Dongjoo
AU - Park, Jaehui
AU - Shim, Junho
AU - Lee, Sang Goo
PY - 2010
Y1 - 2010
N2 - Given a large collection of objects, finding all pairs of similar objects, namely similarity join, is widely used to solve various problems in many application domains.Computation time of similarity join is critical issue, since similarity join requires computing similarity values for all possible pairs of objects. Several existing algorithms adopt prefix filtering to avoid unnecessary similarity computation; however, existing algorithms implementing the prefix filtering have inefficiency in filtering out object pairs, in particular, when aggregate weighted similarity function, such as cosine similarity, is used to quantify similarity values between objects. This is mostly caused by large prefixes the algorithms select. In this paper, we propose an alternative method to select small prefixes by exploiting the relationship between arithmetic mean and geometric mean of elements' weights. A new algorithm, MMJoin, implementing the proposed methods dramatically reduces the average size of prefixes without much overhead. Finally, it saves much computation time. We demonstrate that our algorithm outperforms a state-of-the-art one with empirical evaluation on large-scale real world datasets.
AB - Given a large collection of objects, finding all pairs of similar objects, namely similarity join, is widely used to solve various problems in many application domains.Computation time of similarity join is critical issue, since similarity join requires computing similarity values for all possible pairs of objects. Several existing algorithms adopt prefix filtering to avoid unnecessary similarity computation; however, existing algorithms implementing the prefix filtering have inefficiency in filtering out object pairs, in particular, when aggregate weighted similarity function, such as cosine similarity, is used to quantify similarity values between objects. This is mostly caused by large prefixes the algorithms select. In this paper, we propose an alternative method to select small prefixes by exploiting the relationship between arithmetic mean and geometric mean of elements' weights. A new algorithm, MMJoin, implementing the proposed methods dramatically reduces the average size of prefixes without much overhead. Finally, it saves much computation time. We demonstrate that our algorithm outperforms a state-of-the-art one with empirical evaluation on large-scale real world datasets.
UR - http://www.scopus.com/inward/record.url?scp=78049390973&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-15251-1_33
DO - 10.1007/978-3-642-15251-1_33
M3 - Conference contribution
AN - SCOPUS:78049390973
SN - 3642152503
SN - 9783642152504
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 422
EP - 436
BT - Database and Expert Systems Applications - 21st International Conference, DEXA 2010, Proceedings
T2 - 21st International Conference on Database and Expert Systems Applications, DEXA 2010
Y2 - 30 August 2010 through 3 September 2010
ER -