TY - JOUR
T1 - A novel algorithm for scalable k -nearest neighbour graph construction
AU - Park, Youngki
AU - Hwang, Heasoo
AU - Lee, Sang Goo
N1 - Publisher Copyright:
© Chartered Institute of Library and Information Professionals.
PY - 2016/4/1
Y1 - 2016/4/1
N2 - Finding the k-nearest neighbours of every node in a dataset is one of the most important data operations with wide application in various areas such as recommendation and information retrieval. However, a major challenge is that the execution time of existing approaches grows rapidly as the number of nodes or dimensions increases. In this paper, we present greedy filtering, an efficient and scalable algorithm for finding an approximate k-nearest neighbour graph. It selects a fixed number of nodes as candidates for every node by filtering out node pairs that do not have any matching dimensions with large values. Greedy filtering achieves consistent approximation accuracy across nodes in linear execution time. We also present a faster version of greedy filtering that uses inverted indices on the node prefixes. Through theoretical analysis, we show that greedy filtering is effective for datasets whose features have Zipfian distribution, a characteristic observed in majority of large datasets. We also conduct extensive comparative experiments against (a) three state-of-the-art algorithms, and (b) three algorithms in related research domains. Our experimental results show that greedy filtering consistently outperforms other algorithms in various types of high-dimensional datasets.
AB - Finding the k-nearest neighbours of every node in a dataset is one of the most important data operations with wide application in various areas such as recommendation and information retrieval. However, a major challenge is that the execution time of existing approaches grows rapidly as the number of nodes or dimensions increases. In this paper, we present greedy filtering, an efficient and scalable algorithm for finding an approximate k-nearest neighbour graph. It selects a fixed number of nodes as candidates for every node by filtering out node pairs that do not have any matching dimensions with large values. Greedy filtering achieves consistent approximation accuracy across nodes in linear execution time. We also present a faster version of greedy filtering that uses inverted indices on the node prefixes. Through theoretical analysis, we show that greedy filtering is effective for datasets whose features have Zipfian distribution, a characteristic observed in majority of large datasets. We also conduct extensive comparative experiments against (a) three state-of-the-art algorithms, and (b) three algorithms in related research domains. Our experimental results show that greedy filtering consistently outperforms other algorithms in various types of high-dimensional datasets.
KW - collaborative filtering
KW - k-nearest neighbour graph construction
KW - k-nearest neighbour search
UR - http://www.scopus.com/inward/record.url?scp=84959420775&partnerID=8YFLogxK
U2 - 10.1177/0165551515594728
DO - 10.1177/0165551515594728
M3 - Article
AN - SCOPUS:84959420775
SN - 0165-5515
VL - 42
SP - 274
EP - 288
JO - Journal of Information Science
JF - Journal of Information Science
IS - 2
ER -