TY - JOUR
T1 - Pragmatic correlation analysis for probabilistic ranking over relational data
AU - Park, Jaehui
AU - Lee, Sang Goo
PY - 2013/6/1
Y1 - 2013/6/1
N2 - It is widely recognized that effective ranking methods for relational data (e.g., tuples) enable users to overcome the limitations of the traditional Boolean retrieval model and the hardness of structured query writing. To determine the rank of a tuple, term frequency-based methods, such as tf × idf (term frequency × inverse document frequency) schemes, have been commonly adopted in the literature by simply considering a tuple as a single document. However, in many cases, we have noted that tf × idf schemes may not produce effective rankings or specific orderings for relational data with categorical attributes, which is pervasive today. To support fundamental aspects of relational data, we apply the notions of correlation analysis to estimate the extent of relationships between queries and data. This paper proposes a probabilistic ranking model to exploit statistical relationships that exist in relational data of categorical attributes. Given a set of query terms, information on correlative attribute values to the query terms is used to estimate the relevance of the tuple to the query. To quantify the information, we compute the extent of the dependency between correlative attribute values on a Bayesian network. Moreover, we avoid the prohibitive cost of computing insignificant ranking features based on a limited assumption of node independence. Our probabilistic ranking model is domain-independent and leverages only data statistics without any prior knowledge such as user query logs. Experimental results show that our work improves the effectiveness of rankings for real-world datasets and has a reasonable query processing efficiency compared to related work.
AB - It is widely recognized that effective ranking methods for relational data (e.g., tuples) enable users to overcome the limitations of the traditional Boolean retrieval model and the hardness of structured query writing. To determine the rank of a tuple, term frequency-based methods, such as tf × idf (term frequency × inverse document frequency) schemes, have been commonly adopted in the literature by simply considering a tuple as a single document. However, in many cases, we have noted that tf × idf schemes may not produce effective rankings or specific orderings for relational data with categorical attributes, which is pervasive today. To support fundamental aspects of relational data, we apply the notions of correlation analysis to estimate the extent of relationships between queries and data. This paper proposes a probabilistic ranking model to exploit statistical relationships that exist in relational data of categorical attributes. Given a set of query terms, information on correlative attribute values to the query terms is used to estimate the relevance of the tuple to the query. To quantify the information, we compute the extent of the dependency between correlative attribute values on a Bayesian network. Moreover, we avoid the prohibitive cost of computing insignificant ranking features based on a limited assumption of node independence. Our probabilistic ranking model is domain-independent and leverages only data statistics without any prior knowledge such as user query logs. Experimental results show that our work improves the effectiveness of rankings for real-world datasets and has a reasonable query processing efficiency compared to related work.
KW - Correlation analysis
KW - Probabilistic ranking model
KW - Ranking for structured data
UR - http://www.scopus.com/inward/record.url?scp=84873182749&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2012.11.010
DO - 10.1016/j.eswa.2012.11.010
M3 - Article
AN - SCOPUS:84873182749
SN - 0957-4174
VL - 40
SP - 2649
EP - 2658
JO - Expert Systems with Applications
JF - Expert Systems with Applications
IS - 7
ER -