Efficient processing of k-nearest neighbor queries over relational databases: A cost-based optimization

Date of Completion

January 2004


Business Administration, General|Information Science|Computer Science




Nearest neighbor querying has received the most widespread application in document and multi-media retrieval systems due to the nature of the information stored and the intuitive appeal of the request for “approximate matches”. In contrast, research on the efficient retrieval of these queries over relational database management systems (RDBMSs) is limited. Despite the increasing importance of these queries in applications such as price comparison services and product recommender systems, current RDBMSs do not natively support these queries. ^ This dissertation proposes a Query-Level Optimal Cost Strategy (QLOCS) for estimating an optimal range query for efficient k-nearest neighbor (k-NN) retrieval over a relational database. We develop an analytical model that systematically exploits the histogram information available in RDBMS and incorporates the relevant processing costs and their trade-offs in estimating a cost-optimal range query for nearest neighbor retrieval. Experimental and computational analyses using real and synthetic databases show that the technique effectively trades competing cost factors and achieves better efficiency over existing approaches. ^ Unlike conventional querying, k-NN querying shifts the burden of constructing the relevant range query from the user to the system. This eliminates for the user an iterative refinement of the query conditions in search of the relevant information. To this end, this research has the following distinguishing features. ^ First, the method systematically considers the tradeoffs between the important cost factors in processing top-k queries. Second, unlike existing approaches, the method is designed to deal with individual queries rather than a pre-specified workload of queries by leveraging the statistical information (histogram) available in relational databases and undertaking cost optimization at a query level. This is particularly significant because existing approaches attempt to provide solution at an aggregate level, which requires expensive calibration of the actual dataset for a specific context. Finally, the strategy clearly separates the techniques of histogram construction from the retrieval algorithm and provides robust performance across commonly used histogram construction techniques in RDBMSs. ^