Index-Based Similarity Joins

Pearson, Spencer Scott

Similarity Joins are some of the most useful and powerful data processing techniques. They retrieve all the pairs of data points between different data sets that are considered similar within a certain threshold. This operation is useful in many situations,…

Similarity Joins are some of the most useful and powerful data processing techniques. They retrieve all the pairs of data points between different data sets that are considered similar within a certain threshold. This operation is useful in many situations, such as record linkage, data cleaning, and many other applications. While many techniques to perform Similarity Joins have been proposed, one of the most useful methods is the use of indexing structures to improve the performance of Similarity Joins. After spending pre-processing time to construct an index over a given dataset, the index structure allows for queries over that dataset to be performed significantly faster. Thus, if a dataset will have multiple Similarity Join queries performed over it, it can be beneficial to use index-based techniques to perform Similarity Join queries for that dataset. We present an extension to a previously proposed index structure, the eD-Index, which provides support for Similarity Join operators. We evaluate the performance of the algorithms and also investigate the configuration of parameters that maximizes the performance of the indexing structures. We also propose an algorithm for Multi-Way Similarity Joins using this index, which allows for Similarity Join queries between more than two data sets at a time.

Copyright Statement