Date of Completion
10-28-2019
Embargo Period
10-27-2020
Keywords
Incremental Record Linkage, Edit Distance, Blocking, K-mers, Parallel Computing, Hierarchical Clustering.
Major Advisor
Prof. Sanguthevar Rajasekaran.
Co-Major Advisor
Prof. Reda Ammar.
Associate Advisor
Prof. Song Han.
Associate Advisor
Prof. Sheida Nabavi.
Field of Study
Computer Science and Engineering
Degree
Doctor of Philosophy
Open Access
Campus Access
Abstract
In the biomedical domain, the record linkage is considered as a crucial problem. When the number of records is very large, existing algorithms for record linkage take too much time. Often, we have to link a small set of new records with a large set of old records. This can be done by putting together the old and new records and performing a linkage on all the records. Clearly, this will call for an enormous amount of time. An alternative is to develop algorithms that perform linkage in an incremental manner. We refer to any such algorithm as an Incremental Record Linkage (IRL) algorithm.
In this thesis, we present an efficient IRL algorithm. In addition to taking large amounts of time, existing algorithms might also suffer from a chaining problem and hence introduce some errors in linking. As has been observed in the literature, this chaining problem can be solved by performing clustering under complete linkage.
This thesis makes two main contributions. Firstly, we have offer novel sequential and parallel algorithms for the critical incremental record linkage problem using a single linkage. Secondly, we have come up with novel sequential and parallel algorithms for incremental record linkage using complete linkage to overcome the chaining problems.
Our algorithms can handle any number of datasets. In contrast, many of the existing algorithms can only link two datasets at a time. Our algorithms outperform previous algorithms and offer state-of-the-art solutions to the IRL problem. We have tested our algorithms on millions of records on synthetic and real datasets and shown that our algorithms outperform the best-known RLA algorithms when the number of new records is up to around 20% of the total number of old records. Our algorithms achieve a very nearly linear speedup in parallel.
Recommended Citation
Baihan, Abdullah, "Efficient Sequential and Parallel Algorithms for Incremental Record Linkage" (2019). Doctoral Dissertations. 2335.
https://digitalcommons.lib.uconn.edu/dissertations/2335