Advanced algorithms for information extraction

Date of Completion

January 2008


Computer Science




With the advances in information technology, voluminous data in such domains as the internet, market, biology, medical imaging, etc. are generated constantly. Extracting useful information from these massive data sets in an efficient way is a vital problem. This research studies five information extraction problems: association rules mining, parallel computation of the singular value decomposition (SVD), gene selection for classification, unknown category classification, and k-means clustering. In association rules mining, we present a novel Transaction Mapping (TM) algorithm for frequent itemset mining, where transaction ids of each itemset are mapped and compressed to continuous transaction intervals and the counting of itemsets is performed by intersecting these interval lists in a depth-first order along the lexicographic tree. In the area of SVD, we introduce a novel scheme for the parallel computation of SVDs. This algorithm is a specific "relaxation" of the Jacobi iteration and is nicely parallelizable. For example, it enables the computation of all the rotations of a sweep in parallel such that the number of sweeps is reasonable. In the gene selection for classification problem, we develop a new greedy algorithm incorporated with correlation between genes based on the learned weights of support vector machine, which obtains a higher classification accuracy using a smaller number of selected genes than the well-known algorithms in the literature. In the unknown category classification problem, we propose a novel method that is capable of identifying the presence of a new class (for example when the new object is sufficiently away from any of the known classes). In this algorithm SVM is used for the first-stage classification following which the squared Mahalanobis distance is used to identify the unknown class (if any). This algorithm is able to detect the presence of the unknown class and classify the objects that belong to this unknown class accurately. The proposed method is flexible in that the unknown class can be controlled by the confidence level that can be set by the users. In k-means clustering, we propose three constant approximation algorithms. The first algorithm runs in time O((k/ε)k nd), where k is the number of clusters, n is the size of input points, and d is dimension of attributes. The second algorithm runs in time O( k3n2 log n). This is the first algorithm for k-means clustering that runs in time polynomial in n, k and d simultaneously. The run time of the third algorithm is (O( k5 log3 kd)) and is independent of n. Though an algorithm whose run time is independent of n is known for the k-median problem, ours is the first such algorithm for the k-means problem. ^