Date of Completion
1-17-2014
Embargo Period
1-17-2014
Keywords
transcription factor, machine learning, web tool
Major Advisor
Chun-Hsi Huang
Associate Advisor
Jinbo Bi
Associate Advisor
Sanguthevar Rajasekaran
Associate Advisor
Daniel Schwartz
Associate Advisor
Dong-Guk Shin
Field of Study
Computer Science and Engineering
Degree
Doctor of Philosophy
Open Access
Open Access
Abstract
A transcription factor (TF) is a protein or protein complex. It regulates the expression of its target genes by physically binding to the regulatory regions of these genes. The binding sites of a TF naturally share a common pattern or motif with one another. Given known binding sites of a TF, a TF model can be built to scan sequences for putative binding sites. This is known as a transcription factor binding site (TFBS) search problem. In this dissertation, we investigate the TFBS search problem using machine learning approaches.
In general, the known binding sites of a TF are of variable lengths and have to be aligned before a model can be built. Transcription factor binding site alignment is considered an unsupervised learning problem since no other information about the unaligned binding sites is given. We propose an algorithm that considers the lengths of TFBSs and dependencies of nucleotide positions in a binding site. The novel method is named LASAGNA (Length-Aware Site Alignment Guided by Nucleotide Association).
Studies often utilize TFBS search tools to predict the binding sites of a TF in a DNA sequence when binding sites found by assays are not available. The analysis often involves TF model collection, promoter sequence retrieval and visualization, requiring several tools to accomplish. To accelerate TFBS analyses, we developed a novel integrated webtool named LASAGNA-Search. This user-friendly tool allows users to perform the analysis without leaving the site.
TFBS search methods are considered supervised learning algorithms since they learn from example binding sites of a TF. Most of the TFBS search methods consider only known binding sites of a TF and hence deal with one-class classification problems. However, non-binding sites contain information about the TF as well. When non-binding sites are available, searching for TFBSs becomes a two-class classification problem. We propose two novel methods named the negative-to-positive vector and the optimal discriminating vector methods, utilizing both binding sites and non-binding sites.
Recommended Citation
Lee, Chih, "Machine Learning Approaches to Transcription Factor Binding Site Search and Visualization" (2014). Doctoral Dissertations. 304.
https://digitalcommons.lib.uconn.edu/dissertations/304