Date of Completion
5-11-2013
Embargo Period
5-9-2013
Advisors
Sanguthevar Rajasekaran; Daniel Schwartz
Field of Study
Computer Science and Engineering
Degree
Master of Science
Open Access
Open Access
Abstract
Background: Many consensus-based and Position Weight Matrix-based methods for recognizing transcription factor binding sites are not well suited to the variability in the lengths of binding sites. Besides, many methods discard known binding sites while building the model. Moreover, the impact of Information Content (IC) and the positional dependence of nucleotides within an aligned set of TFBSs has not been well researched for modeling variable-length binding sites. In this paper, we propose ML-Consensus, a consensus model for variable-length binding sites which does not exclude any input binding sites. We consider Pairwise Score (PS) as a measure of positional dependence of nucleotides within an alignment of binding sites. We investigate how the prediction accuracy of ML-Consensus is affected by using IC, PS, and any particular binding site alignment strategy. We perform leave-one-out cross-validations on datasets of six species from the TRANSFAC public database, and analyze the results using ROC curves and Wilcoxon matched-pair signed-ranks test.
Results: We observed that the incorporation of IC and PS in ML-Consensus results in statistically significant improvement in the prediction accuracy. Moreover, any two positions in the multiple sequence alignment of the binding sites were found to be interdependent only when they the distance between them was below a certain value. Lastly, configurations with state-of-the-art alignment strategies did not perform significantly better than configurations with a naive alignment strategy.
Conclusions: There exists a core region within a set of known binding sites, ix and positions in that core region are interdependent. Additionally, it is possible to improve the existing state-of-the-art multiple sequence alignment algorithms by using such information as mentioned above about the core region among the binding sites.
Availability: All source codes (C#), results, supporting evidence, supplementary data and figures are available from http://biogrid.engr.uconn.edu/mlconsensus/ .
Recommended Citation
Quader, Saad A., "Effect of Positional Dependence in Recognizing Transcription Factor Binding Sites" (2013). Master's Theses. 448.
https://digitalcommons.lib.uconn.edu/gs_theses/448
Major Advisor
Chun-Hsi Huang