Date of Completion


Embargo Period



Sanguthevar Rajasekaran, Yufeng Wu

Field of Study

Computer Science and Engineering


Master of Science

Open Access

Open Access


Massively parallel transcriptome sequencing is quickly replacing microarrays as the technology of choice for performing gene expression profiling due to its wider dynamic range and digital quantitation capabilities. However, accurate estimation of expression levels from sequencing data remains challenging due to the short read length delivered by current sequencing technologies and still poorly understood protocol- and technology-specific biases.

To date, two main transcriptome sequencing protocols have been proposed in the literature. The most commonly used one, referred to as RNA-Seq, generates short (single or paired) sequencing tags from the ends of randomly generated cDNA fragments. An alternative protocol, referred to as 3’-tag Digital Gene Expression (DGE), or high-throughput sequencing based Serial Analysis of Gene Expression (SAGE-Seq), generates single cDNA tags using an assay including as main steps transcript capture and cDNA synthesis using oligo(dT) beads, cDNA cleavage with an anchoring restriction enzyme, and release of cDNA tags using a tagging restriction enzyme whose recognition site is ligated upstream of the recognition site of the anchoring enzyme.

In this thesis we present two novel expectation-maximization algorithms for inference of isoform- and/or gene-specific expression levels from RNA-Seq and DGE data and a comparison of estimation performance of the two transcriptome sequencing protocols.

The first algorithm, IsoEM, works on RNA-Seq data and is based on disambiguating of information provided by the distribution of insert sizes generated during sequencing library preparation and takes advantage of base quality scores, strand and read pairing information when available. Empirical experiments on both synthetic and real RNA-Seq datasets show that IsoEM has scalable running time and outperforms existing methods of isoform and gene expression level estimation.

The second algorithm, DGE-EM, is used for inference of gene and isoform expression levels from DGE tags. Unlike previous methods, our algorithm takes into account alternative splicing isoforms and tags that map at multiple locations in the genome, and corrects for incomplete digestion and sequencing errors. Experimental results show that DGE-EM outperforms methods based on unique tag counting on a multi-library DGE dataset consisting of 20bp tags generated from two commercially available reference RNA samples that have been well-characterized by quantitative real time PCR as part of the MicroArray Quality Control Consortium (MAQC).

We also take advantage of the availability of RNA-Seq data generated from the same MAQC samples to directly compare estimation performance of the two transcriptome sequencing protocols.

Major Advisor

Ion Mandoiu