Files in this item



application/pdfAndrew_Magis.pdf (7MB)
(no description provided)PDF


Title:Next-generation sequencing analysis and RNA editing in human brain and glioma
Author(s):Magis, Andrew
Director of Research:Price, Nathan D.
Doctoral Committee Chair(s):Price, Nathan D.
Doctoral Committee Member(s):Ceman, Stephanie S.; Sinha, Saurabh; Ma, Jian
Department / Program:School of Molecular & Cell Bio
Discipline:Biophysics & Computnl Biology
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):RNA sequencing
next-generation sequencing
single end
paired end
top-scoring pair
top-scoring triplet
relative expression
RNA editing
adenosine deaminase, RNA-specific (ADAR)
adenosine deaminase
Abstract:RNA sequencing (RNA-seq) is one of the most common technologies in use today for the analysis of gene expression and transcriptome variation in biological samples (Zhong Wang et al. 2009). As of 2011, the NCBI Sequence Read Archive (SRA) surpassed 100 terabases of sequence data, comprising nearly 40,000 RNA and 260,000 DNA sequencing projects. In 2013 the SRA comprises over 500 terabases, with a projected doubling time of 22.3 months. The explosive growth of next-generation sequence data now exceeds the growth rate of storage capacity (Kodama et al. 2012). Researchers’ ability to process and analyze this data depends upon bioinformatics tools that are accurate, easy to use, and fast—especially when multiple data sets are available for processing and additional analysis beyond standard mapping is required. To address these issues I have developed SNAP-RNA, a new RNA-seq alignment and analysis pipeline designed for datasets involving hundreds or thousands of RNA-seq libraries, while maintaining high alignment accuracy. SNAP-RNA is capable of natively reading from FASTQ, gzipped FASTQ, SAM, and BAM formats, and directly writing to BAM and sorted BAM formats for immediate visualization, without any need for external software packages such as SAMtools (Heng Li et al. 2009). Quality filtering of input reads is incorporated directly into the alignment process. SNAP-RNA can automatically identify and report contaminants or viral/bacterial infections in samples, and gene read counts suitable for downstream analysis with popular statistical programs such as DESeq (Anders and Huber 2010), edgeR (Robinson et al. 2010), or baySeq (Hardcastle and Kelly 2010) are automatically generated with no running time penalty. Finally, SNAP-RNA automatically identifies intra- and inter-chromosomal gene fusions with high accuracy, reporting the results automatically, while maintaining speeds competitive with the fastest available aligners. I demonstrate the capabilities of SNAP-RNA through the analysis of nearly 1300 high-quality RNA-seq samples from several different cancer types derived from The Cancer Genome Atlas, using recently published studies as my benchmarks. Recent developments in next-generation sequencing have revealed unprecedented levels of RNA editing of expressed transcripts, the majority of which occur in the brain. Alterations in transcript editing levels are increasing being linked to pathology in human cancer. Using a novel RNA editing analysis pipeline enabled by SNAP-RNA I have characterized changes in A-to-I editing percentages in nearly 400 glioblastoma and astrocytoma primary tumor, normal brain, and cell line samples. This study represents the first global view of differential editing across multiple brain regions and low- and high-grade astrocytoma. I identify relationships between expression of the editing enzymes ADAR, ADARB1, and ADARB2 and editing profiles in both healthy and diseased states of the brain. Furthermore, I identify many differentially edited bases across normal brain and gliomas that have not previously been characterized. My results highlight biologically relevant editing events that may contribute to astrocytoma pathology. As next-generation RNA sequencing technology has become ubiquitous, researchers have sought to use this expression data to identify distinct gene relationships that classify disease states, allowing for accurate diagnosis of diseases given the expression patterns of a few genes. Such methods include support vector machines (Brown et al. 2000), decision trees (Zhang et al. 2003), and neural networks (Khan et al. 2001). The top-scoring pair (TSP) and the top-scoring triplet (TST) algorithms have demonstrated similar accuracies to these methods while remaining relatively simple, resistant to overfitting, and consistent across data normalization methods (Geman et al. 2004; Tan et al. 2005; Price et al. 2007; Lin et al. 2009). Despite these advantages, the TSP and especially the TST algorithm are computationally intensive and therefore slow. The graphics processing unit (GPU) is increasingly applied to computationally-challenging scientific problems including molecular dynamics simulations (John E Stone et al. 2010) and medical imaging (S S Stone et al. 2008). The GPU is designed for massive parallelism involving thousands of simultaneous executing threads, but requires different coding than that which runs on CPUs. I have implemented both the TSP and the TST algorithm on the GPU, resulting in a dramatic speedup of two orders of magnitude, greatly increasing the searchable combinations and accelerating the pace of discovery. In addition to acceleration of existing relative expression classifiers, I have developed a new classifier called the top-scoring ‘N’ algorithm (TSN). TSN is a generalized form of relative expression algorithm that uses generic permutations and a dynamic classifier size to control both the permutation and combination space available for classification. TSN performs competitively against a wide variety of different classification methods while exhibiting low levels of overfitting on training data compared to other methods, giving confidence that results obtained during cross-validation will be more generally applicable to external validation sets. TSN preserves the strengths of other relative expression algorithms while allowing a much larger permutation and combination space to be explored, potentially improving classification accuracies when fewer numbers of measured features are available.
Issue Date:2014-01-16
Rights Information:Copyright 2013, Andrew T. Magis
Date Available in IDEALS:2014-01-16
Date Deposited:2013-12

This item appears in the following Collection(s)

Item Statistics