Files in this item



application/pdfZHANG-DISSERTATION-2017.pdf (25MB)
(no description provided)PDF


Title:Computational approaches for analyzing regulatory regions in the human genome
Author(s):Zhang, Yang
Director of Research:Ma, Jian
Doctoral Committee Chair(s):Ma, Jian
Doctoral Committee Member(s):Belmont, Andrew; Stubbs, Lisa; Sinha, Saurabh; Warnow, Tandy
Department / Program:Bioengineering
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Regulatory elements
Somatic mutations
Chromatin organization
Abstract:The cis-regulatory elements (CRE) in the human genome play a critical role in transcriptional regulation. Alterations of the CRE have long been considered as the driving force of the human evolution. Recent studies also suggest that somatic mutations within the CRE can act as driver factors in cancer. With the growth of the quantity and the variety of genomic data, more biological functions of CRE remain to be determined. However, the bottleneck of analyzing the human CRE lies in the lack of the holistic approaches to integrating multi-dimensional genomic data and exploring their roles in human biology. The primary motivation of this dissertation is to provide novel insights of the human CRE by developing integrative computational approaches to consolidating versatile genomic data. Specifically, we developed three computational algorithms/frameworks to decipher the complex functions of the human CRE: 1) in the evolutionary context of comparative genomics, 2) in the context of cancer somatic mutations, and 3) in the context of 3D chromatin organization. In the first part of this dissertation, we explored the functions of the lineage-specific TFBS in the human genome using ANTICE, a novel probabilistic algorithm developed by us. Compared to previous methods, ANTICE is favored for its ability to predict lineage-specific TFBS under a phylogenetic model and also account for the uncertainty of the multiple sequence alignment (MSA) and high turnover rate of TFBS. Based on ANTICE, we generated by far the largest genome-wide landscape of lineage-specific transcription factor binding sites (TFBS) using 680 human ChIP-seq datasets from the ENCODE project. We then integrated lineage-specific human TFBS with public genomic data. We discovered that a substantial fraction of human TFBS has emerged after the human-mouse divergence. Younger TFBS, compared to the ancestral TFBS conserved between human and mouse, tend to locate further away from the gene promoters, more likely to involve in the tissue-specific open chromatin regions, and are enriched for common SNP and germline mutations. Our study provides the first genome-wide resources of the locations of lineage-specific TFBS in the human genome, which can help to explain how human become human in the course of the evolution. In the second part of this dissertation, we developed an integrated analysis framework to explore how the genomic context of the CRE influence the UV induced mutagenesis at CRE in Melanoma patients. We discovered that mutation rate of C to T mutations at tumor-specific DNase I hypersensitive site (DHS) are significantly associated with the genomic features of DHS represented by the distance of DHS to the TSS, DNA sequence composition of DHS and the H3K4me3 signal of DHS. We also found that these genomic features often jointly determine the landscape of the mutation rate at DHS. Within DHS regions, somatic mutations are enriched at binding sites with CGGAAT or CTCF motifs, which suggests a potential positive selection phenomenon at CRE in Melanoma. Finally, we proved that we could accurately predict the profiles of the mutation rate at CRE using only genomic data through a random forest model. Our study provides a generic computational approach to prioritize the top players behind the heterogeneity of the cancer mutation rate at the CRE, which can not only explain the biological basis of the mutation variations but also provide a baseline for identifying drive mutations at the CRE. Finally, in the last part of this dissertation, we developed a software package named Norma for processing NGS datasets generated by TSA-seq, a novel technique to measure the three-dimensional cytological distance of chromatin to a specific nuclear structure. Norma is an all-in-one software package for TSA-seq data processing, which covers from the mapping of the raw NGS reads, to the calculations of the enrichment score and conducting all kinds of the integrated analysis of enrichment scores using functional genomic data. Leveraging Norma, we determined the spatial organization of the chromatin in K562 cell relative to the nuclear speckle and the nuclear lamina. We found that genomic data such as histone modifications, high expressed genes, and many sequence features such as the GC content are well organized along the axis from the nuclear speckle to the nuclear lamina. The computational predictions of the cytological distances by Norma are also consistent with microscopy validations. We believe Norma would become a useful tool for analyzing TSA-seq data. In conclusion, in this dissertation, we have provided three computational approaches to integrating versatile genomic data on CRE and gained novel insights on the functions of the human CRE.
Issue Date:2017-12-08
Rights Information:Copyright 2017 Yang Zhang
Date Available in IDEALS:2018-03-13
Date Deposited:2017-12

This item appears in the following Collection(s)

Item Statistics