Reconstruction of demographic history from complete genome sequences.

Several years ago, we developed a statistical method based on the theoretical framework of the Coalescent for reconstructing the demographic history of complex, structured populations from DNA sequence data. Our method, called G-PhoCS (for Generalized Phylogenetic Coalescent Sampler), uses Markov chain Monte Carlo techniques to explore coalescent genealogies consistent with a particular population phylogeny, allowing for gene flow between designated populations. It produces Bayesian estimates of the key parameters that define these population phylogenies, such as the divergence times between populations and the effective sizes of ancestral populations. More recently, we developed another method, ARGweaver, that generalizes G-PhoCS by capturing the manner in which recombination alters genealogies along the genome sequence. ARGweaver samples complete “ancestral recombination graphs,” or ARGs, within an approximate framework known as the sequentially Markov coalescent (SMC), using techniques based on hidden Markov models for efficient MCMC sampling that scales to complete mammalian genomes. Together, G-PhoCS and ARGweaver allowed us to detect significant evidence of gene flow from modern humans into eastern Neadertals, which suggests the existence of an early migration of modern humans out of Africa (>100 thousand years ago). Melissa Hubisz, a PhD student in the group, has recently implemented a demography-aware version of ARGweaver and is using this method to analyze gene flow among ancient and modern hominins. Additionally, Hussein Hijazi, a postdoc in the group, is applying ARGweaver to study introgression and selective sweeps in bird populations.

Related links:

  • Our paper on the G-Phocs method and its application to complete human genome sequences
  • Our paper describing the ARGweaver method and its application to complete human genome sequences
  • Our paper on interbreeding between modern and archaic hominins
  • Blog post by Adam Siepel on ARGs and ARGweaver
  • News article on our work studying human Neandertal interbreeding. See “Press” tab for additional news coverage.
Analysis of natural selection on regulatory sequences in the human genome.

We have a longstanding interest in characterizing the influence of natural selection on DNA sequences, particularly in noncoding regions of the genome. Building on earlier work, we recently developed a method, called LINSIGHT, that combines a generalized linear model for functional genomic data with a probabilistic model of molecular evolution to estimate the fitness effects of mutations in noncoding regions of the human genome. LINSIGHT is fast and scalable, enabling it to exploit the “Big Data” available in modern genomics. Yifei Huang, a former postdoc in the lab, showed that LINSIGHT is highly predictive of variants associated with inherited diseases.  In addition, he applied LINSIGHT to an atlas of human enhancers and showed that the fitness consequences at enhancers depend on cell type, tissue specificity, and constraints at associated promoters. In parallel, Brad Gulko, a PhD student in the group, devised a related algorithm, called FitCons2, that builds a decision tree by repeatedly splitting classes of genomics sites in a manner that is guaranteed to increase global measure of the “information” associated with natural selection. Brad applied FitCons2 to all the data from Roadmap Epigenomics to 115 maps of produce cell-type specific “fitness consequence” (fitCons) scores. His analysis suggests that around 8% of nucleotide sites are constrained by natural selection. We are also collaborating with Michael Purrugganan at NYU and Ed Buckler at Cornell to predict the fitness consequences of mutations in rice, maize, and other crops.

Related links:

Transcriptional Regulation and Its Evolution in Primates.

For several years, our research program in transcriptional regulation has focused on developing new methods for interpreting the rich nascent RNA sequencing data generated by using the powerful GRO-seq (Global Run-On and sequencing) protocol or its higher-resolution successor PRO-seq. In collaboration with Charles Danko at Cornell, we recently made use of PRO-seq to carry out the first comparative study of nascent transcription in primates. This approach allowed us to directly measure active transcription separately from post-transcriptional processes. Our overall findings suggest a pervasive role for evolutionary compensation across ensembles of enhancers that jointly regulate target genes. Additionally, Noah Duker, a PhD student in the group, made use of PRO-seq data to study the dynamics of transcriptional activation following treatment with the natural medicinal compound celastrol.  Now a postdoc in the group, Noah is currently working on a project with our Cornell collaborators to develop new probabilistic methods for characterizing the gain and loss of regulatory elements along the branches of a phylogeny.

Related links:

  • Our paper on the comparative study of nascent transcription in primates
  • Our paper on transcriptional activation following treatment with celastro

Most of the research described on this page has been supported by NIH grants R01-GM102192, R01-HG007070, and R35-GM127070. In addition, several of these projects were initially launched with the help of early career awards from the David & Lucile Packard Foundation, Microsoft Research, the Alfred P. Sloan Foundation, and the National Science Foundation (NSF).