Reconstruction of demographic history from complete genome sequences.

Several years ago, we developed a statistical method based on the theoretical framework of the Coalescent for reconstructing the demographic history of complex, structured populations from DNA sequence data.  Our method, called G-PhoCS (for Generalized Phylogenetic Coalescent Sampler), uses Markov chain Monte Carlo techniques to explore coalescent genealogies consistent with a particular population phylogeny, allowing for gene flow between designated populations.  It produces Bayesian estimates of the key parameters that define these population phylogenies, such as the divergence times between populations and the effective sizes of ancestral populations.  The author of G-PhoCS, former postdoc Ilan Gronau, originally used the method to estimate the date of origin of one of the earliest branching extant human populations, the San hunter gatherers of Southern Africa.  Ilan has continued to refine the G-PhoCS method, and in collaboration with other research groups, has now used it to shed light on the demographic histories of various other organisms, including dogs and wild canids, birds from the genus Sporophila, and archaic hominins (Neanderthals and Denisovans).

Related links:

  • Our paper on the the G-PhoCS method and its application to complete human genome sequences
  • News article at Cornell on this study
  • News and Views article by Jonathan Pritchard covering our paper and a related one by Li and Durbin
  • Collaborative paper on demography inference in dogs and wild canids, in which G-PhoCS played a central role
  • Blog post by Ilan Gronau on the dog paper
Inference of ancestral recombination graphs on a genome-wide scale.

G-PhoCS and methods like it “cheat” by considering only short, widely spaced genomic sequences and ignoring the difficult problem of modeling recombination.  However, one would ideally consider not only the process of coalescence (finding common ancestry) at each locus in a genome, but also the manner in which historical recombination events alter these genealogies along the genome sequence.  This combined history of coalescence and recombination can be explicitly represented by a generalized representation known as an “ancestral recombination graph,” or ARG.  However, the problem of reconstructing an ARG from sequence data is notoriously difficult, and ARG inference has not been widely used in applied population genomics.  Recently, Matt Rasmussen in the group developed an algorithm for sampling ARGs within an approximate framework known as the sequentially Markov coalescent (SMC).  Matt’s method, called ARGweaver, uses techniques from hidden Markov models to repeatedly “thread” individual sequences through an ARG, leading to a Gibbs sampler over the space of ARGs.  ARGweaver is the first ARG inference method efficient enough to apply to complete mammalian genomes.  Matt has shown that it works remarkably well on simulated data and that it reveals clear signatures of natural selection in real human genome sequences.  Melissa Hubisz, a PhD student in the group, is now working on extending ARGweaver into a full method for demography inference.

Related links:

  • Our paper describing the ARGweaver method and its application to complete human genome sequences
  • Blog post by Adam Siepel on ARGs and ARGweaver
Analysis of natural selection on regulatory sequences in the human genome.

We have a longstanding interest in characterizing the influence of natural selection on DNA sequences, particularly in noncoding regions of the genome.  Most of our work in this area has involved comparisons of complete mammalian genomes and, hence, has considered evolutionary processes spanning tens to hundreds of millions of years.  Recently, however, we have become interested in integrating this phylogenetic information with data on human polymorphism, to gain insight into more recent evolutionary events.  Toward this end, we have developed a probabilistic model and inference method, called INSIGHT, that makes use of joint patterns of divergence and polymorphism to shed light on recent natural selection.  INSIGHT focuses in particular on disentangling the influence of positive and negative selection on collections of short, interspersed genomic elements.  Two postdocs in the group, Leo Arbiza and Ilan Gronau, have used INSIGHT to show that natural selection has profoundly influenced transcription factor binding sites across the genome during the past five million years of evolution.  Binding sites are enriched for both adaptive substitutions and weakly deleterious polymorphisms compared with protein coding sequences, and appear to dominate the genetic load associated with deleterious polymorphisms.

Related links:

Calculation of probabilities of fitness consequences for mutations across the human genome.

The INSIGHT method (above) provides an estimate of the fraction of nucleotides under natural selection in any given collection of genomic elements.  These same estimates can alternatively be interpreted as probabilities that mutations falling in the given elements will have fitness consequences.  A few years ago, we realized that this property could be used produce “fitness consequences” (fitCons) scores across the entire human genome.  Using high-throughput data from the ENCODE project, we first partition the genome into classes of sites having characteristic functional genomic “fingerprints” in a given cell type.  We then use INSIGHT to calculate a fitCons score for each fingerprint.  Finally, we plot these scores along the genome sequence. Brad Gulko, a PhD student in the group, has implemented this approach and produced fitCons scores for three human cell types.  These scores turn out to be remarkably powerful for identifying cis-regulatory elements and they are highly complementary to standard evolutionary conservation scores in revealing hidden functional elements.  We have also used fitCons scores to obtain estimates of the fraction of nucleotides in the genome that are influence fitness.  Brad is currently working on an improved version of this scoring system that will accommodate larger and more diverse collections of functional genomic data.

Related links:

  • Our paper introducing the fitCons approach and applying it in an analysis of three human cell types
  • Blog post by Adam Siepel on the story behind this project
  • News article at CSHL on our paper in Nature Genetics
Identification and characterization of enhancers using GRO-seq.

For several years, we have been working closely with John Lis’s group on methods for interpreting data generated using their powerful GRO-seq (Global Run-On and sequencing) technology, which maps the positions of engaged RNA polymerases across the genome.  It has gradually become clear that an unanticipated benefit of GRO-seq and derived technologies is that they are uniquely well suited for detecting so-called enhancer RNAs (or eRNAs), and consequently, for identifyingactive enhancers and other regulatory elements in mammalian cells.  Recently, Andre Martins and Leighton Core led a major project in which we systematically compared patterns in DNA sequences and chromatin near the sites of transcription inititation in both annotated genes and regulatory elements.  We found that the architecture of transcription initiation was remarkably similar at these regions and proposed a unified model for enhancers and promoters. Under this model, the key distinctions between these regions occur in downstream steps, which cause protein-coding mRNAs to become stable, while other RNAs are rapidly degraded by the cell.  

A former postdoc, Charles Danko, has taken this work a step further by developing new machine-learning method, called dREG, for detecting regulatory elements from standard GRO-seq or PRO-seq data, based on their characteristic patterns of transcription initiation.  As a result, it is now possible to detect regulatory elements and measure transcriptional activity using a single experimental assay.

Related links:

Additional analyses of GRO-seq data.

Our work on the analysis and interpretation of GRO-seq data has taken us down several other paths and led to a substantial body of work that is not yet published.  For example, former PhD student Andre Martins has developed new hidden Markov model-based methods for transcription unit calling from GRO-seq data, which he is currently preparing for publication.  In his new laboratory at Cornell, former postdoc Charles Danko is completing an ambitious comparative analysis of regulatory elements in humans, chimpanzees, and rhesus macaques based on PRO-seq data for CD4+ T-cells collected from three individuals of each species.  And Noah Dukler, a new PhD student in the lab, is examining the dynamics of regulatory cascades in K562 cells based on time courses of PRO-seq data after inducation with a small molecule called celastrol.

Most of the research described on this page has been supported by NIH grants GM102192 and HG007070.  In addition, several of these projects were initially launched with the help of early career awards from the David & Lucile Packard Foundation, Microsoft Research, the Alfred P. Sloan Foundation, and the National Science Foundation (NSF).