Research

Our group develops and applies machine learning tools that uncover the genotype-phenotype relationship—that is, that determine the relationship between the sequence of DNA base pairs that makes up an individual’s genome with that individual’s traits and disease. We do so by drawing from and contributing to methods for probabilistic graphical models, deep neural networks and optimization.


  • Review: Maxwell W. Libbrecht, William S. Noble. “Machine learning applications in genetics and genomics.” Nature Reviews Genetics, 16: 321-332, 2015. https://doi.org/10.1038/nrg3920

Annotating the epigenome through unsupervised probabilistic graphical models

Genome-wide association studies (GWAS) have identified tens of thousands of genetic variants associated with human diseases and traits. However, the vast majority of these associations are not backed by a hypothesized mechanism. In addition, genetic variants identified by GWAS, known as tag variants, are usually correlated with, but not causal of, disease. Thus, an important step in understanding the genotype-phenotype association is to identify the genomic elements—that is, the DNA words and sentences that make up the book that is our genome—driving disease association. However, the annotation of genomic elements remains incomplete, hampering our ability to understand disease-associated variants. We are interested in developing integrative machine learning methods that use these genomics assays to discover and catalog functional activity in the human genome.


Relevant papers:

  • Review: Maxwell W Libbrecht*, Rachel CW Chan*, Michael M Hoffman. “Segmentation and genome annotation algorithms for identifying chromatin state and other genomic patterns”. PLoS Computational Biology. 17.10 (2021). https://doi.org/10.1371/journal.pcbi.1009423

  • Maxwell W. Libbrecht*, Oscar L. Rodriguez*, Zhiping Weng, Jeffrey A. Bilmes, Michael M. Hoffman, and William Stafford Noble. “A unified encyclopedia of human functional DNA elements through fully automated annotation of 164 human cell types.” Genome Biology 20, no. 1 (2019): 1-14. https://dx.doi.org/10.1186%2Fs13059-019-1784-2

  • Maxwell W. Libbrecht, Michael Hoffman, Jeff Bilmes, William Stafford Noble. “Entropic graph-based posterior regularization.” Proceedings of the International Conference on Machine Learning (ICML) 2015. https://proceedings.mlr.press/v37/libbrecht15.html

  • Maxwell W. Libbrecht, Ferhat Ay, Michael M. Hoffman, David M. Gilbert, Jeffrey A. Bilmes, and William S. Noble “Joint annotation of chromatin state and chromatin conformation reveals relationships among domain types and identifies domains of cell-type-specific expression.” Genome Research, 25: 544-557, 2015. Named one of ISCB’s Top 10 Regulatory and Systems Genomics papers of 2015. https://dx.doi.org/10.1101%2Fgr.184341.114

Learning representations of the genome to enable perturbation and interpretation

Representation learning is a branch of machine learning that aims to summarize high-dimensional datasets into a low-dimensional representation that can be used for many down- stream tasks. We are interested in applying these approaches to enable new solutions to genomic problems.


Relevant papers:

  • Kevin B. Dsouza, Alexandra Maslova, Ediem Al-Jibury, Matthias Merkenschlager, Vijay K. Bhargava, Maxwell W. Libbrecht. “Hi-C-LSTM: Learning representations of chromatin contacts using a recurrent neural network identifies genomic drivers of conformation”. https://www.biorxiv.org/content/10.1101/ 2021.08.26.457856v1.abstract

  • M. Sadegh Saberian, Kathleen P. Moriarty, Andrea D. Olmstead, Ivan R. Nabi, Franc ̧ois Jean, Maxwell W. Libbrecht*, Ghassan Hamarneh*. ”DEEMD: Drug Efficacy Estimation against SARS-CoV-2 based on cell Morphology with Deep multiple instance learning”. Arxiv preprint: https://arxiv.org/abs/2105.05758

  • Habib Daneshpajouh*, Bowen Chen*, Neda Shokraneh Kenari, Shohre Masoumi, Kay C. Wiese, and Maxwell W Libbrecht. “Continuous chromatin state feature annotation of the human epigenome.” Biorxiv preprint: https://doi.org/10.1101/473017

  • Shohre Masoumi, Maxwell Libbrecht*, Kay Weise*. “SigTools: Exploratory Visualization for Genomic Signals”. Bioinformatics, 2021. https://doi.org/10.1093/bioinformatics/btab742

  • Kevin Bradley Dsouza, Adam Yifan Li, Vijay K Bhargava, Maxwell W Libbrecht. “Latent representation of the human pan-celltype epigenome through a deep recurrent neural network”. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2021. https://doi.org/10.1109/TCBB.2021.3084147.

Predicting drug resistance in pathogenic bacteria

Resistance of pathogenic bacteria to antibiotic drugs is a key global health risk. It is estimated that the total mortality due to drug resistance could exceed 10 million people a year by 2050. This risk can be mitigated through identifying which drugs a given bacterial infection is resistant to from bacterial genome sequence. However, existing approaches for doing so have poor predictive accuracy, especially for situations involving novel drug resistance mechanisms or rarely-used drugs. We are interested in developing machine learning approaches that remedy these limitations by predicting drug resistance from bacterial genome sequence for a variety of antibiotic drugs.


Relevant papers:

  • EinarGabbassov, Miguel Moreno-Molina, Inaki Comas, Maxwell Libbrecht, Leonid Chindelevitch. “SplitStrains, a tool to identify and separate mixed Mycobacterium tuberculosis infections from WGS data”. Microbial Genomics 7(6), 2021. https://doi.org/10.1099/mgen.0.000607

  • Amir Hosein Safari, Nafiseh Sedaghat, Hooman Zabeti, Alpha Forna, Leonid Chindelevitch*, and Maxwell Libbrecht*. 2021. “Predicting drug resistance in M. tuberculosis using a long-term recurrent convolutional network”. Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB ’21). 29, 1-10. 2021. https://doi.org/10.1145/3459930.3469534

  • Hooman Zabeti, Nick Dexter, Amir Hosein Safari, Nafiseh Sedaghat, Maxwell Libbrecht, and Leonid Chindelevitch. “An Interpretable Classification Method for Predicting Drug Resistance in M. Tuberculosis.” Proceedings of International Workshop on Algorithms in Bioinformatics (WABI). 2020. https://doi.org/10.4230/LIPIcs.WABI.2020.2

Principled normalization and selection of genomic data

Effective data analysis depends on having clean, well-behaved data sets. We are interested in developing computational methods that transform genomic data sets to facilitate downstream analyses.


Relevant papers:

  • Alice Yue, Cedric Chauve*, Maxwell Libbrecht*, Ryan Brinkman*. “Automated identification of maximal differential cell populations in flow cytometry data.” Cytometry. 2021; 1-8. https://doi.org/10.1002/cyto.a. 24503

  • Faezeh Bayat, Maxwell W. Libbrecht. “Variance-stabilized units for sequencing-based genomic signals.” Bioinformatics. 2021. https://doi.org/10.1093/bioinformatics/btab457.

  • Maxwell W. Libbrecht, Jeffrey A. Bilmes, William S. Noble. “Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.” Proteins: Structure, Function and Bioinformatics, 86(4):454-466, 2018. https://doi.org/10.1002/prot.25461

  • Kai Wei*, Maxwell W. Libbrecht*, Jeffrey A Bilmes, William S. Noble. “Choosing panels of genomics assays using submodular optimization.” Genome Biology, 2016, 17:229. https://doi.org/10.1186/s13059-016-1089-7