Web servers: Puffin | Puffin-D | Orca | Sei | Multiplexer
Puffin: Deep-learning-inspired explainable sequence model of transcription Initiation
Puffin is an interpretation-focused sequence model for transcription initiation in the human genome which is also applicable to other mammalian species. Puffin can predict basepair-resolution transcription initiation signals using only genomic sequence as input, and more importantly, analyze the sequence basis of any transcription start site (TSS) at motif and basepair levels.
Puffin-D is a prediction-focused deep learning sequence model for basepair-resolution transcription initiation signals. Puffin-D takes 100kb input and accurately predicts the transcriptional signal strength.
The GitHub repository is here. For reproducing the analyses in our manuscript and training code, please visit our manuscript repo. The manuscript is available here. For most use cases, we highly recommend our user-friendly web server puffin.zhoulab.io (or tss.zhoulab.io whichever is easier to memorize) which runs Puffin in your browser with interactive visualization. The prediction-focused deep learning model Puffin-D is available at puffind.zhoulab.io.
DDSM: Dirichlet Diffusion Score Model for biological sequence generation
Dirichlet Diffusion Score Model (DDSM) is a continuous-time diffusion framework designed specifically for modeling discrete data such as biological sequences. We introduce a diffusion process defined in probability simplex space with stationary distribution being the Dirichlet distribution. This makes diffusion in continuous space natural for modeling discrete data. We demonstrated applications to Sudoku solving and promoter sequence design. The GitHub repository is here. The manuscript is here.
Orca: predict genome 3D interactions from kilobase to whole-chromosome scale from sequence
Orca is a deep learning sequence modeling framework for multiscale genome interaction prediction from kilobase to beyond whole-chromosome scale. Orca allows predicting genome structural impacts of any genomic variants, including very large structural variants, and designing virtual genetic screens to probe the sequence basis of genome 3D organization. The GitHub repository is here. The webserver is available here. The publication is here.
Orca-leukemia update (2023): a version of Orca for leukemia-related cell lines is now available in the webserver too and a GitHub repository for structural variant regulatory impact scoring for leukemia subtypes is available here.
Sei: A sequence-based global map of regulatory activity for deciphering human genetics
Sei is a framework for systematically predicting sequence regulatory activities and applying sequence information to human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes, and each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types). Sei has been used for interpreting tissue-specific regulatory signals from GWAS and predicting individual pathogenic variant effects. The Github repository, the webserver, and the publication are available from the URLs above.
Quasildr: An analytical framework for interpretable and generalizable single-cell omics data analysis
GraphDR is a method for single-cell data visualization and representation; StructDR is a method for generalized trajectory inference (StructDR). The unique features of these methods are 1. linear interpretability, which is similar to PCA and eases comparisons across datasets, 2. allowing statistical inference of confidence sets for the trajectories, and 3. unifying the inference of clusters (0-dimensional), trajectories (1-dimensional), and surfaces (2-dimensional) structures. GraphDR is also scalable to more than 10 million cells. The python package for interpretable exploratory analysis of single-cell omics data is available here. The publication is here.
ExPecto: tissue-specific gene expression effect prediction for human mutations
ExPecto is a framework for ab initio sequence-based prediction of mutation gene expression effects and disease risks. The Github repository is here. Precomputed variant effects can be browsed and downloaded here. You can access deep learning sequence model-based variant chromatin effect prediction from the webserver here (thanks to the Flatiron Institute Genomics team!).
Selene: a PyTorch-based deep learning library for sequence data
Selene is a Python library and command line interface for developing deep neural networks for biological sequence data. Our aim for Selene is to accelerate development and application of deep learning sequence models in biology. The development of Selene is led by Kathy Chen. The Github repository is here.
ASD: Deep learning sequence model-based mutation regulatory effects prediction for ASD mutations
For analyzing impact of regulatory mutations to disease, we developed deep learning sequence models for molecular-level effects at chromatin level and RNA-binding protein level and Disease Impact Scores summarizing molecular level effects. The code is available here. Precomputed ASD mutation effects can be browsed and downloaded here.
DeepSEA: Deep learning-based algorithmic framework for predicting chromatin effects
DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities, and histone marks in multiple cell types. It can be further utilized to predict the chromatin effects of sequence variants and prioritize regulatory variants. The code is available here. The DeepSEA webserver is available here.
FIND: Drosophila embryonic development tissue-stage-specific gene expression predictions
We developed a machine learning method to provide genome-wide, quantitative spatiotemporal gene expression predictions. The new method we developed is structured in silico nano-dissection, a lineage-aware probabilistic graphical model that predicts gene expression in >200 tissue-developmental stages. FIND (Fly in silico nanodissection) is a webserver for exploring the gene expression predictions. The code is available here.