Research Software from the Bioinformatics Core

Next Generation Sequencing Tools

RNA-seq

▼ Open All

▲ Close All

InPAS (2024)

» A Bioconductor package for identifying novel polyadenylation sites (PAS) from RNA-seq data

Description: Alternative cleavage and polyadenylation (APA) is a crucial post-transcriptional gene regulation mechanism that regulates gene expression in eukaryotes by increasing the diversity and complexity of both the transcriptome and proteome. Despite the development of more than a dozen experimental methods over the last decade to identify and quantify APA events, widespread adoption of these methods has been limited by technical, financial, and time constraints. Consequently, APA remains poorly understood in most eukaryotes. However, RNA sequencing (RNA-seq) technology has revolutionized transcriptome profiling and recent studies have shown that RNA-seq data can be leveraged to identify and quantify APA events. To fully capitalize on the exponentially growing RNA-seq data, we developed InPAS (Identification of Novel alternative PolyAdenylation Sites), an R/Bioconductor package for accurate identification of novel and known cleavage and polyadenylation sites (CPSs), as well as quantification of APA from RNA-seq data of various experimental designs.

Publication: Ou J, Liu H, Park S, Green MR, Zhu LJ. InPAS: An R/Bioconductor package for identifying novel polyadenylation sites and alternative polyadenylation from bulk RNA-seq data. Front Biosci (Schol Ed). 2024, 16(4):21.

CleanUpRNAseq (2024)

» A Bioconductor package for detecting and correcting DNA contamination in RNA-seq data

Description: RNA sequencing (RNA-seq) has become a standard method for profiling gene expression, yet genomic DNA (gDNA) contamination carried over to the sequencing library poses a significant challenge to data integrity. Detecting and correcting this contamination is vital for accurate downstream analyses. Particularly, when RNA samples are scarce and invaluable, it becomes essential not only to identify but also to correct gDNA contamination to maximize the data's utility. However, existing tools capable of correcting gDNA contamination are limited and lack thorough evaluation. To fill the gap, we developed CleanUpRNAseq, which offers a comprehensive set of functionalities for identifying and correcting gDNA-contaminated RNA-seq data.

Publication: Liu H, Hu K, O'Connor K, Kelliher MA, Zhu LJ. CleanUpRNAseq: An R/Bioconductor Package for Detecting and Correcting DNA Contamination in RNA-Seq Data. BioTech (Basel). 2024, 13(3):30.

OneStopRNAseq (2020)

» A web application for comprehensive and efficient analyses of RNA-seq data

Description: OneStopRNAseq has user-friendly interfaces and offers workflows for common types of RNA-seq data analyses, such as comprehensive data-quality control, differential analysis of gene expression, exon usage, alternative splicing, transposable element expression, allele-specific gene expression quantification, and gene set enrichment analysis.

Publication: Li, R.; Hu, K.; Liu, H.; Green, M.R.; Zhu, L.J. OneStopRNAseq: A Web Application for Comprehensive and Efficient Analyses of RNA-Seq Data. Genes 2020, 11, 1165.

ChIP-seq

▼ Open All

▲ Close All

ChIPpeakAnno (2010)

» A bioconductor package for batch annotating peaks identified from ChIP-seq, ChIP-chip or any experiments that result in large number of genomic interval data

Description: Batch annotation of the peaks identified from either ChIP-seq or ChIP-chip experiments. The package includes functions to retrieve the sequences around the peak, obtain enriched Gene Ontology (GO) terms, find the nearest gene, exon, miRNA or custom features such as most conserved elements and other transcription factor binding sites supplied by users. This package leverages the biomaRt, IRanges, Biostrings, BSgenome, GO.db, multtest and stat packages.

Publications:
Zhu L (2013). "Integrative analysis of ChIP-chip and ChIP-seq dataset.” In Lee T, Luk ACS (eds.), Tilling Arrays, volume 1067, chapter 4, -19. Humana Press.

Zhu LJ*, Gazin C, Lawson ND, Pages H, Lin SM, Lapointe DS and Green MR. (2010) [* denotes corresponding author] ChIPpeakAnno: A Bioconductor package to annotate ChIP-seq and ChIP-chip data. BMC Bioinformatics. 2010, 11:237.

ATAC-seq

▼ Open All

▲ Close All

scATACpipe (2022)

» A bioinformatic pipeline for single-cell ATAC-seq (scATAC-seq) data analysis

Description: Powered by Nextflow, scATACpipe enables users to perform the end-to-end analysis of scATAC-seq data with three sub-workflow options for preprocessing that leverage 10x Genomics Cell Ranger ATAC software, the ultra-fast Chromap procedures, and a set of custom scripts implementing current best practices for scATAC-seq data preprocessing. The pipeline extends the R package ArchR for downstream analysis with added support to any eukaryotic species with an annotated reference genome.

Publication: Hu K, Liu H, Lawson ND, Zhu LJ. scATACpipe: A nextflow pipeline for comprehensive and reproducible analyses of single cell ATAC-seq data. Front Cell Dev Biol. 2022, 10:981859.

ATACseqQC (2018)

» A Bioconductor package for quality assessment of ATAC-seq data

Description: ATAC-seq (Assays for Transposase-Accessible Chromatin using sequencing) is a technique for genome-wide analysis of chromatin accessibility. Compared to earlier methods for assaying chromatin accessibility, ATAC-seq is faster and easier to perform, does not require cross-linking, has higher signal to noise ratio, and can be performed on small cell numbers. However, to ensure a successful ATAC-seq experiment, step-by-step quality assurance processes, including both wet lab quality control and in silico quality assessment, are essential. ATACseqQC package is for easily generating various diagnostic plots to help researchers quickly assess the quality of their ATAC-seq data. In addition, this package contains functions to preprocess aligned ATAC-seq data for subsequent peak calling.

Publication: Ou J, Liu H, Yu J, Kelliher MA, Castilla LH, Lawson ND, Zhu LJ. ATACseqQC: A Bioconductor package for post-alignment quality assessment of ATAC-seq data. BMC Genomics. 2018, 19(1):169.

ChIPpeakAnno (2010)

Publications:
Zhu L (2013). "Integrative analysis of ChIP-chip and ChIP-seq dataset.” In Lee T, Luk ACS (eds.), Tilling Arrays, volume 1067, chapter 4, -19. Humana Press.

RED-seq

▼ Open All

▲ Close All

REDseq (2014)

» A Bioconductor package for analyzing high-throughput sequencing data processed by restriction enzyme digestion

Description: The package includes functions to build restriction enzyme cut site (RECS) map, distribute mapped sequences on the map with five different approaches, find enriched/depleted RECSs for a sample, and identify differentially enriched/depleted RECSs between samples.

Publication: Chen PB, Zhu LJ, Hainer SJ, McCannell KN, Fazzio TG. Unbiased chromatin accessibility profiling by RED-seq uncovers unique features of nucleosome variants in vivo. BMC Genomics. 2014, 15:1104.

NAD-seq

▼ Open All

▲ Close All

NADfinder (2019)

» A Bioconductor package for the bioinformatic analysis of NAD-seq data

Description: The nucleolus is an important structure inside the nucleus in eukaryotic cells. It is the site for transcribing rDNA into rRNA and for assembling ribosomes, aka ribosome biogenesis. In addition, nucleoli are dynamic hubs through which numerous proteins shuttle and contact specific non-rDNA genomic loci. Deep sequencing analyses of DNA associated with isolated nucleoli (NAD- seq) have shown that specific loci, termed nucleolus- associated domains (NADs) form frequent three- dimensional associations with nucleoli. NAD-seq has been used to study the biological functions of NAD and the dynamics of NAD distribution during embryonic stem cell (ESC) differentiation. NADfinder is the first software designed specifically for the bioinformatic analysis of the NAD-seq data, including baseline correction, smoothing, normalization, peak calling, and annotation.

Publication: Vertii A, Ou J, Yu J, Yan A, Liu H, Zhu LJ, Kaufman PD (2019). Two contrasting classes of nucleolus-associated domains in mouse fibroblast heterochromatin. Genome Research. 2019, 29:1235.

CRISPR Tools

▼ Open All

▲ Close All

GS-Preprocess (2021)

» A bioinformatic pipeline that generates input data for the GUIDEseq Bioconductor package

Description: GUIDEseq (GS)-Preprocess is a simple, 5-argument pipeline that generates input data for the GUIDEseq Bioconductor package from raw Illumina sequencer output. For off-target profiling, Bioconductor GUIDEseq only requires a 2-line guideRNA fasta, demultiplexed BAM files of "plus"- and "minus"-strands, and Unique Molecular Index (UMI) references for each read. The latter two are produced by GS-Preprocess.

Publication: Rodríguez TC, Dadafarin S, Pratt HE, Liu P, Amrani N, Zhu LJ. Genome-wide detection and analysis of CRISPR-Cas off-targets. Prog Mol Biol Transl Sci. 2021; 181:31-43.

GUIDEseq (2017)

» A Bioconductor package for identifying off-targets with GUIDE-seq data

Description: The package implements GUIDE-seq analysis workflow in a flexible platform with more than 60 adjustable parameters for the analysis of datasets associated with custom nuclease applications. These parameters allow data analysis to be tailored to different nuclease platforms with different length and complexity in their guide and PAM recognition sequences or their DNA cleavage position. They also enable users to customize sequence aggregation criteria, and vary peak calling thresholds that can influence the number of potential off-target sites recovered. GUIDEseq also annotates potential off-target sites that overlap with genes based on genome annotation information, as these may be the most important off-target sites for further characterization. In addition, GUIDEseq enables the comparison and visualization of off-target site overlap between different datasets for a rapid comparison of different nuclease configurations or experimental conditions.

Publication: Zhu LJ, Lawrence M, Gupta A, Pages H, Kucukural A, Garber M, Wolfe SA. GUIDEseq: A Bioconductor package to analyze GUIDE-Seq datasets for CRISPR-Cas nucleases. BMC Genomics. 2017, 18(1).

CRISPRseek (2014)

» A Bioconductor package for designing target-specific guide RNAs in CRISPR-Cas9, genome-editing systems

Description: The package includes functions to find potential guide RNAs for input target sequences, optionally filter guide RNAs without restriction enzyme cut site, or without paired guide RNAs, genome-wide search for off-targets, score, rank, fetch flank sequence and indicate whether the target and off-targets are located in exon region or not. Potential guide RNAs are annotated with total score of the top5 and topN off-targets, detailed topN mismatch sites, restriction enzyme cut sites, and paired guide RNAs. If GeneRfold is installed, then the minimum free energy and bracket notation of secondary structure of gRNA and gRNA backbone constant region will be included in the summary file. This package leverages Biostrings and BSgenome packages.

Publications:
Zhu LJ. Overview of guide RNA design tools for CRISPR-Cas9 genome editing technology. Front. Biol. 2015, 10(4).

Zhu LJ, Holmes BR, Aronin N, Brodsky MH. CRISPRseek: A Bioconductor package to identify target-specific guide RNAs for CRISPR-Cas9 genome-editing systems. PLoS One. 2014, 9(9).

Transcription Factor Binding / Motif Analysis Tools

▼ Open All

▲ Close All

motifStack (2018)

» A Bioconductor package for visualizing motif alignment and analyzing transcription factor binding site evolution

Description: This package is for the visualization of the alignment of motifs as a phylogenetic tree in different layout types. This tool facilitates the analysis of binding site diversity and conservation within families of TFs and the evolution of TFs among different species. motifStack can align DNA motifs; generate motif signatures for closely related motifs; and plot aligned motifs as a stack, a linear or a radial tree, or a word cloud of sequence logos. Different parameter settings can be used to generate diverse types of plots with color schema highlighting important data features.

This package is involved in the pipeline of finding candidate binding sites for known transcription factors via sequence matching.

Publication: Ou J, Wolfe SA, Brodsky MH, Zhu LJ. motifStack for the analysis of transcription factor binding site evolution. Nature Methods. 2018, 15, 8-9.

Fly Factor Survey (2010)

» A database of Drosophila transcriptional factor DNA-binding specificities

Description: The FlyFactorSurvey database summarizes a project using the bacterial one-hybrid method to systematically describe the binding site preferences of transcription factors in Drosophila melanogaster.

Publication: Zhu LJ, Christensen RG, Kazemian M, Hull CJ, Enuameh MS, Basciotta MD, Brasefield JA, Zhu C, Asriyan Y, Lapointe DS, Sinha S, Wolfe SA, Brodsky MH. FlyFactorSurvey: a database of Drosophila transcription factor binding specificities determined using the bacterial one-hybrid system. Nucleic Acids Res. 2010, 39(Database issue): D111-D117.

GeneNetworkBuilder

» A Bioconductor package for building a regulatory network from ChIP-chip/ChIP-seq and expression data

GeneNetworkBuilder

Description: GeneNetworkBuilder (GNB) is a web application for discovering the transcriptional regulatory network for a given transcription factor of Caenorhabditis elegans, Homo sapiens and so on, using ChIP-chip or ChIP-seq combined with gene expression profile from either RNA-seq or expression microarray experiments.

EZ_weblogo

» A website to create a motif logo for a transcription factor

Proteomics Tools

▼ Open All

▲ Close All

dagLogo (2020)

» A Bioconductor package to find and visualize significantly enriched or depleted amino acid sequence patterns in a proteome dataset

Description: dagLogo visualizes significant conserved amino acid sequence patterns in groups based on probability theory. In addition to implement iceLogo in R to visualize differential amino acid sequence pattern, dagLogo can also test and visualize significant amino acid group patterns by classifying the amino acids into groups according to charge, chemistry and hydrophobicity, etc.

Publication: Ou J, Liu H, Nirala NK, Stukalov A, Acharya U, Green MR, Zhu LJ. dagLogo: An R/Bioconductor package for identifying and visualizing differential amino acid group usage in proteomics data. PLoS One. 2020, 15(11): e0242030.

PolyA Site Identification Tools

▼ Open All

▲ Close All

InPAS (2024)

» A Bioconductor package for identifying novel polyadenylation sites (PAS) from RNA-seq data

cleanUpdTSeq (2013)

» A Bioconductor package to classify putative polyA sites as true or false/internally oligodT primed

Description: cleanUpdTSeq cleans up artifacts from polyadenylation sites from oligo(dT)-mediated 3' end RNA sequending data. This package uses the naïve Bayes classifier (from e1071) to assign probability values to putative polyadenylation sites (pA sites) based on training data from zebrafish. This will allow the user to separate true, biologically relevant pA sites from false, oligodT primed pA sites.

Publication: Sheppard S, Lawson ND* and Zhu LJ*. [* denotes cocorresponding author] Accurate identification of polyadenylation sites from 3' end deep sequencing using a naïve Bayes classifier. Bioinformatics. 2013, 9(20):2564.

Data Visualization Tools

▼ Open All

▲ Close All

dagLogo (2020)

trackViewer (2019)

» A Bioconductor package with minimalist design for plotting elegant track layers

Description: This package is for the visualization of multi-omics data that can be integrated into any analysis pipeline in R. trackViewer can be used not only to visualize coverage and annotation tracks, but also to generate lollipop and dandelion plots that depict sparse and dense methylation/mutation/variant data to facilitate an integrative analysis of diverse datasets. In addition, the updated trackViewer (versions 1.19.27 and higher) has a web interface in addition to the R programming interface. Furthermore, with the ‘browseTracks’ function, users can generate interactive figures—that is, figures one can easily customize the features of by clicking, dragging, and typing.

Publicaton: Ou J, Zhu LJ. trackViewer: A Bioconductor package for interactive and integrative visualization of multi-omics data. Nature Methods. 2019,16:453–454.

motifStack (2018)

» A Bioconductor package for visualizing motif alignment and analyzing transcription factor binding site evolution

This package is involved in the pipeline of finding candidate binding sites for known transcription factors via sequence matching.

Publication: Ou J, Wolfe SA, Brodsky MH, Zhu LJ. motifStack for the analysis of transcription factor binding site evolution. Nature Methods. 2018, 15, 8-9.

Machine Learning Tools

▼ Open All

▲ Close All

cleanUpdTSeq (2013)

Statistical Analysis Tools

▼ Open All

▲ Close All

StepReg

>> An R package designed to streamline stepwise regression analysis while promoting best practices

Description: Stepwise regression is commonly used for model selection, but the lack of a unified tool supporting various regression types, selection strategies, and metrics complicates its proper use. This study introduces StepReg, an R package designed to streamline stepwise regression analysis while promoting best practices. StepReg supports multiple regression types, integrates popular selection methods, and includes key selection metrics. Additionally, StepReg allows users to select multiple strategies and metrics for efficient model selection, visualize variable selection, and export results in various formats. However, StepReg should not be used for statistical inference unless the variable selection process is explicitly accounted for, as this can invalidate the results. This issue does not arise when StepReg is used for prediction. StepReg was validated using public datasets in SAS to ensure accuracy and features an interactive Shiny application to enhance its accessibility.