Workflows for genomics topic with tag ENCODE
ENCODE Transcription Factor and Histone ChIP-Seq processing pipeline This ChIP-Seq pipeline is based off the ENCODE (phase-3) transcription factor and histone ChIP-seq pipeline specifications (by Anshul Kundaje) in this [google doc](https://docs.google.com/document/d/1lG_Rd7fnYgRpSIqrIfuVlAz2dW1VaSQThzk836Db99c/edit#).
This is the ENCODE-DCC RNA-sequencing pipeline. The scope of the pipeline is to align reads, generate signal tracks, and quantify genes and isoforms. RNA-seq data is valuable as it allows the measure of RNA expression levels as a transcriptional readout and the study of RNA structures in order to understand how RNA-based mechanisms impact gene regulation and thus disease and phenotypic variation. Since RNA populations are diverse, the ENCODE Consortium has developed the following RNA-seq pipelines:
This is the ENCODE-DCC Long read RNA-seq pipeline. This pipeline can handle data from both PacBio and Oxford Nanopore platforms. Pipeline does read alignment, corrects for mismatches, microindels and non-canonical splice junctions and then provides quantitations and QC-metrics.
This is the ENCODE-DCC Micro RNA-sequencing pipeline. The scope of the pipeline is to cut adapters, align reads, generate signal tracks, and quantify genes. MicroRNA-seq allows researchers to characterize and quantify the expression and prevalence of the small non-coding RNA moleccules known as microRNA. These molecules may play an important role in diseases, and significant effort is underway to understand their effects across a variety of tissue types and cells. For effective processing, the average insert size must be no more than 30 bases.
This pipeline is designed for automated end-to-end quality control and processing of ATAC-seq and DNase-seq data. The pipeline can be run on compute clusters with job submission engines as well as on stand alone machines. It inherently makes uses of parallelized/distributed computing. Pipeline installation is also easy as most dependencies are automatically installed. The pipeline can be run end-to-end, starting from raw FASTQ files all the way to peak calling and signal track generation using a single caper submit command. One can also start the pipeline from intermediate stages (for example, using alignment files as input). The pipeline supports both single-end and paired-end data as well as replicated or non-replicated datasets. The outputs produced by the pipeline include 1) formatted HTML reports that include quality control measures specifically designed for ATAC-seq and DNase-seq data, 2) analysis of reproducibility, 3) stringent and relaxed thresholding of peaks, 4) fold-enrichment and pvalue signal tracks. The pipeline also supports detailed error reporting and allows for easy resumption of interrupted runs. It has been tested on some human, mouse and yeast ATAC-seq datasets as well as on human and mouse DNase-seq datasets. The ATACseq pipeline protocol specification is here (https://docs.google.com/document/d/1f0Cm4vRyDQDu0bMehHD7P7KOMxTOP-HiNoIvL1VcBt8/edit?usp=sharing). Some parts of the ATAC-seq pipeline were developed in collaboration with Jason Buenrostro, Alicia Schep and Will Greenleaf at Stanford.