Install rMATS:
- Download and install pysam (rMATS was tested with v0.9.1.4)
|
- Download and install samtools (version 1.2 or later)
|
- Download and install STAR (version 2.5 or later)
|
- Add the Python directory to the $PATH environment variable
|
- Add the STAR directories to the $PATH environment variable
|
- Add the samtools directory to the $PATH environment variable
|
- Obtain STAR genome index for genome by either of the following two ways
- Download pre-built STAR indexes if using Human (hg38, hg19) or Mouse (mm10). STAR indexes are large (>70GB ziped). Downloading could take a while.
|
- Build your own STAR index following STAR manual from genome fasta sequence
|
|
- Untar rMATS and STAR indexes. For example, assuming you use pre-build STAR genome indexes, unpack rMATS.3.2.5.tgz in your home directory and unpack STARindex.tgz in your home directory:
|
cd
tar -xzf rMATS.3.2.5.tgzcdtar -xzf STARindex.tgz
Test rMATS:
Run testRun.sh as below to test rMATS runs properly.
cd ~/rMATS.3.2.5
./testRun.sh ~/STARindex/hg19
This two outputs can be found in the fastqTest and bamTest directories. The test run output should look like the
rMATS output description.
Trim Fastq (Optional):
To trim the poor quality 3' end of reads, use the trimFastq.py script found in the bin directory.
python trimFastq.py input.fastq trimmed.fastq desired_length
cd ~/rMATS.3.2.5/
python bin/trimFastq.py testData/231EV.25K.rep-1.R1.fastq testData/trimmed.fastq 32
The above command trims 231EV.25K.rep-1.R1.fastq to 32 bp long by removing sequence from the 3' end of the reads and then saves it to trimmed.fastq.
Alternative Splicing Events
rMATS analyzes skipped exon (SE), alternative 5' splice site (A5SS), alternative 3' splice site (A3SS), mutually exclusive exons (MXE), and retained intron (RI) events. Possible alternative splicing
events are identified from the RNA-Seq data and annotation of transcripts in GTF format. The following is a list of provided GTF files found in the gtf directory:
- Human, Homo sapiens (Ensembl or UCSC Known Genes)
|
- Mouse, Mus musculus (Ensembl or UCSC Known Genes)
|
- Drosophila, Drosophila melanogaster (FlyBase)
|
- C. Elegans, Caenorhabditis elegans (RefSeq)
|
- Zebrafish, Danio rerio (RefSeq)
|
Alternatively, you can download your own transcript annotation in GTF format. However, the first column (chromosome/contig name) in the GTF must match the sequence names in your STARindex.
Using rMATS:
The following is a detailed description of the options used with rMATS.
Usage:
Running with
fastq
python RNASeq-MATS.py -s1 rep1_1[:rep1_2][,rep2_1[:rep2_2]]* -s2 rep1_1[:rep1_2][,rep2_1[:rep2_2]]* -gtf gtfFile -bi STARindexFolder -o outDir -t readType -len readLength [options]*
Running with
bam
python RNASeq-MATS.py -b1 s1_rep1.bam[,s1_rep2.bam]* -b2 s2.rep1.bam[,s2.rep2.bam]* -gtf gtfFile -o outDir -t readType -len readLength [options]*
Required Parameters:
-s1 rep1_1[:rep1_2][,rep2_1[:rep2_2]]* | FASTQ file(s) for the sample_1. For the paired-end data, two files must be colon separated and replicates must be in a comma separated list (Only if using fastq) |
-s2 rep1_1[:rep1_2][,rep2_1[:rep2_2]]* | FASTQ file(s) for the sample_2. For the paired-end data, two files must be colon separated and replicates must be in a comma separated list (Only if using fastq) |
-b1 s1_rep1.bam[,s1_rep2.bam] | Mapping results for the sample_1 in bam format. Replicates must be in a comma separated list (Only if using bam) |
-b2 s2.rep1.bam[,s2.rep2.bam] | Mapping results for the sample_2 in bam format. Replicates must be in a comma separated list (Only if using bam) |
-t readType | Type of read used in the analysis. readType is either 'paired' or 'single'. 'paired' is for paired-end data and 'single' is for single-end data |
-len <int> | The length of each read |
-gtf gtfFile | An annotation of genes and transcripts in GTF format |
-bi STARIndexFolder | The folder name of the STAR binary indexes (i.e., the name of the folder that contains SA file). For example, use ~/STARindex/hg19 for hg19. (Only if using fastq) |
-o outDir | The output directory |
Optional:
-a <int> | The "anchor length" or "overhang length" used in the aligner. At least “anchor length” NT must be mapped to each end of a given junction. The default is 1. (This parameter applies only if using fastq) |
-c <float> | The cutoff splicing difference. The cutoff used in the null hypothesis test for differential splicing. The default is 0.0001 for 0.01% difference. Valid: 0 ≤ cutoff < 1 |
-analysis analysisType | Type of analysis to perform. analysisType is either 'P' or 'U'. 'P' is for paired analysis and 'U' is for unpaired analysis. The default is 'U' |
-libType libraryType | Library type. Default is unstranded (fr-unstranded). Use fr-firststrand or fr-secondstrand for strand-specific data. |
-novelSS <0 or 1> | Detection novel splice sites (unannotated splice sites). 0 is for no detection of novel splice sites and 1 is for detection of novel splice sites. Default is no detection of novel splice sites (0). |
-keepTemp | Enables rMATS to keep its temporary files generated during the run. Temporary files are generally for debugging. The default is to delete all temporary files. |
Examples:
Example using
fastq, performing
paired analysis
python RNASeq-MATS.py -s1 testData/231ESRP.25K.rep-1.R1.fastq:testData/231ESRP.25K.rep-1.R2.fastq,testData/231ESRP.25K.rep-2.R1.fastq:testData/231ESRP.25K.rep-2.R2.fastq -s2 testData/231EV.25K.rep-1.R1.fastq:testData/231EV.25K.rep-1.R2.fastq,testData/231EV.25K.rep-2.R1.fastq:testData/231EV.25K.rep-2.R2.fastq -gtf gtf/Homo_sapiens.Ensembl.GRCh37.72.gtf -bi ~/STARindex/hg19 -o out_test -t paired -len 50 -a 8 -c 0.0001
Example using
bam, performing
unpaired analysis
python RNASeq-MATS.py -b1 testData/231ESRP.25K.rep-1.bam,testData/231ESRP.25K.rep-2.bam -b2 testData/231EV.25K.rep-1.bam,testData/231EV.25K.rep-2.bam -gtf gtf/Homo_sapiens.Ensembl.GRCh37.72.gtf -o bam_test -t paired -len 50 -c 0.0001 -analysis U -libType fr-firststrand
Output:
All output files are in outputFolder
- MATS_output: A folder that contains rMATS output of AS events. Each output file is sorted by P-values in ascending order.
- AS_Event.MATS.JunctionCountOnly.txt evaluates splicing with only reads that span splicing junctions
- IJC_SAMPLE_1: inclusion junction counts for SAMPLE_1, replicates are separated by comma
|
- SJC_SAMPLE_1: skipping junction counts for SAMPLE_1, replicates are separated by comma
|
- IJC_SAMPLE_2: inclusion junction counts for SAMPLE_2, replicates are separated by comma
|
- SJC_SAMPLE_2: skipping junction counts for SAMPLE_2, replicates are separated by comma
|
|
- AS_Event.MATS.ReadsOnTargetAndJunctionCounts.txt evaluates splicing with reads that span splicing junctions and reads on target (striped regions on home page figure)
- IC_SAMPLE_1: inclusion counts for SAMPLE_1, replicates are separated by comma
|
- SC_SAMPLE_1: skipping counts for SAMPLE_1, replicates are separated by comma
|
- IC_SAMPLE_2: inclusion counts for SAMPLE_2, replicates are separated by comma
|
- SC_SAMPLE_2: skipping counts for SAMPLE_2, replicates are separated by comma
|
|
- Important columns contained in both types of output files
- IncFormLen: length of inclusion form, used for normalization
|
- SkipFormLen: length of skipping form, used for normalization
|
- IncLevel1: inclusion level for SAMPLE_1 replicates (comma separated) calculated from normalized counts
|
- IncLevel2: inclusion level for SAMPLE_2 replicates (comma separated) calculated from normalized counts
|
- IncLevelDifference: average(IncLevel1) - average(IncLevel2)
|
|
|
- summary.txt: A file that contains summary of statistically significant AS events and the identity of each replicate
|
- ASEvents: A folder that contains all possible alternative splicing (AS) events derived from GTF and RNA
|
- SAMPLE_1/REP_N: A folder that contains mapping results of sample_1, replicate N
- accepted_hits.bam is the original tophat output containing both multi-mapped and uniquely mappable reads.
|
- unique.S1.sam contains uniquely mappable reads only. rMATS uses uniquely mappable reads.
|
|
- SAMPLE_2/REP_N: A folder that contains mapping results of sample_2, replicate N
- accepted_hits.bam is the original tophat output containing both multi-mapped and uniquely mappable reads.
|
- unique.S2.sam contains uniquely mappable reads only. rMATS uses uniquely mappable reads.
|
|
- commands.txt: A list of key commands executed
|
- log.RNASeq-MATS: Log file for running rMATS pipeline
|