class: center, middle # RNASeq analyses --- # RNASeq procedure Wang et al. Nat Rev Genetics. 2009. doi:[10.1038/nrg2484](http://dx.doi.org/10.1038/nrg2484) ![RNAseq_schema](images/RNASeq_schema.png "RNA Seq") --- # Multiple approaches 1. Genome sequenced, align RNAseq reads to genome 2. de novo Assembly of mRNA into transcripts 3. Quantify gene expression from reads aligned to genome or transcripts --- # Reads to Genome mapping ![SpliceAlign](images/dsv03901.jpeg "Spliced alignment") Tarraga et al 2017. DNA Research.[10.1093/dnares/dsv039](https://dx.doi.org/10.1093/dnares/dsv039) --- #Reads to Genome mapping Challenges: mRNA is spliced, genome contains introns Splice-aware short read aligners. Speed and accuracy tradeoffs * Tophat + Bowtie * HISAT/HISAT2 * GMAP/GSNAP * STAR --- # Quantify expression * Count reads overlapping exons * Table of total read counts per gene * Normalize counts for gene length and sequencing library depth * Gene expression then is FPKM - Fragments per Kilobase per Millions of reads * Tools: htseq-count, stringtie * BEDtools * R tools with iRanges --- # Evaluating expression differences Statistical tools for evaluating gene expression differences * Ballgown [bioconductor package](https://bioconductor.org/packages/release/bioc/html/ballgown.html) * DESeq [bioconductor package](https://bioconductor.org/packages/release/bioc/html/DESeq.html) * edgeR [bioconductor package](https://bioconductor.org/packages/release/bioc/html/edgeR.html) --- # Alternative approach for Quantifying Compare reads to __Transcripts__ instead of Genome * Kalisto and Sailfish are common tools * Bray et al 2016 "Near-optimal probabilistic RNA-seq quantification" doi:[10.1038/nbt.3519](http://dx.doi.org/10.1038/nbt.3519) * Patro et al 2014 "Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms" doi:[10.1038/nbt.2862](http://dx.doi.org/10.1038/nbt.2862) --- # Alignment free quantification ```bash module load kallisto TRANSCRIPTS=Organism.transcripts.fasta CPU=2 for SAMPLE in SAMPLE1 SAMPLE2 SAMPLE3; do kallisto quant -i $TRANSCRIPTS -o kallito/sample1 -t $CPU --bias --single \ -l 300 --sd 20 $INDIR/${SAMPLE}.fastq.gz done ``` --- # Example code and pipelines https://github.com/stajichlab/C_lusitaniae_DHED1_RNAseq/tree/master/jobs --- # Denovo assembly [Trinity Assembler](http://trinityrnaseq.github.io/) for RNASeq ```bash $ module load trinity-rnaseq $ module switch perl/5.22.0 $ Trinity --seqType fq --left reads_1.fq --right reads_2.fq --CPU 8 --max_memory 20G ``` --- #ORF identification [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) * Finds Open Reading Frames in mRNA transcripts ```bash $ module load transdecoder $ TransDecoder.LongOrfs -t target_transcripts.fasta ``` --- #RNAseq read mapping Using HISAT2 ```bash # srun --ntasks 8 --pty bash -l $ mkdir rnaseq; cd rnaseq $ cp -r /bigdata/gen220/shared/projects/RNAseq ./ $ module load hisat2 $ cd genome $ ls -l $ hisat2-build yeast_genome.fasta yeast $ cd .. ``` --- # RNAseq read mapping ```bash $ hisat2 -x genome/yeast -1 fastq/yeast_RNASeq_1.fq -2 fastq/yeast_RNASeq_2.fq \ -S RNASeq_aln.sam -p 16 $ module load samtools $ samtools view -b RNASeq_aln.sam > RNASeq_aln.bam $ samtools sort RNASeq_aln.bam > RNASeq_aln.sort.bam $ samtools index RNASeq_aln.sort.bam $ samtools flagstat RNASeq_aln.sort.bam ``` --- # Get counts Subread - http://subread.sourceforge.net/ ```bash module load subread/1.6.0 CPUS=2 GENOME=genomefile.fasta GFF=genes.gff OUTFILE=read_counts.tab INFILE=RNASeq_aln.sort.bam featureCounts -g gene_id -T $CPUS -G $GENOME -s 0 -a $GFF -o $OUTFILE \ -F GTF $INFILE ```