- Using python write a tool which generates a random subsampling tool for sequences. Given a FASTA sequence database file, which has 100,000 sequences, generate a new file which is a random subset these sequences selecting only 10% of them. Make this 10% an option in the program so it is easy to change to 20%, etc.
See the script ‘rand_shuffle_seqs.py’ in the homework template
- Run RNAseq analysis to compute the gene expression for two experiments. We will use data from this paper on light induction in bacterium Prochlorococcus
See this paper:
Thompson et al. PLoS One 2016; 11(10): e0165375 doi: 10.1371/journal.pone.0165375
Here is a view of the sequencing data. All data are on cluster in this folder /bigdata/gen220/shared/projects/HW4/hw4_template
Also you can get it from the web directly yourself. https://www.ebi.ac.uk/ena/data/view/PRJNA315575
The data for one timepoint: 0hrs light
Go and get the fastq files to process - see the download script
Light 0hr
The data for another timepoints: 4hrs light
The fastq files to process
Here is the Prochlorococcus marinus subsp. pastoris str. CCMP1986 genome: GCF_000011465.1. The RefSeq for this genome is at that link.
Can be obtained from this link
And the GFF
- Use Hisat2, samtools, and stringtie to generate a table of expression for these two experiments
- Write a python script that will generate a report with 5 columns of data. Gene Name, Chromosome, Start, End, Strand, Gene Length, FPKM exp1, FPKM exp 2
- Using R or other tools, make these plots
- Plot gene expression from exp1 vs gene expression of exp 2
- Plot gene length vs expr1 expression (FPKM). Is there a relationship?
- Plot gene location vs expr2 (FPKM). Is there any relationship?
- print a list of genes which are > 2 fold higher in the 4hr time point?