1. Using python write a tool which generates a random subsampling tool for sequences. Given a FASTA sequence database file, which has 100,000 sequences, generate a new file which is a random subset these sequences selecting only 10% of them. Make this 10% an option in the program so it is easy to change to 20%, etc.

See the script ‘rand_shuffle_seqs.py’ in the homework template

  1. Run RNAseq analysis to compute the gene expression for two experiments. We will use data from this paper on light induction in bacterium Prochlorococcus

See this paper:

Also you can get it from the web directly yourself. https://www.ebi.ac.uk/ena/data/view/PRJNA315575

The data for one timepoint: 0hrs light

Go and get the fastq files to process - see the download script

Light 0hr

The data for another timepoints: 4hrs light

The fastq files to process

Here is the Prochlorococcus marinus subsp. pastoris str. CCMP1986 genome: GCF_000011465.1. The RefSeq for this genome is at that link.

Can be obtained from this link

And the GFF