GEN220 2021 edition
This assignment calls for two scripts. They are both started in the template.
Write a program called squared_cubed.py
and prints out three columns of data, ideally, separated by tabs.
A header line should be written which is labels of the columns
N Squared Cubed
Column 1: numbers 0 -> 30 Column 2: Square (x^2) of column 1 Column 3: Cubes (x^3) of column 2
Output should look like this (but going up to at least 30 for for the N column)
N Squared Cubed
0 0 0
1 1 1
2 4 8
3 9 27
4 16 64
5 25 125
We will compute some statistics for a tab delimited file called GFF which lists the location of genes and exons location in a genome annotation. Remember GFF is a structured format, tab delimited, which describes locations of features in a genome.
Recall eukaryotic Genes are made up of features: exons, introns, Untranslated regions (UTR). Some exons are coded as ‘CDS’ for CoDing Sequences - eg the ones that code for proteins.
See Wikipedia gene page and view of Gene structure in particular
Here is a GFF file for the Penicillium chrysosporium genome, which is the fungus which gave us one of the first antibiotics. The FungiDB database hosts genome sequences and data files for a collection of fungi.
The GFF file is available here FungiDB-54_PchrysosporiumRP-78.gff and FastA format genome assembly is FungiDB-54_PchrysosporiumRP-78_Genome.fasta. These are two files related to location of genes and sequence data.
Write a script called genome_stats.py
to:
curl
command from within your script. But if this doesn’t make sense to you, you can remove that.Hints:
aspairs()
function returns a dictionary where the keys are sequence IDs and the values are the DNA sequence for each of the contigs.