class: center, middle # Phylogenetics and data processing --- # Phylogenetics Trees * Representation of relationships of species, genes, groups * Adjacent branches are more closely related entities than ones further away * Generate trees based on computed distances or more complex liklihood models .center[![Darwintree](https://evolution-institute.org/wp-content/uploads/2016/02/figure1-300x252.jpg "Darwintree")] --- # Compact Tree Representation Newick format ``` (A,B,(C,D),(E,F,(G,H))); ________________________ A | |________________________ B | | ________________________ C _|________________________| | |________________________ D | | ________________________ E | | |________________________|________________________ F | | ________________________ G |________________________| |________________________ H ``` --- # Molecular Phylogenetics * Identify homologous sequences (e.g. genes) * Align sequnence with Multiple alignment software * Potentially trim sequence alignment or poor alignment region * Construct phylogenetic tree --- # Identify homologous sequences * Reciprocal BLAST * Orthology Clustering * Syntenic information (shared flanking regions) * Molecular approaches (PCR, Clone, Sequencing) * DNA and Protein considerations * Store in FASTA file --- # Extract those sequences ```bash $ module load cdbfasta $ cdbfasta DATABASE # make a file which has the sequence names, one each line $ cdbyank DATABASE.cidx < listseqids > sequencefile.fa ``` --- # Multiple Sequence Aligmment * Construct MSAs: MUSCLE ```bash $ module load muscle $ muscle -in VMS1.aa.fasta -out VMS1.aa.fasaln $ muscle -clw -in VMS1.aa.fasta -out VMS1.aa.aln ``` * Construct MSA of coding sequence: [T-Coffee](https://tcoffee.readthedocs.io) * Align the sequence on Codon Boundaries (e.g. w/ knowledge of the amino-acid it codes for) ```bash $ module load tcoffee $ t_coffee VMS1.cds.fasta -method cdna_fast_pair ``` --- # Trimming alignments * [Trimal](http://trimal.cgenomics.org/) or Gblocks to remove poorly aligning regions ```bash $ module load trimal $ trimal -in VMS1.aa.fasaln -out VMS1.aa.trim -automated1 ``` * additional [options for trimming](http://trimal.cgenomics.org/use_of_the_command_line_trimal_v1.2) ```bash # remove spurious sequences ('good site's are 75% seqs share a site; # 80% of the sites in sequence must be good to keep the sequence) $ trimal -in inputfile -out outputfile -resoverlap 0.75 -seqoverlap 80 ``` * Trimal can also convert formats of files (fasta to PHYLIP, NEXUS, etc) --- # Phylogenetic anlayses * Neighbor-Joining: Rapid, fast tree building. Distance based ```bash #!/usr/bin/bash #SBATCH --nodes 1 --ntasks 1 module load fasttree FastTree VMS1.aa.trim > VMS1.aa.nj.tre ``` * FastTree automatically bootstraps 1000x * FastTreeMP will run on multiple processors * Many parameters to specify if nucleotide sequences, distributions of site variance, etc --- # Maximum Likelihood * [RAxML](https://sco.h-its.org/exelixis/web/software/raxml/index.html) or [IQTree](http://www.iqtree.org/) on the cluster ```bash #!/usr/bin/bash #SBATCH --ntasks 2 --nodes 1 module load IQ-TREE iqtree-omp -s VMS1.aa.trim -nt 2 ``` ```bash #!/usr/bin/bash #SBATCH --ntasks 4 --nodes 1 module load RAxML raxmlHPC-PTHREADS-SSE3 -m PROTGAMMAAUTO -T 4 -s VMS1.aa.trim -n VMS1_Run1 -p 123 -x 123 -f a -N autoMRE ``` --- # Viewing trees * iTOL - https://itol.embl.de/ * Figtree - http://tree.bio.ed.ac.uk/software/figtree/ https://github.com/rambaut/figtree/ ```bash # with X11 enabled on your laptop $ module load figtree $ figtree file.tre ``` --- # Advanced Models * Model fit - to identify most likely model of seq evolution : ModelTest, jModelTest, ProtTest * Concatenated gene sets, allow partitioning of data : Partition Finder --- # more you can do in BioPython BioPython tutorial http://biopython.org/wiki/Phylo and Cookbook http://biopython.org/wiki/Phylo_cookbook * Read / Write Trees * Construct trees from MSA (distance and parsimony) * Bootstraps --- # Traverse a tree ```python #!/usr/bin/env python3 from Bio import Phylo trees = Phylo.parse("treefile1.tre", "newick") def all_parents(tree): parents = {} for clade in tree.find_clades(order='level'): for child in clade: parents[child] = clade return parents for tree in trees: print(tree) Phylo.draw_ascii(tree) for tip in tree.get_terminals(): print("terminal tip",tip) term_names = [term.name for term in tree.get_terminals()] parents = all_parents(tree) clades = tree.find_clades("E") for myclade in clades: # get first instance of 'E' parent_of_myclade = parents[myclade] print(parent_of_myclade.get_terminals()) ________________________ A | |________________________ B | | ________________________ C _|________________________| | |________________________ D | | ________________________ E | | |________________________|________________________ F | | ________________________ G |________________________| |________________________ H terminal tip A terminal tip B terminal tip C terminal tip D terminal tip E terminal tip F terminal tip G terminal tip H [Clade(name='E'), Clade(name='F'), Clade(name='G'), Clade(name='H')] ```