It is necessary to index the genome in preparation of alignemnt.
#SBATCH -p short -N 1 -n 2 --mem 2gb
module load bwa
GENOME=S_enterica_CT18.fasta
bwa index $GENOME
#SBATCH -p short -N 1 -n 16 --mem 4gb
module load bwa
module load samtools
CPU=16
mkdir -p fastq
ln -s /bigdata/gen220/shared/data/S_enterica/*.fastq.gz fastq
ln -s /bigdata/gen220/shared/data/S_enterica/S_enterica_CT18.fasta
ln -s /bigdata/gen220/shared/data/S_enterica/acc.txt
GENOME=S_enterica_CT18.fasta
if [ ! -f $GENOME.sa ]; then
bwa index $GENOME
fi
for acc in $(cat acc.txt)
do
FWDREAD=fastq/${acc}_1.fastq.gz
REVREAD=fastq/${acc}_2.fastq.gz
bwa mem -t $CPU $GENOME $FWDREAD $REVREAD > ${acc}.sam
samtools fixmate -O bam ${acc}.sam ${acc}_fixmate.bam
samtools sort --threads $CPU -O BAM -o ${acc}.bam ${acc}_fixmate.bam
samtools index ${acc}.bam
done
Interactively - you can use samtools
module load samtools
samtools tview SRR10574912.bam
There are many standardized SNP calling pipelines. GATK provides a robust pipeline that can be used.
Samtools/BCFTools are also useful and straight forward.
freebayes is another very useful pipeline for non-model systems.
#SBATCH -p batch -N 1 -n 4 --mem 16gb
module unload perl
module load samtools
module load bcftools
GENOME=S_enterica_CT18.fasta
# need to make a string which is all the bam files you want to process
# but if we do *.bam it will catch the intermediate bam files that are in the folder
for a in $(cat acc.txt)
do
m="$a.bam $m"
done
VCF=Salmonella.vcf.gz
VCFFILTER=Salmonella.filtered.vcf.gz
bcftools mpileup -Ou -f $GENOME $m | bcftools call -vmO z -o $VCF
tabix -p vcf $VCF
bcftools stats -F $GENOME -s - $VCF $VCF.stats
mkdir -p plots
plot-vcfstats -p plots/ $VCF.stats
bcftools filter -O z -o $FILTERED -s LOWQUAL -i'%QUAL>10' $VCF
IGV - High-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.
Many browsers allow upload of aligned data (bam files) to integrate local data with public genome resources.