Mitochondrial genome assembly

Abstract

De novo assembly of 49 mustelid whole mitochondrial genomes

Steps

Calculate read length of fastq files for each sample and run MitoZ

R1=$1
R2=$2

fq1="$(realpath $R1)"
fq2="$(realpath $R2)"

dir="$(dirname $R1)"
fpath="$(basename $R1)"
#get the sample name and the prefix for the output R1 and R2 reads
samplename="$(cut -d'_' -f1 <<< $fpath)"
#get the length of the 11th read (in case the first few are a bit  short). Method will vary depending if the reads are gzipped or not

extension="${fpath##*.}"
fastq_length=150
if [ $extension == "gz" ]; then
         fastq_length="$(zcat $fq1 | head -n 42 | sed -n '42p'  | wc -c)"
 else
         fastq_length="$(sed -n '42p' $fq1 | wc -c)"
 fi

source samtools-1.10
source mitoz-2.3
source ncbiblast-2.2.27

REF=NC_020638.1_mitochondrial.fasta
#run mitzo all
srun mitoz all --fastq1 $fq1 --fastq2 $fq2 --fastq_read_length $fastq_length  --outprefix $samplename --thread_number 16 --clade  Chordata --genetic_code 2 --filter_taxa_method 1
#re-order assembly so all are anchored to a common reference.
srun python  /ei/software/testing/mitoz/2.3/src/release_MitoZ_v2.3/useful_scripts/Mitogenome_reorder.py -f $samplename.result/work71.mitogenome.fa -r $REF

Genome alignment. Concatenate the genomes and use ClustalW to align them.

2.1.

Rename both the accession name and file name of the genome assemblies, as they'll all have the same name (work71.mitogenome.fa.reorder)

SAMPLE=$1 #the sample name
FASTA=$2 #the path to the assembly

dir="$(dirname $FASTA)"

#change any number of upper and lowercase characters, numbers, spaces and = sign to the sampleID
sed -i "s/>[A-Za-z0-9 =]*/>${SAMPLE}/g" "$FASTA"

#rename the fasta file from 'work71.mitogenome.fa.reorder' to e.g. 'euro_S01_mitogenome.fasta'
mv $FASTA $dir/$SAMPLE\_mitogenome.fasta

2.2.

Concatenate all of the assemblies

find . -name "*_mitogenome.fasta" -exec cat {} \; -printf "\n" > all_mtdna_genomes.fasta
```You may need to visualise the genomes to make sure they're all the same complement. Reverse complement any genomes as required.

2.3.

Align the assemblies and change format to FASTA

source clustalw-2.1
source emboss-6.6.0

#align the seqences
srun clustalw -ALIGN -INFILE=all_mtdna_genomes.fasta -TYPE=DNA -OUTFILE=all_mtdna_genomes_aligned.aln
#reformat to fasta
srun seqret -sequence all_mtdna_genomes_aligned.aln -outseq all_mtdna_genomes_aligned.fasta

Abstract

Steps

推荐阅读