Plant contig clustering based assembly (PLCL) pipeline: an efficient long-read assembly tool for plant chloroplast and mitochondrial genomes
Kanae Nishii
blastn
contig coverage
mitochondrial genome assembly
Oxford Nanopore Technologies long-read sequencing
plant chloroplast genome assembly
samtools
Abstract
Third generation NGS long-read sequences become popular for draft genome assembly. Here, we developed an efficient plant chloroplast assembly tool for whole genome assemblies from long-read datasets. This pipeline could be used also for mitochondrial genome assembly if the coverage is sufficiently high. The PLCL pipeline works on an assembled draft genome. Summary of the PLCL pipeline
Step 1 : Creating a blastn database of the draft genome assembly and search for target sequences (chloroplast or mitochondria) as the queries. Step 2 : Checking the coverage (mean depth) of each contig in the draft genome using minimap2 and samtools . Step 3 : Combining the results from step 1 (“blast_out.tsv”) and step 2 (“coverage.tsv”). Step 4 : Optional : k -means analyses for clustering the draft genome contigs with the Perl Algorithm::KMeans package. Step 5 : Optional : Visualizing blastn bitscore and samtools coverage data by R ggplot2 . Step 6 : Selecting chloroplast contigs from the results of blastn , samtools coverage, and output of k-means analyses. Step 7 : Extracting the chloroplast / mitochondrial reads from the bam file created in step 2. Step 8 : Assembling reads into chloroplast / mitochondrial genomes.
Steps
Pre-starting PLCL pipeline
Download data for Arabidopsis thaliana from NCBI and check their read quality.
Download ONT reads
sratoolkit.2.11.0-ubuntu64/bin/fastq-dump ERR2173373 –gzip
Download reference chloroplast sequence
Click “FASTA” on the browser
From the drop down menu “FASTA”, select “FASTA(text)”
Copy and paste the sequence to a text editor
Save as “AtChloRef.fasta”
Quality checking ONT reads with NanoPlot and visualize read length distribution using R ggplot2
Checking the long-read quality with NanoPlot
a: NanoPlot standard output and automatic visualization
NanoPlot \
-t 8 \
--fastq $HOME/PLCL_At / ERR2173373.fastq.gz \
-o OUTPUT_dir \
--loglength
Checking the long-read quality with NanoPlot
b: NanoPlot raw output and manual visualization with R ggplot2
b-1: Store the raw data from NanoPlot
NanoPlot \
-t 8 \
--fastq $HOME/ONT_genome_raw.fastq.gz \
-o OUTPUT_dir_raw \
--raw \
--store
b-2: R visualization of read length distribution as histogram
library(ggplot2)
df <- read.table("NanoPlot-data.tsv", header = T)
ONT <- ggplot(df, aes(x=lengths)) + geom_histogram(bins=1000) +
scale_x_log10(limits=c(1,1000000),labels=scales::comma) +
ylim(0,10000) +
xlab("Read_length") +
ylab("Number_of_reads")+
theme(text=element_text(size=20))+
geom_vline(xintercept = c(3500), linetype="dashed", color="yellow")
ggsave("ONT.emf",plot=ONT)
Assembling a draft genome for Arabidopsis thaliana from ONT reads using e.g. wtdbg2 or canu
This example script uses wtdbg2.
Creating a lay file from raw reads
wtdbg2 \
-x ont \
-g 135m \
-i $HOME/PLCL_At / ERR2173373.fastq.gz \
-t 32 \
-fo Atont
Generation of a draft genome assembly fasta file from a lay file
wtpoa-cns \
-t 32 \
-i ont.ctg.lay.gz \
-fo Atont.ctg.fasta
Optional: Assessing the draft genome assembly using Quast
quast \
--large \
-o Atont_wtdbg2_quast \
-t 8 \
Atont.ctg.fasta
PLCL pipeline: Step 1
Creating a blastn database of the draft genome assembly and search for target sequences (chloroplast or mitochondria) as the queries.
Generating a blastn database of draft genome assembly
makeblastdb \
-in Atont.ctg.fasta \
-out Atont.ctg_db \
-dbtype nucl \
-parse_seqids
Searching for chloroplast contigs using blastn
blastn \
-db Atont.ctg_db \
-query AtChloRef.fasta \
-evalue 0.1 \
-outfmt "6 sacc evalue bitscore" \
-out blast_out.tsv
PLCL pipeline: Step 2
Checking the coverage (mean depth) of each contig in the draft genome using minimap2 and samtools
Aligning raw ONT reads to the draft genome assembly using minimap2
minimap2 -ax map-ont Atont.ctg.fasta ERR2173373.fastq.gz > ontasm.sam
Calculating coverage with samtools
samtools sort -o ontasm.bam -@ 8 ontasm.sam && \
samtools index ontasm.bam && \
samtools coverage -o coverage.tsv ontasm.bam
PLCL pipeline: Step 3
Combining the results from step 1 (“blast_out.tsv”) and step 2 (“coverage.tsv”)
In console, locate the “coverage.tsv” and “blast_out.tsv” files, and the “PLCL1.pl” script in the same folder and run:
Perl PLCL1.pl
#PLCL1.pl
#!/usr/bin/env perl -w
use strict;
use warnings;
my $coverage="coverage.tsv";
my $blast="blast_out.tsv";
my $outfile="blast_cov.tsv";
my $outfile2="hitcontig_cov.tsv";
my $outfile3="hitcontig_cov.u.csv";
#step1 coverage data save as hash, key is rname
my %cov_data = ();
open COV,"<coverage.tsv" or die "!";
while(<COV>){
chomp;
my($rname,$start,$end,$numreads,$covbases,$coverage,$meandepth,$meanbaseq,$meanmapq) = split (/\t/, $_, 9);
$cov_data{$rname} = [$start,$end,$numreads,$covbases,$coverage,$meandepth,$meanbaseq,$meanmapq];
}
close(COV);
#step2 merge blast data and meandepth
open OUT, '>', $outfile or die "!";
open OUT2, '>', $outfile2 or die "!";
open BLAST, '<',$blast or die "!";
while (my $line = <BLAST>){
chomp($line);
#$line =~ /^#/ and next;
my ($rname, $evalue, $bitscore, $length) = split(/\t/, $line, 4);
#search coverage ids
my $ref_data = $cov_data{$rname};
#add meandepth 1st place of cov_data
my $depth = $ref_data->[5];
#output
print OUT join ("\t",map $_ // "", $rname,$evalue,$bitscore,$length, $depth), "\n";
print OUT2 join (",", map $_ // "", $rname,$depth), "\n";
}
close(OUT);
close(OUT2);
close(BLAST);
#step3 create table with only unique value
open IN, '<'. $outfile2 or die "!";
open OUT3, '>', $outfile3 or die "!";
my @array = <IN>;
my %count;
@array=grep{!$count{$_}++}@array;
print OUT3 @array, "\n";
close(IN);
close(OUT3);
PLCL pipeline: Step 4
Optional: k -means analyses for clustering the draft genome contigs with the Perl Algorithm::KMeans package
Locate “hitcontig_cov.u.csv” and “PLCL2.auto.pl” in the same folder and run:
Perl PLCL2.auto.pl 2>&1 > PLCL2.auto.log
#PLCL2.auto.pl
#!/usr/bin/env perl -w
use strict;
use warnings;
use Algorithm::KMeans;
#perlbrew switch perl-5.34.0
#perlbrew switch-off
#export LD_LIBRARY_PATH="/ gsl-2.6/lib"
#export LD_LIBRARY_PATH="/gsl-2.6/"
#/usr/sbin/setenforce 0
my $datafile="hitcontig_cov.u.csv";
my $mask="N1";
my $clusterer = Algorithm::KMeans -> new(datafile => $datafile,
mask => $mask,
K => 0,
cluster_seeding => 'smart',
terminal_output => 1,
write_clusters_to_files => 1,
);
$clusterer->read_data_from_file();
my ($clusters_hash,$cluster_centers_hash)=$clusterer->kmeans();
my $K_best=$clusterer->get_K_best();
my $QoCval=$clusterer->show_QoC_values();
my $outfile4="kmeans.out";
open OUT4, '>', $outfile4 or die "!";
print OUT4 "\ncluster\tcluster_center_value\tcontig\n";
foreach my $cluster_id (sort keys %{$clusters_hash}) {
print OUT4 "\n$cluster_id\t@{$cluster_centers_hash->{$cluster_id}}\t@{$clusters_hash->{$cluster_id}}\n";
}
close (OUT4);
Locate “hitcontig_cov.u.csv” and “PLCL2.manual.pl” in the same folder and run:
Perl PLCL2.manual.pl 2>&1 > PLCL2.manual.log
#PLCL2.manual.pl: Change [k] to desirable number of clusters
#!/usr/bin/env perl -w
use strict;
use warnings;
use Algorithm::KMeans;
my $datafile="hitcontig_cov.u.csv";
my $mask="N1";
my $clusterer = Algorithm::KMeans -> new(datafile => $datafile,
mask => $mask,
K => [k],
cluster_seeding => 'smart',
terminal_output => 1,
write_clusters_to_files => 1,
);
$clusterer->read_data_from_file();
my ($clusters_hash,$cluster_centers_hash)=$clusterer->kmeans();
my $K_best=$clusterer->get_K_best();
my $QoCval=$clusterer->show_QoC_values();
my $outfile4="kmeans.k[k].out";
open OUT4, '>', $outfile4 or die "!";
print OUT4 "\ncluster\tcluster_center_value\tcontig\n";
foreach my $cluster_id (sort keys %{$clusters_hash}) {
print OUT4 "\n$cluster_id\t@{$cluster_centers_hash->{$cluster_id}}\t@{$clusters_hash->{$cluster_id}}\n";
}
close (OUT4);
Combine the results of k means analyses to the “hitcontig_cov.u.csv” and save as “chlo_u.km.csv”
<img src="https://static.yanyin.tech/literature_test/protocol_io_true/protocols.io.bx5dpq26/g4vfbcmf71_update.jpg" alt="Example result file: "chlo_u.km.csv"" loading="lazy" title="Example result file: "chlo_u.km.csv""/>
PLCL pipeline: Step 5 (optional)
Visualizing blastn bitscore and samtools coverage data by R ggplot2
Plot the data "chlo_u.km.csv" from step 4 in R.
#load libraries
library(ggplot2)
library(dplyr)
library(tibble)
library(ggrepel)
#load chloroplast data
chlo <- read.csv("chlo_u.km.csv")
colnames(chlo)[5] <- "kmeans_k0"
colnames(chlo)[6] <- "kmeans_k5"
chlo$coverage <- as.numeric(chlo$coverage)
colnames(chlo)
#display k=0 plot without labels
ggplot(chlo) +
aes(x=bitscore,y=coverage)+
geom_point(mapping = aes(color=kmeans_k0),size=4) +
geom_point(colour="gray90",size=0.01)+
scale_x_log10(limits=c(1,1000000),labels=scales::comma)+
scale_y_log10(limits=c(0.1,1e4))+
theme(text=element_text(size=20))
#display k=0 plot with labels > 50 coverage
ggplot(chlo) +
aes(x=bitscore,y=coverage)+
geom_point(mapping = aes(color=kmeans_k0),size=4) +
geom_point(colour="gray90",size=0.01)+
scale_x_log10(limits=c(1,1000000),labels=scales::comma)+
scale_y_log10(limits=c(0.1,1e4))+
theme(text=element_text(size=20))+
geom_label_repel(data=chlo %>% filter(coverage>50),aes(label=tig),nudge_x=0.1,nudge_y=0.2)
#save k = 0 plot with labels > 50 coverage
c0 <- ggplot(chlo) +
aes(x=bitscore,y=coverage)+
geom_point(mapping = aes(color=kmeans_k0),size=4) +
geom_point(colour="gray90",size=0.01)+
scale_x_log10(limits=c(1,1000000),labels=scales::comma)+
scale_y_log10(limits=c(0.1,1e4))+
theme(text=element_text(size=20))+
geom_label_repel(data=chlo %>% filter(coverage>50),aes(label=tig),nudge_x=0.1,nudge_y=0.2)
ggsave("At_chlo_k0.svg",c0)
#save k = 5 plot with labels > 50 coverage
c5 <- ggplot(chlo) +
aes(x=bitscore,y=coverage)+
geom_point(mapping = aes(color=kmeans_k5),size=4) +
geom_point(colour="gray90",size=0.01)+
scale_x_log10(limits=c(1,1000000),labels=scales::comma)+
scale_y_log10(limits=c(0.1,1e4))+
theme(text=element_text(size=20))+
geom_label_repel(data=chlo %>% filter(coverage>50),aes(label=tig),nudge_x=0.1,nudge_y=0.2)
ggsave("At_chlo_k5.svg",c5)
PLCL pipeline: Step 6
Select chloroplast contigs from the results of blastn , samtools coverage, and k -means analyses.
Create a txt file of the list of contig name ("list.txt")
<img src="https://static.yanyin.tech/literature_test/protocol_io_true/protocols.io.bx5dpq26/g4vjbcmf70_update.jpg" alt="Example "list.txt" file" loading="lazy" title="Example "list.txt" file"/>
PLCL pipeline: Step 7
Extracting the chloroplast / mitochondrial reads from the bam file created in step 2
Locate "list.txt" in the working directory ($PWD) and run the commands "read_extract.sh".
#read_extract.sh
mkdir $PWD/tempbam
for i in $(cat $PWD/list.txt);
do
samtools view \
-F 16 \
-b \
-o $PWD/tempbam/"$i".bam \
$PWD/ontasm.bam \
"$i"
done &&\
samtools merge $PWD/ontasm_filt.bam \
-f \
$PWD/tempbam/* && \
samtools fasta \
$PWD/ontasm_filt.bam \
-F 4 > \
ontasm_filt.reads.fa
PLCL pipeline: Step 8
Assembling reads into chloroplast / mitochondrial genomes
canu \
-p Atont.chlo -d Atont_chloro_canu \
usegrid=false \
genomeSize=0.15m \
minReadLength=12000 \
minOverlapLength=10000 \
corMinCoverage=8 \
trimReadsOverlap=100 \
trimReadsCoverage=100 \
rawErrorRate=0.500 \
correctedErrorRate=0.144 \
corOutCoverage=40 \
minInputCoverage=8 \
stopOnLowCoverage=8 \
-nanopore-raw ontasm_filt.reads.fa