Consensus Sequence Generation Protocol: Leveraging Variant Analysis in Mulberry Genotypes
Vidya Niranjan, Lavanya C, Spoorthi R Kulkarni, Shri Ganapathi
Abstract
This research investigates four distinct Mulberry genotypes, each offering valuable traits essential for sericulture and broader agricultural sustainability. Notably, Thailand Male is examined for genetic variations associated with superior leaf yield, potentially informing selective breeding efforts for cultivars providing optimal nutrition to silkworms. Assam Bol's resistance to root rot diseases is explored to elucidate genetic factors, promising the development of disease-resistant Mulberry varieties and reducing reliance on chemical interventions. The drought-resistant S1 genotype's genetic resilience is scrutinized to develop Mulberry cultivars capable of withstanding water scarcity amid climate change challenges. Furthermore, the study assesses Punjab Local's nitrogen use efficiency, crucial for enhancing agricultural productivity while mitigating environmental impact. Employing advanced genetic analysis techniques, including SNP and SSR markers, this research aims to revolutionize Mulberry agriculture by identifying genetic markers, establishing linkage maps, and creating diagnostic tools. Ultimately, this endeavor seeks to enhance economic viability and environmental sustainability within the sericulture industry.
The current protocol is designed for development of consensus for mulberry genome. But the protocol can be used for other species too.
Before start
Scripting and Command line operation should be known
The Server with higher configuration to be used for the protocol
The GATK has some filter values which will change according to the sample. The User has to set the values accordingly.
Java file has to be installed properly
Steps
SAMPLE COLLECTION AND REFERENCE GENOME
Collection of the Whole genome sequences and the Reference genome for Morus Indica (Mulberry)
The high-depth sequencing of multiple mulberry accessions were sequenced via Illumina platform and the fastq files were retrieved from MindGP an Indian repository of Mulberry data. . The project involves generation of a high-density SNP map for various genotyping applications.
#Downloading the Genome sequences from FTP link in Ubuntu 20.04
wget <FTP Link>
Reference genome
The Morus indica species have 365 sequenced contigs which were combined together to get a whole reference genome.
The concatenated file was used for alignment and mapping of the fastq files and was named as mulberry_reference.fasta
ALIGNMENT AND MAPPING
The alignment and mapping process begins by indexing the reference genome for efficient alignment using BWA MEM. The aligned sequences are then stored in a SAM file format. Next, SAMtools is employed to convert the SAM file to the binary BAM format and to sort the aligned reads based on their genomic coordinates. This sorting step optimizes downstream analyses such as variant calling and consensus sequence generation.
#Alignment UBUNTU 20.04
bwa mem mulberry_reference.fasta <SRRID>_1.fastq.gz <SRRID>_2.fastq.gz > <SRRID>.sam
#SAM to BAM conversion Ubuntu 20.04
samtools view -bS <SRRID>.sam > <SRRID>.bam
#Sorting BAM
samtools sort <SRRID>.bam -o <SRRID>_sorted.bam
VARIANT CALLING USING GATK PIPELINE
The GATK pipeline processes raw sequencing data by initially aligning reads to a reference genome and then recalibrating base quality scores to improve accuracy. Variant calling algorithms identify potential variants, including SNPs, which are subsequently filtered for high quality using variant quality score recalibration. Filtered variants are annotated with functional information before a consensus sequence is generated by incorporating the most common variant at each genomic position across all samples. This approach ensures reliable SNP identification, filtering, and consensus sequence generation.
The provided command is using the Picard tools, specifically the MarkDuplicates tool, in a Java environment. Let's break down the command:
java -jar picard.jar: This part of the command indicates that you are running a Java application and specifying the JAR (Java Archive) file picard.jar, which contains the Picard tools.
MarkDuplicates: This is the specific Picard tool being invoked. It is used to identify and mark duplicate reads in a BAM (Binary Alignment/Map) file.
INPUT=
OUTPUT=
METRICS_FILE=
java -jar picard.jar MarkDuplicates INPUT=<SRRID>_sorted.bam OUTPUT=<SRRID>_dedup_reads.bam METRICS_FILE=<SRRID>_metrics.txt
CollectAlignmentSummaryMetrics tool, to collect alignment summary metrics from a BAM (Binary Alignment/Map) file. Let's break down the command:
java -jar picard.jar: This part of the command indicates that you are running a Java application and specifying the JAR (Java Archive) file picard.jar, which contains the Picard tools.
CollectAlignmentSummaryMetrics: This is the specific Picard tool being invoked. It is used to collect summary metrics about the alignment from a BAM file.
R=mulberry_reference.fasta: Specifies the reference genome in FASTA format against which the reads were aligned. mulberry_reference.fasta is a placeholder, and you would replace it with the actual filename of your reference genome.
I=
O=
java -jar picard.jar CollectAlignmentSummaryMetrics R=mulberry_reference.fasta I=<SRRID>_dedup_reads.bam O=<SRRID>_alignment_metrics.txt
java -jar picard.jar CollectInsertSizeMetrics -I <SRRID>_dedup_reads.bam -O <SRRID>_insert_metrics.txt -H <SRRID>_insert_size_histogram.pdf
AddOrReplaceReadGroups adds or replaces read groups in a BAM (Binary Alignment/Map) file. Read groups are metadata associated with sequencing data and are important for downstream analysis. Let's break down the command:
java -jar picard.jar: This part of the command indicates that you are running a Java application and specifying the JAR (Java Archive) file picard.jar, which contains the Picard tools.
AddOrReplaceReadGroups: This is the specific Picard tool being invoked. It is used to add or replace read group information in a BAM file.
I=
O=
RGID=4: Specifies the Read Group ID. It is a unique identifier for the read group. In this case, it’s set to 4.
RGLB=lib1: Specifies the Read Group Library. It represents the library from which the sequencing data originates.
RGPL=ILLUMINA: Specifies the Read Group Platform. It indicates the platform used for sequencing, and here it’s set to ILLUMINA.
RGPU=unit1: Specifies the Read Group Platform Unit. It provides additional information about the flowcell or lane, and here it’s set to unit1.
RGSM=20: Specifies the Read Group Sample Name. It represents the sample name associated with the read group. Here, it’s set to 20.
java -jar picard.jar AddOrReplaceReadGroups I=<SRRID>_dedup_reads.bam O=<SRRID>_sorted_dedupreads.bam RGID=4 RGLB=lib1 RGPL=ILLUMINA RGPU=unit1 RGSM=20
#INDEXING THE BAM FILE UBUNTU 20.04
samtools index <SRRID>_sorted_dedupreads.bam
Now that the Preprocessing is been performed, The GATK Pipeline is run to get the indels and the SNPs
#GATK PHASE 1 UBUNTU 20.04
gatk HaplotypeCaller -R mulberry_reference.fasta -I <SRRID>_sorted_dedupreads.bam -O <SRRID>_raw_variants.vcf
gatk SelectVariants -R mulberry_reference.fasta -V <SRRID>_raw_variants.vcf --select-type-to-include SNP -O <SRRID>_raw_snps.vcf
gatk SelectVariants -R mulberry_reference.fasta -V <SRRID>_raw_variants.vcf --select-type-to-include INDEL -O <SRRID>_raw_indels.vcf
gatk VariantFiltration -R mulberry_reference.fasta -V <SRRID>_raw_snps.vcf -O <SRRID>_filtered_snps.vcf -filter-name "QD_filter" -filter "QD < 2.0" -filter-name "FS_filter" -filter "FS > 60.0" -filter-name "MQ_filter" -filter "MQ < 40.0" -filter-name "SOR_filter" -filter "SOR > 4.0" -filter-name "MQRankSum_filter" -filter "MQRankSum < -12.5" -filter-name "ReadPosRankSum_filter" -filter "ReadPosRankSum < -8.0"
gatk VariantFiltration -R mulberry_reference.fasta -V <SRRID>_raw_indels.vcf -O <SRRID>_filtered_indels.vcf -filter-name "QD_filter" -filter "QD < 2.0" -filter-name "FS_filter" -filter "FS > 200.0" -filter-name "SOR_filter" -filter "SOR > 10.0"
gatk SelectVariants --exclude-filtered -V <SRRID>_filtered_snps.vcf -O <SRRID>_bqsr_snps.vcf
gatk SelectVariants --exclude-filtered -V <SRRID>_filtered_indels.vcf -O <SRRID>_bqsr_indels.vcf
#GATK PHASE 2 UBUNTU 20.04
gatk BaseRecalibrator -R mulberry_reference.fasta -I <SRRID>_sorted_dedupreads.bam --known-sites <SRRID>_bqsr_snps.vcf --known-sites <SRRID>_bqsr_indels.vcf -O <SRRID>_recal_data.table
gatk ApplyBQSR -mulberry_reference.fasta -I <SRRID>_sorted_dedupreads.bam -bqsr <SRRID>_recal_data.table -O <SRRID>_recal_reads.bam
gatk BaseRecalibrator -mulberry_reference.fasta -I <SRRID>_recal_reads.bam --known-sites <SRRID>_bqsr_snps.vcf --known-sites <SRRID>_bqsr_indels.vcf -O <SRRID>_post_recal_data.table
gatk AnalyzeCovariates -before <SRRID>_recal_data.table -after <SRRID>_post_recal_data.table -plots <SRRID>_recalibration_plots.pdf
#GATK PHASE 3 UBUNTU 20.04
gatk HaplotypeCaller - mulberry_reference.fasta -I <SRRID>_recal_reads.bam -O <SRRID>_raw_variants_recal.vcf
gatk SelectVariants -mulberry_reference.fasta -V <SRRID>_raw_variants_recal.vcf --select-type-to-include SNP -O <SRRID>_raw_snps_recal.vcf
gatk VariantFiltration -mulberry_reference.fasta -V <SRRID>_raw_snps_recal.vcf -O <SRRID>_filtered_snps_final.vcf -filter-name "QD_filter" -filter "QD < 2.0" -filter-name "FS_filter" -filter "FS > 60.0" -filter-name "MQ_filter" -filter "MQ < 40.0" -filter-name "SOR_filter" -filter "SOR > 4.0" -filter-name "MQRankSum_filter" -filter "MQRankSum < -12.5" -filter-name "ReadPosRankSum_filter" -filter "ReadPosRankSum < -8.0"
CONSENSUS GENERATION
The process combines information from multiple variants to generate a single consensus sequence. The consensus sequence provides a comprehensive summary of the genetic variation present in the samples and serves as a reference for downstream analysis, such as functional annotation or comparative genomics.
#bgzip Ubuntu 20.04
bgzip <SRRID>_filtered_snps_final.vcf
#Tabix Ubuntu 20.04
tabix -p vcf <SRRID>_filtered_snps_final.vcf.gz
#Samtools faidx and Bcftools consensus Ubuntu 20.04
samtools faidx mulberry_reference.fasta Super-Scaffold_<ID> | bcftools consensus <SRRID>_filtered_indels.vcf.gz > Contig_<ID>.fa