Consensus Sequence Generation Protocol: Leveraging Variant Analysis in Mulberry Genotypes

Vidya Niranjan, Lavanya C, Spoorthi R Kulkarni, Shri Ganapathi

Published: 2024-02-19 DOI: 10.17504/protocols.io.n2bvj37dnlk5/v1

Abstract

This research investigates four distinct Mulberry genotypes, each offering valuable traits essential for sericulture and broader agricultural sustainability. Notably, Thailand Male is examined for genetic variations associated with superior leaf yield, potentially informing selective breeding efforts for cultivars providing optimal nutrition to silkworms. Assam Bol's resistance to root rot diseases is explored to elucidate genetic factors, promising the development of disease-resistant Mulberry varieties and reducing reliance on chemical interventions. The drought-resistant S1 genotype's genetic resilience is scrutinized to develop Mulberry cultivars capable of withstanding water scarcity amid climate change challenges. Furthermore, the study assesses Punjab Local's nitrogen use efficiency, crucial for enhancing agricultural productivity while mitigating environmental impact. Employing advanced genetic analysis techniques, including SNP and SSR markers, this research aims to revolutionize Mulberry agriculture by identifying genetic markers, establishing linkage maps, and creating diagnostic tools. Ultimately, this endeavor seeks to enhance economic viability and environmental sustainability within the sericulture industry.

The current protocol is designed for development of consensus for mulberry genome. But the protocol can be used for other species too.

Before start

Scripting and Command line operation should be known

The Server with higher configuration to be used for the protocol

The GATK has some filter values which will change according to the sample. The User has to set the values accordingly.

Java file has to be installed properly

Steps

SAMPLE COLLECTION AND REFERENCE GENOME

Collection of the Whole genome sequences and the Reference genome for Morus Indica (Mulberry)

The high-depth sequencing of multiple mulberry accessions were sequenced via Illumina platform and the fastq files were retrieved from MindGP an Indian repository of Mulberry data. . The project involves generation of a high-density SNP map for various genotyping applications.

Dateset

Dataset Collection from MINDGP http://tgsbl.jnu.ac.in/MindGP/raw_data.php

Dateset

Dataset reposited in NCBI SRA https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA728807

#Downloading the Genome sequences from FTP link in Ubuntu 20.04 
wget <FTP Link>

Reference genome

The Morus indica species have 365 sequenced contigs which were combined together to get a whole reference genome.

The concatenated file was used for alignment and mapping of the fastq files and was named as mulberry_reference.fasta

Dateset

Mulberry Reference http://tgsbl.jnu.ac.in/MindGP/download/JNU_Mind_v1.2.fasta.gz

ALIGNMENT AND MAPPING

The alignment and mapping process begins by indexing the reference genome for efficient alignment using BWA MEM. The aligned sequences are then stored in a SAM file format. Next, SAMtools is employed to convert the SAM file to the binary BAM format and to sort the aligned reads based on their genomic coordinates. This sorting step optimizes downstream analyses such as variant calling and consensus sequence generation.

#Alignment   UBUNTU 20.04
bwa mem mulberry_reference.fasta <SRRID>_1.fastq.gz <SRRID>_2.fastq.gz > <SRRID>.sam

#SAM to BAM conversion   Ubuntu 20.04
samtools view -bS <SRRID>.sam > <SRRID>.bam

#Sorting BAM 
samtools sort <SRRID>.bam -o <SRRID>_sorted.bam

VARIANT CALLING USING GATK PIPELINE

The GATK pipeline processes raw sequencing data by initially aligning reads to a reference genome and then recalibrating base quality scores to improve accuracy. Variant calling algorithms identify potential variants, including SNPs, which are subsequently filtered for high quality using variant quality score recalibration. Filtered variants are annotated with functional information before a consensus sequence is generated by incorporating the most common variant at each genomic position across all samples. This approach ensures reliable SNP identification, filtering, and consensus sequence generation.

The provided command is using the Picard tools, specifically the MarkDuplicates tool, in a Java environment. Let's break down the command:

java -jar picard.jar: This part of the command indicates that you are running a Java application and specifying the JAR (Java Archive) file picard.jar, which contains the Picard tools.

MarkDuplicates: This is the specific Picard tool being invoked. It is used to identify and mark duplicate reads in a BAM (Binary Alignment/Map) file.

INPUT=_sorted.bam: Specifies the input BAM file with aligned reads. _sorted.bam is a placeholder where represents the actual sample or experiment identifier. You would replace with the specific identifier for your data.

OUTPUT=_dedup_reads.bam: Specifies the output BAM file where the deduplicated reads will be written. _dedup_reads.bam is a placeholder for the deduplicated file, and the actual filename will be determined by the provided sample or experiment identifier.

METRICS_FILE=_metrics.txt: Specifies the file where metrics about the duplication process will be written. The metrics include information about the number of duplicates, optical duplicates, and other statistics. _metrics.txt is a placeholder for the metrics file, and the actual filename will be determined by the provided sample or experiment identifier.

java -jar picard.jar MarkDuplicates INPUT=<SRRID>_sorted.bam OUTPUT=<SRRID>_dedup_reads.bam METRICS_FILE=<SRRID>_metrics.txt

CollectAlignmentSummaryMetrics tool, to collect alignment summary metrics from a BAM (Binary Alignment/Map) file. Let's break down the command:

java -jar picard.jar: This part of the command indicates that you are running a Java application and specifying the JAR (Java Archive) file picard.jar, which contains the Picard tools.

CollectAlignmentSummaryMetrics: This is the specific Picard tool being invoked. It is used to collect summary metrics about the alignment from a BAM file.

R=mulberry_reference.fasta: Specifies the reference genome in FASTA format against which the reads were aligned. mulberry_reference.fasta is a placeholder, and you would replace it with the actual filename of your reference genome.

I=_dedup_reads.bam: Specifies the input BAM file containing deduplicated reads. _dedup_reads.bam is a placeholder, and you would replace with the specific identifier for your sample or experiment.

O=_alignment_metrics.txt: Specifies the output file where alignment summary metrics will be written. _alignment_metrics.txt is a placeholder, and the actual filename will be determined by the provided sample or experiment identifier.

java -jar picard.jar CollectAlignmentSummaryMetrics R=mulberry_reference.fasta I=<SRRID>_dedup_reads.bam O=<SRRID>_alignment_metrics.txt

java -jar picard.jar CollectInsertSizeMetrics -I <SRRID>_dedup_reads.bam -O <SRRID>_insert_metrics.txt -H <SRRID>_insert_size_histogram.pdf

10.

AddOrReplaceReadGroups adds or replaces read groups in a BAM (Binary Alignment/Map) file. Read groups are metadata associated with sequencing data and are important for downstream analysis. Let's break down the command:

java -jar picard.jar: This part of the command indicates that you are running a Java application and specifying the JAR (Java Archive) file picard.jar, which contains the Picard tools.

AddOrReplaceReadGroups: This is the specific Picard tool being invoked. It is used to add or replace read group information in a BAM file.

I=_dedup_reads.bam: Specifies the input BAM file containing deduplicated reads. _dedup_reads.bam is a placeholder, and you would replace with the specific identifier for your sample or experiment.

O=_sorted_dedupreads.bam: Specifies the output BAM file where the reads with added or replaced read groups will be written. _sorted_dedupreads.bam is a placeholder, and the actual filename will be determined by the provided sample or experiment identifier.

RGID=4: Specifies the Read Group ID. It is a unique identifier for the read group. In this case, it’s set to 4.

RGLB=lib1: Specifies the Read Group Library. It represents the library from which the sequencing data originates.

RGPL=ILLUMINA: Specifies the Read Group Platform. It indicates the platform used for sequencing, and here it’s set to ILLUMINA.

RGPU=unit1: Specifies the Read Group Platform Unit. It provides additional information about the flowcell or lane, and here it’s set to unit1.

RGSM=20: Specifies the Read Group Sample Name. It represents the sample name associated with the read group. Here, it’s set to 20.

java -jar picard.jar AddOrReplaceReadGroups I=<SRRID>_dedup_reads.bam O=<SRRID>_sorted_dedupreads.bam RGID=4 RGLB=lib1 RGPL=ILLUMINA RGPU=unit1 RGSM=20

11.

#INDEXING THE BAM FILE  UBUNTU 20.04
samtools index <SRRID>_sorted_dedupreads.bam

12.

Now that the Preprocessing is been performed, The GATK Pipeline is run to get the indels and the SNPs

#GATK PHASE 1  UBUNTU 20.04
gatk HaplotypeCaller -R mulberry_reference.fasta -I <SRRID>_sorted_dedupreads.bam -O <SRRID>_raw_variants.vcf

gatk SelectVariants -R mulberry_reference.fasta -V <SRRID>_raw_variants.vcf --select-type-to-include SNP -O <SRRID>_raw_snps.vcf

gatk SelectVariants -R mulberry_reference.fasta -V <SRRID>_raw_variants.vcf --select-type-to-include INDEL -O <SRRID>_raw_indels.vcf

gatk VariantFiltration -R mulberry_reference.fasta -V <SRRID>_raw_snps.vcf -O <SRRID>_filtered_snps.vcf -filter-name "QD_filter" -filter "QD < 2.0" -filter-name "FS_filter" -filter "FS > 60.0" -filter-name "MQ_filter" -filter "MQ < 40.0" -filter-name "SOR_filter" -filter "SOR > 4.0" -filter-name "MQRankSum_filter" -filter "MQRankSum < -12.5" -filter-name "ReadPosRankSum_filter" -filter "ReadPosRankSum < -8.0"

gatk VariantFiltration -R mulberry_reference.fasta -V <SRRID>_raw_indels.vcf -O <SRRID>_filtered_indels.vcf -filter-name "QD_filter" -filter "QD < 2.0" -filter-name "FS_filter" -filter "FS > 200.0" -filter-name "SOR_filter" -filter "SOR > 10.0"

gatk SelectVariants --exclude-filtered -V <SRRID>_filtered_snps.vcf -O <SRRID>_bqsr_snps.vcf

gatk SelectVariants --exclude-filtered -V <SRRID>_filtered_indels.vcf -O <SRRID>_bqsr_indels.vcf

#GATK PHASE 2  UBUNTU 20.04
gatk BaseRecalibrator -R mulberry_reference.fasta -I <SRRID>_sorted_dedupreads.bam --known-sites <SRRID>_bqsr_snps.vcf --known-sites <SRRID>_bqsr_indels.vcf -O <SRRID>_recal_data.table

gatk ApplyBQSR -mulberry_reference.fasta -I <SRRID>_sorted_dedupreads.bam -bqsr <SRRID>_recal_data.table -O <SRRID>_recal_reads.bam 

gatk BaseRecalibrator -mulberry_reference.fasta -I <SRRID>_recal_reads.bam --known-sites <SRRID>_bqsr_snps.vcf --known-sites <SRRID>_bqsr_indels.vcf -O <SRRID>_post_recal_data.table

gatk AnalyzeCovariates -before <SRRID>_recal_data.table -after <SRRID>_post_recal_data.table -plots <SRRID>_recalibration_plots.pdf

#GATK PHASE 3  UBUNTU 20.04
gatk HaplotypeCaller - mulberry_reference.fasta -I <SRRID>_recal_reads.bam -O <SRRID>_raw_variants_recal.vcf

gatk SelectVariants -mulberry_reference.fasta -V <SRRID>_raw_variants_recal.vcf --select-type-to-include SNP -O <SRRID>_raw_snps_recal.vcf

gatk VariantFiltration -mulberry_reference.fasta -V <SRRID>_raw_snps_recal.vcf -O <SRRID>_filtered_snps_final.vcf -filter-name "QD_filter" -filter "QD < 2.0" -filter-name "FS_filter" -filter "FS > 60.0" -filter-name "MQ_filter" -filter "MQ < 40.0" -filter-name "SOR_filter" -filter "SOR > 4.0" -filter-name "MQRankSum_filter" -filter "MQRankSum < -12.5" -filter-name "ReadPosRankSum_filter" -filter "ReadPosRankSum < -8.0"

CONSENSUS GENERATION

13.

The process combines information from multiple variants to generate a single consensus sequence. The consensus sequence provides a comprehensive summary of the genetic variation present in the samples and serves as a reference for downstream analysis, such as functional annotation or comparative genomics.

#bgzip   Ubuntu 20.04
bgzip <SRRID>_filtered_snps_final.vcf

#Tabix  Ubuntu 20.04
tabix -p vcf <SRRID>_filtered_snps_final.vcf.gz

#Samtools faidx and Bcftools consensus  Ubuntu 20.04
samtools faidx mulberry_reference.fasta Super-Scaffold_<ID> | bcftools consensus <SRRID>_filtered_indels.vcf.gz > Contig_<ID>.fa

Citation

Consensus Sequence Generation Protocol: Leveraging Variant Analysis in Mulberry Genotypes

Abstract

Before start

Steps

SAMPLE COLLECTION AND REFERENCE GENOME

ALIGNMENT AND MAPPING

VARIANT CALLING USING GATK PIPELINE

CONSENSUS GENERATION

推荐阅读