ICGRC Portal Tripal Data Generation and Setup
Locedie Mansueto, ramil.mauleon, Tobias Kretzschmar, Graham King
Abstract
The data provided by the International Cannabis Genomics Research Consortium (ICGRC, https://icgrc.info) web portal consists of published results from past analyses, and results generated using the steps described in this protocol. We used public datasets, and open software commonly used for the specific tasks, suggested from best practices, or from other Tripal sites. The sections in this protocol can be grouped into Data Generation (steps 1-8), Setup Tripal and Modules (steps 9-15), Multi-omics API Integration (step 16,17), and Setup of non-Tripal pages (step 18).
Steps
Prepare RNA-Seq Sequences
Next Generation Sequencing (NGS) RNA-Seq sequences are downloaded from NCBI SRA, and trimmed to remove adapters. Three sets of RNA-Seqs are prepared for: 1) Purple Kush gene prediction, 2) Finola gene prediction, 3) transcript assembly, expression level, and variant discovery for trichomes from 21 samples
Download and extract sequences from NCBI-SRA. The SRR fastq sequences can be paired-end (in SRRLIST_PE) or single end (in SRRLIST_SE)
purple_kush_srr_SE.txt purple_kush_srr_PE.txt finola_srr_PE.txt finola_srr_SE.txt 21trichs_srr_PE.txt
Software
Value | Label |
---|---|
SRA-Toolkit | NAME |
https://hpc.nih.gov/apps/sratoolkit.html | REPOSITORY |
NCBI | DEVELOPER |
https://hpc.nih.gov/apps/sratoolkit.html | LINK |
SRATOOLKIT_DIR=/path/to/sratoolkit
FASTQDOWNLOAD_DIR=/path/to/fastq_downloads
# for Purple kush
SRRLIST_PE=purple_kush_srr_PE.txt
SRRLIST_SE=purple_kush_srr_SE.txt
# for Finola
#SRRLIST_PE=finola_srr_PE.txt
#SRRLIST_SE=finola_SE.txt
# for 21 trichomes
#SRRLIST_PE=21trichs_srr_PE.txt
cat $SRRLIST_SE $SRRLIST_PE > $SRRLIST
cat $SRRLIST | ${SRATOOLKIT_DIR}/fastq-dump --split-files --gzip --outdir $FASTQDOWNLOAD_DIR -
Trim using trimmomatic. The paired-end and single-end fastq are iterated separately since they require different command arguments
Software
Value | Label |
---|---|
Trimmomatic | NAME |
http://www.usadellab.org/cms/index.php?page=trimmomatic | REPOSITORY |
http://www.usadellab.org/cms/index.php?page=trimmomatic | LINK |
0.39 | VERSION |
TRIMMOMATIC_DIR=/path/to/trimmomatic
FASTQTRIM_DIR=/path/to/trimmedfastq
THREADS=10
lines=$(cat $SRRLIST_PE)
for line in $lines
do
java -jar $TRIMMOMATIC_DIR/trimmomatic.jar PE -threads $THREADS $FASTQDOWNLOAD_DIR/${line}_1.fastq.gz $FASTQDOWNLOAD_DIR/${line}_2.fastq.gz $FASTQTRIM_DIR/${line}_1.fastq.gz $FASTQTRIM_DIR/${line}_1.U.qtrim.fastq.gz $FASTQTRIM_DIR/${line}_2.fastq.gz $FASTQTRIM_DIR/${line}_2.U.qtrim.fastq.gz ILLUMINACLIP:${TRIMMOMATIC_DIR}/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25
done
lines=$(cat $SRRLIST_SE)
for line in $lines
do
java -jar $TRIMMOMATIC_DIR/trimmomatic.jar SE -threads $THREADS $FASTQDOWNLOAD_DIR/${line}_1.fastq.gz $FASTQTRIM_DIR/${line}.fastq.gz ILLUMINACLIP:${TRIMMOMATIC_DIR}/adapters/TruSeq3-SE.fa:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25
done
(Optional) Concatenate multi-fastq samples. BioSample SAMN with multiple SRR fastq sequences
lines=$(cat booth2020_srr.txt)
cd $FASTQTRIM_DIR
for line in $lines
do
a=($(echo "$line" | tr '\t' '\n'))
cat "${a[1]}_1.fastq.gz" "${a[2]}_1.fastq.gz" "${a[3]}_1.fastq.gz" "${a[4]}_1.fastq.gz" > "${a[0]}_1.fastq.gz"
cat "${a[1]}_2.fastq.gz" "${a[2]}_2.fastq.gz" "${a[3]}_2.fastq.gz" "${a[4]}_2.fastq.gz" > "${a[0]}_2.fastq.gz"
done
cd ..
Gene models prediction
Generation of gff and fasta files. Unless gene models and annotation are available in RefSeq for a genome assembly like cs10, local gene prediction and annotation are preformed. Cultivar-specific RNA-Seq datasets are used whenever available.
Software
Value | Label |
---|---|
STAR | NAME |
https://github.com/alexdobin/STAR | REPOSITORY |
https://github.com/alexdobin/STAR | LINK |
2.7.6a | VERSION |
Software
Value | Label |
---|---|
FINDER | NAME |
https://github.com/sagnikbanerjee15/Finder | REPOSITORY |
https://github.com/sagnikbanerjee15/Finder | LINK |
1.6 | VERSION |
Generate metadata csv listing the RNA-Seq sequences to use. Examples for Purple Kush and Finola
finola_metadata.csv purple_kush_metadata.csv
Prepare the STAR and olego genome index. FINDER wrongly parse a dot/period (.) in the contig IDs. As most contig IDs from NCBI ReqSeq/GeneBank have dots, rename the IDs by replacing dots with underscore
METADATAFILE=purple_kush_metadata.csv
#METADATAFILE=finola_metadata.csv
REF_FASTA=pkv5.fna
#REF_FASTA=fnv2.fna
FINDER_WORKDIR=finder_pkv5
#FINDER_WORKDIR=finder_fnv2
FINDER_OUTDIR=finder_out_pkv5
#FINDER_OUTDIR=finder_out_fnv2
FINDER=/path/to/finder
CPU=12
GENOME_FASTA=${REF_FASTA}.renamed.fna
GENOME_DIR_STAR=star_index_without_transcriptome
GENOME_DIR_OLEGO=olego_index
STAR=/path/to/star
cd $FINDER_WORKDIR
ln -s $FASTQTRIM_DIR trimfastq_rnaseq
sed 's/\./_/g' $REF_FASTA > $GENOME_FASTA
$STAR --runMode genomeGenerate --runThreadN $CPU \
--genomeDir $GENOME_DIR_STAR --genomeSAindexNbases 12 \
--genomeFastaFiles $GENOME_FASTA
../dep/olego/olegoindex -p $GENOME_DIR_OLEGO --output_directory $FINDER_OUTDIR
Run FINDER.
The final outputs are generated in the directory
Cluster gene models using gffread.
It is observed that multiple gene models are overlapping and fragmented in certain regions. These models are clustered in this step from the FINDER output gtf file.
Software
Value | Label |
---|---|
gffread | NAME |
https://github.com/gpertea/gffread | REPOSITORY |
Pertea and Pertea | DEVELOPER |
http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread | LINK |
0.11.7 | VERSION |
FINDER_GTF=${FINDER_WORKDIR}/${FINDER_OUTDIR}/final_GTF_files/combined_with_CDS_high_conf.gtf
CLUSTERED_GFF=${FINDER_GTF}.clustered.gff
CLUSTERED_PEP=${FINDER_GTF}.clustered.pep.fasta
CLUSTERED_CDS=${FINDER_GTF}.clustered.cds.fasta
CLUSTERED_SPLICEDEXON=${FINDER_GTF}.clustered.exon.fasta
/path/to/gffread -g $GENOME_FASTA --merge -d ${FINDER_GTF}.dupinfo \
-K -Q -Y -x $CLUSTERED_CDS -y $CLUSTERED_PEP \
-w $CLUSTERED_SPLICEDEXON -o $CLUSTERED_GFF $FINDER_OUTGTF
Merged gene models are listed in ${FINDER_GTF}.dupinfo. The clustered gene models in CLUSTERED_GFF is in gff format. The remaining files are sequences for CDS, exons and proteins in fasta format.
Revert the contig names in CLUSTERED_GFF back to the original names used in REF_FASTA. As an option, the clustered output gene model IDs (gene ids, cds id, protein ids, exon ids) may be renamed using your own nomenclature.
Gene functional annotation
Functional annotation by sequence homology using mmseqs2, is a fast sequence alignment alternative to BLAST.
Software
Value | Label |
---|---|
mmseqs2 | NAME |
https://github.com/soedinglab/mmseqs2 | REPOSITORY |
https://github.com/soedinglab/mmseqs2 | LINK |
Download the latest Uniprot viridiplantae proteins sequence and generate mmseqs2 database
MMSEQDB_UNIPROT=sptr_plants
MMSEQ_DIR=/path/to/mmseqs
$MMSEQ_DIR/mmseqs createdb sptr_plants.fasta.gz $MMSEQDB_UNIPROT
Get best hit alignment between predicted proteins with Uniprot plants
MM_RESULTDB=pkv5_sptr_plants_resultDB
MM_BESTRESULTDB=pkv5_sptr_plants_bestresultDB
$MMSEQ_DIR/mmseqs search $CLUSTERED_PEP ${MMSEQDB_UNIPROT} \
$MM_RESULTDB tmp --start-sens 1 --sens-steps 3 -s 7
$MMSEQ_DIR/mmseqs filterdb $MM_RESULTDB $MM_BESTRESULTDB \
--extract-lines 1
$MMSEQ_DIR/mmseqs convertalis $CLUSTERED_PEP \
${MMSEQDB_UNIPROT} $MM_BESTRESULTDB ${MM_BESTRESULTDB}.txt
Assign the homologs functional annotation and scores (target ID, e-value) to the Description and/or Note attribute in the GFF.
Functional annotation by Interpro. Submit CLUSTERED_PEP to InterproScan to any of the public Galaxy servers (EU,US or AU). Include Gene Onlogy ID (GO_ID) assignment, and set outputs to xml and tsv formats.
Transcript assembly and alignments
Four sources of Cannabis transcripts use are: 1) cs10 mRNA from NCBI RefSeq, 2) transcript assembled by publications, 3) gene prediction using cultivar-specific RNA-Seq sequences in step 2., 4) assembled transcripts using trichome RNA-Seq of 21 cultivars from multiple sources. The transcripts in this step are loaded and visualized in the JBrowse genome browser.
(Optional) Assemble RNA-Seq sequences into transcripts. rnaspades is used for its speed.
Software
Value | Label |
---|---|
SPADES | NAME |
https://github.com/ablab/spades | REPOSITORY |
https://github.com/ablab/spades | LINK |
v3.14.1 rnaSPAde | VERSION |
SPADES_DIR=/path/to/spades
SPADES_OUTDIR=outdir-21trichomes
# concatenate fastq files
cat $FASTQTRIM_DIR/SRR*_1.fastq.gz > $FASTQTRIM_DIR/allsrr_1.fastq.gz
cat $FASTQTRIM_DIR/SRR*_2.fastq.gz > $FASTQTRIM_DIR/allsrr_2.fastq.gz
$SPADES_DIR/rnaspades.py -o $SPADES_OUTDIR \
--pe1-1 $FASTQTRIM_DIR/allsrr_1.fastq.gz --pe1-2 $FASTQTRIM_DIR/allsrr_2.fastq.gz \
-t 10 -m 300
TRANSCRIPT_SPADES_FASTA=$SPADES_OUTDIR/hard_filtered_transcripts.fasta
The output assembled transcript sequences are in file $TRANSCRIPT_SPADES_FASTA
Align transcripts to genome into bam file. The results are visualized in the JBrowse genome browser.
Software
Value | Label |
---|---|
minimap2 | NAME |
Linux | OS_NAME |
https://github.com/lh3/minimap2 | REPOSITORY |
Heng Li | DEVELOPER |
https://github.com/lh3/minimap2/releases | LINK |
2.16 | VERSION |
Software
Value | Label |
---|---|
samtools | NAME |
GitHub | REPOSITORY |
http://htslib.org/ | DEVELOPER |
https://github.com/samtools/samtools/releases/tag/1.9 | LINK |
1.9 | VERSION |
MINIMAP2_DIR=/path/to/minimap2
SAMTOOLS_DIR=/path/to/samtools
TRANSCRIPT_FASTA=$TRANSCRIPT_SPADES_FASTA
TRANSCRIPT_BAM=${TRANSCRIPT_FASTA}.bam
$MINIMAP2_DIR/minimap2 -ax splice $REF_FASTA $TRANSCRIPT_FASTA | $SAMTOOLS_DIR/samtools view -@ $THREADS -u - | $SAMTOOLS_DIR/samtools sort -m 4G -@ THREADS -o $TRANSCRIPT_BAM -
Find open reading frames (ORF) of assembled transcripts to within the reference genomes. Transdecoder is used following https://github.com/TransDecoder/TransDecoder/wiki
The transcripts are aligned to genome using GMAP https://github.com/juliangehring/GMAP-GSNAP
Build the GMAP genome database
Software
Value | Label |
---|---|
GMAP | NAME |
https://github.com/juliangehring/GMAP-GSNAP | REPOSITORY |
Genentech, Inc | DEVELOPER |
https://github.com/juliangehring/GMAP-GSNAP | LINK |
GMAP_DIR=/path/to/gmap
REF=/path/to/reference_genome.fasta
refgenome_name=cs10
TRANSCIPT_FASTA=$TRANSCRIPT_SPADES_FASTA
$GMAP_GFF=${refgenome_name }-${TRANSCIPT_FASTA}-gmap-f3n1.gff
${GMAP_DIR}/gmap_build -D $refdb_name -d $refgenome_name -t 10 -s none $REF
Map assembled transcripts to find ORFs
${GMAP_DIR}/gmap -D $refdb_name -d $refgenome_name -f 3 -t 12 -n 1 > $GMAP_GFF
Search for long ORFs and peptides from assembled transcripts
TRANSDECODER_DIR=/path/to/transdecoder
${TRANSDECODER_DIR}/TransDecoder.LongOrfs -t $TRANSCIPT_FASTA
Longest ORF peptides are aligned to UniProt plants, using mmseqs similar to step 3.2. Best hit results is in file in LONGESTORF_MMSEQ_BESTHIT
${MMSEQ_DIR}/mmseqs search rnablongestorf_pep_seqDB sptr_plants_2022_seqDB rnablongestorf_pep-sptr_plants-resultDB tmp --start-sens 1 --sens-steps 3 -s 7
${MMSEQ_DIR}/mmseqs filterdb rnablongestorf_pep-sptr_plants-resultDB rnablongestorf_pep-sptr_plants-bestresultDB --extract-lines 1
${MMSEQ_DIR}/mmseqs convertalis rnablongestorf_pep_seqDB sptr_plants_2022_seqDB rnablongestorf_pep-sptr_plants-bestresultDB $LONGESTORF_MMSEQ_BESTHIT
Run TransDecoder ORF prediction, generating output ${TRANSCIPT_FASTA}.transdecoder.gff3
${TRANSDECODER_DIR}/TransDecoder.Predict -t $TRANSCIPT_FASTA --retain_blastp_hits $LONGESTORF_MMSEQ_BESTHIT
Generate ORF with genome coordinates
TRANSCRIPT_GENOME_GFF= ${TRANSCIPT_FASTA}.transdecoder.genome.gff3
perl ../../util/cdna_alignment_orf_to_genome_orf.pl ${TRANSCIPT_FASTA}.transdecoder.gff3 $GMAP_GFF $TRANSCIPT_FASTA > $TRANSCRIPT_GENOME_GFF
Cluster overlapping gene loci, and generate CDS, exon and protein sequences
Software
Value | Label |
---|---|
gffread | NAME |
https://github.com/gpertea/gffread | REPOSITORY |
Pertea and Pertea | DEVELOPER |
http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread | LINK |
0.11.7 | VERSION |
GENOME_FASTA=cs10.fna
FINDER_GTF=transcripts.fasta.renamed.fasta.transdecoder.genome.gff3 # FINDER_WORKDIR}/${FINDER_OUTDIR}/final_GTF_files/combined_with_CDS_high_conf.gtf
CLUSTERED_GFF=${TRANSCRIPT_GENOME_GFF}.clustered.gff
CLUSTERED_PEP=${TRANSCRIPT_GENOME_GFF}.clustered.pep.fasta
CLUSTERED_CDS=${TRANSCRIPT_GENOME_GFF}.clustered.cds.fasta
CLUSTERED_SPLICEDEXON=${TRANSCRIPT_GENOME_GFF}.clustered.exon.fasta
GFFREAD_DIR=/path/to/gffread
${GFFREAD_DIR}/gffread -g $GENOME_FASTA --merge -d ${TRANSCRIPT_GENOME_GFF}.dupinfo -K -Q -Y -x $CLUSTERED_CDS -y $CLUSTERED_PEP -w $CLUSTERED_SPLICEDEXON -o $CLUSTERED_GFF $TRANSCRIPT_GENOME_GFF
Gene expression
Gene expression quantification uses Salmon in mapping-based mode https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode
We quantify the expression of the cs10 RefSeq mRNA.
Software
Value | Label |
---|---|
Salmon | NAME |
https://combine-lab.github.io/salmon | REPOSITORY |
https://combine-lab.github.io/salmon | LINK |
1.4.0 | VERSION |
Software
Value | Label |
---|---|
Trinity RNA-Seq | NAME |
https://github.com/trinityrnaseq/trinityrnaseq/wiki | REPOSITORY |
https://github.com/trinityrnaseq/trinityrnaseq/wiki | LINK |
v2.11.0 | VERSION |
Create decoy-aware transcriptome file https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/. This is generated by appending the genome sequences to the transcripts sequences, and list the decoys IDs. The decoys are the genome sequences.
TRANSCRIPT_FASTA=GCF_900626175.2_cs10_rna.fna
GENTROME_FASTA=${TRANSCRIPT_FASTA}_${REF_FASTA}
SALMON_INDEX=${GENTROME_FASTA}_index
grep "^>" $REF_FASTA | cut -d " " -f 1 | sed -i 's/>//g' > ${REF_FASTA}.decoys.txt
cat $TRANSCRIPT_FASTA $REF_FASTA > $GENTROME_FASTA
Create salmon index file
salmon index -p 10 -t $GENTROME_FASTA -d ${REF_FASTA}.decoys.txt -i $SALMON_INDEX -k 31
Quantify each replicate
mkdir $SALMON_DIR
lines=$(cat 21trichomes_replicates.txt)
for line in $lines
do
a=($(echo "$line" | tr '\t' '\n'))
mkdir $SALMON_DIR/${a[0]}.1
mkdir $SALMON_DIR/${a[0]}.2
mkdir $SALMON_DIR/${a[0]}.3
salmon quant -p 5 -i ${SALMON_INDEX} -l IU -1 $FASTQTRIM_DIR/${a[1]}_1.fastq.gz -2 $FASTQTRIM_DIR/${a[1]}_2.fastq.gz --validateMappings -o $SALMON_DIR/${a[0]}.1
salmon quant -p 5 -i ${SALMON_INDEX} -l IU -1 $FASTQTRIM_DIR/${a[2]}_1.fastq.gz -2 $FASTQTRIM_DIR/${a[2]}_2.fastq.gz --validateMappings -o $SALMON_DIR/${a[0]}.2
salmon quant -p 5 -i ${SALMON_INDEX} -l IU -1 $FASTQTRIM_DIR/${a[3]}_1.fastq.gz -2 $FASTQTRIM_DIR/${a[3]}_2.fastq.gz --validateMappings -o $SALMON_DIR/${a[0]}.3
done
The quantified expression is in file $SALMON_DIR/*/quant.sf
Estimate abundance
QUANT_FILE=/path/to/quant_filelist
ls $SALMON_DIR/*/*quant.sf > ${QUANT_FILE}.txt
TRINITY_DIR=/path/to/trinity
${TRINITY_DIR}/util/abundance_estimates_to_matrix.pl --est_method salmon --name_sample_by_basedir --gene_trans_map none \
--out_prefix ${QUANT_FILE}/${QUANT_FILE} \
--quant_files ${QUANT_FILE}.txt
(Optional) Generate bigwig files to display abundance level as JBrowse track using https://www.encodeproject.org/software/bedgraphtobigwig plugin.
Software
Value | Label |
---|---|
Bedtools | NAME |
https://github.com/arq5x/bedtools2 | REPOSITORY |
http://quinlanlab.org | DEVELOPER |
https://bedtools.readthedocs.io/en/latest/ | LINK |
2.27.1 | VERSION |
Software
Value | Label |
---|---|
bedgraphtobigwig | NAME |
https://www.encodeproject.org/software/bedgraphtobigwig | REPOSITORY |
https://www.encodeproject.org/software/bedgraphtobigwig | LINK |
Edit the filenames in matrix2bigwig.py using the outputs from steps 5.3 and 4.3, and run the python script. These chromosome length files are also needed chromosome_lengths.zip
Genotype
SNP genotype dataset generation is described in details in separate protocols
Software
Value | Label |
---|---|
10.5281/zenodo.8351609 | NAME |
https://zenodo.org/record/8351609 | REPOSITORY |
https://zenodo.org/record/8351609 | LINK |
Biopathways
The biopathways module visualize the expression level or differential expression of transcripts/proteins over biological pathways. It is a web-based implemantation of MapMan
https://github.com/usadellab/usadellab.github.io/tree/master/MapManJS
Download MapMan mapping files, curated pathways, and protein sequences for Arabidopsis thaliana, Eucalyptus grandis (eucalypthus), and Solanum lycopersicum (tomato)
X4 Araport11
Eucalyptus grandis map
Egrandis_201
Solanum lycopersicum map
ITAG2.3
Software
Value | Label |
---|---|
MapManJS | NAME |
https://usadellab.github.io/MapManJS | REPOSITORY |
https://usadellab.github.io/MapManJS/ | LINK |
Align protein of cs10 vs the three plants' proteins using mmseq reciprocal best hit
PROTEINCS10_FASTA=/path/to/cs10_pep.fasta.gz
PROTEIN1_FASTA=/path/to/ath_pep.fasta.gz
PROTEIN2_FASTA=/path/to/egr_pep.fasta.gz
PROTEIN3_FASTA=/path/to/sly_pep.fasta.gz
PAIR1_RBH=cs10_ath
PAIR2_RBH=cs10_egr
PAIR3_RBH=cs10_sly
$MMSEQ_DIR/mmseqs easy-rbh $PROTEINCS10_FASTA $PROTEIN1_FASTA $PAIR1_RBH tmp
$MMSEQ_DIR/mmseqs easy-rbh $PROTEINCS10_FASTA $PROTEIN2_FASTA $PAIR2_RBH tmp
$MMSEQ_DIR/mmseqs easy-rbh $PROTEINCS10_FASTA $PROTEIN3_FASTA $PAIR3_RBH tmp
Revise MapMan maps to use the cannabis genes using the reciprocal best hit results.
X4_ENSEMBL39_ISOFORM_ArabidopsisThaliana-cs10pep.out.gz
Synteny
MCScanX was used to detect synteny blocks using the genes for cs10 and pkv5 https://github.com/wyp1125/MCScanX
Software
Value | Label |
---|---|
blast+ | NAME |
https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+ | REPOSITORY |
https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST | LINK |
2.2.29+ | VERSION |
Software
Value | Label |
---|---|
MCScanX | NAME |
https://github.com/wyp1125/MCScanX | LINK |
Align two reference gene models using blastp
BLAST_DIR=/path/to/blast
PROT_FASTA_1=/path/to/cs10_pep.gz
PROT_FASTA_2=/path/to/pkv5_pep.gz # from gene prediction results
PROT_DB1='cs10'
PROT_DB2='pkv5'
$BLAST_DIR/makeblastdb -in $PROT_FASTA_1 -dbtype prot -out ${PROT_DB1}_db
$BLAST_DIR/blastall -i $PROT_FASTA_2 -d ${PROT_DB1}_db -p blastp -e 1e-10 -b 5 -v 5 -m 8 -o ${PROT_DB2}-${PROT_DB1}_blastallp_b5v5m8.blast
From the GFF file, generate TSV file with columns contig,gene_id,start,stop then rename contigs into annotation code plus incremental number, ie. cs1, cs2 .. csN, pk1,pk2,..pkN. Also generate a TSV mapping file of the codes and original contig names. Concatenate the files for the pair.
Create a directory and move the input files. Run MCScanX
MCSCANX_DIR=/path/to/mcscanx
PAIR_NAME=${PROT_DB1}_${PROT_DB2}
mkdir $PAIR_NAME
cp genelist_${PAIR_NAME}.txt ${PAIR_NAME}/
cp contigmap_${PAIR_NAME}.txt ${PAIR_NAME}/
$MCSCANX_DIR/MCScanX ${PAIR_NAME}/${PAIR_NAME}
File
Tripal modules and data loaders
Install and enable the following Tripal modules, then load the different datasets corresponding to each module. The step-by-step guide to loading are described in each modules documentations in the provided links. The purpose of this section is to summarize the datafiles and format needed for each module, how to generate them using the results from the previous sections, or where to find them if publicly available.
Datasets generated from the previous sections and needed in the next steps are compiled in https://icgrc.info/downloads
Install Tripal v3 docker
Software
Value | Label |
---|---|
Tripal v3 | NAME |
docker pull statonlab/tripal3 | REPOSITORY |
docker pull statonlab/tripal3 | LINK |
v3 | VERSION |
Create and load the basic information for a new genome.
Create an organism https://tripal.readthedocs.io/en/latest/user_guide/example_genomics/organisms.html
In Chado, the organism_id is used to define a genome assembly. Multiple assemblies or versions for a species can be assigned as different "Organisms" in Tripal.
Create Genome Assembly (Analysis), and load reference genome FASTA file, associated to Analysis and Organism
Define the Gene Annotation (Analysis) used to generate the gene sequences.
Load and publish mRNA and protein sequences FASTA files generated from step 2.5 or downloaded from sources like NCBI RefSeq, associated to Analysis and Organism
https://tripal-devseed.readthedocs.io/en/latest/loading_FASTA.html
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_900626175.2/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCF_900626175.2.zip" -H "Accept: application/zip"
Steps 10.2 and 10.3 are interchangeable, provided the sequence IDs in the FASTA and GFF files should match.
Define the gene modelling analysis and load the GFF gene models generated in step 3.2, or downloaded from sources like NCBI RefSeq.
https://tripal-devseed.readthedocs.io/en/latest/loading_GFF.html
Load additional annotations generated from different analysis. In all analyses always generate the XML output format when available, since it is the most comprehensive and required by the Tripal loaders. As a rule, always read the loaders documentation first before running an analysis so that the accepted formats are generated.
Load the InterproScan result in step 3.3
https://tripal-devseed.readthedocs.io/en/latest/loading_IPS.html
and additional BLAST results performed
https://tripal-devseed.readthedocs.io/en/latest/loading_BLAST.html
Install Mainlab Chado Search module
https://gitlab.com/mainlabwsu/chado_search
This module uses the genome features and annotations analyses loaded in steps 2 and 3, to create the Materialize View chado_search_gene_search.
The default query definition is modified in the Materialized Views page to account for i) the nomenclature of the different annotations and references, ii) to include associated mRNA and protein annotations in gene accession and keyword search, and iii) include transitive closure when using Gene Ontology terms in keyword search. The modified SQL definition is in chado_search_gene_search.sql
Install Expression module
Create the biosample TSV file, or download them from NCBI BioSample as XML is available.
For TSV, the columns should be as shown in this example
https://github.com/tripal/tripal_analysis_expression/blob/master/example_files/exampleTSV.tsv
organism should match the organism name defined in 10.1
sample_name should match previously loaded Biomaterial names if available
temp , tissue , treatment should match previously loaded Ontology terms if available
Load and publish BioSample
Load expression data generated in step 5.3, or from public expression repository like NCBI GEO.
The expression data can take a TSV format with sample name as column header, feature name (gene, mRNA, protein name) as row header, and expression value at each matrix element. Samples (12.1) and features (10.2) should be loaded first as described, and the names should match.
A sample expression matrix available in
https://github.com/tripal/tripal_analysis_expression/blob/master/example_files/exampleMatrix.tsv
This module creates the expression_feature_all materialized view
Install Phenotype module
https://analyzedphenotypes.readthedocs.io/en/latest/index.html
This module heavily relies on ontology terms to define phenotypes. There are two ways of loading ontologies supported by this module:
-
use published Bioontologies
-
use customized/specialized ontology/controlled terms
In addition, there are already loaded default ontologies in Tripal for method NCIT and unit UO
Download OBO files from published sources
For popular crops, use crop-specific terms from http://www.cropontology.org
In addition, use generic plant and trait terms from TO https://obofoundry.org/ontology/to and PATO https://obofoundry.org/ontology/pato.
For metabolites, use ChEBI https://purl.obolibrary.org/obo/chebi.obo
Load all OBO files necessary
https://tripal.readthedocs.io/en/latest/user_guide/example_genomics/controlled_vocabs.html
Setup Trait Ontologies using any of the loaded Ontologies
https://analyzedphenotypes.readthedocs.io/en/latest/admin_guide/setup.html
Check Allow new terms to be added to the Controlled Vocabulary during upload for flexibility
Upload phenotype data in TSV format
https://analyzedphenotypes.readthedocs.io/en/latest/user_guide/loading.html
In the tsv, the Trait Name, Method Name, Unit columns should match the ontology terms when available
Germplasm Accession, Germplasm Name should match preloaded Stocks when available
Stocks may also be preloaded in bulk
https://tripal.readthedocs.io/en/latest/user_guide/bulk_loader.html
or Stock https://icgrc.info/bio_data/add/37,
or Germplasm https://icgrc.info/bio_data/add/21
This module creates the mview_phenotype materialized view
Map viewer
https://gitlab.com/mainlabwsu/tripal_map
Maps consists of a map with and set of features (genetic markers, QTLs, physical markers)
Create map
Load map features
Install Synteny module
Copy the files generated from steps 8.2 and 8.3 to the tripal_synview directory
cd tripal_synview
# organism_id 1
ORG_ID1=1
# organism_id 2
ORG_ID2=2
perl syntenyTool.pl -t mcscanx_block -c contigmap_${PAIR_NAME}.txt ${PAIR_NAME}.collinearity genelist_${PAIR_NAME} > ${PAIR_NAME}.block
perl syntenyTool.pl -t mcscanx_tripal -a $ANNOTCODE_1 -b $ANNOTCODE_2 -c $ORG_ID1 -d $ORG_ID2 contigmap_${PAIR_NAME}.txt genelist_${PAIR_NAME} ${PAIR_NAME}.collinearity ${PAIR_NAME}.block > ${PAIR}.block.4tripal.txt
Load the generated file ${PAIR}.block.4tripal.txt to Tripal and use in defining a new Synteny page
Multiomics queries
The multiomics data integration is implemented as a web-service API. The documentation in https://icgrc.info/api_doc describes the queries and applicable constraints. It is uses a Drupal module to query the various materialized views generated by the Tripal modules.
The URL queries are processed using Drupal simple_slim_api module https://www.drupal.org/project/simple_slim_api. To instantiate, install and enable the simple_slim_api module, then replace simple_slim_api.module with this customized script simple_slim_api.module , and utility script read_snps.php. Set the bcftools path and vcf files location in read_snps.php
In the module, the URL arguments are converted into SQL queries on the materialized views created by the modules in section
These materialize views were generated by the
CREATE MATERIALIZED VIEW expression_feature_all AS
SELECT F.feature_id AS feature_id,
F.name AS feature_uniquename,
B.biomaterial_id AS biomaterial_id,
B.name AS biomaterial_name,
AN.analysis_id AS analysis_id,
AN.program AS analysis_method,
ER.signal AS signal
FROM elementresult ER
INNER JOIN element E ON E.element_id = ER.element_id
INNER JOIN feature F ON F.feature_id = E.feature_id
INNER JOIN quantification Q ON Q.quantification_id = ER.quantification_id
INNER JOIN acquisition AQ ON AQ.acquisition_id = Q.acquisition_id
INNER JOIN assay A ON A.assay_id = AQ.assay_id
INNER JOIN assay_biomaterial AB ON AB.assay_id = A.assay_id
INNER JOIN biomaterial B ON B.biomaterial_id = AB.biomaterial_id
INNER JOIN analysis AN ON AN.analysis_id = Q.analysis_id;
CREATE MATERIALIZED VIEW mview_phenotype AS
SELECT
o.genus AS organism_genus,
trait.cvterm_id AS trait_id,
trait.name AS trait_name,
proj.project_id AS project_id,
proj.name AS project_name,
method.cvterm_id AS method_id,
method.name AS method_name,
unit.cvterm_id AS unit_id,
unit.name AS unit_name,
loc.value AS location,
yr.value AS year,
s.stock_id AS germplasm_id,
s.name AS germplasm_name,
array_to_json(array_agg(p.value)) AS values
FROM {phenotype} p
LEFT JOIN {cvterm} trait ON trait.cvterm_id=p.attr_id
LEFT JOIN {project} proj USING(project_id)
LEFT JOIN {cvterm} method ON method.cvterm_id=p.assay_id
LEFT JOIN {cvterm} unit ON unit.cvterm_id=p.unit_id
LEFT JOIN {stock} s USING(stock_id)
LEFT JOIN {organism} o ON o.organism_id=s.organism_id
LEFT JOIN {phenotypeprop} loc ON loc.phenotype_id=p.phenotype_id AND loc.type_id = 2944
LEFT JOIN {phenotypeprop} yr ON yr.phenotype_id=p.phenotype_id AND yr.type_id = 141
GROUP BY
trait.cvterm_id,
trait.name,
proj.project_id,
proj.name,
method.cvterm_id,
method.name,
unit.cvterm_id,
unit.name,
loc.value,
yr.value,
s.stock_id,
s.name,
o.genus
The general structure is to query triples of (DATATYPE-PROPERTY,BIOMATERIAL,VALUE). The queries as shown below don't include the WHERE clauses, which are dynamically generated from the arguments of the API call.
Query expression values, with datatype EXP
select feature_uniquename as property, biomaterial_id as biomaterial, signal::text as value from chado.expression_feature_all WHERE ...
```Query phenotype values, with datatype PHEN
select trait_name as property, stock_id as biomaterial , values as value from chado.mview_phenotype WHERE ...
select 'stock_name'as property, stock_id biomaterial , name as value from chado.stock
union
select db.name property, b.stock_id biomaterial , dx.accession as value from chado.stock b, chado.stock_dbxref bdx, chado.dbxref dx, chado.db
where db.db_id=dx.db_id and dx.dbxref_id = bdx.dbxref_id and bdx.stock_id = b.stock_id
union
select 'stock_description' as property, stock_id biomaterial , description as value from chado.stock
union
select db.name property, b.stock_id biomaterial , dx.accession as value from chado.stock b, chado.stock_dbxref bdx, chado.dbxref dx, chado.db
where db.db_id=dx.db_id and dx.dbxref_id = bdx.dbxref_id and bdx.stock_id = b.stock_id
union
select c.name property, b.stock_id biomaterial , bp.value from chado.stock b , chado.stockprop bp , chado.cvterm c
where bp.stock_id = b.stock_id and c.cvterm_id = bp.type_id
Query SNPs, with datatype SNP
Software
| Value | Label |
| --- | --- |
| bcftools | NAME |
| https://samtools.github.io/bcftools | REPOSITORY |
| https://samtools.github.io/bcftools/howtos/install.html | LINK |
| 1.10.2 | VERSION |
In the API module, call bcftools to generate the three files pos_nnnnnn.txt, samples_nnnnnn.txt and call_nnnnnn.txt using the $REGION and $SELECTED_DATASET parameters, and nnnnnn is randomly generated per request. $REGION are the genomic regions from the API argument, or from genomic regions of genes returned by keyword search in <gene_function_search>. $SELECTED_DATASET is the vcf file to use based on the reference and dataset from the API arguments.
select pt.pos as property, case when s2s.samn is null then st.sample else s2s.samn end as biomaterial, ct.gt as value from call_nnnnnn ct, pos_nnnnnn pt, col_nnnnnn st left join chado.mview_srr_prjn_samn_name s2s on s2s.srr=st.sample where ct.colno=st.colno and ct.lineno=pt.lineno
The queries from the different datasets are merged using UNION of (PROPERTY,BIOMATERIAL,VALUE). For sorting and grouping purposes, the DATATYPE column with possible values [ID, PROP, PHEN, EXP or SNP] is prepended for each PROPERTY to identify the row datatype.
First define the colpivot function (https://github.com/hnsl/colpivot) inside Postgres
Software
Value | Label |
---|---|
colpivot | NAME |
https://github.com/hnsl/colpivot | REPOSITORY |
https://github.com/hnsl/colpivot | LINK |
Pivot into table with row headers PROPERTY, column headers BIOMATERIAL, and element values VALUE
WITH_SQL=
UNION_SQL=
select colpivot('biomatexp_stocksample_pivoted_nnnnnn',
'$WITH_SQL $UNION_SQL ' ,
array['property'], array['accession'], '#.value', null)
/* get headers */
SELECT column_name FROM information_schema.columns WHERE table_name = 'biomatexp_stocksample_pivoted_nnnnnn' order by ordinal_position
/* export table */
select distinct * from biomatexp_stocksample_pivoted_nnnnnn order by property
Multiomics API Use-cases
These scripts demonstrate the use of the API to query various datasets and do some plotting and analysis. The web-service API documentation is available at https://icgrc.info/api_doc
The Jupyter notebook ipynb uses the client modules defined in omics_api.py, using functions defined in omics_api_utils.py and the scripts in api-modules.tgz. Static export pages of the notebook on different use cases are available at https://snp.icgrc.info/static/icgrc_omics_demo.html. Module updates are also announced/available at this page between protocol version updates.
icgrc_omics_demo.ipynb omics_api.py omics_api_utils.py api-modules.tgz
Embedded non-Tripal pages
These tools are not part of Tripal, but are instantiated and embedded as Tripal pages in the ICGRC web portal.
Instantiate JBrowse v1 following the manual at https://jbrowse.org/docs/tutorial_classic.html
Software
Value | Label |
---|---|
JBrowse | NAME |
https://jbrowse.org/jbrowse1.html | REPOSITORY |
https://jbrowse.org/jbrowse1.html | LINK |
1 | VERSION |
Load the genome assembly, GFF gene model tracks (from step 2, or 10.2), transcript BAM alignment and count bigwig tracks (step 4)
JBROWSE_DIR=/path/to/jbrowse
${JBROWSE_DIR}/bin/prepare-refseqs.pl --fasta $REF_FASTA
```Load the GFF gene models generated in step 3.2, or downloaded from NCBI RefSeq in step 10.2
Install and configure the Tripal-JBrowse Integration module from [https://tripal-jbrowse.readthedocs.io/](https://tripal-jbrowse.readthedocs.io)
The following modules use the Drupal Iframe module to embed non-Tripal modules into Tripal pages.
Download from https://www.drupal.org/node/297891 into the modules directory, extract and enable the module. Once each tools are setup in their respective directories, reference their URL address in the Iframe definition pages.
Hemp-Seek
Software
Value | Label |
---|---|
SNP-Seek | NAME |
https://bitbucket.org/irridev/iric_portal | REPOSITORY |
https://bitbucket.org/irridev/iric_portal | LINK |
Load VCF files generated in step 6
MapManJS
Software
Value | Label |
---|---|
MapManJS | NAME |
https://usadellab.github.io/MapManJS | REPOSITORY |
https://usadellab.github.io/MapManJS | LINK |
Place the mapping files from step 7.3 into the MapManJS MappingFiles directory, and reference them in ultramicro.html mappings[] options.
mappings['cs10-A.thaliana_RBH']= "./MappingFiles/X4_ENSEMBL39_ISOFORM_ArabidopsisThaliana-cs10pep.out";
mappings['cs10-E.grandis_RBH']="./MappingFiles/Egrandis_201.txt-cs10pep.out";
mappings['cs10-S.lycopersicum_RBH'] ="./MappingFiles/X4_ENSEMBL39_ISOFORM_SolanumLyc-cs10pep.out";
The customized ultramicro.html ultramicro.html