The annotation pipeline for the genome of a snake
Boyang Liu, Liangyu Cui, Yue Ma, Diancheng Yang, Yanan Gong, Yanchun Xu, Shuhui Yang, Song Huang, Zhangwen Deng
Abstract
Here are detailed methods use for the annotation of various snake genomes.
Before start
Steps
Repeat annotation_de novo
1) Run RepeatModeler to build a de novo library based on the input assembled genome sequence.
2) Using the library constructed in step 5 as the database, run RepeatMasker (v. 3.3.0) to find and then classify the repetitive sequences.
Repeat annotation_database
Run TRF (v. 4.09), RepeatMasker and RepeatProteinMask (v. 3.3.0) to identify repeats in the genome at DNA and protein level, respectively, by aligning sequences against the Repbase library (v. 17.01).
Gene prediction_preparation
Mask these repetitive regions obtained above (step 4-6) with 'N's.
Gene prediction_de novo
Run Augustus (v3.0.3) to de novo predict genes in the repeat-masked genome sequences.
Gene prediction_homolog
Download the publicly available protein sequences of representative homologous snake species, align these against our masked genome sequences with BLAT, and then based on the BLAT mapping results, GeneWise (v2.4.1 ) is then run to predict the genes.
Gene prediction_transcriptome
Then filter RNA-seq data using Trimmomatic(v0.30). The resulting data is then assembled by Trinity (v2.13.2). PASA(v2.0.2) was finally used to align transcript against the snake genome of interest to obtain gene structures.
Final gene set_MAKER
Integrate the genes predicted in step 4-6 to obtain the consensus gene set using the MAKER pipeline (v3.01.03).
Functional annotation
Map protein sequences of the final gene set to existing databases to identify their functions or motifs, such as SwissProt, TrEMBL, KEGG, InterPro.