Using EMBL-EBI Services via Web Interface and Programmatically via Web Services

Fábio Madeira, Fábio Madeira, Nandana Madhusoodanan, Nandana Madhusoodanan, Joonheung Lee, Joonheung Lee, Alberto Eusebi, Alberto Eusebi, Ania Niewielska, Ania Niewielska, Adrian R. N. Tivey, Adrian R. N. Tivey, Stuart Meacham, Stuart Meacham, Rodrigo Lopez, Rodrigo Lopez, Sarah Butcher, Sarah Butcher

Published: 2024-06-10 DOI: 10.1002/cpz1.1065

Abstract

The European Bioinformatics Institute (EMBL-EBI)’s Job Dispatcher framework provides access to a wide range of core databases and analysis tools that are of key importance in bioinformatics. As well as providing web interfaces to these resources, web services are available using REST and SOAP protocols that enable programmatic access and allow their integration into other applications and analytical workflows and pipelines. This article describes the various options available to researchers and bioinformaticians who would like to use our resources via the web interface employing RESTful web services clients provided in Perl, Python, and Java or who would like to use Docker containers to integrate the resources into analysis pipelines and workflows. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC.

Basic Protocol 1 : Retrieving data from EMBL-EBI using Dbfetch via the web interface

Alternate Protocol 1 : Retrieving data from EMBL-EBI using WSDbfetch via the REST interface

Alternate Protocol 2 : Retrieving data from EMBL-EBI using Dbfetch via RESTful web services with Python client

Support Protocol 1 : Installing Python REST web services clients

Basic Protocol 2 : Sequence similarity search using FASTA search via the web interface

Alternate Protocol 3 : Sequence similarity search using FASTA via RESTful web services with Perl client

Support Protocol 2 : Installing Perl REST web services clients

Basic Protocol 3 : Sequence similarity search using NCBI BLAST+ RESTful web services with Python client

Basic Protocol 4 : Sequence similarity search using HMMER3 phmmer REST web services with Perl client and Docker

Support Protocol 3 : Installing Docker and running the EMBL-EBI client container

Basic Protocol 5 : Protein functional analysis using InterProScan 5 RESTful web services with the Python client and Docker

Alternate Protocol 4 : Protein functional analysis using InterProScan 5 RESTful web services with the Java client

Support Protocol 4 : Installing Java web services clients

Basic Protocol 6 : Multiple sequence alignment using Clustal Omega via web interface

Alternate Protocol 5 : Multiple sequence alignment using Clustal Omega with Perl client and Docker

Support Protocol 5 : Exploring the RESTful API with OpenAPI User Inferface

INTRODUCTION

Since 1995, the European Bioinformatics Institute (EMBL-EBI) has provided access to a wide range of databases and analysis tools using web services technologies (Chojnacki et al., 2017; Li et al., 2015; Madeira et al., 2019; Madeira et al., 2022; Madeira et al., 2024; McWilliam et al., 2013). These comprise services to search, retrieve, and run analysis tools on the databases hosted at the institute and to explore the network of cross-references present in the data, e.g., EBI Search (Madeira et al., 2022; Park et al., 2017; Squizzato et al., 2015; Valentin et al., 2010). In this article, we introduce the reader to services used to retrieve entry data in various data formats and to access the data using specific fields [e.g., Dbfetch (Lopez et al., 2003)] and to analysis tool services, for example, sequence similarity search [SSS; e.g., FASTA (see Current Protocols article: Pearson, 2016; Pearson & Lipman, 1988) and NCBI BLAST+ (Altschul et al., 1997; Camacho et al., 2009; also see Current Protocols article: Ladunga, 2002)], multiple sequence alignment [MSA; e.g., Clustal Omega (see Current Protocols article: Sievers & Higgins, 2014; also see Sievers & Higgins, 2018; Sievers et al., 2011)], and pairwise sequence alignment and protein functional analysis [PFA; e.g., InterProScan (see Current Protocols article: Mulder & Apweiler, 2003; also see Jones et al., 2014)].

In addition to the web service technologies, Job Dispatcher now offers a new website (https://www.ebi.ac.uk/jdispatcher/) that serves as a central gateway for accessing several sequence analysis tool categories. This website simplifies the process of finding and selecting the appropriate tool for users. The homepage displays the status of submitted jobs and enables users to search for analysis results using the job identifier. Furthermore, the homepage provides the latest news, data-release updates, and the list of collaborators (Fig. 1).

Job Dispatcher homepage.
Job Dispatcher homepage.

The “Help and Privacy” page offers comprehensive documentation, including links to webinars and other training materials. Additionally, it provides access to the Job Dispatcher's detailed documentation, available from https://www.ebi.ac.uk/jdispatcher/docs/.

To enhance user experience, all tool web forms have been redesigned. The new tool pages no longer offer email notifications from individual tool pages. Instead, users can find all submitted jobs under the “Your Jobs” page (https://www.ebi.ac.uk/jdispatcher/recentJobs).

The Representational State Transfer (REST) and Simple Object Access Protocol (SOAP) web services (https://www.ebi.ac.uk/jdispatcher/docs/webservices/) interfaces to these databases and tools allow their integration into other tools, applications, web portals, analysis pipelines, and workflows. Sample clients covering a range of popular bioinformatics programming languages (https://github.com/ebi-jdispatcher/webservice-clients/), a Docker container image with pre-installed sample clients (https://hub.docker.com/r/ebiwp/webservice-clients/), Common Workflow Language (CWL) descriptions for the sample clients (https://github.com/ebi-jdispatcher/webservice-cwl), and examples of usage are provided to help users get started using the EMBL-EBI's Job Dispatcher Sequence Analysis web services.

The following protocols describe how you can retrieve data from EMBL-EBI using Dbfetch (Basic Protocol 1, Alternate Protocols 1 and 2, and Support Protocol 1). You will also learn how to perform sequence similarity search using FASTA, NCBI BLAST+, and HMMER3 phmmer (Basic Protocols 2 to 4, Alternate Protocol 3, and Support Protocols 2 and 3), and how to perform protein functional analysis using InterProScan 5 (Basic Protocol 5, Alternate Protocol 4, and Support Protocol 4), as well as how to perform multiple sequence analysis using Clustal Omega (Basic Protocol 6 and Alternate Protocol 5). Finally, you will learn how to explore and navigate the Job Dispatcher RESTful API (Support Protocol 5).

STRATEGIC PLANNING

The most significant planning issues around the decision to use the SOAP and RESTful web services of EMBL-EBI services are detailed below.

Web services have several potential uses over and above normal web interface access to services: offering services behind or together with another service, systematic access to resources, and as a gateway to workflows. Although these needs can also be served by local installation of individual tools and databases, doing so comes with additional burdens on technical support and skills, for example, the requirement of keeping local software and databases up to date, as well as computational and storage burdens. Web services reduce these overheads by allowing a standardized interface to remotely managed servers (at EMBL-EBI in this instance) where the tools and database providers manage the software and database updating and also provide access to large compute resources and the management thereof.

Web services allow for programmatic access to services (for example, using scripts) and are thus suitable for mass/systematic analysis or for using the services as part of a wider workflow or as the backend to another service.

There are some situations where web services are not suitable:

  • Where the analysis is time-critical: the nature of remote services necessarily adds some latency to the process.
  • Where the data cannot leave the local computer/network for any reason: although web services use secure https protocols, license restrictions on datasets that you own may prevent their transmission in any form over the Internet.

Although using web services reduces the burden of maintaining software and data, it is important to note that the user still needs to be familiar with the tools used as well as programmatic concepts, though using a graphical workflow tool that interfaces with web services can alleviate some of the programming knowledge required.

Basic Protocol 1: RETRIEVING DATA FROM EMBL-EBI USING Dbfetch VIA THE WEB INTERFACE

In this protocol, we introduce the reader to commonly used biological sequence databases and how to retrieve data from them using services at EMBL-EBI.

A large number of databases exist that store biological data derived from experiments or computation. These aim to determine the order of nucleotides or amino acids (also known as the primary structure) and include methods such as Sanger sequencing (Sanger & Coulson, 1975), next-generation sequencing (NGS; Pettersson et al., 2009) for whole-genome and exome sequencing, peptide sequencing from C and N-terminal analysis (Edman et al., 1950), Edman degradation (Roberts & Murray, 1976), enzyme digestion (Hernandez et al., 2006), mass spectrometry, and X-ray crystallography of biomolecular structures (Franklin, 1956).

Nucleotide Sequences

The most commonly used nucleotide sequence database is the product of a trilateral agreement between EMBL-EBI, the National Center for Biological Information (NCBI), and the DNA Databank of Japan (DDBJ). These form the International Nucleotide Sequence Database Collaboration (INSDC). This collaborative database comprises the European Nucleotide Archive (Silvester et al., 2018; Yuan et al., 2024), GenBank (Benson et al., 2017; Sayers et al., 2024), and the DDBJ (Kodama et al., 2018; Tanizawa et al., 2023). These three centers collect and share data on a daily basis, forming perhaps the largest effort to exchange and share scientific data across the globe.

Genomes

NGS technology has evolved rapidly over recent years. The sequencing speeds afforded by traditional methods had been a limiting factor for obtaining whole genomes. Today, with NGS, it is possible to sequence a human genome in a single day and at a fraction of the cost of the older methods. This has led to an explosion in the number of genomes available for biomedical, agronomical, environmental, and computational research.

The largest collection of these genomes is spread across organism-specific databases, e.g., FlyBase (Gramates et al., 2017; Larkin et al., 2020), WormBase (Davis et al., 2022; Lee et al., 2018; also see Current Protocols article: Schwartz & Sternberg, 2004), SGD (Cherry et al., 2012; also see Current Protocols article: Skrzypek & Hirschman, 2011), Ensembl (Martin et al., 2023; Zerbino et al., 2018; also see Current Protocols article: Wolfsberg, 2007), and Ensembl Genomes (Kersey et al., 2018). Ensembl is a joint project between EMBL-EBI and the Wellcome Trust Sanger Institute and is primarily focused on genomes from vertebrate and other eukaryotic organisms. Ensembl Genomes is based on the Ensembl infrastructure and is divided across five websites that respectively focus on the genomes of bacteria, protists, fungi, plants, and invertebrate metazoa.

Protein Sequences

Amino acid sequences date back to the late 1940s, when Edman and Sanger developed methods for retrieving sequence from purified protein using a combination of biochemical methods. Just as with nucleotide sequences later, collecting and distributing these sequences became a task that would enable researchers to share and de-duplicate effort. The first such database was established in 1960s by the National Biochemical Research Foundation (NBRF) and was known as the Atlas of Protein Sequence and Structure, published by Margaret Dayhoff. Her group pioneered methods for the comparison of protein sequences using computational methods. The NBRF established the Protein Information Resource (PIR) in 1984 to produce and distribute the PIR–Protein Sequence Database (PIR-PSD; Wu & Nebert, 2004), the first international database that grew out from Dayhoff's Atlas of Protein Sequence and Structure. PIR, EMBL, and the Swiss Institute of Bioinformatics joined efforts to produce a single (and the largest) protein sequence database by unifying the PIR-PSD, TrEMBL, and Swiss-Prot (Bairoch et al., 2004) databases. This is known today as the UniProt Knowledgebase (UniProtKB; UniProt Consortium, 2019, 2023). This service provides access to sequences from multiple sources, including nucleotide translations and protein sequences derived from structures in the Protein Data Bank (PDB), as well as those from the Structural Genomics Consortium (SGC) initiative.

Retrieving Sequences from EMBL-EBI Using Dbfetch

Dbfetch (database fetch; Lopez et al., 2003) is a retrieval system specifically designed to provide a single point of access for biological data spread across multiple resources. Dbfetch has been in operation since 1995 and currently provides unified access to 58 databases (https://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases). Among the databases recently added to Dbfetch are the AlphaFold Protein Structure Database, COVID-19 Data Portal, European Nucleotide Archive (ENA) ribosomal RNA (ENA rRNA) browser, non-human Immunoglobulin-like Receptors (IPD-NHKIR) nucleotide coding sequence (CDS) database, IPD-NHKIR nucleotide genomic database, IPD-NHKIR protein database, PDB in Europe–Knowledge Base (PDBe-KB), and Electron Microscopy Data Bank (EMDB). Dbfetch uses multiple data sources to provide a range of data formats wider than that available from a single source and to mitigate the effect of a single data source being unavailable.

Alternate Protocol 1: RETRIEVING DATA FROM EMBL-EBI USING WSDbfetch VIA THE REST INTERFACE

Dbfetch provides three modes of access to the user. As described above in Basic Protocol 1, one is using a web browser and the CGI interface. The two others make use of data access standards called web services. Web services consist of two protocols, REST and SOAP, that complement each other and can be used to perform various data-retrieval tasks. Like Dbfetch, WSDbfetch (McWilliam et al., 2009) allows the user to retrieve entries. For the developer, the advantage of the REST interface is that it allows the functionality of Dbfetch to be integrated into an application, workflow, or process pipeline. Because the web services technologies are language agnostic, the developer can use the programming language of choice. EMBL-EBI provides fully working example clients written in a variety of common programming languages, including Perl, Python, and Java. These command-line interface (CLI) clients can be downloaded from https://github.com/ebi-jdispatcher/webservice-clients and give full access to the Dbfetch service from the command line (also known as the terminal, shell, or command prompt). The REST clients provide an easier-to-use interface that reports standard HTTP status codes (https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). The REST interface can be operated using a web browser or common web retrieval utilities such as wget and curl. In the following examples, we will use RESTful URLs to demonstrate the WSDbfetch REST interface.

The fundamental syntax of the WSDbfetch REST interface is as follows:

where {db} is the database name (e.g., uniprotkb) and {id} is the identifier (e.g., WAP_MOUSE). The following shows how to fetch the mouse whey acidic protein (WAP) precursor from UniProtKB using the RESTful interface:

As described earlier, Dbfetch provides access to various formats and styles in which to download data. WSDbfetch provides the same functionality. To download WAP_MOUSE in the UniProtKB XML format (uniprotxml), the URL is as follows:

Likewise, to download WAP_MOUSE in UniProtKB flat-file format with HTML hyperlinks, the following URL would be used:

Dbfetch presently provides access to 58 databases. These are shown in Table 1 along with the acronyms used in Dbfetch and WSDbfetch as the database names.

Table 1. Database Names and Corresponding Database Identifiers Currently Provided in Dbfetch
Database name Database identifier

AlphaFold Database

COVID-19 Data Portal

ChEMBL Targets

afdb

cdp

chembl

EDAM edam
ENA Coding ena_coding
ENA Geospatial ena_geospatial

ENA Non-coding

ENA rRNA

ena_noncoding

ena_rrna

ENA Sequence ena_sequence
ENA Sequence Constructed ena_sequence_con
ENA Sequence Constructed Expanded ena_sequence_conexp
ENA/SVA ena_sva
Ensembl Gene ensemblgene
Ensembl Genomes Gene ensemblgenomesgene
Ensembl Genomes Transcript ensemblgenomestranscript
Ensembl Transcript ensembltranscript
EPO Proteins epo_prt
HGNC hgnc
IMGT/HLA nucleotide cds imgthlacds
IMGT/HLA nucleotide genomic imgthlagen
IMGT/HLA protein imgthlapro
IMGT/LIGM-DB imgtligm
InterPro interpro
IPD-KIR nucleotide cds ipdkircds
IPD-KIR nucleotide genomic ipdkirgen
IPD-KIR protein ipdkirpro
IPD-MHC nucleotide cds ipdmhccds
IPD-MHC nucleotide genomic ipdmhcgen

IPD-MHC protein

IPD-NHKIR nucleotide cds

IPD-NHKIR nucleotide genomic

IPD-NHKIR protein

Ipdmhcpro

ipdnhkircds

ipdnhkirgen

ipdnhkirpro

IPRMC iprmc
IPRMC UniParc iprmcuniparc
JPO Proteins jpo_prt
KIPO Proteins kipo_prt
MEDLINE medline
MEROPS-MP mp
MEROPS-MPEP mpep
MEROPS-MPRO mpro
Patent DNA Non Redundant L1 nrnl1
Patent DNA Non Redundant L2 nrnl2
Patent Protein Non Redundant L1 nrpl1
Patent Protein Non Redundant L2 nrpl2
Patent Equivalents patent_equivalents

PDB

PDBe Knowledge Base

pdb

pdbekb

Electron Microscopy Data Bank emdb
RefSeq nucleotide refseqn
RefSeq protein refseqp
Taxonomy taxonomy
UniParc uniparc
UniProtKB uniprotkb
UniRef100 uniref100
UniRef50 uniref50
UniRef90 uniref90
UniSave unisave
USPTO Proteins uspto_prt

A listing of the available databases with a description of each database, details of the various available data formats, and result styles and example entry identifiers can be found at https://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases.

Necessary Resources

Hardware

  • An Internet-connected UNIX, Linux, Mac, or Windows workstation

Software

  • The wget utility. For OS X, Linux, and UNIX systems, wget is commonly installed by default. If wget is not installed, it can be installed from the systems package manager or downloaded and installed from https://www.gnu.org/software/wget/. For MS Windows (versions 7SP1 and above, including 10), the iwr command is built-in inside powershell. The syntax is slightly different. One example is provided below. Alternatively, wget can be obtained in Cygwin (https://cygwin.com/).

Input

  • Database entry identifiers in the format database name:database identifier supported by Dbfetch

1.Retrieve entry into a file.

Note
Using the above URLs with a utility such as wget is quite simple, and building this into a shell or batch language script should be straightforward. The following describes typical command lines using wget and the RESTful interface of WSDbfetch.

Note
To get the nucleotide sequence of human putative G-Protein Coupled Receptor 40 (GPR40), type “Homo sapiens putative GPR40” in the EBI Search (Madeira et al., 2022; Park et al., 2017) search box available at https://www.ebi.ac.uk/ebisearch/. This search returns a list of matching nucleotide sequence entries and relevant cross-reference information. We could then retrieve AF024687, for example. To write it to a file, you would run the following:

Note
A file called AF024687 will be present in the file system after wget finishes. The iwr equivalent is as follows:

2.Retrieve entry into a console or terminal.

Note
Displaying the entry directly in the console (or terminal) is also possible. To do that, use the wget -qO- flag:

3.Retrieve entry annotation.

Note
Retrieving the annotations section of the nucleotide sequence in the above example is done using the following:

4.Retrieve entry FASTA-format sequence.

Note
Examining the above entry, the user will notice that cross-references to Ensembl and UniProtKB are present in the annotation. The identifiers here can be used to obtain these entries. To obtain the protein sequence in FASTA format, use the following:

5.Retrieve entry with cross-references and features.

Note
Retrieve the Ensembl Gene by typing the following:

Note
The default for Ensembl Gene in Dbfetch is to retrieve a sequence in FASTA format. To retrieve annotations with cross-references and features in EMBL format, you can use this:

Alternate Protocol 2: RETRIEVING DATA FROM EMBL-EBI USING Dbfetch VIA RESTful WEB SERVICES WITH PYTHON CLIENT

Dbfetch provides fully working RESTful web services clients (i.e., command-line applications) written in Perl and Python programming languages. For a full description of the Dbfetch RESTful web services, see https://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp.

Necessary Resources

Hardware

  • An Internet-connected UNIX, Linux, Mac, or Windows workstation

Software

  • Python (https://www.python.org/) with the xmltramp2 module installed
  • EMBL-EBI Python RESTful web services clients (see Support Protocol 1 for how to download and install)

Input

  • Database entry identifiers in the format database name:database identifier supported by Dbfetch

1.Display client usage. To do so, switch to the directory containing the downloaded Python client dbfetch.py. Run the script without specifying any parameters (or adding --help to the command shown below) to print a brief help message (Fig. 6).

  • python dbfetch.py
Dbfetch Python client displaying help text.
Dbfetch Python client displaying help text.

2.Display a list of the databases supported by the service:

  • python dbfetch.py getSupportedDBs

3.Display a list of the available formats associated with a particular database (e.g., uniprotkb):

  • python dbfetch.py getDbFormats uniprotkb

4.Retrieve an entry.

Note
For example, to obtain the protein structure of the hepatocyte-derived nuclear factor 4alphafrom the PDB, which is described in the PDB entry 3CBB, enter the following command:

Note
python dbfetch.py fetchData pdb:3cbb

5.Get the sequences of all the chains in the structure.

Note
To get the sequences of all the chains in the structure, in FASTA format, enter the following:

Note
python dbfetch.py fetchData pdb:3cbb fasta

Note
This returns all four chains in the structure.

6.To get the sequence of a specific chain, instead of all the chains, use the chain identifier as suffix for the entry identifier.

Note
For the example above, use the following:

Note
python dbfetch.py fetchData pdb:3cbb_A fasta

Note
Note that although PDB entry identifiers are not case sensitive, the PDB chain identifiers are. Thus, 3cbb_a and 3cbb_A are not the same.

7.Retrieve a set of entries from a database.

Note
Using the fetchBatch method, a set of entries can be retrieved.

Note
For example, to fetch the sequences from the UniProtKB entries for the rat, mouse, and pig WAP precursors, in FASTA format, enter the following command:

Note
python dbfetch.py fetchBatch uniprotkb wap_rat,wap_mouse,wap_pig fasta

Note
Although the UniProtKB entry names are used in this command, these are not stable over time, so it is better whenever possible to use the UniProtKB entry accessions instead, e.g.:

Note
python dbfetch.py fetchBatch uniprotkb P01174,P01173,O46655 fasta

Support Protocol 1: INSTALLING PYTHON REST WEB SERVICES CLIENTS

Python is commonly used in bioinformatics and typically installed by default on UNIX and UNIX-like systems. Because many existing analytical pipelines are implemented in Python, the Python clients provide an option for integration of web services into existing pipelines.

Necessary Resources

Hardware

  • An Internet-connected UNIX, Linux, Mac, or Windows workstation

Software

1.Check that Python is installed on the system. In the Command Prompt or terminal, enter the following:

  • python --version

If Python is not installed or the current version is not 3.5 or later, then download Python 3 and follow the instructions provided at https://www.python.org/downloads/.

Note
In MS Windows, open a Command Prompt. The procedure to do this varies according to different versions of Windows. In OS X, Linux, or UNIX, open a terminal.

Note
As an alternative, we would recommend installing Python Anaconda's distribution (https://www.anaconda.com/download), as it comes bundled with useful Python modules.

2.Check that the xmltramp2 Python module has been installed:

  • python -c “import xmltramp2”

  • a.If an error message is returned, install the xmltramp2 module using pip:

  • pip install xmltramp2

  • b.

If pip is not installed, follow the instructions provided at https://pip.pypa.io/en/stable/installation/ on how to install it.

If Python version 2 is installed in your operating system, if available, pip will be linked to Python 2.After installing Python 3 and pip (for Python 3), replace all of the python and pip commands provided in this article by python3 and pip3.This ensures that you are running Python 3, which has the xmltramp2 dependency installed.

3.Open a web browser and go to the GitHub page for the EMBL-EBI web services clients at https://github.com/ebi-jdispatcher/webservice-clients.

Note
Clients are provided in a number of programming languages and using a variety of web services tool kits. For WSDbfetch, these include Perl and Python clients.

Note
Dependencies and requirements for running each client are detailed on the GitHub page.

4.Download the Python clients.

Note
Download the Python sample client for Dbfetch from https://github.com/ebi-jdispatcher/webservice-clients/tree/master/python:

5.Test and run the clients. Within the Command Prompt or terminal, change to the directory that contains the client program downloaded in step 4.To test the program (e.g., dbfetch.py, retrieving sequences from UniProtKB in FASTA format), enter the following:

  • python dbfetch.py fetchBatch uniprotkb P01174,P01173 fasta

Note
Help information will be displayed with further instructions on usage of the client. If any error message is displayed, read through it to try and identify the problem. If help is still needed, send your query to our help desk using the Feedback link at the top of the page or via https://www.ebi.ac.uk/about/contact/support/job-dispatcher-services.

Basic Protocol 2: SEQUENCE SIMILARITY SEARCH USING FASTA SEARCH VIA THE WEB INTERFACE

EMBL-EBI provides and maintains a comprehensive range of freely available analysis tools through web interfaces and web services (Chojnacki et al., 2017; Li et al., 2015; Madeira et al., 2019, Madeira et al., 2022, Madeira et al., 2024; McWilliam et al., 2013). The analysis services include 53 tools, divided into 10 categories. In this protocol, we aim to demonstrate how to run analysis tools and interpret results through the web interface.

Table 2 shows the analysis tools along with the categories and the URLs of their web interfaces. The popular categories include SSS (e.g., NCBI BLAST+ and FASTA), MSA (e.g., Clustal Omega), and PFA (e.g., InterProScan, Phobius).

Table 2. Tools and Categories of the EMBL-EBI Bioinformatics Sequence Analysis Tools
Tool category Tools included Main URL
Sequence similarity search NCBI BLAST+, FASTA, FASTM/S/F, PSI-BLAST, PSI-Search, SSEARCH, GGSEARCH, GLSEARCH, PSI-Search2 https://www.ebi.ac.uk/jdispatcher/sss/
Multiple sequence alignment Clustal Omega, Kalign, MAFFT, MUSCLE, T-Coffee, WebPRANK, MView https://www.ebi.ac.uk/jdispatcher/psa/
Protein function analysis InterProScan 5, Phobius, Pratt, RADAR, HMMER 3 hmmscan, HMMER3 phmmer, PfamScan https://www.ebi.ac.uk/jdispatcher/pfa/
Sequence format conversion Seqret, MView https://www.ebi.ac.uk/jdispatcher/sfc/
Phylogeny analysis Simple Phylogeny https://www.ebi.ac.uk/jdispatcher/phylogeny/
Pairwise sequence alignment Needle, Stretcher, Water, Matcher, LALIGN, GeneWise, GGSEARCH2SEQ, SSEARCH2SEQ https://www.ebi.ac.uk/jdispatcher/psa/
RNA analysis Infernal cmscan, MapMi, R2DT https://www.ebi.ac.uk/jdispatcher/rna/
Sequence operation SeqCksum https://www.ebi.ac.uk/jdispatcher/so/
Sequence translation Transeq, Sixpack, Backtranseq, Backtransmbig https://www.ebi.ac.uk/jdispatcher/st/
Sequence Statistics Pepinfo, Pepstats, Pepwindow, SAPS, Cpgplot, Newcpgreport, Isochore, Dotmatcher, Dotpath, Dottup, Polydot https://www.ebi.ac.uk/jdispatcher/seqstats/
EMBOSS tools Needle, Stretcher, Water, Matcher, Transeq, Sixpack, Backtranseq, Backtransmbig, Pepinfo, Pepstats, Pepwindow, Cpgplot, Newcpgreport, Isochore, Seqret, Dotmatcher, Dotpath, Dottup, Polydot https://www.ebi.ac.uk/jdispatcher/emboss/

In the following protocols, we will introduce the most commonly used sequence analysis tools using the web interface and REST web services client programs. EMBL-EBI provides freely available web services for analysis tools (https://www.ebi.ac.uk/jdispatcher/docs/webservices/), which mainly include SSS, MSA, PFA, Phylogeny Analysis, Pairwise Sequence Alignment (PSA), RNA Analysis, Sequence Format Convert (SFC), Sequence Statistics, Sequence Translation, and Sequence Operations (SO). Basic Protocol 2 demonstrates examples using web services for SSS, PFA, and MSA.

SSS is a method of searching sequence databases by using alignment to a query sequence. By statistically assessing how well database and query sequences match, one can infer homology and transfer information to the query sequence. The EMBL-EBI SSS web services contain the analysis tools of NCBI BLAST+, FASTA, FASTM, PSI-BLAST, and PSI-Search.

We use the FASTA service web interface to run and interpret a FASTA search job. The FASTA package provides a comprehensive set of similarity/homology searching programs, similar to those provided by NCBI BLAST+, and some additional programs for searching with short peptides and oligonucleotides.

Necessary Resources

Hardware

  • Any Internet-connected computer

Software

  • A web browser, e.g., Google Chrome, Mozilla Firefox, Microsoft Edge, Safari, or Opera

Input

  • A plain text file containing a sequence in FASTA, EMBL, GenBank, GCG, PIR, NBRF, PHYLIP, or UniProtKB/Swiss-Prot format. If the file is not available, the entry identifier in the format database name:database identifier, e.g., UniProtKB:GSTM1_MOUSE, can be used as input, or a sequence in one of the formats mentioned above can be pasted into the form.
  • This example uses the mouse protein “Glutathione S-transferase Mu 1” from the UniProtKB database as the input sequence. The entry details can be found at https://www.uniprot.org/uniprotkb/P10649/entry, and the FASTA-format sequence can be downloaded at https://rest.uniprot.org/uniprotkb/P10649.fasta.

1.Go to the SSS web page https://www.ebi.ac.uk/jdispatcher/sss/ using a web browser (Fig. 7).

Note
The SSS page allows a user to select different tools to search against databases of proteins and nucleotides.

Screenshot of SSS categories web page.
Screenshot of SSS categories web page.

2.Click “Protein” search under the FASTA section or go directly to https://www.ebi.ac.uk/jdispatcher/sss/fasta/ (Fig. 8).

Note
Job submission on the input form for FASTA is organized into five sections: Databases, Input Sequence, Program, Parameters, and Submit.

FASTA input form.
FASTA input form.

3.Select the databases to search. From the “Databases” section, click + or - to expand or collapse the available databases under the main database categories. Check or uncheck the boxes of the databases to select the appropriate databases.

Note
Multiple databases can be chosen. In this example, we choose UniProKB/Swiss-Prot.

4.Enter the input sequence by browsing and selecting the input sequence file. Alternatively, copy the sequence and paste it into the sequence box.

Note
The user can also input the entry accession with the database identifier, e.g., UniProtKB:P10649. Select the correct input sequence type just above the input sequence box. In this example, we paste a protein sequence in FASTA format and select the PROTEIN sequence type.

5.Set the parameters (Fig. 9). To do so, first, select the program to run (FASTA, FASTX, FASTY, SSEARCH, GLSEARCH, or GGSEARCH). Then, click on the “More options” button to expand the section for the advanced parameters (e.g., matrix, gap penalties, ktup, e-values, and output formats). Change the settings of the parameters according to need.

Note
Tooltips (info icons) provide details on each parameter.

Note
In this example, we choose the FASTA program and leave the other parameters as default.

Advanced parameters for FASTA search.
Advanced parameters for FASTA search.

6.Submit the job. To do so, provide a job “Title” (optional) to briefly describe the job and click the “Submit” button.

Note
A pop-up window will confirm your job's submission with a job identifier (JobId) and status. The job will be queued and then run until completed. The new web forms do not have the “notify by email” option; instead, you can access your jobs that are submitted within 7 days from https://www.ebi.ac.uk/jdispatcher/recentJobs. If the information provided in the submission is not correct, the page will show an error message to indicate which parameters need correction.

7.View job result summary (Fig. 10).

Note
The result pages provide multiple views that vary between tools. For FASTA they are as follows: Summary Table, Tool Output, Visual Output, Functional Predictions, Result Files, and Submission Details. The default view is Summary Table. Click on the result tabs to switch between the views.

Note
The Summary Table view lists information about the resulting top hits, including alignment numbers, database and identifier, length, bit score, percentages of identities and positives, E-value, description, and cross-references to other relevant databases. The user can click on the links of identifiers or cross-references to enter external resource pages.

Note
The result page lists only 50 hits at a time, so users can move to the next page to see the next hits.

Note
On the left side, there is an option to filter the results based on the “Organism” facets.

Note
The user can check or uncheck the boxes of alignments in the first column of the table and then click the left-side buttons in this view to show or hide annotations and alignments and to download source data in different formats. The user can also pass the selected sequences on to other tools for further analysis, for example, MSA using Clustal Omega.

FASTA results summary table.
FASTA results summary table.

8.Display the tool raw output (Fig. 11) by clicking the “Tool Output” tab.

Note
This page also allows the user to download the raw output in XML format.

FASTA tool output tab.
FASTA tool output tab.

9.Visualize the result (Fig. 12) by switching to the “Visual Output” view.

Note
The interactive visualization lines up the query sequence and the subject matches with lengths and colors, showing the significance levels of the alignments. You can download the PNG-format image from this page.

Visual output from FASTA search.
Visual output from FASTA search.

10.Display functional predictions (Fig. 13).

Note
A protein search job result will contain the Functional Predictions view, which visualizes functional predictions using InterPro matches. Check or uncheck the boxes for the protein features in order to include features for the visualization, and the image can be downloaded in PNG format.

Functional predictions tab from FASTA search.
Functional predictions tab from FASTA search.

11.Click the “Result Files” tab to display all the result files the tool produces.

Note
This includes the actual alignment, a list of alignment accessions and identifiers, the result in XML, and other visual outputs (Fig. 14).

Result files tab from FASTA search.
Result files tab from FASTA search.

12.Display the submission details (Fig. 15) using the Submission Details view.

Note
This view shows information about the program and its version, database, job title, date and time for job launch, input and output files, command line executed, and input parameter settings. The user can review these details to decide if the submission is correct and whether a resubmission is needed.

Submission details tab from FASTA search.
Submission details tab from FASTA search.

Alternate Protocol 3: SEQUENCE SIMILARITY SEARCH USING FASTA VIA RESTful WEB SERVICES WITH PERL CLIENT

Fully working RESTful web services clients written in Perl, Python, and Java programming languages are provided for FASTA SSS.

For a full description of the SSS RESTful web services, see https://github.com/ebi-jdispatcher/webservice-clients.

This protocol uses the Perl CLI client to run a FASTA search via the RESTful web services client.

Necessary Resources

Hardware

  • An Internet-connected UNIX, Linux, Mac, or Windows workstation

Software

  • Perl (https://www.perl.org/) with the LWP and XML::Simple modules installed (see Support Protocol 2 for how to download and install the EMBL-EBI Perl RESTful web services clients)

Input

  • A plain text file containing a sequence in FASTA, EMBL, or GenBank format
  • GCG, PIR, NBRF, PHYLIP, and UniProtKB/Swiss-Prot or a database entry supported by EMBL-EBI in the format database name:database identifier (e.g., UniProtKB:GSTM1_MOUSE)

1.Display client usage. To do so, switch to the directory containing the downloaded client program fasta.pl. Run the script without specifying any parameters (or adding --help to the command shown below) to print a brief help message:

  • perl fasta.pl --help

Note
A common requirement to use Perl clients (also Python and Java clients, for that matter) is the argument email: --email your@email.com. Table 3 lists the general command-line options that all the Perl, Python, and Java clients accept. Corresponding options are available in the Python and Java clients. Table 4 describes required and optional command-line options that the FASTA web service accepts.

Table 3. Description of General Command-Line Options Found in All Perl, Python, and Java Clients
Option Type Description
-h, --help Show this help message and exit
--asyncjob Forces an asynchronous query
--title Str Title for job
--status Get job status
--resultTypes Get available result types for job
--polljob Poll for the status of a job
--pollFreq Int Poll frequency in seconds (default 3 s)
--jobid Str JobId that was returned when an asynchronous job was submitted
--outfile Str File name for results (default is jobid; “-” for STDOUT)
--outformat Str Result format(s) to retrieve. It accepts comma-separated values.
--params List input parameters
--paramDetail Str Display details for input parameter
--quiet Decrease output
--verbose Increase output
--baseUrl Str Base URL. Defaults to https://www.ebi.ac.uk/Tools/services/rest/<tool_name>.
Table 4. Description of Required and Optional Command-Line Options for the FASTA Perl Client
Option Type Description
Required (for job submission)
--email str E-mail address
--program str The FASTA program to be used for the sequence similarity search
--stype str Indicates if the query sequence is protein, DNA, or RNA. Used to force FASTA to interpret the input sequence as specified type of sequence (via the -p, -n, or -U options), this prevents issues when using nucleotide sequences that contain many ambiguous residues.
--sequence str The query sequence can be entered directly into this form. The sequence can be in GCG, FASTA, EMBL (Nucleotide only), GenBank, PIR, NBRF, PHYLIP, or UniProtKB/Swiss-Prot (Protein only) format. A partially formatted sequence is not accepted. Adding a return to the end of the sequence may help certain applications understand the input. Note that directly using data from word processors may yield unpredictable results, as hidden/control characters may be present.
--database str The databases to run the sequence similarity search against multiple databases can be used at the same time.
Optional
--matrix str (Protein searches) The substitution matrix used for scoring alignments when searching the database. Target identity is the average alignment identity the matrix would produce in the absence of homology and can be used to compare different matrix types. Alignment boundaries are more accurate when the alignment identity matches the target identity percentage.
--match_scores str (Nucleotide searches) The match score is the bonus to the alignment score when matching the same base. The mismatch is the penalty when failing to match.
--gapopen int Score for the first residue in a gap
--gapext int Score for each additional residue in a gap
--hsps bool Turn on/off the display of all significant alignments between the query and library sequence
--expupperlim float Limits the number of scores and alignments reported based on the expectation value. This is the maximum number of times the match is expected to occur by chance.
--explowlim float Limit the number of scores and alignments reported based on the expectation value. This is the minimum number of times the match is expected to occur by chance. This allows closely related matches to be excluded from the results in favor of more distant relationships.
--strand str For nucleotide sequences, specify the sequence strand to be used for the search. By default, both upper (provided) and lower (reverse complement of provided) strands are used. For single-stranded sequences, searching with only the upper or lower strand may provide better results.
--hist bool Turn on/off the histogram in the FASTA result. The histogram gives a qualitative view of how well the statistical theory fits the similarity scores calculated by the program.
--scores int Maximum number of match score summaries reported in the result output
--alignments int Maximum number of match alignments reported in the result output
--scoreformat str Different score report formats
--stats str The statistical routines assume that the library contains a large sample of unrelated sequences. Options to select what method to use include regression, maximum likelihood estimates, shuffles, or combinations of these.
--annotfeats bool Turn on/off annotation features. Annotation features shows features from UniProtKB, such as variants, active sites, phospho-sites, and binding sites, that have been found in the aligned region of the database hit. To see the annotation features in the results after this has been enabled, select sequences of interest and click to “Show” alignments. This option also enables a new result tab (Domain Diagrams) that highlights domain regions.
--annotsym str Specify the annotation symbols
--dbrange str Specify the sizes of the sequences in a database to search against. For example, “100-250” will search all sequences in a database with length between 100 and 250 residues, inclusive.
--seqrange str Specify a range or section of the input sequence to use in the search. For example, specifying “34-89” in an input sequence of total length of 100 will tell FASTA to only use residues 34-89, inclusive.
--filter str Filter regions of low sequence complexity. This can avoid issues with low-complexity sequences where matches are found due to composition rather than meaningful sequence similarity. However, in some cases, filtering also masks regions of interest and so should be used with caution.
--transltable int Query genetic code to use in translation
--ktup int FASTA uses a rapid word-based lookup strategy to speed the initial phase of the similarity search. The KTUP is used to control the sensitivity of the search. Lower values lead to more sensitive but slower searches.

2.Display parameter details. To display all parameters of the tool, run

  • perl fasta.pl --params
  • To see further details of the parameter, run with the argument --paramDetail .
  • To see which FASTA programs are available, run
  • perl fasta.pl --paramDetail program
  • To see which FASTA databases are available, run
  • perl fasta.pl --paramDetail database

3a. Run jobs in synchronous mode.

Note
The jobs can be run in synchronous mode to retrieve a result as soon as the job is finished or in asynchronous mode (step 3b) to retrieve a result later.

Note
To run a FASTA search, decide which FASTA program to run, the database to search, and the query sequence type. Either a full sequence file or just an entry identifier in the form database name:database identifier can be used as input. Additionally, specify an e-mail address for communication using the web services.

Note
For example, we can search for the mouse protein “Glutathione S-transferase Mu 1” against the UniProtKB database with the FASTA tool. This protein's details can be found at https://www.uniprot.org/uniprotkb/P10649/entry, and the FASTA-format sequence can be downloaded at https://rest.uniprot.org/uniprotkb/P10649.fasta using the instructions above. Save the results in the file P10649.fasta and run the following:

Note
perl fasta.pl --email your@email.com --program fasta --database uniprotkb --stype protein P10649.fasta

Note
Alternatively, if you know the entry identifier of your query sequence, you can the search using this identifier as the input in the format database name:database identifier:

Note
perl fasta.pl --email your@email.com --program fasta --database uniprotkb --stype protein uniprotkb:gstm1_mouse

Note
In synchronous mode, the program will print out JobId and JobStatus (QUEUED/RUNNING/FINISHED) to the terminal/command prompt until result files are received. The results contain files of input sequence and output files in plain text, XML, JSON, PNG, JPG, and SVG formats.

3b. Run jobs in asynchronous mode.

Note
To retrieve a result later, run jobs in asynchronous mode using the --asyncjob argument:

Note
perl fasta.pl --asyncjob --email your@email.com --program fasta --database uniprotkb --stype protein uniprotkb:gstm1_mouse

Note
If the job submission is successful, the client will provide the JobId in the terminal/command prompt (STDOUT). The user has to use the JobId in the result retrieval. Please see the guidelines section for more information about the composition of the JobId.

Note
To check the job status before getting the results, run the following:

Note
perl fasta.pl --status --jobid

Note
The client will print if the job is QUEUED, RUNNING, ERROR, FAILURE, OR FINISHED.

Note
If the job status is FINISHED, get the result types:

Note
perl fasta.pl --resultTypes --jobid

Note
The FASTA web services provide result types of plain output (out), plain input (sequence), alignment identifiers (ids), XML result (xml), JSON (json), and other visualization images in SVG, JPG, and PNG formats.

Note
If the user wants to retrieve the result of a specific result type, for example, the plain text output (out), they can enter the following:

Note
perl fasta.pl --polljob --outformat out --jobid

Note
One can also pass comma-separated values to --outformat in order to get selected results (e.g., plain output and XML file):

Note
perl fasta.pl --polljob --outformat out,xml --jobid

Note
To retrieve all available results, use the following:

Note
perl fasta.pl --polljob --jobid

Note
If the job status is RUNNING, please check it again later. In the case of ERROR or FAILURE, please resubmit the job. If you still experience the same issue, please send us a support request via https://www.ebi.ac.uk/about/contact/support/job-dispatcher-services, making sure to include the JobId and the error message. In the case of NOT_FOUND, please check the JobId; if the JobId is correct, the job results might have expired, so please resubmit the job.

Note
Please note that the results are stored for only 7 days, and it is recommended to download the results before the job expires.

Support Protocol 2: INSTALLING PERL REST WEB SERVICES CLIENTS

Perl is commonly used in bioinformatics and typically installed by default on UNIX and UNIX-like systems. Because many existing analytical pipelines are implemented in Perl, the Perl clients provide an option for integration of EMBL-EBI's web services into existing pipelines.

Necessary Resources

Hardware

  • An Internet-connected UNIX, Linux, Mac, or Windows workstation

Software

1.Check that Perl is installed on the system. To do so, in the Command Prompt or terminal, enter the following:

  • perl --version

If Perl is not installed, download and follow the instructions provided at https://www.perl.org/get.html.

Note
In MS Windows, open a Command Prompt. The procedure to do this varies according to different versions of Windows. In OS X, Linux, or UNIX, open a terminal.

2.Check that the required LWP and XML::Simple Perl modules have been installed:

  • perl --MLWP -e “print $LWP::VERSION;”
  1.         If a “Can't locate LWP.pm” error message is returned, install the LWP Perl module.

        The LWP Perl module can be installed via the operating system package manager on many Linux/UNIX systems. For example, on Debian-based Linux distributions (e.g., Bio-Linux, Linux Mint, and Ubuntu), the “lib-perl” package should be installed. The LWP Perl module can also be installed from the Comprehensive Perl Archive Network (CPAN); seehttps://www.cpan.org/for details:

        perl -MXML::Simple -e “print$XML::Simple::VERSION;”

  1.         If a “Can't locate XML/Simple.pm” error message is returned, install the XML::Simple Perl module.

        The XML::Simple Perl module can be installed via the operating system package manager on many Linux/UNIX systems. For example, on Debian-based Linux distributions (e.g., Bio-Linux, Linux Mint, and Ubuntu), the “libxml-simpleperl” package should be installed. The XML::Simple Perl module can be installed from the CPAN; seehttps://www.cpan.org/for details.

3.Download the Perl clients (e.g., ncbiblast.pl) from https://github.com/ebi-jdispatcher/webservice-clients. Alternatively, download the NCBI BLAST+ client directly from GitHub with wget :

4.Test and run the client. Within the Command Prompt or terminal, change to the directory that contains the client program downloaded earlier. To test the program (e.g., ncbiblast.pl) , enter the following:

  • perl ncbiblast.pl --help

Note
Help information will be displayed with further instructions on usage of the client. If any error message is displayed, read through it to try and identify the problem. If you still need help, send your query to our help desk using the Feedback link at the top of the page or via https://www.ebi.ac.uk/about/contact/support/job-dispatcher-services.

Basic Protocol 3: SEQUENCE SIMILARITY SEARCH USING NCBI BLAST+ RESTful WEB SERVICES WITH PYTHON CLIENT

NCBI BLAST+ (Altschul et al., 1997; Camacho et al., 2009; also see Current Protocols article: Ladunga, 2002) emphasizes finding regions of sequence similarity, which will yield functional and evolutionary clues about the structure and function of your novel sequence.

EMBL-EBI provides web services clients written in Perl, Python, and Java programming languages for the NCBI BLAST+ SSS.

For a full description of the RESTful web services, see https://www.ebi.ac.uk/jdispatcher/docs/webservices/.

This protocol uses a Python client program to run NCBI BLAST+ via the RESTful web service CLI client.

Necessary Resources

Hardware

  • An Internet-connected UNIX, Linux, Mac, or Windows workstation

Software

  • Python (https://www.python.org/) with the xmltramp2 module installed (see Support Protocol 1 for how to download and install the EMBL-EBI Python RESTful web services clients)

Input

  • A plain text file containing a sequence in one of the formats of GCG, FASTA, EMBL, GenBank, PIR, NBRF, PHYLIP, or UniProtKB/Swiss-Prot or a database entry supported by EMBL-EBI in the format database name:database identifier (e.g., embl:x56957)

1.Display client usage by switching to the directory containing the downloaded Python client ncbiblast.py. For details of how to use the client, run the script with --help:

  • python ncbiblast.py --help

2.Display parameter details. To display all parameters of the tool, run

  • python ncbiblast.py --params
  • To see further details of the parameter, run with the argument --paramDetail .
  • To see the available BLAST programs, run
  • python ncbiblast.py --paramDetail program
  • To see the available BLAST databases, run
  • python ncbiblast.py --paramDetail database

3a. Run jobs in synchronous mode.

Note
The jobs can be run in synchronous mode in order to retrieve a result as soon as the job is finished or in asynchronous mode (step 3b) to retrieve a result later. The user can run a job with a sequence file or entry identifier as input. The user also needs to specify an e-mail address for communication in using the web services.

Note
For example, run a BLASTP job against the UniProtKB database with a sequence input file:

Note
python ncbiblast.py --email your@email.com --program blastp --database uniprotkb --stype protein <SequenceFile.fasta>

Note
Alternatively, if you know the entry identifier of your query sequence, you can do the search using this identifier as the input:

Note
python ncbiblast.py --email your@email.com --program blastp --database uniprotkb --stype protein DB:Identifier

Note
The entry identifier should contain the database name and the entry accession, separated by a colon, e.g., UniProtKB:APOE_HUMAN, the Human protein (Apolipoprotein E) entry APOE_HUMAN in UniProtKB:

Note
python ncbiblast.py --email your@email.com --program blastp --database uniprotkb --stype protein uniprotkb:apoe_human

Note
In synchronous mode, the program will print out JobId and JobStatus (QUEUED/RUNNING/FINISHED) to the terminal/command prompt until result files are received. The results contain files of input sequence and output files in text, XML, and SVG formats.

3b. Run jobs in asynchronous mode.

Note
If the user wants to retrieve a result later, jobs should be run in asynchronous mode using the --asyncjob argument:

Note
python ncbiblast.py --asyncjob --email your@email.com--program blastp --database uniprotkb --stype protein <SequenceFile.fasta>

Note
If the job submission is successful, the client will provide the JobId in the terminal/command prompt (STDOUT). The user has to use the JobId in the result retrieval. Please see Understanding Results for more information about the composition of the JobId.

Note
To check the job status before getting the results, run the following:

Note
python ncbiblast.py --status --jobid

Note
The client will indicate if the job is RUNNING, ERROR, FAILURE, OR FINISHED. If the job status is FINISHED, get the result types:

Note
python ncbiblast.py --resultTypes --jobid

Note
The NCBI BLAST+ web services provide result types of plain output (out), plain input (sequence), alignment identifiers (ids), XML result (xml), and other visualization images in SVG and PNG formats.

Note
If the user wants to retrieve the result of a specific result type, for example, the plain text output (out), they can run the following:

Note
python ncbiblast.py --polljob --outformat out --jobid

Note
One can also pass comma-separate values to –outformat in order to get selected results (e.g., plain output and XML file):

Note
python ncbiblast.py --polljob --outformat out,xml --jobid

Note
To retrieve all available results, use the following:

Note
python ncbiblast.py --polljob --jobid

Note
If the job status is QUEUED or RUNNING, check it again later. In the case of ERROR or FAILURE, resubmit the job. If you still experience the same issue, please send us a support request via https://www.ebi.ac.uk/about/contact/support/job-dispatcher-services, making sure to include the JobId and the error message. In the case of NOT_FOUND, please check the JobId; if the JobId is correct, the job results might have expired, so please resubmit the job.

Basic Protocol 4: SEQUENCE SIMILARITY SEARCH USING HMMER3 phmmer REST WEB SERVICES WITH PERL CLIENT AND DOCKER

SSS using profile hidden Markov models (HMMs) has become a common practice in biological sequence analysis. Profile HMMs are constructed from a set of related sequences, which can then be used to search large sequence databases. In addition to residue conservation, HMMs also incorporate rates of insertions and deletions. The sensitivity of profile HMMs is achieved by the position-specific probabilistic modeling of the MSA, which allows detection of even distantly related sequences (Eddy, 1998).

HMMER3 (Potter et al., 2018) is a popular software package for detecting sequence homology, comparing a profile HMM to either a single sequence or a database of sequences. HMMER3 phmmer is used to search a database of protein sequences with a protein sequence of interest.

For the full description of the REST web services, see https://www.ebi.ac.uk/jdispatcher/docs/webservices/.

This protocol uses Docker to run a pre-configured container that provides Perl, Python, and Java CLI clients. In this example, we use the Perl client to run HMMER3 phmmer via the REST web service interface.

Necessary Resources

Hardware

  • An Internet-connected UNIX, Linux, Mac, or Windows workstation, running Docker

Software

  • See Support Protocol 3 for instructions on downloading and installing Docker as well as the ebiwp/webservice-clients image that provides access to pre-configured Perl, Python, and Java REST Web Services CLI Clients

Input

  • A plain text file containing a sequence in GCG, FASTA, EMBL, GenBank, PIR, NBRF, PHYLIP, or UniProtKB/Swiss-Prot format or a database entry supported by EMBL-EBI in the format database name:database identifier

1.Display client usage. To do so, call Perl, Python, or Java clients in the ebiwp/webservice-clients Docker image with docker run --rm ebiwp/webservice-clients. To see details of how to use the client and a detailed list of major command-line options, call the client with --help as follows:

  • docker run --rm ebiwp/webservice-clients hmmer3_phmmer.pl --help

2.Display parameter details. To display all parameters of the tool, run

  • docker run --rm ebiwp/webservice-clients hmmer3_phmmer.pl --params
  • To see further details of the parameter, run with argument --paramDetail .
  • To see which databases are available, run
  • docker run --rm ebiwp/webservice-clients hmmer3_phmmer.pl --paramDetail database

3a. Run jobs in synchronous mode.

Note
The user can run jobs in synchronous mode to retrieve results as soon as the job is finished or in asynchronous mode (step 3b) to retrieve results later. Either a full sequence file or just an entry identifier in the form database name:database identifier can be used as input. Additionally, specify an e-mail address for communication using the web services.

Note
To use files stored on the local disk and to get access to results produced by the analysis in the container, the user needs to pass -w /results -v pwd:/results as command-line options to the Docker command. This will define /results as the path that result files will be written by the client (in the container). A -v, --volume mapping is used to provide the container access to the current working directory pwd to the container. See Support Protocol 3 for additional information about running EMBL-EBI clients with Docker.

Note
For example, run a HMMER3 phmmer job against the UniProtKB database with a sequence input file:

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients hmmer3_phmmer. pl --email your@email.com --database uniprotkb <SequenceFile.fasta>

Note
Alternatively, if you know the entry identifier of your query sequence, you can the search using this identifier as input:

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients hmmer3_phmmer.pl --email your@email.com --database uniprotkb DB:Identifier

Note
The entry identifier should contain the database name and the identifier, separated by a colon, e.g., UniProt:GSTM1_MOUSE, and the mouse protein entry GSTM1_MOUSE in UniProtKB.

Note
In synchronous mode, the program will print out JobId and JobStatus (QUEUED/RUNNING/FINISHED) to the screen until result files are received. The results contain files of input sequence and a plain text output file.

3b. Run jobs in asynchronous mode.

Note
If you want to retrieve a result later, run jobs in asynchronous mode using the --asyncjob argument:

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients hmmer3_phmmer.pl --asyncjob --email your@email.com --database uniprotkb <SequenceFile.fasta>

Note
If the job submission is successful, the client will provide the JobId in the screen (STDOUT). The user has to use the JobId in the result retrieval.

Note
To check the job status before getting the results, run the following:

Note
docker run --rm ebiwp/webservice-clients hmmer3_phmmer.pl --status --jobid

Note
The client will say if the job is QUEUED, RUNNING, ERROR, FAILURE, OR FINISHED.

Note
If the job status is FINISHED, you can view the possible result types with this command:

Note
docker run --rm ebiwp/webservice-clients hmmer3_phmmer.pl --resultTypes --jobid

Note
The HMMER3 phmmer web service provides, e.g., plain text output (out), plain input (sequence), alignment identifiers, accessions.

Note
If the user wants to retrieve one specific result type, for example, the plain text output (out), they can enter the following:

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients hmmer3_phmmer.pl --polljob --outformat out --jobid

Note
To retrieve all available results, use the following:

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients hmmer3_phmmer.pl --polljob --jobid

Note
If the job status is RUNNING, check it again later. In the case of ERROR or FAILURE, resubmit the job. If you still experience the same issue, please submit a support request to https://www.ebi.ac.uk/about/contact/support/job-dispatcher-services/ including the JobId and the error message. In the case of NOT_FOUND, check the JobId; if the JobId is correct, the job results might have expired, so please resubmit the job.

Note
Please note that the results are stored for only 7 days, and it is recommended to download the results before the job expires.

Support Protocol 3: INSTALLING DOCKER AND RUNNING THE EMBL-EBI CLIENT CONTAINER

Docker is based on an operating system–level virtualization technology known as “containerization” and a software platform that allows users to run pre-configure Docker containers. A Docker container contains software components along with all their dependencies, binaries, libraries, configuration files, scripts, and so forth. Containers are pre-configured and deployed in such a way that the contained programs are run in isolation. This greatly helps with reproducibility by leveraging the fact that the user does not need to worry about installation and configuration of specific versions of software and their dependencies.

A Docker image (ebiwp/webservice-clients) has been developed and is freely available to users. This provides users with pre-installed Perl, Python, and Java EMBL-EBI Web Service CLI Clients. The ebiwp/webservice-clients image can be pulled from the Docker Hub at https://hub.docker.com/r/ebiwp/webservice-clients/.

Necessary Resources

Hardware

  • An Internet-connected UNIX, Linux, Mac, or Windows workstation

Software

1.Install Docker.

Note
Installation instructions are provided at https://docs.docker.com/get-docker/. Follow the instructions provided on how to install Docker and to get a Docker daemon running in your system.

2.List available Docker images:

  • docker image ls

Note
This assumes that Docker has been correctly installed and the Docker daemon is currently running.

Note
If Docker has just been installed, no images are expected to be listed.

3.Download and install the ebiwp/webservice-clients container by getting the latest tag of the required ebiwp/webservice-clients image by running the following:

  • docker pull ebiwp/webservice-clients:latest

Note
Please note that a Docker/Docker Hub user account is not required to run the ebiwp/webservice-clients image.

Note
Periodically, the administrators might update the web services clients and therefore the ebiwp/webservice-clients image. To keep your local Docker image up to date, simply run the previous command, which will take care of downloading the updated image layers.

4.Run clients with the ebiwp/webservice-clients container.

Note
As an example, for running the DBfetch Python client, type

Note
docker run --rm ebiwp/webservice-clients dbfetch.py --help

Note
The main Docker command is docker run, which takes as options --rm, responsible for cleaning up after the container has stopped running, followed by the name of the image (ebiwp/webservice-clients) and the name of the client.

Note
Generally speaking, to run CLI clients, one simply needs to run

Note
docker run --rm ebiwp/webservice-clients:latest <client.py|pl|jar> [options …]

Note
Note that the argument perl or python typically passed before the name of the client is not required when running Docker commands for the clients, as these have been set to be executable in the container.

5.Mount local directories to be accessible to the container.

Note
To use files stored on the local disk and to get access to results produced by the analysis in the container, the user needs to pass-w /results -v pwd:/results as a command-line option to the Docker command. This will define /results as the path where result files will be written by the client (in the container). A --volume (-v) mapping is used to provide access to the current working directory pwd to the container:

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients:latest <client.py|pl|jar> <options … >

Note
Alternatively, using a different location, instead of using the current working directory (/user/username/directory), is done as follows:

Note
docker run --rm -w /results -v

Note
/user/username/directory:/results

Note
ebiwp/webservice-clients:latest <client.py|pl|jar> [options …]

Basic Protocol 5: PROTEIN FUNCTIONAL ANALYSIS USING InterProScan 5 RESTful WEB SERVICES WITH THE PYTHON CLIENT AND DOCKER

InterProScan 5 (Jones et al., 2014) combines different protein signature recognition methods into one resource and allows the user to scan sequences for matches against the InterPro collection of protein signature databases. This example uses Docker and the Python client program to run an InterProScan 5 search via the REST web service interface.

More information about the InterProScan service is available at https://www.ebi.ac.uk/interpro/search/sequence/.

Necessary Resources

Hardware

  • An Internet-connected UNIX, Linux, Mac, or Windows workstation, running Docker

Software

  • See Support Protocol 3 for instructions on downloading and installing Docker as well as the ebiwp/webservice-clients image that provides access to pre-configured Perl, Python, and Java REST Web Services CLI Clients

Input

  • A plain text file containing a sequence in GCG, FASTA, EMBL, GenBank, PIR, NBRF, PHYLIP, or UniProtKB/Swiss-Prot format or a database entry supported by EMBL-EBI in the format database name:database identifier

1.Display client usage. To do so, call Python clients in the ebiwp/webservice-clients Docker image with docker run --rm ebiwp/webservice-clients. To see details of how to use the client, a detailed list of major command-line options, how to run the client without any argument, or alternatively how to run the client with the argument --help, call the client as follows:

  • docker run --rm ebiwp/webservice-clients iprscan5.py

Note
Usage help will be shown on screen.

2.Display parameter details. To display all parameters of the tool, run

  • docker run --rm ebiwp/webservice-clients iprscan5.py --params
  • To see the details of a parameter, use with the argument --paramDetail .
  • To see which applications are available, run
  • docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients iprscan5.py --paramDetail appl

3a. Run jobs in synchronous mode.

Note
The user can run jobs in synchronous mode to retrieve results as soon as the job is finished or asynchronous mode (step 3b) to retrieve results at a later time. Here, we describe how to run synchronous jobs.

Note
Either a full sequence file or just an entry identifier in the form database name:database identifier can be used as input. Additionally, specify an e-mail address for communication using the web services.

Note
In order to use files stored on the local disk and to get access to results produced by the analysis in the container, the user needs to pass -w /results -v pwd:/results as command-line options to the Docker command. This will define /results as the path where result files will be written by the client (in the container). A -v, --volume mapping is used to provide the container access to the current working directory pwd. See Support Protocol 3 for additional information about running EMBL-EBI clients with Docker.

Note
For example, run an InterProScan 5 job using all InterPro applications with a sequence input file:

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients iprscan5.py --email your@email.com<SequenceFile.fasta>

Note
If you know the entry identifier of your query sequence, you can the search using this identifier as input:

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients iprscan5.py --email your@email.comDB:Identifier

Note
The entry identifier should contain the database name and the entry identifier, separated by a colon, e.g., UniProt:GSTM1_MOUSE, the mouse protein entry GSTM1_MOUSE in UniProtKB.

Note
By default, all applications, GO terms, and pathways are included in the analysis. To exclude pathways and GO terms from the analysis, run

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients iprscan5.py --email your@email.com --goterms false --pathways false --sequence uniprot:gstm1_mouse

Note
The application names are separated by commas in the command line above.

Note
In synchronous mode, the program will print out JobId and JobStatus (QUEUED/RUNNING/FINISHED) in the terminal/command prompt until result files are received. The results contain files of input sequence, output files in text, XML, TSV, JSON, GFF, and SVG formats.

3b. Run jobs in asynchronous mode.

Note
If the user wants to retrieve a result later, run jobs in asynchronous mode using the --asyncjob argument:

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients iprscan5.py --asyncjob --email your@email.com uniprot:gstm1_mouse

Note
If the job submission is successful, the client will provide the JobId in the terminal/command prompt (STDOUT). The user has to use the JobId in the result retrieval.

Note
To check the job status before getting the results, run

Note
docker run --rm ebiwp/webservice-clients iprscan5.py --status --jobid

Note
The client will print if the job is QUEUED, RUNNING, ERROR, FAILURE, OR FINISHED.

Note
If the job status is FINISHED, get the result types:

Note
docker run --rm ebiwp/webservice-clients iprscan5.py --resultTypes --jobid

Note
The InterProScan 5 web services provide result types of plain output (out), plain input (sequence), XML result (xml), GFF output (gff), TSV table (tsv), SVG image (svg), and the JSON output (json).

Note
If the user wants to retrieve a specific result type, for example, the plain text output (out), they can enter the following:

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients iprscan5.py --polljob --outformat out --jobid

Note
One can also pass comma-separated values to --outformat in order to get selected results (e.g., plain output and XML file):

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients iprscan5.py --polljob --outformat out,xml --jobid

Note
To retrieve all available results, use the following:

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients iprscan5.py --polljob --jobid

Note
If the job status is RUNNING, please check it later. In the case of ERROR or FAILURE, resubmit your job. If you still experience issues, please send us a support request via https://www.ebi.ac.uk/about/contact/support/job-dispatcher-services including the JobId and the error message. In the case of NOT_FOUND, check the JobId; if the JobId is correct, the job results might have expired, so please resubmit the job.

Alternate Protocol 4: PROTEIN FUNCTIONAL ANALYSIS USING InterProScan 5 RESTful WEB SERVICES WITH THE JAVA CLIENT

This example uses a Java client program to run InterProScan 5 search via the REST web service interface.

Necessary Resources

Hardware

  • An Internet-connected UNIX, Linux, Mac, or Windows workstation

Software

  • Java 8 or later runtime environment (https://www.java.com/)
  • See Support Protocol 4 for instructions on downloading and installing the EMBL-EBI Java RESTful web services clients

Input

  • A plain text file containing a sequence in one of the formats of FASTA, EMBL, GenBank, GCG, PIR, NBRF, PHYLIP, and UniProtKB/Swiss-Prot or a database entry supported by EMBL-EBI in the format database name:database identifier (e.g: UniProtKB:GSTM1_MOUSE)

1.Display client usage. To do so, switch to the directory containing the downloaded client program iprscan5.jar. For details of how to use the client, run it without any arguments:

  • java -jar iprscan5.jar

Usage help will be shown on the screen. Alternatively, run it with the argument --help:

  • java -jar iprscan5.jar --help

2.Display parameter details. To display all parameters of the tool, run

  • java -jar iprscan5.jar --params
  • To see the details of a parameter, use with the argument --paramDetail .
  • To see which applications are available, run
  • java -jar iprscan5.jar --paramDetail appl

3a. Run jobs in synchronous mode.

Note
The user can run jobs in synchronous mode to retrieve results as soon as the job is finished or in asynchronous mode (see step 3b) to retrieve results at a later time. Here, we describe how to run synchronous jobs.

Note
To run an InterProScan 5 search, the user has to decide which applications to run. The user can run a job with a sequence file or entry identifier as input. The user also needs to specify an e-mail address for communication in using the web services.

Note
For example, run an InterProScan 5 job using all InterPro applications with a sequence input file:

Note
java -jar iprscan5.jar --email your@email.com <SequenceFile.fasta>

Note
If you know the entry identifier of your query sequence, you can the search using this identifier as input:

Note
java -jar iprscan5.jar --email your@email.com DB:Identifier

Note
The entry identifier should contain the database name and the entry identifier, separated by a colon, e.g., UniProt:GSTM1_MOUSE, the mouse protein entry GSTM1_MOUSE in UniProtKB.

Note
By default, all applications, GO terms, and pathways are included in the analysis. To exclude pathways and GO terms from the analysis, run

Note
java -jar iprscan5.jar --email your@email.com

Note
--nogoterms --nopathways --sequence

Note
<SequenceFile.fasta>

Note
In synchronous mode, the program will print out JobId and JobStatus (QUEUED/RUNNING/FINISHED) to the screen until result files are received. The results contain files of input sequence and output in text, XML, SVG, TSV, JSON, and GFF formats.

3b. Run jobs in asynchronous mode.

Note
If the user wants to retrieve a result later, run jobs in asynchronous mode using the --asyncjob argument:

Note
java -jar iprscan5.jar --asyncjob --email your@email.com <SequenceFile.fasta>

Note
If the job submission is successful, the client will print the JobId to the terminal/command prompt (STDOUT). The user has to use the JobId in the result retrieval.

Note
To check the job status before getting the results, run

Note
java --jar iprscan5.jar --status --jobid

Note
The client will print if the job is QUEUED, RUNNING, ERROR, FAILURE, OR FINISHED.

Note
If the job status is FINISHED, get the result types:

Note
java -jar iprscan5.jar --resultTypes --jobid

Note
The InterProScan 5 web services provide result types of plain output (out), plain input (sequence), XML result (.xml), GFF output (.gff), TSV table (.tsv), the SVG image (.svg), and the JSON output (.json).

Note
If the user wants to retrieve a specific result type, for example, the plain text output (out), they can run the following:

Note
java -jar iprscan5.jar --polljob --outformat out --jobid

Note
One can also pass comma-separated values to --outformat in order to get selected results (e.g., plain output and XML file):

Note
java -jar iprscan5.jar --polljob --outformat out,xml --jobid

Note
To retrieve all available results, run the following:

Note
java -jar iprscan5.jar --polljob --jobid

Note
If the job status is RUNNING, please check it later. In the case of ERROR or FAILURE, please resubmit your job. If you still experience issues, please send us a support request via https://www.ebi.ac.uk/support/ including the JobId and the error message. In the case of NOT_FOUND, please check the JobId; if the JobId is correct, the job results might have expired, so please resubmit the job.

Support Protocol 4: INSTALLING JAVA WEB SERVICES CLIENTS

Commonly installed Java provides a platform-independent option for developing and deploying software.

Necessary Resources

Hardware

  • An Internet-connected UNIX, Linux, Mac, or Windows workstation

Software

  • A Java 1.8 or above runtime environment, see https://www.java.com/
  • A web browser, e.g., Google Chrome, Mozilla Firefox, Microsoft Edge, Safari, or Opera

1.Check that Java is installed in the system. In the Command Prompt or terminal, enter the following:

  • java -version

Note
In MS Windows, open a Command Prompt. The procedure to do this varies according to different versions of Windows. In OS X, Linux, or UNIX, open a terminal.

2.If Java is not installed, download and follow the instructions provided at https://www.java.com/en/download/. If Java is installed in the system but the “java” command is not found, add Java to the PATH to try to solve the issue:

  • a.

For MS Windows, check the location used to install Java using Explorer.

This will usually be something like C:\Program Files (x86)\Java\jre8. In the Command Prompt, add the location of the Java bin directory to the PATH by entering

set PATH=%PATH%;C:\Program Files (x86)\Java\jre8\bin

The “java” command should now be found.

  • b.

On Linux, OS X, and UNIX systems, where the method to add a directory to the PATH depends on the shell being used, first locate the Java installation and then add the Java bin directory to the PATH.

For example, for a Java installation in /usr/lib/jvm/java-8-openjdk-amd64/, use the following commands:

1.For sh or bash shells :

export PATH=${PATH}:/usr/lib/jvm/java-8-openjdk-amd64/bin

2.For csh or tcsh shells :

setenv PATH ${PATH}:/usr/lib/jvm/java-8-openjdk-amd64/bin

3.Download the Java clients (e.g., clustalo.jar) from https://github.com/ebi-jdispatcher/webservice-clients. Alternatively, download the Clustal Omega client directly from GitHub with wget :

4.Test and run the clients. To do so, within the Command Prompt or terminal, change to the directory which contains the client program downloaded earlier. To test the program (e.g., clustalo.jar), enter

  • java -jar clustalo.jar --help

Note
Help information will be displayed with further instructions on usage of the client. If any error message is displayed, read through it to try identifying the problem. If you still need help, send your query to our help desk using the Feedback link at the top of the page or via https://www.ebi.ac.uk/about/contact/support/job-dispatcher-services.

Basic Protocol 6: MULTIPLE SEQUENCE ALIGNMENT USING CLUSTAL OMEGA VIA WEB INTERFACE

MSA is generally the alignment of three or more biological sequences. From the output, homology can be inferred and the evolutionary relationships between the sequences studied.

Clustal Omega (Sievers & Higgins, 2018; Sievers et al., 2011; also see Current Protocols article: Sievers & Higgins, 2014) is a fast, large-scale MSA program that uses seeded guide trees and HMM profile-profile techniques to generate alignments.

Necessary Resources

Hardware

  • Any Internet-connected computer

Software

  • A web browser, e.g., Google Chrome, Mozilla Firefox, Microsoft Edge, Safari, or Opera

Input

1.Optional : To view the range of MSA tools available at EMBL-EBI, point the browser to the MSA web page https://www.ebi.ac.uk/jdispatcher/msa.

Note
The MSA page (shown in Fig. 16) allows a user to select between different MSA tools.

MSA tools page.
MSA tools page.

2.Click “Launch Clustal Omega” under the Clustal Omega section or directly go to https://www.ebi.ac.uk/jdispatcher/msa/clustalo.

Note
Job submission via this page (Fig. 17) is organized into three sections: Input sequence, Parameters and Submit.

Clustal Omega input form.
Clustal Omega input form.

3.Enter the input sequences by browsing and selecting the input sequences file. Alternatively, copy the sequences and paste them into the sequence box. Select the correct input sequence type just above the input sequence box.

Note
In this example, we paste a set of protein sequences in FASTA format and select the sequence type of PROTEIN.

4.Set the parameters. To do so, first select the output format. To examine further options, click on the “More options” button to expand the section for the advanced parameters, which for Clustal Omega includes options to de-align input sequences, the number of iterations for the guide tree, and HMM stages, among others. Change the settings of the parameters according to needs.

Note
Tooltips (info icons) provide details on each parameter.

Note
In this example, we leave the parameters at their default settings.

5.Provide a job “Title” (optional) to briefly describe the job and click the “Submit” button.

Note
A pop-up window will confirm your job's submission with a JobId and status. The job will be queued and/or run until completed. The new web forms do not have the “notify by email” option; instead, you can access your jobs that are submitted within 7 days from https://www.ebi.ac.uk/jdispatcher/recentJobs. If the information provided in the submission is not correct, the page will show an error message to indicate which parameters need correction.

6.View the results.

Note
The result pages provide multiple views: Alignments, Tool Output, Guide Tree, Phylogenetic Tree, Result Viewers, Result Files, and Submission Details. The default view is Alignments. Click on the result tabs to switch between views.

Note
The “Alignments” tab provides an interactive view, powered by Nightingale, allowing you to zoom in and out for detailed inspection (Fig. 18). You can also scroll horizontally to see the entire alignment.

Alignments tab from Clustal Omega results.
Alignments tab from Clustal Omega results.

7.View the actual tool output.

Note
The “Tool Output” tab shows the actual alignment produced by the Clustal Omega program. There are options to download the alignment and show the alignment in color based on the physiochemical properties of the residues (Fig. 19).

Tool output tab from Clustal Omega.
Tool output tab from Clustal Omega.

8.View the guide and phylogenetic tree.

Note
The “Guide Tree” tab shows the guide tree, which is a tree used to determine the order in which pairwise sequence alignments are performed. Switch to the “Phylogenetic Tree” view. This page shows a simple (by default, neighbor-joining) phylogenetic tree calculated from your alignment, and dendrogram visualizations powered by phylotree.js (Shank et al., 2018) are also available.

Note
The first part of the page (Fig. 20) contains the full tree data, which can be downloaded for use in third-party tree-viewer programs.

Note
The second part of the page contains a visualization of the tree data with zoom-in and zoom-out options. The visualization provides several interactive functions, including branch collapsing and re-rooting.

Phylogenetic tree visualization.
Phylogenetic tree visualization.

9.Display results viewers.

Note
This page allows users to launch Jalview Desktop (Waterhouse et al., 2009) interactively with the alignment, which provides further visualization options. This also allows users to send the alignment to Simple Phylogeny and MView (Fig. 21).

Result viewers tab from Clustal Omega.
Result viewers tab from Clustal Omega.

10.Display result files.

Note
The “Result Files” tab shows all the result files produced by the Clustal Omega, including the alignment file, Percent Identity Matrix (PIM), guide tree, and phylogenetic tree, among others (Fig. 22).

Result files tab from Clustal Omega.
Result files tab from Clustal Omega.

11.Display submission details.

Note
The Submission Details view (Fig. 23) shows information about the program and its version, job title, date and time for job launch, input and output files, command line executed, and input parameter settings. The user can review these details to decide if the submission is correct and if a resubmission is needed.

Submission details tab for Clustal Omega.
Submission details tab for Clustal Omega.

Alternate Protocol 5: MULTIPLE SEQUENCE ALIGNMENT USING CLUSTAL OMEGA WITH PERL CLIENT AND DOCKER

This protocol demonstrates a Clustal Omega MSA via web services using a Perl Client with Docker.

For the full description of the Clustal Omega REST web services, see https://www.ebi.ac.uk/jdispatcher/docs/webservices/.

Necessary Resources

Hardware

  • An Internet-connected UNIX, Linux, Mac, or Windows workstation, running Docker

Software

  • See Support Protocol 3 for instructions on downloading and installing Docker as well as the ebiwp/webservice-clients image that provides access to pre-configured Perl, Python, and Java REST Web Services CLI Clients

Input

  • A plain text file containing three or more sequences in FASTA, EMBL, GCG, PIR, NBRF, PHYLIP, GenBank, or UniProtKB/Swiss-Prot format or three or more database entries supported by EMBL-EBI in the format database name:database identifier
  • This example uses a FASTA-format multiple sequence file containing a collection of myosin sequences. The example file can be downloaded from https://www.ebi.ac.uk/Tools/examples/protein/sequence12.txt.

1.Display client usage by running Perl clients in the ebiwp/webservice-clients Docker image with docker run --rm ebiwp/webservice-clients.

Note
Details of how to use the client and a detailed list of major command-line options are displayed when calling the client as follows:

Note
docker run --rm ebiwp/webservice-clients clustalo.pl

Note
A usage help will be shown on the screen. Alternatively, run it with the argument --help:

Note
docker run --rm ebiwp/webservice-clients clustalo.pl --help

2.Display parameter details. To display all parameters of the tool, run

  • docker run --rm ebiwp/webservice-clients clustalo.pl --params
  • To see further details of the parameter, run with the argument --paramDetail . For example, to see what input types are available, run
  • docker run --rm ebiwp/webservice-clients clustalo.pl --paramDetail stype

3a. Run jobs in synchronous mode.

Note
The user can run jobs in synchronous mode to retrieve results as soon as the job is finished or in asynchronous mode (see step 3b) to retrieve results later. Either a file containing full sequences or three or more entry identifiers in the form database name:database identifier can be used as input. Additionally, specify an e-mail address for communication in using the web services.

Note
To use files stored on the local disk and to get access to results produced by the analysis in the container, the user needs to pass -w /results -v pwd:/results as command-line options to the Docker command. This will define /results as the path where result files will be written by the client (in the container), and a -v, --volume mapping is used to provide the container access to the current working directory pwd. See Support Protocol 3 for additional information about running EMBL-EBI clients with Docker.

Note
To run a Clustal Omega alignment, the user has to supply a minimum of an input file containing three or more sequences in the correct format and their e-mail address.

Note
For example,

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients clustalo.pl --email your@email.com sequence12.txt

Note
In synchronous mode, the program will output the JobId and JobStatus (QUEUED/RUNNING/FINISHED) to standard output until result files are received. The results contain files of input sequence and output in text and XML formats.

3b. Run jobs in asynchronous mode.

Note
If the user wants to retrieve results at a later time, they should run jobs in asynchronous mode using the --asyncjob argument:

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients clustalo.pl --asyncjob --email your@email.com sequence12.txt

Note
If the job submission is successful, the client will provide the JobId in the terminal/command prompt (STDOUT). The user has to use the JobId to retrieve the result.

Note
To check the job status before getting the results, run

Note
docker run --rm ebiwp/webservice-clients clustalo.pl --status --jobid

Note
The client will print if the job status is FINISHED, RUNNING, ERROR, FAILURE, OR FINISHED.

Note
If the job status is FINISHED, you can view the available result types with the following:

Note
docker run --rm ebiwp/webservice-clients clustalo.pl --resultTypes --jobid

Note
With default options, the Clustal Omega web services provide result types of plain output (out), plain input (sequence), alignment (aln-clustal_num), phylogenetic tree data (phylotree), and the PIM (pim).

Note
If the user wants to retrieve a specific result type, for example, the plain text output (out), they can run the following:

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients clustalo.pl --polljob --outformat out --jobid

Note
One can also pass comma-separated values to --outformat in order to get selected results (e.g., plain output and XML file):

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients clustalo.pl --polljob --outformat out,xml --jobid

Note
To retrieve all available results, run

Note
docker run --rm -w /results -v pwd:/results ebiwp/webservice-clients clustalo.pl --polljob --jobid

Note
If the job status is RUNNING, please check it again later. In the case of ERROR or FAILURE, please resubmit your job. If you still experience issues, please contact us via https://www.ebi.ac.uk/about/contact/support/job-dispatcher-services and include the JobId and the error message. In the case of NOT_FOUND, check the JobId; if the JobId is correct, the job results might have expired (7 days after submission), so you will need to resubmit the job.

Support Protocol 5: EXPLORING THE RESTful API WITH OpenAPI USER INTERFACE

The EMBL-EBI RESTful web services can be explored with the aid of Swagger OpenAPI User Interface (UI). Swagger UI allows anyone to visualize, interact with, and explore the API's resources and endpoints. Documentation pages for all available bioinformatics web services are provided at https://www.ebi.ac.uk/jdispatcher/docs/webservices/. An accompanying Swagger UI is available for each tool at https://www.ebi.ac.uk/jdispatcher/docs/webservices/#openapi.

Necessary Resources

Hardware

  • An Internet-connected UNIX, Linux, Mac, or Windows workstation

Software

  • A web browser, e.g., Google Chrome, Mozilla Firefox, Microsoft Edge, Safari, or Opera

1.Choose a tool (e.g., FASTA). To do so, head to https://www.ebi.ac.uk/jdispatcher/docs/webservices/#openapi in a web browser. Then, in “STEP 1 - Choose a Tool,” click in the drop-down menu and select FASTA.

Note
As shown in Figure 24, six endpoints are listed: “List available parameters,” “Parameter details,” “Submit job,” “Status,” “Result types,” and “Result.”

Note
By clicking any of the endpoints, the window expands, showing more information about the endpoint.

Note
As an example of clicking “List available parameters” (Fig. 25), one can see that the endpoint refers to /parameters, which is a GET method. Clicking on the “Try it out” button will display a curl command (which can be used in a UNIX/Linux terminal or Windows command prompt); a “Request URL;” a “Response Body,” which lists (in the XML body) available parameters available for FASTA; a “Response Code,” typically 200 (HTTP status code for success); and finally, “Response Headers.”

Note
The “Parameter details” endpoint displays additional information for each of the parameters returned in the Parameters endpoint. This endpoint requires a parameterDetail to be passed.

Note
The “Submit job” endpoint corresponds to a POST method, which means that data are expected to be passed to the web service. As shown in Figure 26, multiple parameters (which are equivalent to command-line options in the Web Services CLI Clients) are displayed. Parameters such as email and sequence are required in order to submit a Job. Optional parameters use default values when not specified.

Note
The “Status” endpoint returns information about the status of a particular job. This endpoint therefore requires a JobId to be passed. Expected statuses are QUEUED, RUNNING, FINISHED, ERROR, and FAILURE.

Note
Once a particular job has finished, the “Result types” endpoint can be called to get a list of valid outputs. This list is provided as a formatted XML.

Note
The “Result” endpoint can then be used passing a valid resultType in order to download any of the result types listed by the previous endpoint.

Screenshot of Swagger UI (RESTful API) for FASTA.
Screenshot of Swagger UI (RESTful API) for FASTA.
Exploring “List available parameters” API endpoint for FASTA.
Exploring “List available parameters” API endpoint for FASTA.
Exploring “Submit job” API endpoint for FASTA.
Exploring “Submit job” API endpoint for FASTA.

COMMENTARY

Understanding Results

The interpretation of the scientific results from the wide variety of tools that are available through the EMBL-EBI web interface and web services is beyond the scope of this article; however, in this section, we present some of the common outcomes from successful or unsuccessful uses of the services.

When a job is submitted through the web interface (Basic Protocol 2), a quick check on the input is carried out, and only after the data pass this validation check are the data submitted to the compute clusters where the actual request/analysis is executed. This check allows us to reduce the number of invalid submissions to the clusters and allows the user to quickly correct simple errors. If the input check is not passed, an error box appears on the web page with some detail about the error and what action the user can take to correct it (Fig. 27). If the check is passed, a temporary running page will be displayed with the JobId until the results are ready to be viewed (Fig. 28). The unique JobId currently consists of the name of the tool, the method of submission (I, E, R, or S, representing Interactive, Email, REST, and SOAP), the date and time of submission, and, finally, an identifier that is helpful to the administrators, internally relating to the running of jobs on our compute clusters.

Clustal Omega input page showing error message from failed input validation.
Clustal Omega input page showing error message from failed input validation.
Clustal Omega successful submission/job running page.
Clustal Omega successful submission/job running page.

Causes for failing the validation check are usually simple user mistakes, such as failing to select a database to search against in the case of FASTA or accidently hitting the “Submit” button before a set of sequences has been uploaded or entered into the input box for Clustal Omega. Errors are also returned when the data input is too large. For popular tools, there are FAQs in the Documentation pages that address common causes of validation check failure.

Unfortunately, passing the quick input validation check does not guarantee that the job will complete successfully, as there can be situations in which the underlying tool produces an error once it is run. An example is where a user has accidentally truncated the input for an MSA such that sequence file header text now appears in the middle of the sequence data for a different entry (Fig. 29). When we detect that a tool has failed to provide the expected results (or has produced an error), we highlight this to the user in place of the normal results pages and present links to the user that contain as much information as possible to help determine the cause of the error. In this case, the error file of EMBOSS Needle gives a message indicating the input sequence is too large and suggests another tool, EMBOSS Stretcher (Fig. 30). When encountering the error page, users should read any error messages from the tool and check their input carefully for errors. If help is still needed, the JobId should be sent to our help desk using the Feedback link at the top of the page or via https://www.ebi.ac.uk/about/contact/support/job-dispatcher-services.

Example input mistake for an MSA. Note how the first two sequences have been merged so that the header information for the second sequence (sp|P01317|) appears as part of the sequence data for the first sequence.
Example input mistake for an MSA. Note how the first two sequences have been merged so that the header information for the second sequence (sp|P01317|) appears as part of the sequence data for the first sequence.
EMBOSS Needle results page for large input sequences.
EMBOSS Needle results page for large input sequences.

Attempting to view the results of a job a long time after it was submitted may not succeed, as results are not kept indefinitely; currently, they are deleted after 7 days. Doing so generates a “job not available” page, as seen in Figure 31. To generate the results again, the user will need to carry out a new job submission.

“Job not available” page for an expired job.
“Job not available” page for an expired job.

The situation when using web services is similar. Incorrect usage of a command-line client, for example, supplying an incorrect parameter, returns an error such as “Unknown option:”. The user should run the client without any parameters to display correct usage and available parameters. Omission of data required for a job (for example, failing to select a database or supplying an input file for MSA that only contains one sequence) results an error being passed to the user in exactly the same terms as when the validation check fails on the website; behind the scenes, it is in fact the same check as for the web interface.

Successful web service requests result in a job status of “FINISHED;” this is analogous to the results page being displayed for web interface submissions. Problems with the running of the job (for example, due to server failure) result in a status of “ERROR” or “FAILURE.” Requests for an invalid JobId, either because the ID is incorrect or because the result has expired, return a status of “NOT FOUND” (Fig. 32).

Error message returned when attempting to retrieve an invalid JobId via web services.
Error message returned when attempting to retrieve an invalid JobId via web services.

If there is a problem and the tool generates an error, then error files are produced, together with your input and any standard output from the tool (Fig. 33). Error files can be identified by their suffix of .error and contain information about the error. These error files are of particular value when requesting assistance from our help desk. Common causes of errors include incorrect or missing parameters, using input that is incorrectly formatted or unsuitable for the tool, and attempted retrieval of results beyond the period when they are available.

Files created from EMBOSS Needle web services job for large sequence input file from Figure 30. The
.
error.txt file contains the tool error details. The
.
out.txt file contains the standard output from the tool. The 
.
sequence.txt file contains the input that was submitted for the job.
Files created from EMBOSS Needle web services job for large sequence input file from Figure 30. The . error.txt file contains the tool error details. The . out.txt file contains the standard output from the tool. The . sequence.txt file contains the input that was submitted for the job.

Note that there are situations when an incorrect analysis has been requested yet the tool appears to run fine, for example, when a search is carried out against a protein database using DNA input. Correct usage would be to employ a tool such as FASTX to translate the DNA input; however, if the user incorrectly uses FASTA, the tool will still run and produce a result of sorts. This is because there are amino acids corresponding to the same single-letter characters used for DNA bases, so the program does not prevent the search. Another example might be the use of an MSA tool, such as Clustal Omega, for situations that it is not designed for, e.g., for pairwise alignment or to align short primers to a longer sequence. In general, if the standalone tool allows an analysis to be carried out, then we attempt to allow it at EMBL-EBI as well; it is up to the user to decide for what purposes they use the tools, and they should examine the results for the unexpected.

Please note that the results are stored for only 7 days, and it is recommended to download the results before the job expires.

We offer documentation and training courses (https://www.ebi.ac.uk/training/) to educate users on correct usage of the tools, and our help desk is available for further assistance at https://www.ebi.ac.uk/about/contact/support/job-dispatcher-services.

Acknowledgments

The EMBL-EBI services mentioned in this article are supported by core EMBL funding. EMBL-EBI is indebted to its funders, including the EMBL member states.

Author Contributions

Fábio Madeira: Writing—original draft; writing—review and editing. Nandana Madhusoodanan: Writing—original draft; writing—review and editing. Joonheung Lee: Writing—original draft; writing—review and editing. Alberto Eusebi: Writing—review and editing. Ania Niewielska: Writing—review and editing. Adrian R. N. Tivey: Writing—review and editing. Stuart Meacham: Writing—review and editing. Rodrigo Lopez: Writing—review and editing. Sarah Butcher: Writing—review and editing.

Conflict of Interest

The authors declare no conflict of interest.

Open Research

Data Availability Statement

The data are openly available in a public repository that does not issue DOIs.

Literature Cited

  • Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research , 25(17), 3389–3402. https://doi.org/10.1093/nar/25.17.3389
  • Bairoch, A., Boeckmann, B., Ferro, S., & Gasteiger, E. (2004). Swiss-Prot: Juggling between evolution and stability. Briefings in Bioinformatics , 5, 39–55. https://doi.org/10.1093/bib/5.1.39
  • Benson, D. A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Sayers, E. W. (2017). GenBank. Nucleic Acids Research , 45, D37–D42. https://doi.org/10.1093/nar/gkw1070
  • Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: Architecture and applications. BMC Bioinformatics , 10, 421. https://doi.org/10.1186/1471-2105-10-421
  • Cherry, J. M., Hong, E. L., Amundsen, C., Balakrishnan, R., Binkley, G., Chan, E. T., Christie, K. R., Costanzo, M. C., Dwight, S. S., Engel, S. R., Fisk, D. G., Hirschman, J. E., Hitz, B. C., Karra, K., Krieger, C. J., Miyasato, S. R., Nash, R. S., Park, J., Skrzypek, M. S., … Wong, E. D. (2012). Saccharomyces genome database: The genomics resource of budding yeast. Nucleic Acids Research , 40, D&00–D705. https://doi.org/10.1093/nar/gkr1029
  • Chojnacki, S., Cowley, A., Lee, J., Foix, A., & Lopez, R. (2017). Programmatic access to bioinformatics tools from EMBL-EBI update: 2017. Nucleic Acids Research , 45, W550–W553. https://doi.org/10.1093/nar/gkx273
  • Davis, P., Zarowiecki, M., Arnaboldi, V., Becerra, A., Cain, S., Chan, J., Chen, W. J., Cho, J., da Veiga Beltrame, E., Diamantakis, S., Gao, S., Grigoriadis, D., Grove, C. A., Harris, T. W., Kishore, R., Le, T., Lee, R. Y. N., Luypaert, M., Müller, H. M., … Sternberg, P. W. (2022). WormBase in 2022-data, processes, and tools for analyzing Caenorhabditis elegans. Genetics , 220(4), iyac003. https://doi.org/10.1093/genetics/iyac003
  • Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics , 14, 755–763. https://doi.org/10.1093/bioinformatics/14.9.755
  • Edman, P., Högfeldt, E., Sillén, L. G., & Kinell, P.-O. (1950). Method for determination of the amino acid sequence in peptides. Acta Chemica Scandinavica , 4, 283–293. https://doi.org/10.3891/acta.chem.scand.04-0283
  • Franklin, R. E. (1956). Structure of tobacco mosaic virus: Location of the ribonucleic acid in the tobacco mosaic virus particle. Nature , 177, 928–930. https://doi.org/10.1038/177928b0
  • Gramates, L. S., Marygold, S. J., Santos, G. D., Urbano, J. M., Antonazzo, G., Matthews, B. B., Rey, A. J., Tabone, C. J., Crosby, M. A., Emmert, D. B., Falls, K., Goodman, J. L., Hu, Y., Ponting, L., Schroeder, A. J., Strelets, V. B., Thurmond, J., Zhou, P., & the FlyBase Consortium. (2017). FlyBase at 25: Looking to the future. Nucleic Acids Research , 45, D663–D671. https://doi.org/10.1093/nar/gkw1016
  • Hernandez, P., Müller, M., & Appel, R. D. (2006). Automated protein identification by tandem mass spectrometry: Issues and strategies. Mass Spectrometry Reviews , 25, 235–254. https://doi.org/10.1002/mas.20068
  • Jones, P., Binns, D., Chang, H. Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., Nuka, G., Pesseat, S., Quinn, A. F., Sangrador-Vegas, A., Scheremetjew, M., Yong, S. Y., Lopez, R., & Hunter, S. (2014). InterProScan 5: Genome-scale protein function classification. Bioinformatics , 30, 1236–1240. https://doi.org/10.1093/bioinformatics/btu031
  • Kersey, P. J., Allen, J. E., Allot, A., Barba, M., Boddu, S., Bolt, B. J., Carvalho-Silva, D., Christensen, M., Davis, P., Grabmueller, C., Kumar, N., Liu, Z., Maurel, T., Moore, B., McDowall, M. D., Maheswari, U., Naamati, G., Newman, V., Ong, C. K., … Yates, A. (2018). Ensembl Genomes 2018: An integrated omics infrastructure for non-vertebrate species. Nucleic Acids Research , 46, D802–D808. https://doi.org/10.1093/nar/gkx1011
  • Kodama, Y., Mashima, J., Kosuge, T., Kaminuma, E., Ogasawara, O., Okubo, K., Nakamura, Y., & Takagi, T. (2018). DNA data bank of Japan: 30th anniversary. Nucleic Acids Research , 46, D30–D35. https://doi.org/10.1093/nar/gkx926
  • Ladunga, I. (2002). Finding homologs to nucleotide sequences using network BLAST searches. Current Protocols in Bioinformatics , 00, 3.3.1–3.3.25. https://doi.org/10.1002/0471250953.bi0303s00
  • Larkin, A., Marygold, S. J., Antonazzo, G., Attrill, H., Dos Santos, G., Garapati, P. V., Goodman, J. L., Gramates, L. S., Millburn, G., Strelets, V. B., Tabone, C. J., Thurmond, J., & FlyBase Consortium (2020). FlyBase: Updates to the Drosophila melanogaster knowledge base. Nucleic Acids Research , 49, D899–D907. https://doi.org/10.1093/nar/gkaa1026
  • Lee, R. Y. N., Howe, K. L., Harris, T. W., Arnaboldi, V., Cain, S., Chan, J., Chen, W. J., Davis, P., Gao, S., Grove, C., Kishore, R., Muller, H. M., Nakamura, C., Nuin, P., Paulini, M., Raciti, D., Rodgers, F., Russell, M., Schindelman, G., … Sternberg, P. W. (2018). WormBase 2017: Molting into a new stage. Nucleic Acids Research , 46, D869–D874. https://doi.org/10.1093/nar/gkx998
  • Li, W., Cowley, A., Uludag, M., Gur, T., McWilliam, H., Squizzato, S., Park, Y. M., Buso, N., & Lopez, R. (2015). The EMBL-EBI bioinformatics web and programmatic tools framework. Nucleic Acids Research , 43, W580–W584. https://doi.org/10.1093/nar/gkv279
  • Lopez, R., Duggan, K., Harte, N., & Kibria, A. (2003). Public services from the European Bioinformatics Institute. Briefings in Bioinformatics , 4, 332–340. https://doi.org/10.1093/bib/4.4.332
  • Madeira, F., Madhusoodanan, N., Lee, J., Eusebi, A., Niewielska, A., Tivey, A. R. N., Lopez, R., & Butcher, S. (2024). The EMBL-EBI Job Dispatcher sequence analysis tools framework in 2024. Nucleic Acids Research , gkae241. https://doi.org/10.1093/nar/gkac241
  • Madeira, F., Park, Y. M., Lee, J., Buso, N., Gur, T., Madhusoodanan, N., Basutkar, P., Tivey, A. R. N., Potter, S. C., Finn, R. D., & Lopez, R. (2019). The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic Acids Research , 47, W597–600. https://doi.org/10.1093/nar/gkz268
  • Madeira, F., Pearce, M., Tivey, A. R. N., Basutkar, P., Lee, J., Edbali, O., Madhusoodanan, N., Kolesnikov, A., & Lopez, R. (2022). Search and sequence analysis tools services from EMBL-EBI in 2022. Nucleic Acids Research , 50, W276–W279. https://doi.org/10.1093/nar/gkac240
  • Martin, F. J., Amode, M. R., Aneja, A., Austine-Orimoloye, O., Azov, A. G., Barnes, I., Becker, A., Bennett, R., Berry, A., Bhai, J., Bhurji, S. K., Bignell, A., Boddu, S., Branco Lins, P. R., Brooks, L., Ramaraju, S. B., Charkhchi, M., Cockburn, A., Da Rin Fiorretto, L., … Flicek, P. (2023). Ensembl 2023. Nucleic Acids Research , 51, D933–D941. https://doi.org/10.1093/nar/gkac958
  • McWilliam, H., Li, W., Uludag, M., Squizzato, S., Park, Y. M., Buso, N., Cowley, A. P., & Lopez, R. (2013). Analysis tool web services from the EMBL-EBI. Nucleic Acids Research , 41, W597–600. https://doi.org/10.1093/nar/gkt376
  • McWilliam, H., Valentin, F., Goujon, M., Li, W., Narayanasamy, M., Martin, J., Miyar, T., & Lopez, R. (2009). Web services at the European Bioinformatics Institute-2009. Nucleic Acids Research , 37, W6–W10. https://doi.org/10.1093/nar/gkp302
  • Mulder, N. J., & Apweiler, R. (2003). The InterPro database and tools for protein domain analysis. Current Protocols in Bioinformatics , 2, 2.7.1–2.7.19. https://doi.org/10.1002/0471250953.bi0207s02
  • Park, Y. M., Squizzato, S., Buso, N., Gur, T., & Lopez, R. (2017). The EBI search engine: EBI search as a service—Making biological data accessible for all. Nucleic Acids Research , 45, W545–W549. https://doi.org/10.1093/nar/gkx359
  • Pearson, W. R. (2016). Finding protein and nucleotide similarities with FASTA. Current Protocols in Bioinformatics , 53, 3.9.1–3.9.25. https://doi.org/10.1002/0471250953.bi0309s53
  • Pearson, W. R., & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences , 85, 2444–2448. https://doi.org/10.1073/pnas.85.8.2444
  • Pettersson, E., Lundeberg, J., & Ahmadian, A. (2009). Generations of sequencing technologies. Genomics , 93, 105–111. https://doi.org/10.1016/j.ygeno.2008.10.003
  • Potter, S. C., Luciani, A., Eddy, S. R., Park, Y., Lopez, R., & Finn, R. D. (2018). HMMER web server: 2018 update. Nucleic Acids Research , 46, W200–W204. https://doi.org/10.1093/nar/gky448
  • Roberts, R. J., & Murray, K. (1976). Restriction endonuclease. Critical Reviews in Biochemistry and Molecular Biology , 4, 123–164. https://doi.org/10.3109/10409237609105456
  • Sanger, F., & Coulson, A. R. (1975). A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal of Molecular Biology , 94, 441k–448. https://doi.org/10.1016/0022-2836(75)90213-2
  • Sayers, E. W., Cavanaugh, M., Clark, K., Pruitt, K. D., Sherry, S. T., Yankie, L., & Karsch-Mizrachi, I. (2024). GenBank 2024 update. Nucleic Acids Research , 52(D1), D134–D137. https://doi.org/10.1093/nar/gkad903
  • Schwartz, E. M., & Sternberg, P. W. (2004). Searching WormBase for information about Caenorhabditis elegans. Current Protocols in Bioinformatics , 6, 1.8.1–1.8.44. https://doi.org/10.1002/0471250953.bi0108s6
  • Shank, S. D., Weaver, S., & Pond, S. L. K. (2018). phylotree.js—A JavaScript library for application development and interactive data visualization in phylogenetics. BMC Bioinformatics , 19, 276. https://doi.org/10.1186/s12859-018-2283-2
  • Sievers, F., & Higgins, D. G. (2014). Clustal Omega. Current Protocols in Bioinformatics , 48, 3.13.1–3.13.16. https://doi.org/10.1002/0471250953.bi0313s48
  • Sievers, F., & Higgins, D. G. (2018). Clustal Omega for making accurate alignments of many protein sequences. Protein Science , 27, 135–145. https://doi.org/10.1002/pro.3290
  • Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R., McWilliam, H., Remmert, M., Söding, J., Thompson, J. D., & Higgins, D. G. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology , 7, 539. https://doi.org/10.1038/msb.2011.75
  • Silvester, N., Alako, B., Amid, C., Cerdeño-Tarrága, A., Clarke, L., Cleland, I., Harrison, P. W., Jayathilaka, S., Kay, S., Keane, T., Leinonen, R., Liu, X., Martínez-Villacorta, J., Menchi, M., Reddy, K., Pakseresht, N., Rajan, J., Rossello, M., Smirnov, D., … Cochrane, G. (2018). The European nucleotide archive in 2017. Nucleic Acids Research , 46, D36–D40. https://doi.org/10.1093/nar/gkx1125
  • Skrzypek, M. S., & Hirschman, J. (2011). Using the Saccharomyces Genome Database (SGD) for analysis of genomic information. Current Protocols in Bioinformatics , 35, 1.20.1–1.20.23. https://doi.org/10.1002/0471250953.bi0120s35
  • Squizzato, S., Park, Y. M., Buso, N., Gur, T., Cowley, A., Li, W., Uludag, M., Pundir, S., Cham, J. A., McWilliam, H., & Lopez, R. (2015). The EBI Search engine: Providing search and retrieval functionality for biological data from EMBL-EBI. Nucleic Acids Research , 43, W585–W588. https://doi.org/10.1093/nar/gkv316
  • Tanizawa, Y., Fujisawa, T., Kodama, Y., Kosuge, T., Mashima, J., Tanjo, T., & Nakamura, Y. (2023). DNA Data Bank of Japan (DDBJ) update report 2022. Nucleic Acids Research , 51, D101–D105. https://doi.org/10.1093/nar/gkac1083
  • UniProt Consortium. (2019). UniProt: A worldwide hub of protein knowledge. Nucleic Acids Research , 47(D1), D506–D515. https://doi.org/10.1093/nar/gky1049
  • UniProt Consortium. (2023). UniProt: The universal protein knowledgebase in 2023. Nucleic Acids Research , 51, D523–D531. https://doi.org/10.1093/nar/gkac1052
  • Valentin, F., Squizzato, S., Goujon, M., McWilliam, H., Paern, J., & Lopez, R. (2010). Fast and efficient searching of biological data resources-using EB-eye. Briefings in Bioinformatics , 11, 375–384. https://doi.org/10.1093/bib/bbp065
  • Waterhouse, A. M., Procter, J. B., Martin, D. M. A., Clamp, M., & Barton, G. J. (2009). Jalview version 2-A multiple sequence alignment editor and analysis workbench. Bioinformatics , 25, 1189–1191. https://doi.org/10.1093/bioinformatics/btp033
  • Wolfsberg, T. G. (2007). Using the NCBI map viewer to browse genomic sequence data. Current Protocols in Bioinformatics , 16, 1.5.1–1.5.22. https://doi.org/10.1002/0471250953.bi0105s16
  • Wu, C., & Nebert, D. W. (2004). Update on genome completion and annotations: Protein information resource. Human Genomics , 1, 229–233. https://doi.org/10.1186/1479-7364-1-3-229
  • Yuan, D., Ahamed, A., Burgin, J., Cummins, C., Devraj, R., Gueye, K., Gupta, D., Gupta, V., Haseeb, M., Ihsan, M., Ivanov, E., Jayathilaka, S., Kadhirvelu, V. B., Kumar, M., Lathi, A., Leinonen, R., McKinnon, J., Meszaros, L., O'Cathail, C., … Cochrane, G. (2024). The European nucleotide archive in 2023. Nucleic Acids Research , 52(D1), D92–D97. https://doi.org/10.1093/nar/gkad1067
  • Zerbino, D. R., Achuthan, P., Akanni, W., Amode, M. R., Barrell, D., Bhai, J., Billis, K., Cummins, C., Gall, A., Girón, C. G., Gil, L., Gordon, L., Haggerty, L., Haskell, E., Hourlier, T., Izuogu, O. G., Janacek, S. H., Juettemann, T., To, J. K., … Flicek, P. (2018). Ensembl 2018. Nucleic Acids Research , 46, D754–D761. https://doi.org/10.1093/nar/gkx1098

推荐阅读

Nature Protocols
Protocols IO
Current Protocols