GENESEQ, USGENE, PATGENE: Sequence Similarity Searching with BLAST or GETSIM

  • Updated
Download Icon Download

The BLAST® and GETSIM run packages are available to search in the database GENESEQ, USGENE, and PATGENE for protein and nucleotide sequence data by similarity (homology). BLAST (version 2.12.0) is provided with the permission of the National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM). For further information, please refer to: https://blast.ncbi.nlm.nih.gov/doc/blast-help/. GETSIM is using the FASTA algorithm (version 36.3.8h.) A detailed user guide is provided here.

Nucleotide and protein sequences can be subjected to a similarity search using RUN BLAST or RUN GETISM command. The sequences may be entered directly on the command line with a maximum of 278 characters or they may have been uploaded via the “Structures” page, which is explained here. The L-number may also derive from a previous sequence search in another STN database with bio sequence search capabilities, e.g., the CAS REGISTRYSM file.

The minimum length of a sequence query depends on the BLAST parameters used (especially the word size). For BLAST default parameter values the minimum query length is 6 for /SQP and 9 for /SQN. Sequence queries uploaded from .txt files can be up to 30,000 characters in length. The uploaded sequence can be displayed with D LQUE.

Search Options

To initiate a BLAST or GETSIM search with the command RUN BLAST or RUN GETSIM the following search codes have to be specified:

  • /SQP for searching peptide sequences
  • /SQN for nucleotide sequences
  • /TSQN for searching peptide sequences translated from GENESEQ nucleotide sequences.

For the BLAST package four additional search codes are available:

  • /SQM (megaBLAST) for searching highly similar nucleotide sequences
  • /SQDM (discontiguous megaBLAST) for searching similar nucleotide sequences allowing more mismatches
  • /TSQP for searching nucleotide sequences translated from GENESEQ protein sequences
  • /TSQNX for searching translated nucleotides form GENESEQ protein sequences

It is recommended to use the search codes /SQM or /SQDM rather than /SQN when searching longer sequences as the response time is much faster. The commands /TSQN, /TSQP and /TSQNX are more time consuming compared to the other commands.

When using the /SQN, /SQM, /SQDM, or /TSQNX option, it is possible to specify whether single (SIN), complementary (COM), or BOTH strands should be searched. The options can be specified with the search code, e.g., /SQN -S COM. If no search option is given, BOTH (both) will be used by BLAST and GETSIM. Note that for the /TSQN option generally both strands will be searched.

Search Types in BLAST and GETSIM

Description Search Code Search Examples (1)
Peptide homology /SQP RUN BLAST L1 /SQP RUN GETSIM L1/SQP
Nucleotide homology

/SQN
/SQM (2)
/SQDM (2)

RUN BLAST L1 /SQN RUN GETSIM L1/SQN
RUN BLAST L1 /SQM
RUN BLAST L1 /SQDM
Translated peptide homology /TSQN RUN BLAST L1 /TSQN RUN GETSIM L1 /TSQN
Translated peptide homology from translated peptide /TSQNX (2) RUN BLAST L1/TSQNX
Translated nucleotide homology /TSQP (2) RUN BLAST L1 /TSQP

(1) Where L1 is a sequence query generated using the “Structure” page.  (2) BLAST only

The maximum number of hits is by default 15,000 records. The parameter "-maxseq" allows to increase the maximum number of hits to 100,000 records, e.g., =>RUN BLAST L1/SQN -F F -MAXSEQ 100000.

The number of additional results and their relevance in terms of high score and/or high identity values depend on the length of the query sequence and the number of subject sequences in the database. In general, searching a short sequence with -maxseq 100000 may retrieve additional documents with high score and high identity values while searching a longer sequence with -maxseq 100000 may retrieve only additional documents with high identity values.

After a search with BLAST or GETSIM the number of retrieved sequences for the different score values are displayed in two diagrams. The y-axis of these diagrams represents the number of answers (absolute values are displayed as bars, logarithmic values are shaded) and the x-axis the score as the specific degree of similarity for this search. In the left diagram the score values are displayed, in the right diagram the percentage values of the maximum score. In addition, two score values are given, the highest possible score value defining the maximum score when the query is aligned to itself, and the score of the best answer of the retrieved answer set. Both values are the same, if the query and at least one retrieved sequence are identical.

STNext-GENESEQ-USGENE-PATGENE-SequenceSimilaritySearching-BLASTorGETSIM-1.png

Creating Answer Sets

Multiple answer sets (L-numbers) can be created with different cut off values for core and percentage identity. Five options are available:

  • Select a part of the answer set using the score value from the left histogram. The generated L-number contains all records with a score above the entered value.

    STNext-GENESEQ-USGENE-PATGENE-SequenceSimilaritySearching-BLASTorGETSIM-2.png

  • Select a part of the answer set using the percentage score value from the right histogram, e.g., "85%" or "85% SCORE". The generated L-number contains all records with a percentage score above the entered value.

    STNext-GENESEQ-USGENE-PATGENE-SequenceSimilaritySearching-BLASTorGETSIM-3.png

  • Select a part of the answer set using the percentage identity value, e.g., "100% IDENT". The generated L-number contains all records with a percentage identity above the entered value.

    STNext-GENESEQ-USGENE-PATGENE-SequenceSimilaritySearching-BLASTorGETSIM-4.png

  • Select a part of the answer set combining the percentage score and the percentage identity value, e.g., "85% SCORE 100% IDENT". The generated L-number contains all records which have a percentage score and percentage identity above the entered value.

    STNext-GENESEQ-USGENE-PATGENE-SequenceSimilaritySearching-BLASTorGETSIM-5.png

  • Keep the complete answer set with ALL.

    STNext-GENESEQ-USGENE-PATGENE-SequenceSimilaritySearching-BLASTorGETSIM-6.png

The percentage score or identity cut off value can be set to two digits after the period. Small differences in the cut off value can have a tremendous effect on the number of results like in this example.

STNext-GENESEQ-USGENE-PATGENE-SequenceSimilaritySearching-BLASTorGETSIM-7.png

In order to complete the RUN BLAST or the RUN GETSIM command, END must be entered.

STNext-GENESEQ-USGENE-PATGENE-SequenceSimilaritySearching-BLASTorGETSIM-8.png

An L-number is generated for each selection, which contains all answers of the specified subset. Each L-number can be used for further processing or may be combined with any search field of the database, for example => S L1 AND ARTIFICIAL SEQUENCE/ORGN where L1 represents the answer set from a RUN GETSEQ operation.

As the initial L-number is sorted by descending accession number, the selected L-number may be re-arranged by descending similarity score (SORT SCORE D L1) or descending percent identity (SORT IDENT D L1).

The alignment between the retrieved sequence and the query sequence can be displayed as text with the display format ALIGN or as an image with ALIGNG. The top line is the query sequence and the bottom line the hit sequence. Above each alignment the percentage of the BLAST and GETSIM score compared to the query self-score value and the percentage of identity is given. Both values can also be displayed as well with D SCORE and D IDENT.

STNext-GENESEQ-USGENE-PATGENE-SequenceSimilaritySearching-BLASTorGETSIM-9.png

Advanced User Options for BLAST and GETSIM

For the experienced user of BLAST® and GETSIM a variety of options are available via the STN command line. Altering these parameters will have a profound effect on the outcome of the search. It is strongly recommended that users are completely familiar with NCBI documentation before embarking on customizing any of these settings. For further information see the information on the NCBI website.

The advanced user options are specified with a single letter code preceded by a hyphen and followed by a blank and the required value, e.g., RUN BLAST L1 /SQN -F F or RUN BLAST L1/SQP -E 0.1 -M PAM30.

Option Switch Purpose Values
1. Filter -f Specifies filter to be used to mask query sequence. T (True), F (False), Default value is T.
If T is set, for peptides the SEG, and for nucleotides the DUST filter is employed.
2. Expectation Value -e This parameter controls the search sensitivity. To see less significant hits, increase the setting. To increase sensitivity, use lower parameter values. Floating point number. (Default is 10)
3. Word Size -w Specifies the length of the character string fragments of a sequence query which are used as the basis for a BLAST search. 11 (default) or 7-23 for nucleotides 3 (default) or 2 for peptides

4. Strand
(nucleotides only)

-s Specifies which nucleotide query strand to use in the search. 1 (SIN), 2 (COM) or 3 (BOTH) default value is 3
5. Matrix
(peptides only)
-m Specifies which protein scoring matrix to use. BLAST: BLOSUM62 (default), BLOSUM80, BLOSUM45, PAM30, PAM70
GETSIM: BL50 (default), BL62, BL80, MD10, MD20, MD40, OPT5, P120, P250, VT160
6. Gap Penalty -g Specifies the gap opening cost.
For nucleotides multiple combinations of gap penalties are available for specific sets of -r (reward for match) and -q (penalty for mismatch) values.
Peptides (default): BLAST 11; GETSIM 12 Nucleotides (default): BLAST 5; GETSIM 12
7. Gap Extension -x Specifies the gap extension cost.
For proteins only a restricted set of gap penalties is valid.
Peptides: BLAST 1; GETSIM 2
Nucleotides: BLAST 2; GETSIM 4
8. Penalty for nucleotide mismatch -q Specifies penalty for a nucleotide mismatch.
Different -r/-q ratio is optimal for alignments with different percent of identities, 1/-3 for 99%, 2/-5 for 98%, 1/-2 for 95%, 2/-3 for 90%, 4/-5 for 80-85%, 1/-1 for 75%, and 5/-4 for 65%. Only a limited set of -r/-q is supported.
BLAST: -3 (default)
GETIM: -2 (default)
9. Reward for nucleotide match -r Specifies reward for a nucleotide match.
Different -r/-q ratio is optimal for alignments with different percent of identities, 1/-3 for 99%, 2/-5 for 98%, 1/-2 for 95%, 2/-3 for 90%, 4/-5 for 80-85%, 1/-1 for 75%, and 5/-4 for 65%. Only a limited set of -r/-q is supported.
BLAST: 1 (default)
GETSIM: 3 (default)

BLAST and GETSIM Matrix Settings (Option 5)

The following table lists the recommended scoring matrices and corresponding gap penalties for BLAST searches depending on query length.

Query Length Substitution Matrix Gap Cost
<35 PAM30 ( 9,1)
35-50 PAM70 (10,1)
50-85 BLOSUM80 (10,1)
>85 BLOSUM62 (11,1)

For a certain matrix only a restricted set of possible gap and gap extension values are possible. The settings available to each matrix are summarized in the table below. Default settings are indicated in the table. Any different combinations will be rejected by the system and a warning message issued.

Matrix Gap Gap Extension
BLOSUM62 9
8
7
12
11
1
2
2
2
1
1 (default)
1
BLOSUM80 8
7
6
11
10
9
2
2
2
1
1 (default)
1
BLOSUM45 13
11
12
9
15
14
13
12
19
18
17
16
3
3
3
3
2 (default)
2
2
2
1
1
1
1
BLOSUM50 13
12
11
10
9
16
15
14
13
12
19
18
17
16
15
3
3
3
3
3
2
2
2
2 (default)
2
1
1
1
1
1
BLOSUM90 9
8
7
6
11
10
2
2
2
1
1
1 (default)
PAM30 9
7
6
5
10
8
9
1
2
2
2
1
1
1 (default)
PAM70 8
7
6
11
10
9
2
2
2
1
1 (default)
1
PAM250 15
14
13
12
11
17
16
15
14
13
21
20
19
18
17
3
3
3
3
3
2
2
2
2 (default)
2
1
1
1
1
1

The following table lists the matrices and the query length for GETSIM searches at which they should be used preferentially.

Query Length Substitution Matrix Gap Cost
<30 BL50 BLOSUM50
10-90 BL62 BLOSUM62
>10 BL80 BLOSUM80
<40 MD10 MutationData10
<60 MD20 MutationData20
<85 MD40 MutationData40
25-150 OPT5 OPTIMA5
10-90 P120 PAM120
>100 P250 PAM250
>20 VT160 Variable Time Maximum Likelihood

The matrices MD10, MD20, MD40, and VT160 are versions of the PAM-matrix.