GENESEQ, USGENE, PATGENE: Sequence Searching with GETSEQ

  • Updated
Download Icon Download

The GETSEQ run package is a tool to search in the databases GENESEQ, USGENE, and PATGENE for a direct sequence code match of peptide and nucleic acid sequences. This method is ideal for short and/or highly conserved sequence queries where similarity (homology) searching is not required. The maximum number of hits is 250,000 records.

Nucleotide and protein sequences can be subjected to a GETSEQ search as a query entered directly on the command line using RUN GETSEQ or the query may be created with the QUERY command, and subsequently searched through the GETSEQ run package specifying the query L-number (e.g., RUN GETSEQ L1, if L1 represents the sequence query).

=> RUN GETSEQ MCLHFLVLVICIL/SQSP
RUN GETSEQ AT 08:57:25 ON 2021-10-11
COPYRIGHT (C) 2021 FIZ KARLSRUHE on STN

GetSeq motif search by FIZ Karlsruhe; Version: 1.0.0

Query time: 115

L13 RUN STATEMENT CREATED
L13 30 MCLHFLVLVICIL/SQSP

Long sequences may be uploaded via the “Structures” page; see details here. The L-number may also derive from a previous sequence search in another STN database with bio sequence search capabilities, e.g., the CAS REGISTRYSM file.

Any L-numbered sequence answer set from RUN GETSEQ may be combined with any search field in the GENESEQ file, for example => S L1 AND ARTIFICIAL SEQUENCE/ORGN where L1 represents the answer set from a RUN GETSEQ operation.

Hits of the retrieved sequence can be displayed in context of the whole sequences as text with the display format ALIGN or as an image with ALIGNG.

=> D ALIGN
L3 ANSWER 1 OF 30 GENESEQ COPYRIGHT 2021 CLARIVATE ANALYTICS on STN.
ALIGN
Sequence Length: 43;
Hits at: 8-20
1 MFTIRSRMCL HFLVLVICIL RECESVCVCV CVCVCLWHLG RVV
= ==========

The HIT display format contains only the part of the hit sequence with the matching residues which are highlighted with double underlining. In addition, the information HITS AT: gives the residue number of the start and end point of the matching part of the hit sequence.

   => D HIT
L5 ANSWER 50 OF 147 GENESEQ COPYRIGHT 2021 CLARIVATE ANALYTICS on STN.
SEQ
SGTTGKPKG
=========
Hits at: 413-420 3426-3433 4466-4473

Sequence Search Terms

Amino acid and nucleic acid sequences may be searched with the one-letter code, amino acids also with the three-letter codes for common amino acids. Enter HELP AAC for a table of the one- and three-letter codes of the common amino acids and HELP NUC for a table of the codes for nucleic acids.

Uncommon amino acids are represented in the sequence by an 'X' (or 'Xaa'). ‘X’ is used also as an unspecified amino acid since July 2022 with standard ST.26. If you want to search specifically for an 'X' in the sequence, it has to be placed in square brackets, e.g., =>RUN GETSEQ TF[X]C[X]T/SQSP

Terms Search Examples

One-letter codes for common amino acids

Three-letter codes for common amino acids

Enclose strings of codes in single quotes and
use dashes to separate codes in strings.

One-letter codes for nucleic acids

LAGLL/SQSP

'HIS-LEU-TYR-LEU-GLN-TYR-ILE-ARG-LYS-LEU'/SQSFP 'HIS-LEU-TYR-LEU-GLN-TYR-ILE-ARG-LYS-LEU' /SQEP

ATGAAN/SQEN CATCTGTATT/SQSN

Types of Sequence Searches

In the GETSEQ run package four options are available for searching polypeptide sequences using amino acid codes and two options for searching nucleic acid sequences.

Sequence data for nucleic acid and protein sequences are displayed in the SEQ field with one-letter codes and the SEQ3 field with three-letter codes for proteins only.

Type Definition Search Code Query Examples
Sequence Exact Protein Search for sequences that match the query. /SQEP GAPGEK/SQEP 'ASP-HIS-ALA-ILE-HIS' /SQEP
Sequence Exact Family, Protein Search for sequences that match the query and those in which family-equivalent substitution of the query amino acids occur. /SQEFP YGGFL/SQEFP 'TYR-GLY-GLY-PHE-LEU'/SQEFP
Subsequence, Protein Search for exact answers plus sequences in which the query sequence is embedded. /SQSP LAGLL/SQSP 'ASP-HIS-ALA'/SQSP
Subsequence Family, Protein Search for exact sequences, subsequences, and answers in which family-equivalent substitution of the query amino acids occurs. /SQSFP ATCXAWV/SQSFP 'THR-ASP-SER-GLU-SER-SER-HIS' /SQSFP
Sequence Exact, Nucleic Acid Search for sequences that match the query. Ambiguity codes for nucleic acids are allowed. /SQEN ATGAAN/SQEN
Subsequence, Nucleic Acid Search for exact answers, plus sequences in which the query sequence is embedded. Ambiguity codes for nucleic acids are allowed. /SQSN TGGAGAAGGC

The families of amino acid equivalents retrieved in the polypeptide family searches SQEFP and SQSFP are:

P, A, G, S, T (neutral, weakly hydrophobic)
Q, N, E, D, B, Z (hydrophilic, acid amine)
H, K, R
(hydrophilic, basic)
F, Y, W hydrophobic, aromatic)
L, I, V, M (hydrophobic) C (cross-link forming)

Variability Symbols for Sequence Code Match Searches

Variability symbols are allowed in all GETSEQ search options. The caret character may be used at the beginning or at the end of a sequence to search for that sequence at the beginning or end of the sequence field.

Symbol(s) Function Query Examples
[ ] specify alternate residues NGSLLAGAYAIST[LV]I/SQSP LGP['VAL-LEU-LYS']/SQSP
[ - ] exclude a specific residue or alternate residues LGP[-H]/SQSP LGP[-'HIS']/SQSFP LGP[-HL]/SQSP
{m} repeat the preceding sequence m times (FL){2}/SQSP (CTGA){3}/SQSN TAA(TAAA){2}/SQSN
{m,u} or {m-u} repeat the preceding sequence m to u times GG(FL){1,2}/SQSP (CTGA){2,4}/SQSN
? or {0,1} or {0-1} repeat the preceding sequence zero or one time FLRRI(RP)?K/SQSP FLRRI(RP){0,1}K/SQSP CATG(CGTA){0,1}GGAC/SQSN
* or {0,} or {0-} repeat the preceding sequence zero or more times KLK(WD){0,}N/SQSP KLK(WD)*N/SQSP CATAA(CTG){0,}TATT/SQSN
+ or {1,} or {1-} repeat the preceding sequence one or more times KLK(DLE){1,}/SQSP KLK(DLE)+/SQS CATA(CTG){1,}TATT/SQSN
^ (Caret)| search at the beginning or end of a sequence specifies alternate residues ^MCGIL/SQS VCDS^/SQSP ACDS|KLMP/SQSP

Specifying Gaps in GETSEQ Sequence Queries

A gap may be specified in a sequence expression using the period (.) for one residue, the colon (:) for zero or one residue or the period (.) followed by an appropriate repeat expression. The following table summarizes all the options for specifying gaps in GETSEQ sequence searches.

Symbol(s) Function Query Examples
. a gap of one residue SY.RPG/SQSP SY..RPG/SQSP AAG...TGC/SQSN
.{m} or [m.] a gap of m residues SY.{2}RPG/SQSP SY[2.]RPG/SQSP
.{m,u} or .{m-u} a gap of m to u residues GFF.{2,10}LSS/SQSP GFF.{2-10}LSS/SQSP AAG.{2,5}TGC/SQSN
: or .? or .{0,1} or .{0-1} a gap of zero or one residues AGA:SRI/SQSFP AGA.?SRI/SQSFP AGA.{0,1}SRI/SQSFP AGA.{0-1}SRI/SQSFP
.* or .{0,} or .{0-} a gap of zero or more residue HLC.*TYG/SQSP HLC.{0,}TYG/SQSP HLC.{0-}TYG/SQSP AAGGCAGATG.*GCAA/SQSN
.+ or .{1,} or .{1-} a gap of one or more residues SY.+TH/SQSP SY.{1,}TH/SQSP SY.{1-}TH/SQSP TCCTG.+GTGG/SQSN

More than one symbol may be used to create complex sequence queries. If you do not use parentheses in sequence queries, the operations will be executed in the following order:

  1. repeat symbols ? or * or +
  2. repeat expressions using curly braces, e.g., {3,6},
  3. concatenation symbol &,
  4. the vertical bar

Concatenating Queries

In addition to the variability symbols, you may use the & symbol to join together sequences or L-numbered queries. The order of the specified sequence has no influence on the search result.

=> RUN GETSEQ MEVQLQESGP & SLVKPSQTLS & LTCSVTGDSI/SQSP

retrieves the same records as

=> RUN GETSEQ LTCSVTGDSI & SLVKPSQTLS & MEVQLQESGP/SQSP
=> S L1&L2&L3/SQSN

retrieves the sequence in L1 followed by the sequence in L2 followed by the sequence in L3.

The concatenation symbol may be used in subsequence searches within RUN GETSEQ (/SQSN, /SQSP, /SQSFP) and also in exact sequence searches of proteins or nucleic acids (/SQEP, /SQEFP, /SQEN).