The GETSEQ run package is a tool to search in the databases GENESEQ, USGENE, and PATGENE for a direct sequence code match of peptide and nucleic acid sequences. This method is ideal for short and/or highly conserved sequence queries where similarity (homology) searching is not required. The maximum number of hits is 250,000 records.
Nucleotide and protein sequences can be subjected to a GETSEQ search as a query entered directly on the command line using RUN GETSEQ or the query may be created with the QUERY command, and subsequently searched through the GETSEQ run package specifying the query L-number (e.g., RUN GETSEQ L1, if L1 represents the sequence query).
=> RUN GETSEQ MCLHFLVLVICIL/SQSP
RUN GETSEQ AT 08:57:25 ON 2021-10-11
COPYRIGHT (C) 2021 FIZ KARLSRUHE on STN
GetSeq motif search by FIZ Karlsruhe; Version: 1.0.0
Query time: 115
L13 RUN STATEMENT CREATED
L13 30 MCLHFLVLVICIL/SQSP
Long sequences may be uploaded via the “Structures” page; see details here. The L-number may also derive from a previous sequence search in another STN database with bio sequence search capabilities, e.g., the CAS REGISTRYSM file.
Any L-numbered sequence answer set from RUN GETSEQ may be combined with any search field in the GENESEQ file, for example => S L1 AND ARTIFICIAL SEQUENCE/ORGN where L1 represents the answer set from a RUN GETSEQ operation.
Hits of the retrieved sequence can be displayed in context of the whole sequences as text with the display format ALIGN or as an image with ALIGNG.
=> D ALIGN
L3 ANSWER 1 OF 30 GENESEQ COPYRIGHT 2021 CLARIVATE ANALYTICS on STN.
ALIGN
Sequence Length: 43;
Hits at: 8-20
1 MFTIRSRMCL HFLVLVICIL RECESVCVCV CVCVCLWHLG RVV
= ==========
The HIT display format contains only the part of the hit sequence with the matching residues which are highlighted with double underlining. In addition, the information HITS AT: gives the residue number of the start and end point of the matching part of the hit sequence.
=> D HIT
L5 ANSWER 50 OF 147 GENESEQ COPYRIGHT 2021 CLARIVATE ANALYTICS on STN.
SEQ
SGTTGKPKG
=========
Hits at: 413-420 3426-3433 4466-4473
Sequence Search Terms
Amino acid and nucleic acid sequences may be searched with the one-letter code, amino acids also with the three-letter codes for common amino acids. Enter HELP AAC for a table of the one- and three-letter codes of the common amino acids and HELP NUC for a table of the codes for nucleic acids.
Uncommon amino acids are represented in the sequence by an 'X' (or 'Xaa'). ‘X’ is used also as an unspecified amino acid since July 2022 with standard ST.26. If you want to search specifically for an 'X' in the sequence, it has to be placed in square brackets, e.g., =>RUN GETSEQ TF[X]C[X]T/SQSP
Terms | Search Examples |
One-letter codes for common amino acids Three-letter codes for common amino acids Enclose strings of codes in single quotes and One-letter codes for nucleic acids |
LAGLL/SQSP 'HIS-LEU-TYR-LEU-GLN-TYR-ILE-ARG-LYS-LEU'/SQSFP 'HIS-LEU-TYR-LEU-GLN-TYR-ILE-ARG-LYS-LEU' /SQEP ATGAAN/SQEN CATCTGTATT/SQSN |
Types of Sequence Searches
In the GETSEQ run package four options are available for searching polypeptide sequences using amino acid codes and two options for searching nucleic acid sequences.
Sequence data for nucleic acid and protein sequences are displayed in the SEQ field with one-letter codes and the SEQ3 field with three-letter codes for proteins only.
Type | Definition | Search Code | Query Examples |
Sequence Exact Protein | Search for sequences that match the query. | /SQEP | GAPGEK/SQEP 'ASP-HIS-ALA-ILE-HIS' /SQEP |
Sequence Exact Family, Protein | Search for sequences that match the query and those in which family-equivalent substitution of the query amino acids occur. | /SQEFP | YGGFL/SQEFP 'TYR-GLY-GLY-PHE-LEU'/SQEFP |
Subsequence, Protein | Search for exact answers plus sequences in which the query sequence is embedded. | /SQSP | LAGLL/SQSP 'ASP-HIS-ALA'/SQSP |
Subsequence Family, Protein | Search for exact sequences, subsequences, and answers in which family-equivalent substitution of the query amino acids occurs. | /SQSFP | ATCXAWV/SQSFP 'THR-ASP-SER-GLU-SER-SER-HIS' /SQSFP |
Sequence Exact, Nucleic Acid | Search for sequences that match the query. Ambiguity codes for nucleic acids are allowed. | /SQEN | ATGAAN/SQEN |
Subsequence, Nucleic Acid | Search for exact answers, plus sequences in which the query sequence is embedded. Ambiguity codes for nucleic acids are allowed. | /SQSN | TGGAGAAGGC |
The families of amino acid equivalents retrieved in the polypeptide family searches SQEFP and SQSFP are:
P, A, G, S, T | (neutral, weakly hydrophobic) |
Q, N, E, D, B, Z | (hydrophilic, acid amine) |
H, K, R |
(hydrophilic, basic) |
F, Y, W | hydrophobic, aromatic) |
L, I, V, M | (hydrophobic) C (cross-link forming) |
Variability Symbols for Sequence Code Match Searches
Variability symbols are allowed in all GETSEQ search options. The caret character may be used at the beginning or at the end of a sequence to search for that sequence at the beginning or end of the sequence field.
Symbol(s) | Function | Query Examples |
[ ] | specify alternate residues | NGSLLAGAYAIST[LV]I/SQSP LGP['VAL-LEU-LYS']/SQSP |
[ - ] | exclude a specific residue or alternate residues | LGP[-H]/SQSP LGP[-'HIS']/SQSFP LGP[-HL]/SQSP |
{m} | repeat the preceding sequence m times | (FL){2}/SQSP (CTGA){3}/SQSN TAA(TAAA){2}/SQSN |
{m,u} or {m-u} | repeat the preceding sequence m to u times | GG(FL){1,2}/SQSP (CTGA){2,4}/SQSN |
? or {0,1} or {0-1} | repeat the preceding sequence zero or one time | FLRRI(RP)?K/SQSP FLRRI(RP){0,1}K/SQSP CATG(CGTA){0,1}GGAC/SQSN |
* or {0,} or {0-} | repeat the preceding sequence zero or more times | KLK(WD){0,}N/SQSP KLK(WD)*N/SQSP CATAA(CTG){0,}TATT/SQSN |
+ or {1,} or {1-} | repeat the preceding sequence one or more times | KLK(DLE){1,}/SQSP KLK(DLE)+/SQS CATA(CTG){1,}TATT/SQSN |
^ (Caret)| | search at the beginning or end of a sequence specifies alternate residues | ^MCGIL/SQS VCDS^/SQSP ACDS|KLMP/SQSP |
Specifying Gaps in GETSEQ Sequence Queries
A gap may be specified in a sequence expression using the period (.) for one residue, the colon (:) for zero or one residue or the period (.) followed by an appropriate repeat expression. The following table summarizes all the options for specifying gaps in GETSEQ sequence searches.
Symbol(s) | Function | Query Examples |
. | a gap of one residue | SY.RPG/SQSP SY..RPG/SQSP AAG...TGC/SQSN |
.{m} or [m.] | a gap of m residues | SY.{2}RPG/SQSP SY[2.]RPG/SQSP |
.{m,u} or .{m-u} | a gap of m to u residues | GFF.{2,10}LSS/SQSP GFF.{2-10}LSS/SQSP AAG.{2,5}TGC/SQSN |
: or .? or .{0,1} or .{0-1} | a gap of zero or one residues | AGA:SRI/SQSFP AGA.?SRI/SQSFP AGA.{0,1}SRI/SQSFP AGA.{0-1}SRI/SQSFP |
.* or .{0,} or .{0-} | a gap of zero or more residue | HLC.*TYG/SQSP HLC.{0,}TYG/SQSP HLC.{0-}TYG/SQSP AAGGCAGATG.*GCAA/SQSN |
.+ or .{1,} or .{1-} | a gap of one or more residues | SY.+TH/SQSP SY.{1,}TH/SQSP SY.{1-}TH/SQSP TCCTG.+GTGG/SQSN |
More than one symbol may be used to create complex sequence queries. If you do not use parentheses in sequence queries, the operations will be executed in the following order:
- repeat symbols ? or * or +
- repeat expressions using curly braces, e.g., {3,6},
- concatenation symbol &,
- the vertical bar
Concatenating Queries
In addition to the variability symbols, you may use the & symbol to join together sequences or L-numbered queries. The order of the specified sequence has no influence on the search result.
=> RUN GETSEQ MEVQLQESGP & SLVKPSQTLS & LTCSVTGDSI/SQSP
retrieves the same records as
=> RUN GETSEQ LTCSVTGDSI & SLVKPSQTLS & MEVQLQESGP/SQSP
=> S L1&L2&L3/SQSN
retrieves the sequence in L1 followed by the sequence in L2 followed by the sequence in L3.
The concatenation symbol may be used in subsequence searches within RUN GETSEQ (/SQSN, /SQSP, /SQSFP) and also in exact sequence searches of proteins or nucleic acids (/SQEP, /SQEFP, /SQEN).