January 4, 2022

  • Updated
Download Icon Download

Reloaded GENESEQ Database Provides Updated BLAST and GETSIM Versions and New Searching Capabilities

The patent sequence database GENESEQ, produced by Clarivate and providing coverage of nucleic acid and protein sequences extracted from the original (basic) patent documents published by 57 patent offices worldwide, has been reloaded and enhanced. The database was previously known as DGENE on STNext.

Many of the enhancements documented herein have already been implemented in PATGENE, earlier in 4Q2021. USGENE is expected to be updated similarly in 1Q2022.

Highlights of the new version of the GENESEQ database are:

  • New BLAST version and additional BLAST search options
  • New FASTA version
  • Improved usability of Motif searching (RUN GETSEQ) results
  • Better display of search results
  • New search fields for the composition of nucleic acid and protein sequences
  • Better compatibility with PATGENE and USGENE databases
  • Better compatibility with full text patent databases
  • Improved performance, and additional enhancements

New BLAST Version and Additional BLAST Search Options

GENESEQ now uses BLAST version 2.12.0. Four additional search options have been introduced:

  • /SQM - the "megaBLAST" algorithm, for searching highly similar nucleotide sequences
  • /SQDM - the "discontiguous megaBLAST" algorithm, for searching similar nucleotide sequences but allowing more mismatches
  • /TSQP - the BLASTx algorithm, for searching nucleotide sequences translated from PATGENE protein sequences
  • /TSQNX - the tBLASTx algorithm, for searching translated nucleotides from PATGENE protein sequences

Additional details on these new search options can be found by typing HELP BLAST or HELP TLATION at an arrow prompt while in GENESEQ.

New FASTA Version

The FASTA algorithm, invoked by RUN GETSIM, has been updated to version 36.3.8h. It now allows searching of sequences up to 30K characters in length. The available search options are the same as before: /SQN for searching nucleotides sequences, /SQP for searching amino acid sequences, and /TSQP translating a nucleotide query in all six reading frames to an amino acid sequence and searching in the protein sequences. The display of the parameters, the overview diagram and the alignments are now the same for GETSIM and BLAST searches. Updated HELP information is available is available in HELP GSIM.

Improved Usability of Motif Searching (RUN GETSEQ) Results

To improve the usability of Motif searching results, the entire answer set is now always included within a single L-number. HELP GSEQ has been updated and includes additional information.

Better Display of Search Results

New displays of similarity results are now available. For each BLAST or GETSIM search two diagrams are generated to provide an overview of the similarity between the retrieved sequences and the query:

  • the number of answers, and
  • a score for the specific degree of similarity for this search

For BLAST and GETSIM searches, L-numbers are each generated by entering ALL, a percentage or an absolute number. Each L-number can be used for further processing.

Alignments can be displayed for all three RUN options (BLAST, GETSIM, GETSEQ) as text with the display format ALIGN or as an image with ALIGNG.

New Search Fields for the Composition of Nucleic Acid and Protein Sequences

Need to find sequences with a particular type of content? The introduction of new search fields reporting the nucleotide and amino acid composition of specific sequence makes this possible.

The new fields are as follows:

  • /AA - retrieves amino acid codes expressed as single characters (see HELP AAC for the definitions of the amino acid codes)
  • /NA - retrieves the nucleotide codes (see HELP NUC)
  • /AA.CNT - retrieves the number of amino acids
  • /NA.CNT - retrieves the number of nucleotides
  • /AA.PER - retrieves the percentage of amino acids in the sequence
  • /NA.PER - retrieves the percentage of nucleotides in the sequence

Range searching is possible for the /AA.CNT, /NA.CNT, /AA.PER, and /NA.PER fields, and the use of (S) proximity provides precision searching capabilities. For example, nucleotides with high GC-content (Guanine, Cytosine) can be retrieved with: => S (G OR C)/NA (S) 60-100/NA.PER

Better Compatibility with the PATGENE and USGENE Sequence Databases

The search fields Patent Sequence Location (/PSL) and Sequence Count (/SEQC), already available in PATGENE and USGENE, are now also available in GENESEQ. This means that the same sequence-specific searches can now be performed in all three databases.

For every sequence in GENESEQ, the SHA-2 algorithm has been applied and indexed in the new field Sequence Key (/SEQK). The generated string (e.g., A0000030BD19782FC1774AF58E4CFFEE7F0E30588CBA14DCD38C), is specific to a sequence. Identical sequences receive the same string, regardless of the database of origin, or the organism frommwhich the sequence was isolated. The /SEQK field has already been added to PATGENE and will be added to USGENE in due course to enable efficient duplicate identification.

Compatibility with Full-Text Patent Databases

Search fields common to the patent full text databases are now also available in GENESEQ:

  • /APO Application Number, Original/DED Data Entry Date
  • /DUPD Data Update Date
  • /PNO Publication Number, Original
  • /PRDF Priority Date, First
  • /PRYF Priority Year, First
  • /PRNO Priority Number, Original

These fields already appear in PATGENE and will also appear in USGENE in due course.

Improved Performance and Additional Enhancements

As a result of the new BLAST and FASTA versions, search performance is improved.

Although BATCH searches are not possible, L-numbers from sequence searches can be saved with the command SAVE and reactivated with ACTIVATE.

Alerts for sequences are not possible for the time being but can be set up for bibliographic fields.

The default maximum number of hits has been increased to 15,000. The new parameter "-maxseq" allows the maximum number of hits to be increased to 100,000, but larger maximums will mean longer processing time. Example: = > RUN BLAST L1/SQN -F F -MAXSEQ 100000

The new Database Summary Sheet for GENESEQ is available at: https://www.cas.org/sites/default/files/documents/geneseq.pdf