February 21, 2022

  • Updated
Download Icon Download

USGENE Database Reload Provides Updated BLAST and GETSIM Versions and New Searching Capabilities

The patent sequence database USGENE, providing all available peptide and nucleic acid sequences from the published applications and issued patents of the United States Patent and Trademark Office (USPTO), has been reloaded and enhanced on STNext.

In addition to faster search processing, the highlights of the new version of the USGENE are:

  • New BLAST version and additional BLAST search options
  • New FASTA version
  • Better display of search results, New sorting option
  • New search fields for the composition of nucleic acid and protein sequences
  • Better compatibility with PATGENE and GENESEQ databases
  • Better compatibility with full text patent databases
  • Maximum number of hits increased

New BLAST Version and Additional BLAST Search Options

USGENE now uses BLAST version 2.12.0. Four additional search options have been introduced, allowing for more precision in search results:

  • /SQM - the "megaBLAST" algorithm, for searching highly similar nucleotide sequences
  • /SQDM - the "discontiguous megaBLAST" algorithm, for searching similar nucleotide sequences but allowing more mismatches
  • /TSQP - the BLASTx algorithm, for searching nucleotide sequences translated from PATGENE protein sequences
  • /TSQNX - the tBLASTx algorithm, for searching translated nucleotides from PATGENE protein sequences

Additional details on these new search options can be found by typing HELP BLAST or HELP TLATION at an arrow prompt while in USGENE.

New FASTA Version

The FASTA algorithm, invoked by RUN GETSIM, has been updated to version 36.3.8h. It now allows searching of sequences up to 30K characters in length. The available search options are the same as before: /SQN for searching nucleotides sequences, /SQP for searching amino acid sequences, and /TSQP translating a nucleotide query in all six reading frames to an amino acid sequence and searching in the protein sequences. The display of the parameters, the overview diagram and the alignments are now the same for GETSIM and BLAST searches. Updated HELP information is available is available in HELP GSIM.

Improved Usability of Motif Searching (RUN GETSEQ) Results

To improve the usability of Motif searching results, the entire answer set is now always included within a single L number. HELP GSEQ has been updated and includes additional information.

Better Display of Search Results, New Sorting Option

New displays of similarity results are now available. For each BLAST or GETSIM search, two diagrams are now generated to provide an overview of the similarity between the retrieved sequences and the query:

  • the number of answers
  • a score for the specific degree of similarity for the search

For BLAST and GETSIM searches, L-numbers are each generated by entering ALL, a percentage or an absolute number. Each L-number can be used for further processing. While the default search results display is sorted by descending Accession Number, the ability to sort by descending Similarity Score (SORT SCORE D L1) has been retained and the ability to sort by descending Percent Identity (SORT IDENT D L1) has been introduced in USGENE. The capability to sort by Descending Percent Identity is now also being introduced in PATGENE and GENESEQ.

Alignments can be displayed for all three RUN options (BLAST, GETSIM, GETSEQ) as text with the display format ALIGN or as an image with ALIGNG.

New Search Fields for the Composition of Nucleic Acid and Protein

The introduction of new search fields reporting the nucleotide and amino acid composition of a specific sequence makes it possible to refine your searches to find sequences with a particular type of content. The new fields are as follows:

  • /AA - retrieves amino acid codes expressed as single characters (see HELP AAC for the definitions of the amino acid codes)
  • /NA - retrieves the nucleotide codes (see HELP NUC)
  • /AA.CNT - retrieves the number of amino acids
  • /NA.CNT - retrieves the number of nucleotides
  • /AA.PER - retrieves the percentage of amino acids in the sequence
  • /NA.PER - retrieves the percentage of nucleotides in the sequence

Range searching is possible for the /AA.CNT, /NA.CNT, /AA.PER, and /NA.PER fields. Use the (S) proximity for precision searching results. For example, nucleotides with high GC-content (Guanine, Cytosine) can be retrieved with: => S (G OR C)/NA (S) 60-100/NA.PER

Better Compatibility with the PATGENE and GENESEQ Sequence Databases

While USGENE already had the Patent Sequence Location (/PSL) and Sequence Count (/SEQC) fields, their recent addition to PATGENE and GENESEQ means that the same sequence-specific searches can now be performed in all three databases.

For every sequence in USGENE, the SHA-2 algorithm has been applied and indexed in the new field Sequence Key (/SEQK). The generated string (e.g., A0000030BD19782FC1774AF58E4CFFEE7F0E30588CBA14DCD38C), is specific to a sequence. Identical sequences receive the same string, regardless of the database of origin, or the organism from which the sequence was isolated. Further details on using the /SEQK field for efficient duplicate identification will be communicated in due course.

Compatibility with Full-Text Patent Databases

Search fields common to the patent full text databases are now also available in USGENE:

  • /APO Application Number, Original
  • /DED Data Entry Date
  • /DUPD Data Update Date
  • /INA Inventor Address
  • /PAA Patent Assignee Address
  • /PNO Publication Number, Original
  • /PRDF Priority Date, First
  • /PRYF Priority Year, First
  • /PRNO Priority Number, Original
  • /RLPC Related Publication Country
  • /RLPD Related Publication Date
  • /RLPN Related Publication Number
  • /RLPY Related Publication Year
  • /RLT Related Application Type

Maximum Number of Hits Increased

The default maximum number of hits has been increased to 15,000.

The new parameter "-maxseq" allows the maximum number of hits to be increased to 100,000, but a larger maximum will mean a longer processing time. Example of setting maxseq to 100,000: = > RUN BLAST L1/SQN -F F -MAXSEQ 100000

Additional Information

Although BATCH searches are not possible, L-numbers from sequence searches can be saved with the command SAVE and reactivated with ACTIVATE.

Alerts for sequences are not possible for the time being but can be set up for bibliographic fields.

The new Database Summary Sheet for USGENE is available at: https://www.cas.org/sites/default/files/documents/usgene.pdf

Manual Codes for Derwent World Patents Index Revised for 2022

The Derwent World Patents Index Manual Codes are revised each year to include new codes suggested by customers as well as the patent analysts at Clarivate.

For the 2022 revision, 79 new Manual Codes have been added, comprising:

  • 65 new CPI (Chemical Patents Index) codes
  • 14 new codes in the GMPI/EPI (General and Mechanical Patents Index / Electrical Engineering Patents Index) areas.

The new codes, in use since update 2022001, allow newly emerging technologies to be indexed in DWPI. Scope note changes also have been introduced, to increase clarity.

Significant revisions for 2022 include:

  • PCR Testing: New code hierarchies - B11-C08N*; C11-B08N*, D05-H18* for PCR testing methodologies and Rapid/Real-time testing
  • Geophysical muon imaging - using naturally occurring muons for imaging/mapping: New codes - S03-C02M
  • Mixed reality systems merging real-world and virtual world environments: New codes in T01-J40D
  • 6G mobile communication: New code - W02-C03C1M
  • Electric vehicle safety systems: New codes:
    • X21-A05A1 - Passenger and pedestrian protection
    • X21-A05A2 - External/Internal view cameras
    • X21-A05A3 - Horns/noise generators
    • X21-A05A5 - Anti-collision/parking systems

Full lists of the new and revised codes can be viewed at: https://clarivate.com/derwent/dwpi-reference-center/dwpi-manual-code/