PATGENE Database Provides Updated BLAST and GETSIM Versions, New Searching Capabilities and Data Quality Improvements
The patent sequence database PATGENE, providing rapid access to nucleotide and amino acid sequence data as submitted by patent applicants to the World Intellectual Property Organization (WIPO), has been reloaded and enhanced. The database was previously known as PCTGEN on STN.
Highlights of the new version of the database are:
- updates to the similarity searching packages BLAST and GETSIM (FASTA)
- availability of new BLAST algorithms
- better display of search results
- improvements in data quality in certain fields
- better compatibility with other sequence databases and with STN full text patent databases to makes cross file searching easier
- increased processing speed for BLAST, GETSIM and GETSEQ
New BLAST and FASTA Versions
PATGENE now uses BLAST version 2.12.0. and four additional search options have been introduced:
- /SQM - the "megaBLAST" algorithm, for searching highly similar nucleotide sequences
- /SQDM - the "discontiguous megaBLAST" algorithm, for searching similar nucleotide sequences but allowing more mismatches
- /TSQP - the BLASTx algorithm, for searching nucleotide sequences translated from PATGENE protein sequences
- /TSQNX - the tBLASTx algorithm, for searching translated nucleotides from PATGENE protein sequences
Additional details on these new search options can be found by typing HELP BLAST or HELP TLATION at an arrow prompt while in PATGENE.
The FASTA algorithm, invoked by RUN GETSIM, has been updated to version 36.3.8h. It now allows searching of sequences up to 30K characters in length. The display of the parameters, the overview diagram and the alignments are now the same for GETSIM and BLAST searches.
Better Display of Search Results
New displays of similarity results are now available.
Alignments can be displayed for all three RUN options (BLAST, GETSIM, GETSEQ) as text with the display format ALIGN or as an image with ALIGNG.
New search fields for the composition of nucleic acid and protein sequences
Need to find sequences with a particular type of content? The introduction of new search fields reporting the nucleotide and amino acid composition of specific sequence makes this possible. The new fields are as follows:
- /AA - retrieves amino acid codes expressed as single characters (see HELP AAC for the definitions of the amino acid codes)
- /NA - retrieves the nucleotide codes (see HELP NUC)
- /AA.CNT - retrieves the number of amino acids
- /NA.CNT - retrieves the number of nucleotides
- /AA.PER - retrieves the percentage of amino acids in the sequence
- /NA.PER - retrieves the percentage of nucleotides in the sequence
Range searching is possible for the /AA.CNT, /NA.CNT, /AA.PER, and /NA.PER fields, and the use of (S) proximity provides precision searching capabilities. For example, nucleotides with high GC-content (Guanine, Cytosine) can be retrieved with: => S (G OR C)/NA (S) 60-100/NA.PER
Better Data Quality for Organism Names and Molecular Types
The quality of the data in the Organism Names (/ORGN) and Molecule Types (/MTY) has been improved for better search quality in these fields. Organism Names are indexed not only as strings but also as single words. Genus and species names can be searched using implied (S)-proximity (e.g., = > s drosophila melanogaster/orgn). Examples of Molecule Types include nucleic acid, mRNA, cDNA, and amino acid. EXPAND in the /MTY field to see all possibilities.
Compatibility with Other Sequence Databases
The search fields Patent Sequence Location (/PSL) and Sequence Count (/SEQC), already available in DGENE and USGENE, are now also available in PATGENE. This means that the same sequence-specific searches can now be performed in all three databases.
For every sequence in PATGENE, the SHA-2 algorithm has been applied and indexed in the new field Sequence Key (/SEQK). The generated string (e.g., A0000030BD19782FC1774AF58E4CFFEE7F0E30588CBA14DCD38C), is specific to a sequence. Identical sequences receive the same string, regardless of the database of origin, or the organism from which the sequence was isolated. The /SEQK field will be added to USGENE and GENESEQ (DGENE) in due course to enable efficient duplicate identification.
Priority Information
Priority information is now available in the usual fields: /PRN, /PRC, /PRD and /PRY. The fields RLN, RLC and RLD/RLY which previously included the priority numbers are no longer available. With the improved standardization of application and priority numbers, more efficient cross file searching is possible.
Compatibility with Full Text Patent Databases
Search fields common to the patent full text databases are now also available in PATGENE:
- /APO Application Number, Original
- /DED Data Entry Date
- /DUPD Data Update Date
- /PNO Publication Number, Original
- /PRDF Priority Date, First
- /PRYF Priority Year, First
- /PRNO Priority Number, Original
These fields also will appear in USGENE and GENESEQ (DGENE) in due course.
Additional Changes/Enhancements
The default maximum number of hits has been increased to 15,000. The new parameter "-maxseq" allows the maximum number of hits to be increased to 100,000, but larger maximums will mean longer processing time. Example: = > RUN BLAST L1/SQN -F F -MAXSEQ 100000
Answer sets for motif searches (from RUN GETSEQ) will now be provided in a single L-number.
Batch searches are no longer available, but L-numbers from sequence searches in PATGENE can be saved with the Messenger command SAVE and reactivated with ACTIVATE.
For More Information
The new Database Summary Sheet for PATGENE is available at: https://www.stn-international.com/database-summary-sheets/patgene