April 8, 2021

  • Updated
Download Icon Download

Biosequence Query Validations

In order to simplify the input of biosequence queries, STNext is introducing additional support and validation for EMBL, GenBank, and FASTA file formats

When the user clicks the Run Search button for a new or saved biosequence search, STNext performs the following validations against the query:

  1. Acceptable Format
  2. Multiple Sequences
  3. Line-Level Validation

1. Acceptable Format

The first validation check is for whether the query is one of the below acceptable formats. If the format is invalid, the “Invalid format” error message displays.

_STNext-BiosequenceValidation-InvalidFormat.png

Notes:

  • BLAST will support all the sequence formats mentioned below.
  • CDR will support only FASTA format.
  • Motif will support ONLY Plain format.

Plain

MDIAIHHPW IRRPFFPFHS PSRLFDQF FGEHLLE SDLFPAS TSLSPFYLR

PPSFLRAPS WIDTGLSEMR LEKDRFSV NLDVKHF SPEELKV KVLGDVIEV

EMBL

Sequence data having trailing line numbers within the <metadata>.

Example:

<meta data>

(ID AMU73928

.....

SQ Sequence....

</meta data>

MDIAIHHPW IRRPFFPFHS PSRLFDQF FGEHLLE SDLFPAS TSLSPFYLR 60

PPSFLRAPS WIDTGLSEMR LEKDRFSV NLDVKHF SPEELKV KVLGDVIEV 120

HGKHEERQD EHGFISREFH RKYRIPAD VDPLAIT SSLSSDG VLTVNGPRK 180

//

Sequence data having trailing line numbers without <metadata>.

Example:

MDIAIHHPW IRRPFFPFHS PSRLFDQF FGEHLLE SDLFPAS TSLSPFYLR 60

PPSFLRAPS WIDTGLSEMR LEKDRFSV NLDVKHF SPEELKV KVLGDVIEV 120

HGKHEERQD EHGFISREFH RKYRIPAD VDPLAIT SSLSSDG VLTVNGPRK 180

GENBANK

Sequence data having leading line numbers within the <metadata>.

Example:

<meta data>

LOCUS SCU49845

....

ORIGIN

</meta data>

1 MDIAIHHPW IRRPFFPFHS PSRLFDQF FGEHLLE SDLFPAS TSLSPFYLR

61 PPSFLRAPS WIDTGLSEMR LEKDRFSV NLDVKHF SPEELKV KVLGDVIEV

121 HGKHEERQD EHGFISREFH RKYRIPAD VDPLAIT SSLSSDG VLTVNGPRK

//

Sequence data having leading line numbers without <metadata>.

Example:

1 MDIAIHHPW IRRPFFPFHS PSRLFDQF FGEHLLE SDLFPAS TSLSPFYLR

61 PPSFLRAPS WIDTGLSEMR LEKDRFSV NLDVKHF SPEELKV KVLGDVIEV

121 HGKHEERQD EHGFISREFH RKYRIPAD VDPLAIT SSLSSDG VLTVNGPRK

FASTA

Single sequence with data below the line with the > sign and title text.

Example:

>crab_mouse ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN) (P23).

tgcaccaaac atgtctaaag ctggaaccaaaa ttactttctttg aagacaaaaactttca

aggccgccac tatgacagcg attgcgactgtg cagatttccaca tgtacctgagccgctg

caactccatc agagtggaag gaggcacctggg ctgtgtatgaaaggcccaattttgctgg

gtacatgtacatcctaccccggggcgagtatcctgagtaccagcactggatgggcctcaa

2. Multiple Sequences

The second validation check is for whether a single query field has multiple sequences (with the > tag); if so, the “Multiple sequences” error message displays with the line numbers shown/highlighted.

_STNext-BiosequenceValidation-MultipleSequences.png

3. Line-Level Validation

If the query passes the first two validations, the following are checked for every line of the file:

  • Leading and trailing numbers
  • BLAST and CDR: If the parsed query contains a character other than the following, the “Invalid characters” error message displays with the line numbers shown/highlighted:
    • A-Z
    • a-z
    • 0-9
  • Motif:
    • If the parsed query contains a character other than the following, the “Invalid characters” error message displays with the line numbers shown/highlighted:
      • A-Z
      • a-z
      • 0-9
      • { }
      • [ ]
      • ^$ Note: Both the ^ and $ sign are required in any expression (e.g., valid: ^AAAAAAAAAAAAAA$, invalid: ^AAAAAAAAAAAAAA)
      • ,
      • .

        _STNext-BiosequenceValidation-MultipleSequences-BLAST.png
    • If the parsed query contains mismatching numbers of opening and closing brackets for { } or [ ], the “Mismatched brackets” error message displays with the line numbers shown/highlighted.
    • Numeric characters may only occur as part of a quantifier (e.g., valid: {1,2}, {2}; invalid: 11CC22, AA2CC); otherwise, the "Invalid quantifier" error message displays with the line numbers shown/highlighted.
    • If the numbers 0-9 do not appear in the correct format within a quantifier, the “Invalid number pattern” error message displays with the line numbers shown/highlighted.
      • Invalid format examples:
        • Dashes instead of commas/missing number: {1-2}, {1,},_
        • “From” value 5 greater than “To” value 1: {5,1}
      • Valid format example: {1,5}

Download Optimization

  • Report downloads up to 30% faster due to parallelized field detection.
  • Transcript downloads up to 40% faster due to optimized RTF structure image conversion.