Searching DWPI Chemical Fragmentation Codes

  • Updated
Download Icon Download

STNext can translate a structure into corresponding alphanumeric fragmentation codes and search them against the WPIX or WPIDS files via a script.

After drawing or importing a fragmentation code structure:

  1. On the Structures page, click the ellipses () menu for the fragmentation code structure, and then select Generate FragCode Script.


    Note: Please see FCO Input Format: Supported and Unsupported Components for a complete list of supported and unsupported structure components.

  2. You may edit the system-generated Script Name as long as it is not the same as a pre-existing file on the Scripts page, and then click the Generate Script button.


  3. The system automatically converts a structure drawing into a fragmentation code search and prepares a search strategy in the form of an STN script which appears in the Edit Script window, where you may run the search (which also saves the script) in either the WPIX or WPIDS database or save the script for later use.

    Note: The validation check only covers STN commands and not the syntax of the script or the validity of fragmentation codes.


In addition, the Edit Script window can be used to modify the search to allow for substitution by:

  • Removing negation codes
  • Adding codes with counts (e.g., J012-J014 to consider several COOH substituents)
  • Adding codes for derivatives (e.g., hydroxy to ether or ester)

Below is an example of a search script ran in WPIX.


In the display, the hit codes are highlighted in the CMC section (e.g., d kwic, d cmc).


To ensure that you receive correct results of your fragmentation code search, please see Critical Aspects for Your Search: Limitations and Best Practices.

FCO Input Format: Supported and Unsupported Components

Below are the supported and unsupported structure components for fragmentation code structures.

Bond Representation

Unsupported: unspecified

Bond Types

  • Supported: single, double, triple
  • Unsupported: ring (for chain bond)

Generic Nodes

  • Supported: CHK, CHE, CHY, ARY, HEA, HEF, HET, CYC, A35, AMX, TRM, LAN, ACT, HAL, X
  • Unsupported: Ak, Cb, Cy, Hy, Id, A, Q, ACY, DYE, POL, XX, PEG, PRT, UNK

Other Nodes Not Supported

Isotopes D and T

Ignored Features and Attributes

The following features and attributes are ignored. As a consequence, the respective default values are automatically used for script generation.

  • The following bond values and representations
    • exact/normalized; normalized
    • stereo bonds; E/Z bonds
  • saturated/unsaturated
  • branched/linear
  • number carbon atoms: less than 7 / 7 or more
  • number of hetero atoms: exactly 1 / 2 or more
  • abnormal valency
  • abnormal/specific mass
  • atom lock
  • charges
  • match level
  • element count level
  • non-hydrogen count (NHC), hydrogen count (HC)
  • ring/chain (for chain bonds)

Unsupported Features and Attributes

  • Variable Point of Attachment (VPA)
  • Ring isolation
  • Repeating Groups
  • R-groups with more than 5 atoms in the atom list
  • R-groups with generic nodes
  • Two or more X nodes attached to the same atom

Atom Lists

  • The maximum number of atoms per list is 5.
  • Not-lists are not supported.

Supergroups (Sgroups)

  • Supported: SUP (superatom)
  • Unsupported: MUL = multiple group, SRU = structure repeating unit, MOM = monomer, etc. Consequently polymers and peptides cannot be processed.

Unsupported Chemotypes

For these cases, no specific error message can be generated (except for Two or More X Nodes at the Same Atom Generate Wrong Strategy). For best practice, please refer to Critical Aspects for Your Search: Limitations and Best Practices below.

Critical Aspects for Your Search: Limitations and Best Practices

-OH and -SH Substituents on Aromatic Rings

Issue: For aromatic rings with hydroxy or thiol substituents (i.e., the hydrogen is present in the structure), mandatory codes are not generated (e.g., phenol lacks codes (H401, H441), thiophenol lacks code (H494)).

Workaround: Do not draw the hydrogen or use the respective -OH or –SH shortcut for these types of structures. In case of the "naked“ heteroatom (e.g., Phe-O, Ph-S) the correct strategy is generated.


Note: Saturated carbocycles are not affected (e.g., for cyclohexanol with explicitly present hydrogen, the correct strategy is generated).

-OH and -SH Substituents on Saturated Heterocycles

Issue: For saturated heterocycles with –O and –S substituents (i.e., the hydrogen is not present in the structure), wrong codes are generated for –O substituted structures (codes J521, J522, J523 (1, 2, >=3 – Het-Oxo)) and –S substituted structures (codes J592 (Het-thioxo)).

Workaround: Draw the hydrogen or use the respective –OH or –SH shortcut for these types of structures.


Tautomerism: Keto-enol and Iminine/Enamine Tautomers

When searching for tautomeric structures, please make sure that your STNext structure query is inline with the coding rules.

For example, when keto-enol tautomerism is possible, the structure is coded in the keto form unless the -OH group of the enol form is bonded to a fully conjugated carbocyclic ring (e.g., benzene). For this reason, the search structures should be drawn as follows:


For detailed information, please refer to the “Tautomerism” section in Chapter 8: Functional Groups of the CPI Chemical Indexing User Guide.

Important: In some cases, the generated codes of the tautomeric structure are never 100% correct. We therefore highly recommend to always check the corresponding script again for correctness.

Two or More X Nodes at the Same Atom Generate Wrong Strategy

Issue: If two or more X nodes (generic node for halogen) are attached on the same atom (e.g., benzene-CCl2 or benzene-CF3), a wrong fragcode strategy is generated.

Workaround: Use R-groups containing F, Cl, Br, I.

Carbohydrates Require Negation Code Revision

The indexing of carbohydrates includes the required code L8 as well as code K0.

Issue: In the fragmentation code strategy for carbohydrates, the code K0 is included in the negation codes. As a consequence, relevant records are not found since K0 is usually indexed for carbohydrates.

Workaround: For carbohydrates, the code K0 needs to be manually deleted from the negation codes. In addition, codes K1-9 and L1-L7 and L9 should be added since it will lead to more accurate results by eliminating those structures that have other functional groups present which are not part of the original structure.


Indexed fragmentation codes

M2 *01* F012 F013 F014 F015 F016 F123 H4 H404 H423 H481 H5 H521 H8 K0 L8

L815 L821 L831 M210 M211 M272 M281 M311 M321 M342 M373 M391 M413

M431 M510 M521 M530 M540 M782 P220 P420 P943 Q261 R032 M905


Query Structure in STNext

Note: In order to avoid -OH and -SH Substituents on Saturated Heterocycles, the hydrogens of the hydroxy groups should be explicitly present.


Autogenerated Fragmentation Code Strategy from the Structure Editor

=>s (M210 OR M211)/M0,M2,M3,M4 \>_line1

=>s (M413(P)F123(P)H423(P)H481(P)H521)/M0,M2,M3,M4 \>_line2

=>s _line2(P)(M521(P)M510(P)M530(P)M540)/M0,M2,M3,M4 \>_line3

=>s _line3(P)((M272 OR M270)(P)M281(P)M311(P)M321(P)M342(P)(M373 OR M370)(P)M391)/M0,M2,M3,M4 \>_line4

=>s _line4(P)_line1 \>_line5

=>s _line5(P)(F012(P)F013(P)F014(P)F015(P)F016(P)H404)/M0,M2,M3,M4 \>_line6

=>s (_line2(P)M900/M0) OR (_line3(P)M901/M2,M3,M4) OR (_line5(P)M902/M2,M3,M4) OR _line6


=>s _line7(NOTP)(H1 OR H2 OR H3 OR H6 OR H7 OR H9 OR J0 OR J1 OR J2 OR J3 OR J4 OR J5 OR J9 OR

K0 OR M1)/M2,M3,M4 \>_line8

Issue: The indexed code K0 is included in the negation codes and has to be deleted.

Manually Corrected Codes

=>s (M210 OR M211)/M0,M2,M3,M4 \>_line1

=>s (M413(P)F123(P)H423(P)H481(P)H521)/M0,M2,M3,M4 \>_line2

=>s _line2(P)(M521(P)M510(P)M530(P)M540)/M0,M2,M3,M4 \>_line3

=>s _line3(P)((M272 OR M270)(P)M281(P)M311(P)M321(P)M342(P)(M373 OR M370)(P)M391)/M0,M2,M3,M4 \>_line4

=>s _line4(P)_line1 \>_line5

=>s _line5(P)(F012(P)F013(P)F014(P)F015(P)F016(P)H404)/M0,M2,M3,M4 \>_line6

=>s (_line2(P)M900/M0) OR (_line3(P)M901/M2,M3,M4) OR (_line5(P)M902/M2,M3,M4) OR _line6


=>s _line7(NOTP)(H1 OR H2 OR H3 OR H6 OR H7 OR H9 OR J0 OR J1 OR J2 OR J3 OR J4 OR J5 OR J9 OR K1 OR K2 OR K3 OR K4 OR K5 OR K6 OR K7 OR K8 OR K9 OR L1 OR L2 OR L3 OR L4 OR L5 OR L6 OR L7 OR L9 OR M1)/M2,M3,M4 \>_line8

Solution: K0 is removed (mandatory). To enhance accuracy of results, codes K and L are added to the required K0 and L8.

Code Generation for Steroids Not Supported

Issue: For the chemotype of steroids (i.e., cholesterol), a wrong fragcode strategy is generated. There is no workaround.


L9 Code Set Should Be Manually Edited

The DCR indexing of the L9 code set is not consistent due to the complexity of complete recognition of these structural elements within a chemical structure. Therefore, fragmentation code strategies which include such codes may miss relevant records.

The L9 code set includes:


Best Practice: Delete those codes from the fragmentation code strategy.

RIN Codes for Certain Non-Aromatic Versions of Polycyclic Carbocycles To Be Edited

There is RIN indexing for complete spiro systems (e.g., RIN 06706 for 9,9′- Spirobifluorene). In certain cases, there is additional RIN indexing for the individual ring systems that are joined by the spiro link. Usually, ring index numbers apply to a ring system irrespective of the degree of unsaturation; there are a small number of polycyclic carbocyclic ring systems where there is no specific code for the aromatic version of the system and a specific code for the non-aromatic version of the ring system (even though it is the same ring system with all of the benzene rings wholly or partially hydrogenated – i.e., no intact aromatic ring system or quinoid variant thereof present in the system).

The following chemotypes are affected:

  • Fluorene (Aromatic) G310 (no RIN) Non-aromatic version G720 RIN 03126
  • Anthracene (Aromatic) G331/G332 (No RIN) Non-aromatic version G730 RIN 03618
  • Phenanthrene (Aromatic) G341/G342 (No RIN) Non aromatic version G730 RIN 03619
  • Chrysene (Aromatic) G410 (NO RIN) Non-Aromatic version G800 RIN 05254
  • Naphthacene (Aromatic) G420 (NO RIN) Non-aromatic version G800 RIN 05252
  • Dibenzo(a,d)cycloheptene (Aromatic) G360 (NO RIN) Non-aromatic version G750 RIN 03708
  • Dibenzo(a,c)cycloheptene (Aromatic) G380 (NO RIN) Non-aromatic version G750 RIN 03714

Issue: In STNext, it may occur that even for aromatic systems of the type described above, the respective RIN codes for the non-aromatic versions are included. For instance, for 9,9′-Spirobifluorene (CAS-Nr.: 159-66-0), a fully aromatic system, only the RIN code 06706 relating to the spiro system should be applied, but on STNext, the wrong RIN code 03126 relating to non-aromatic systems (and triggered by the presence of code G720) is additionally included. This leads to a reduced answer set, and consequently, relevant hits are missed.

Best Practice: Check your fragmentation code script to ensure that, for instance, RIN code 03126 is not applied if only G310 and not G720 is present (and similarly check for the other systems listed above).

Example: Comparison of fragmentation code strategies on STNext for STR1, STR2, and STR3:


This specific example relates to the following codes:

  • Fluorene – G310 (This code covers only the ring system fluorene and hydrogenated versions where at least one benzene ring retains its 3 double bonds (or a quinoid variant thereof). There is no asterisk as the code only describes one ring system.
  • Polyhydrofluorene – G720 (Neither of the 6-membered rings are aromatic or quinoids.) Note that G720 has an asterisk indicating an RIN is required as it covers several possible ring systems.

Autogenerated Fragmentation Code Strategy for STR1: RIN 06706 is correct, RIN 03126 is wrong (to be deleted manually from the STNext fragmention code script).

=>s (M414(P)G041(P)G310(P)G399(P)M532)/M0,M2,M3,M4 \>_line1

=>s _line1(P)(M610(P)M510(P)M520(P)M540)/M0,M2,M3,M4 \>_line2

=>s _line2(P)(M280(P)M320)/M0,M2,M3,M4 \>_line3

=>s _line3(P)(03126(P)06706)/RIN \>_line4

=>s _line4(P)(G031(P)G039)/M0,M2,M3,M4 \>_line5

=>s (_line1(P)M900/M0) OR (_line2(P)M901/M2,M3,M4) OR (_line4(P)M902/M2,M3,M4) OR _line5


=>s _line6(NOTP)(H1 OR H2 OR H3 OR H4 OR H5 OR H6 OR H7 OR H8 OR H9 OR J0 OR J1 OR J2 OR J3 OR J4 OR J5 OR J9 OR K0 OR M1)/M2,M3,M4 \>_line7

Autogenerated Fragmentation Code Strategy for STR2: RINs 03126 and 06706 are correct.

=>s (M414(P)G041(P)G052(P)G310(P)G720(P)M531)/M0,M2,M3,M4 \>_line1

=>s _line1(P)(M541(P)M610(P)M510(P)M520)/M0,M2,M3,M4 \>_line2

=>s _line2(P)(M280(P)M320)/M0,M2,M3,M4 \>_line3

=>s _line3(P)(03126(P)06706)/RIN \>_line4

=>s _line4(P)(G031(P)G039)/M0,M2,M3,M4 \>_line5

=>s (_line1(P)M900/M0) OR (_line2(P)M901/M2,M3,M4) OR (_line4(P)M902/M2,M3,M4) OR _line5\>_line6

=>s _line6(NOTP)(H1 OR H2 OR H3 OR H4 OR H5 OR H6 OR H7 OR H8 OR H9 OR J0 OR J1 OR J2 OR J3 OR J4 OR J5 OR J9 OR K0 OR M1)/M2,M3,M4 \>_line7

Autogenerated Fragmentation Code Strategy for STR3: RINs 03126 and 06706 are correct.

=>s (M415(P)G052(P)G720(P)G799)/M0,M2,M3,M4 \>_line1

=>s _line1(P)(M542(P)M610(P)M510(P)M520(P)M530)/M0,M2,M3,M4 \>_line2

=>s _line2(P)(M280(P)M320)/M0,M2,M3,M4 \>_line3

=>s _line3(P)(03126(P)06706)/RIN \>_line4

=>s _line4(P)(G031(P)G039)/M0,M2,M3,M4 \>_line5

=>s (_line1(P)M900/M0) OR (_line2(P)M901/M2,M3,M4) OR (_line4(P)M902/M2,M3,M4) OR _line5\>_line6

=>s _line6(NOTP)(H1 OR H2 OR H3 OR H4 OR H5 OR H6 OR H7 OR H8 OR H9 OR J0 OR J1 OR J2 OR J3 OR J4 OR J5 OR J9 OR K0 OR M1)/M2,M3,M4 \>_line7

Note: STN Express strategy for STR2 and STR3 is incomplete as RIN code 03126 is missing. However, the omission of 03126 will probably have no effect on the retrieval (or at worst a marginal effect) since RIN 06706 relates to a spiro system linking 2 fluorene ring systems together. This means that when RIN 06706 is present along with G720, it implies that the ring system it applies to is a hydrogenated fluorine.

Missing J-Codes for Certain Types of Fused Heterocycles Leading to Loss of Precision

Rule: “If the atom bonded to the functional group is a non-angular C atom in a bridged ring system, and if the C atom could be seen as being a member of different sized rings, the smallest ring size is chosen, even if it is of lower priority than the larger ring.”

In the example below for STR1, the non-angular C atom that is substituted by a functional group (Oxo) is part of two, 6-membered rings (the bridged all carbon and nitrogen containing rings). For STR2, the bridged all-carbon ring is 5-membered, and the nitrogen-containing heterocyclic ring is 6-membered. According to the rule, the system should generated J521 for STR1 and J561 for STR2.


Issue: For fused heterocycles of the chemotype of STR2, the respective J-code is missing in the fragmentation code strategy. For the example above, STNext fails to generate code J561 for STR2. The issue affects structures with a non-angular C atom substituted by oxo (J561 is missing) or thio (J596 is missing); the corresponding imino chemotype is not affected. Furthermore, it affects not only nitrogen heterocycles, but also heterocycles containing other heteroatoms. The consequence of this issue is loss of precision resulting in a larger answer set. For the example above, STR2 leads to 40 hits in WPIX with the automatically generated fragmentation code strategy (lacking J561), whereas the corrected strategy (including J561) leads to 7 hits.

Workaround: Add the respective J-codes for the affected chemotypes to enhance precision.