Jmol SMILES, Jmol SMARTS, Jmol bioSMILES, and Jmol bioSMARTS

Robert M. Hanson
Department of Chemistry
St. Olaf College
8/26/2015

This document describes a specification for an extension of SMARTS for use in 3D molecular atom search and selection as well as biomolecular sequence and cross-link searching. This specification was initially implemented in Jmol 12.0 and revised for Jmol 14.4. It is really a set of specifications:

Jmol SMILES A minor adaptation of SMILES, allowing comments and white space, and allowing more than 99 concurrently open connections.
Jmol bioSMILES An extension of Jmol SMILES that incorporates both biomolecular sequence/cross-linking information along with more standard molecular or ionic components, allowing for extensive searching of biomolecular frameworks. The coding basically substitutes residues for SMILES atoms and cross-linking and base pairing for SMILES "ring" connections.
Jmol SMARTS An extension of SMARTS substructure searching that allows several more features, including (among others) searching of molecular distance, angle, and torsion measurements, searching of both standard SMARTS substructure and Jmol bioSEQUENCE information within a 3D molecular structure, a standard SMILES string, Jmol SMILES string, or Jmol bioSMILES string.
Jmol bioSMARTSAn extension of Jmol SMARTS that allows additional substructure search options relevant to biomolecules.

The org.jmol.smiles package provides extensive functionality for selecting atoms within a three-dimensional model based on SMILES and SMARTS strings.

Besides a presentation of general considerations, a detailed specification for syntax, and the term aromatic is defined.

Comparision to Daylight SMILES

All single-component aspects of Daylight SMILES are implemented, including aromaticity and atom- and bond-based stereochemistry ("chirality").

aromaticity

The default Jmol SMILES and Jmol SMARTS defines "aromatic" unambiguously and strictly geometrically. However, starting with Jmol 12.3.24, you can use the directive /aromaticStrict/ to add to that a 6-electron Hueckel model for specifying aromaticity. see below.
Note that "aromatic" is not restricted to any specific subset of elements.
For large biomolecule searches, the search for aromatic rings can be time consuming and unnecessary. Adding the directive "/noAromatic/" at the beginning of the search pattern will turn off all checks for aromaticity and may dramatically increase processing speed.

Jmol SMILES adds the following two aspects to Daylight SMILES:

%(n)

Jmol SMILES adds unlimited branching. Daylight SMILES allows indication of "rings" using the digits 1-9, for example, C1CCCC1. Actually, these numbers do not necessarily indicate rings. Rather, in association with "." component notation, they may simply indicate connectivity. For example, ethane can be CC or C1.C1. The original SMILES notation allows up to 99 open connections using $nn, where nn is 10-99. In generalizing SMILES to Jmol SMILES and Jmol bioSMILES, since connections can represent hydrogen bonds between nucleic acid chains, it was necessary to allow more than 99 open connections. Adding parentheses, for example %(130), allows for an unlimited number of open connections. Note that despite this allowance, Jmol itself will not generate SMILES strings using this notation unless it is absolutely necessary.

//*...*//

Jmol SMILES is free-formatted, allowing standard whitespace as well as general comments in the form //*....*// anywhere within the string. For example, note the difference when Jmol debugging is set ON for the show SMILES command:

$ show SMILES

[n](C)1c2=O.c23=c4[n](C)c1=O.[n](C)3c=[n]4

$ set debug; show SMILES

//* N1 #1 *//	[n](
//* C2 #2 *//	C)1
//* C13 #13 *//	c2=
//* O14 #14 *//	O.
//* C12 #12 *//	c23=
//* C7 #7 *//	c4
//* N5 #5 *//	[n](
//* C6 #6 *//	C)
//* C3 #3 *//	c1=
//* O4 #4 *//	O.
//* N10 #10 *//	[n](
//* C11 #11 *//	C)3
//* C9 #9 *//	c=
//* N8 #8 *//	[n]4

This allows a direct correlation between an actual atom in the 3D structure and its contribution to the SMILES string.

Comments are used in Jmol bioSMILES representations for indicating the Jmol version used for its creation as well as chain and residue information:

$ load =1crn; print {*}.find("SEQUENCE")

//* Jmol bioSMILES 14.3.16_2015.08.25  2015-08-25 09:07 1 *//
//* chain A protein 1 *// ~p~TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN //* 46 *//

Jmol bioSMILES adds the following to Jmol SMILES:

~X~

Jmol bioSMILES separates all protein, nucleic, and carbohydrate polymers into separate SMILES components, separated by ".". The Jmol bioSEQUENCE type, consisting of a character surrounded by two tildes, introduces each Jmol bioSEQUENCE component. The character X may be one of p, d, r, or c, indicating a protein, DNA, or RNA sequence, respectively. Generally, the string will be a sequence of standard single-character group symbols appropriate for that sequence type. For example, the Jmol bioSEQUENCE string created using the commands load =1crn; print {*}.find("SEQ") is:

~p~TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

When a group is a non-standard amino acid or is present, it is indicated by its residue name in brackets. For example, the Jmol bioSEQUENCE returned from load =4zyg; print {:A and protein}.find("SEQ") is:

 ~p~MLVYGLYKSPLGYITVAKDDKGFIMLDFCDCVEGNSRDDSSFTEFFHK
  LDLYFEGKPINLREPINLKTYPFRLSVFKEVMKIPWGKVMTYKQIADSLGTSPRAVGMALSKNPILLIIP[SMC]HR
  VIAENGIGGYSRGVKLKRALLELEGVKIPE

where [SMC] indicates S-methylcysteine.

Components that are not bioSEQUENCE types will indicate connectivity to a bioSEQUENCE (if such connection exists) via a fully qualified bioSMILES designation for the connected atoms. For example, the magnesium atom component in PDB structure 1p9b is described by:

[Mg]123456.O3[C]N([O])[C][C]([O])[O].[IMO.O1#8]4.[GDP.O2B#8]2.[GLY.O#8]5.[ASP.OD1#8]6.[GDP.O2A#8]1.

(The element number, #8 here, allows searching these jmol bioSMILES strings themselves, in the absence of the associated model.) Notice that connection 3 here is a to an unidentified ligand. Jmol bioSMILES only abbreviates groups that are within polymers. Unconnected components such as water molecules will not be repeated; the Jmol bioSMILES representation of a protein with associated water molecules will only show one component in the form:

[O].

":"

Jmol bioSEQUENCES utilize the bond type ":" to indicate "cross-linked groups". Recognized cross-linking includes hydrogen bonding between the purine N1 and pyrimidine N3 in nucleic acids, hydrogen bonds created with create hbonds, cysteine-cysteine disulfide bonds in proteins, and ether linkages between carbohydrate residues. For instance, the commands load =3LL2; print {carbohydrate}.find("SEQ", true) reports for this branched carbohydrate:

~c~[MAN]:1[MAN]:2[MAN].~c~[MAN]:2[MAN].~c~[MAN]:1[MAN][MAN]

indicating a branched manose hexamer. No indication is given for exact atom-atom connectivity in the Jmol bioSEQUENCE; all connectivity is at the level of residues.

Comparision to Daylight SMARTS

[H1] interpreted as [*H1] -- "an atom with one connected H atom".
Allows definition of [$XXX] variables:
```
      Var x = '$R1="[CH3,NH2]";$R2="[OH]"; {a}[$R1]' // select aromatic atoms attached to CH3 or NH2  
      select within(SMARTS,@x)
```
Note that these variables are any string whatsoever, not just atom sets. The syntax is simply:
- Each variable definition takes the form $ [name] =" [definition] " [comments] ;
- [name] must start with a letter and can contain any characters other than '$', '=', '(', and ']'.
- [definition] can be any valid SMARTS characters.
- [comments] can be any characters other than ';'.
- The actual pattern starts after the last variable definition.
- Nested variables are allowed, but note that this may require using the recursion syntax, $(...):
```
      Var x = '$R1="[CH3,NH2]";$R2="[$($R1),OH]"; {a}[$R1]' // select aromatic atoms attached to CH3, NH2, or OH  
      select within(SMARTS,@x)
```
- For $xxx="yyyy", all occurrances of the string "[$xxx]" are replaced within the pattern prior to parsing.

Implements nested ("recursive") SMARTS:

 
      Var x = '$R1="[CH3,NH2]";$R2="[OH]";  {a}[$([$R1]),$([$R2])]' // aromatic attached to CH3, NH2, or OH
      select within(SMARTS,@x)

Note that $(...) need not be within [...], and wherever it is, it always means "just the first atom".

All primitives that are not element names, *, A, or a must be enclosed in brackets. In addition, the following elements must be enclosed in brackets because their two-letter combination Xy implies the non-aromatic element X attached to the aromatic element y: Ac, Ba, Ca, Na, Pa, and Sc.
Allows any order of bracketed primitives: [H2C13] same as [13CH2].
All atom and bond logic implemented: [X,!X,X&X,X&X&X;X&X]-,=X
"&" is optional: [13CH2] same as [13&C&H2] except in cases of ambiguity with element symbols: [Rh] is rhodium, not [R&h].
Jmol SMARTS does NOT implement:
- "zero-level parentheses", since the match is always only within a given model (but note that you can still use "." to indicate that the two search sections are not connected.
- "?" in atom stereochemistry ("chirality") because 3D structures are always defined stereochemically.
- "?" for bond stereochemistry, as 3D structures are always defined stereochemically

primitives and in-line options

All Daylight SMARTS primitives are implemented. These include:

[Element]	capitalized - standard notation Na, Si, etc. -- specific non-aromatic atom
[element]	uncapitalized - specific aromatic atom (as for standard notation, no limitations)
*	any atom
A	any non-aromatic atom
a	any aromatic atom
#	atomic number
(integer)	mass number -- Note, however, that [H1] is [*H1], "any atom with one attached hydrogen", not unlabeled hydrogen, [1H].
D	degree - total number of connections
H	exact hydrogen count
h	"implicit" hydrogen count (atoms are not in structure)
R	in the specified number of rings
r	in ring of a given size (not restricted to SSSR set)
v	valence (total bond order)
X	calculated connectivity, including implicit hydrogens
x	number of ring bonds
@	stereochemistry

Jmol SMARTS adds the following primitives:

d	non-hydrogen degree -- number of non-hydrogen connections
=	Jmol atom index, for example: [=23]
"xxx"	atom type, in double quotes, for example: ["39"r5] (After calculate partialcharge this will be the MMFF94 atom type. [Jmol 12.3.24]
$(select xxx)"	external selection method. For Jmol, this is an atom expression. For example: [c$(select temperature>10)] [Jmol 12.3.26]
r500	a specifically aromatic 5-membered ring [Jmol 12.3.24]
r600	a specifically aromatic 6-membered ring [Jmol 12.3.24]
number?	mass number or undefined (so, for example, [C12?] means any carbon that isn't explicitly C13 or C14
$n(pattern)	A specific number of occurances of pattern. For example, C[$3(C=C)]C is synonymous with CC=CC=CC=CC.
$min-max(pattern)	A variable number of occurances of pattern. For example: A[$0-2(C:G)]A is synonymous with AA or AC(:G)A or AC(:G)C(:G)A.
residueName#resno^insCode.atomName#atomicNumber	All five fields are optional; only the period itself is required. This primitive may appear with other primitives provided (a) it is first, and (b) it is followed by an operator ("," ,"&", or ";"). This allows searching a bioSMILES string using SMARTS patterns that only involve standard atom types. In the above example, notice that the connecting atoms to protein chains within the non-bioSEQUENCE component indicates the connections to the protein using this extended notation. Thus, both the actual 3D model and the bioSMARTS string for 1d66 can be searched using the SMARTS search "[Cd][S]" as well as the more specific search "Cd[.SG]". Wild cards provide additional options: [#35.], [ALA.], [#^A.] [.], [.CA]; however, their presence is optional: [#35.], [ALA.], [^A.] [.], [.CA]. The special designation "0" for an atom name, as in [GLY.0], indicates the "lead atom" -- the alpha carbon for proteins, the phosphorus atom in nucleic acids, or the anomeric carbon in carbohydrates.

Jmol SMARTS adds the following in-line options:

/..../	processing directives Jmol recognizes /..../ at the beginning of a pattern as processing directives. These directives can be introduced individually or as groups. They are not case-sensitive. /noaromatic/ /nostereo/ is read the same as /noAromatic,noStereo/. invertStereo reverses the sense of chirality (R-, S- stereochemistry). Double-bond stereochemistry is not reverse. noStereo turns off all stereochemical checking. aromaticDouble allows for using "=" between two aromatic atoms to indicate an explicitly double aromatic bond. To specify an explicitly single aromatic bond, use @&!=. aromaticStrict uses a 6-electron Hueckel model for specifying aromaticity. noAromatic turns off all aromaticity checks.It may be desirable when no distinction between aromatic and nonaromatic atoms is desired. For large biomolecules /noAromatic/ can dramatically improve processing speed when no check for aromaticity is necessary. All atoms are then considered NOT aromatic.
{...}	Jmol atom selection Then general way within Jmol to select atoms based on SMARTS searches is to use select search("..."). To assign variables to the results of a search, one can use the find() command. However, to select one or more atoms within the found pattern, simply enclose the desired atoms in { }: select search("{C}C=O"), for example, returns all alpha carbons, and select search("~d~G{C}A") returns all DNA cytidines that are in GCA sequences. No valence calculation is done to add any additional hydrogens to unbracketed atoms. "CCC" is the same as "[C][C][C]". only unbracketed or bracketed hydrogen atoms such as H[C]C or [H] or [2H] are selected; connected hydrogen atoms as in [CH3] are not selected.
(.measure)	The extension capitalizes on the fact that in a standard SMARTS string, period "." cannot ever appear immediately following an open parenthesis "(". Using this fact, the format involves the following: "(." [single character type - "d" (distance), "a" (angle), or "t" (torsion)] [optional numeric identifier] ":" [optional "!" (not)] [ranges] ")" where [ranges] is one or more ranges in the form [minimum value], [maximum value] separated by commas. That is, one or more This extension must appear immediately following an element symbol or a bracketed atom expression. The separators "," or "-" between minimum and maximum values are equivalent. For example, the following will find all aliphatic carbon-carbon bonds that are between 1.5 and 1.6 angstroms long. select search("C(.d:1.5-1.6)C") The following will select for all 1,2-trans-diaxial methyl groups on a cyclohexane ring, finding all torsions that are outside the range -160 to 160 degrees: select search("{[CH3]}(.t:!-160,160)CC{[CH3]}") The following will select for all 1,2-trans-diequatorial methyl groups on a cyclohexane ring by selecting for all adjacent methyl groups that are anit to a ring atom: select search("{[$([CH3](.t:!-160,160)CC[Cr6])]}CC{[$([CH3](.t:!-160,160)CC[Cr6])]}") The following will select for all gauche-dimethyl groups on a cyclohexane ring: select search("{[CH3]}(.t:50,70,-50,-70)CC{[CH3]}") and the following prints the number of gauche interactions. Division by two is necessary in this case because of the symmetry involved. print compare({},{},"MAP","[CH3](.t:50,70,-70,-50)CC[CH3]").length/2 The default in terms of specifying which atoms are involved is simply "the next N-1 atoms," where N is 2, 3, or 4. For more complicated patterns, one can designate the specific atoms in the measurement using a numeric identifier after the measurement type. The following will target the bond angle across the carbonyl group in the backbone of a peptide: select search("[.CA](.a1:105-110)C(.a1)(=O)N(.a1)") Designations can overlap; one simply adds whatever (.xn) designation is wanted after the desired atoms: select search("C(.a1:105,108)C(.a1)(.a2:110,130)C(.a1)(.a2)C(.a2)") In Jmol, this capability is extended to the measure* command for easy access to SMARTS-based measurements: select * measure search("C(.a1:110,130)C(.a1)(=O)C(.a1)") Note that the atoms in no way have to be connected. The only restriction is that the three markers for an angle or the four markers for a torsion will be identified in order from left to right within the SMARTS string. The following, for example, will find all carbonyl oxygen atoms that are within 5 angstroms of each other: select search("{O}(.d1:0,5)=C.{O}(.d1)=C") The "." here indicating "not bonded." {O} specifies that although we want to find the entire set, we only want to select the oxygen atoms. The close of the selection brace may appear before or after the (.x) designation.
pattern1 \|\| pattern2	"\|\|" indicates "or" and allows searching for multiple patterns, which may overlap. For example: select search("c{O} \|\| c{C}"). Note that the "\|\|" syntax is an alternative to using "[,]", in this case being equivalent to (and slightly slower than) select search("c{[O,C]}").

Jmol bioSMARTS adds the following pattern options to Jmol SMARTS:

"~"	Any biopolymer.
"~n~"	DNA or RNA
"+"	Jmol bioSMARTS adds the "+" bond type to indicate standard sequence. The Jmol bioSMARTS pattern "~p~C+C+C" is the same as "~p~CCC". In conjunction with the cross-linking type ":", one can do searches for double-stranded nucleic acids quite easily. ~d~CCC:GGG would be three CG base-pairs (because the two strands are going in opposite direction). Note that Jmol atom selection can be specified by For example, select search("[CYS.CA][PRO.CB]") would select just the alpha carbon of cysteine and the beta carbon of an adjacent proline.
branching	Branching (cross-linking) can also be indicated using the standard SMILES (...) notation. So, for example, ~d~C(G)C(G)C(G) indicates three CG base pairs. Ring notation can also be used: C:1CC(GGG:1) is the same three CG base pairs. An empty branch, ~C(), indicates "not cross-linked" -- in this case a cysteine without a disulphide bond or a cytidine that is not base-paired.

implicit hydrogen count

The primitives h (implicit hydrogen count) and X (total connections, including implicit hydrogens) require analysis of bonding around a model atom to determine the number of missing ("implicit") hydrogen atoms based on a "target valence." Models that specify only "aromatic" or "partial" bonding may produce ambiguous results, and for that reason, primitives X and h are not recommended for use. Other primitives, such as D, d, and v should be more useful. The analysis Jmol uses here is the same as for how Jmol calculates the number of hydrogens to add for the calculate hydrogens command and includes:
1. Assign the target valence TV as follows:
  - For C and Si, TV = 4.
  - For B, N, and P, TV = 3.
  - For O and S, TV = 2.
  - For F, Cl, Br, and I, TV = 1.
  - For all other atoms, TV = 0.
2. Obtain the formal charge on the atom, C.
3. Group IV elements such as carbon are unique, in that their cations are valence-poor, not valence-rich. So for carbon and silicon, subtract the ABSOLUTE VALUE of C from the target valence. In all other cases, let TV = TV + C.
4. Determine the overall valence of the atom, OV. This is calculated by adding up all the bond orders to the atom.
5. Subtract OV from TV to get the number of implicit hydrogen atoms. If this number is less than zero, assign zero.
Thus, the implicit hydrogen count is:
- 0 for all atoms other than {B,C,N,O,P,Si,S}
- 0 for BR3
- 0 for CR4, 1 for CR3, 2 for CR2, 3 for CR
- 0 for CR3(+), 0 for CR3(-)
- 0 for R=CR2, 1 for R=CR, 2 for R=C, 1 for C#R (triple bond)
- 0 for NR3, 1 for NR2, 2 for NR
- 0 for RN=R, 1 for R=N
- 1 for NR3(+), 1 for R=NR(+), 1 for RN(-)
- 0 for OR2, 0 for O=R, 1 for OR
- 0 for RO(-), 2 for RO(+)

Detailed Jmol SMILES/bioSMILES Specification

 
      # note: prior to parsing, all white space is removed
       
   [smilesDef] == [preface] [smiles]
   [preface] == { [directiveDefs] | NULL } 
   [directiveDefs] == { [directiveDef] || [directiveDef] [directiveDefs] }
   [directiveDef] == "/" [processingDirectives] "/"
   [processingDirectives] == { [processingFlag] | [processingDirective] [processingDirectives] }
   [processingFlag] == { "noAromatic" | "aromaticDefined" | "aromaticStrict" | "noStereo" | "invertStereo"} (case-insensitive)
      # note: the noAromatic directive indicates to not distinguish between
      #       aromatic/aliphatic searches -- "C" and "c"
      # note: the noStereo directive turns off all stereochemical testing
      # note: thus, both "/noAromatic//noStereo/" and "/noAromatic noStereo/" are valid 
   [smiles] == { [entity] | [entity] "." [entity] }
   [entity] == { [bioSequence] | [molecularSequence] }
   [molecularSequence] = [node][connections] 
   [node] == { [atomExpression] | [connectionPointer] }

   [atomExpression] = { [unbracketedAtomType] 
                             | "[" [bracketedExpression] "]" }
   
   [unbracketedAtomType] == [atomType] 
                                 & ! { "Ac" | "Ba" | "Ca" | "Na" | "Pa" | "Sc"
                                     | "ac" | "ba" | "ca" | "na" | "pa" | "sc" }
      # note: Brackets are required for these elements: [Na], [Ca], etc.
      #       These elements Xy are instead interpreted as "X" "y", a single-letter
      #       element followed by an aromatic atom. 
      
   [atomType] == { [validElementSymbol] | [aromaticType] }
   [validElementSymbol] == (see Elements.java; 
                            including Xx and only through element 109)
   [aromaticType] == { [validElementSymbol].toLowerCase() }
       
   [bracketedExpression] == { "[" [atomPrimitives] "]" } 
   
   [atomPrimitives] == { [atom] | [atom] [atomModifiers] }
   [atom] == { [isotope] [atomType] | [atomType] } 
   [isotope] == [digits]
       # note -- isotope mass must come before the element symbol. 
   [digits] == { [digit] | [digit] [digits] }
   [digit] == { "0" | "1" | "2" | "3" | "4" | "5" | "6" | 7" | "8" | "9" }
   [atomModifiers] == { [atomModifier] | [atomModifier] [atomModifiers] }
   [atomModifier] == { [charge] | [stereochemistry] | [H_Prop] }
   [charge] == { "-" [digits] | "+" [digits] | [plusSet] | [minusSet] }
   [plusSet] == { "+" | "+" [plusSet] }
   [minusSet] == { "-" | "-" [minusSet] }
   [stereochemistry] == { "@"           # anticlockwise
                              | "@@"    # clockwise
                              | "@" [stereochemistryDescriptor] 
                              | "@@" [stereochemistryDescriptor] }
   [stereochemistryDescriptor] == [stereoClass] [stereoOrder]
   [stereoClass] == { "AL" | "TH" | "SP" | "TP" | "OH" }
   [stereoOrder] == [digits]
   
   [connectionPointer] == { "%" [digit][digit] | [digit] | "%(" [digits] ")"}
      # note: all connectionPointers must have a second matching connectionPointer 
      #       and must be preceded by an atomExpression for the 
      #       first occurance and either an atomExpression or a bond
      #       for the second occurance
      # note: Jmol bioSMARTS extends the possible number of rings to > 100 by 
      #       allowing %(n)

   [connections] == [connection] | NULL }
   [connection] == { [branch] | [bond] [node] } [connections]
   [branch] == { "(" { [smiles] | [bond] [smiles] } ")" | "()" }
      # note: empty parentheses "()" are ignored in SMILES and bioSMILES
   [bond] == { "-" | "=" | "#" | "." | "/" | "\\" | ":" | NULL
      # note: Jmol will not match two totally independent molecular pieces. For example,
      #       Jmol will not math [Na+].[Cl-]. However, "." can be used to clarify a
      #       structure that has "ring" bond notation:
      #       CC1CCC.C1CC   is a valid structure.
      # note: bioSEQUENCE uses ":" to indicate "cross-linked", which is the default for branches

   [bioSequence] == [bioCode] [bioNode] [connections]
   [bioCode] == { "~" | "~" [bioType] "~" }
      # note: The "~" must be the first character in a component and must be repeated 
      #       for each component (separated by ".")
   [bioType] == { "p" | "n" | "r" | "d" }
      # note: protein, nucleic, RNA, DNA
   [bioNode] == { "[" [bioResidueName] "." [bioAtomName] "]" 
                 | "[" [bioResidueName] "." [bioAtomName] "#" [atomicNumber] "]" 
                 | [bioResidueCode] } 
   [atomicNumber] == [digits]
   [bioResidueName] == { "ARG" | "GLY" ... } (case-insensitive) 
   [bioAtomName] == {"C" | "CA" | "N" ... } (case-insensitive)
   [bioResidueCode] == { "A" | "R" | "G" ... } (case-sensitive)
      # note: In a BioSEQUENCE, residues are designated using standard 1-letter-code group names
      #       or bracketed residues [xxx] with optional atoms specified: [ARG], [CYS.SG].

Detailed Jmol SMARTS/Jmol bioSMARTS Specification

 

 ######## GENERAL ########

      # note: prior to parsing, all white space is removed

   [smartDef] == [preface] [smartsSet]
   [preface] == { [directiveDefs] [variableDefs] | [variableDefs] | NULL } 
   [directiveDefs] == { [directiveDef] || [directiveDef] [directiveDefs] }
   [directiveDef] == "/" [processingDirectives] "/"
   [processingDirectives] == { [processingDirective] | [processingDirective] [processingDirectives] }
   [processingFlag] == { "noAromatic" | "aromaticDefined" | "aromaticStrict" | "noStereo" | "invertStereo"} (case-insensitive)
      # note: the noAromatic directive indicates to not distinguish between
      #       aromatic/aliphatic searches -- "C" and "c"
      # note: the noStereo directive turns off all stereochemical testing
      # note: thus, both "/noAromatic//noStereo/" and "/noAromatic noStereo/" are valid 
   [variableDefs] == [variableDef] | [variableDef] [variableDefs]
   [variableDef] ==  "$" [label] "=" "\"" [smarts] "\"" [comments] ";"
   [label] == 'A-Z' [any characters other than "=", "(", or "$"]
   [comments] == [any characters other than ";"]
      # note: Variable definitions must be parsed first. 
      #       After that, all variable references [$XXXX] are replaced
      
   [smartsSet] == { [smarts] | [smarts] "||" [smartsSet] }
      # note: Jmol adds the "or" operation "||", for example: "C=O || C=N"
      #       which, in this case, could also be written as "C=[O,N]"
      #       Jmol preprocesses these sets, evaluates them independently, and then
      #       combines them.
      
   [smarts] == { [node3D] [connections] | [bioSequence] } 
   [connections] == [connection] | NULL }
   [connection] == { [branch] | [bondExpression] [node3D] } [connections]
   [branch] == { "(" { [smarts] | [bondExpression] [smarts] } ")" | "()" }
      # note: Default bonding for a branch is single for SMARTS or cross-linked (:) for bioSEQUENCE
      # note: "()" is ignored in SMARTS and indicates "not cross-linked" in bioSEQUENCE
   
 ######## ATOMS ########
    
   [node3D] == { [atomExpression] | [atomExpression] "(." [measure] ")" | [connectionPointer] }
   [atomExpression] = { [unbracketedAtomType]
                             | [bracketedExpression] 
                             | [multipleExpression]
                             | [nestedExpression] }
   
   
   [unbracketedAtomType] == [atomType] 
                                 & ! { "Ac" | "Ba" | "Ca" | "Na" | "Pa" | "Sc"
                                     | "ac" | "ba" | "ca" | "na" | "pa" | "sc" }
      # note: Brackets are required for these elements: [Na], [Ca], etc.
      #       These elements Xy are instead interpreted as "X" "y", a single-letter
      #       element followed by an aromatic atom. 
      # note: in a bioSEQUENCE, all atom types are 1-letter code group names
      
   [atomType] == { [validElementSymbol] | "A" | [aromaticType] | "*" }
   [validElementSymbol] == (see Elements.java; 
                            including Xx and only through element 109)
   [aromaticType] == { "a" | [validElementSymbol].toLowerCase() }
       
   [bracketedExpression] == "[" { [atomOrSet] | [atomOrSet] ";" [atomAndSet] } "]" 
   
   [atomOrSet] == { [atomAndSet] | [atomAndSet] "," [atomAndSet] }
   [atomAndSet] == { [atomPrimitives] | [atomPrimitives] "&" [atomAndSet]
                              | "!" [atomPrimitive] 
                              | "!" [atomPrimitive] "&" [atomAndSet] }
                              
 ######## ATOM PRIMITIVES ########

   [atomPrimitives] == { [atomPrimitive] | [atomPrimitive] [atomPrimitives] }
       # note -- if & is not used, certain combinations of primitiveDescritors
       #         are not allowed. Specifically, combinations that together
       #         form the symbol for an element will be read as the element (Ar, Rh, etc.)
       #         when NOT followed by a digit and no element has already been defined 
       #         So, for example, [Ar] is argon, [Ar3] is [A&r3], [ORh] is [O&R&h],  
       #         but [Ard2] is [Ar&d2] -- "argon with two non-hydrogen connections"
       #         Also, "!" may not be use with implied "&". 
       #         Thus, [!a], [!a&!h2], and [h2&!a] are all valid, but [!ah2] is invalid.             
   [atomPrimitive] == { [isotope] | [atomType] | [charge] | [stereochemistry]
                              | [a_Prop] | [A_Prop] | [D_Prop] | [H_Prop] | [h_Prop] 
                              | [R_Prop] | [r_Prop] | [v_Prop] | [X_Prop]
                              | [x_Prop] | [at_Prop] | [nestedExpression] }
   [isotope] == [digits] | [digits] "?"
       # note -- isotope mass may come before or after element symbol, 
       #         EXCEPT "H1" which must be parsed as "an atom with a single H" 
   [digits] == { [digit] | [digit] [digits] }
   [digit] == { "0" | "1" | "2" | "3" | "4" | "5" | "6" | 7" | "8" | "9" }
   [charge] == { "-" [digits] | "+" [digits] | [plusSet] | [minusSet] }
   [plusSet] == { "+" | "+" [plusSet] }
   [minusSet] == { "-" | "-" [minusSet] }
   [stereochemistry] == { "@"           # anticlockwise
                              | "@@"    # clockwise
                              | "@" [stereochemistryDescriptor] 
                              | "@@" [stereochemistryDescriptor] }
   [stereochemistryDescriptor] == [stereoClass] [stereoOrder]
   [stereoClass] == { "AL" | "TH" | "SP" | "TP" | "OH" }
   [stereoOrder] == [digits]
       # note -- "?" here (unspecified) is not relevant in Jmol SMARTS, only Jmol bioSMARTS 
   
   [A_Prop] == "#" [digits]           # elemental atomic number
   [a_Prop] == "=" [digits]           # Jmol atom index (starts with 0)
   [D_Prop] == { "D" [digits] | "D" } # degree -- total number of connections 
                                      #   excludes implicit H atoms; default 1
   [d_Prop] == { "d" [digits] | "d" } # degree -- non-hydrogen connections
                                      #   default 1 
   [H_Prop] == { "H" [digits] | "H" } # exact hydrogen count 
                                      #   excludes implicit H atoms
   [h_Prop] == { "h" [digits] | "h" } # implicit hydrogens -- "h" indicates "at least one"
                                      #   (see note below)
   [R_Prop] == { "R" [digits] | "R" } # ring membership; e.g. "R2" indicates "in two rings"
                                      #   "R" indicates "in a ring" 
                                      #   !R" or "R0" indicates "not in any ring"
   [r_Prop] == { "r" [digits] | "r" } # in ring of size [digits]; "r" indicates "in a ring"
                                      #   r500 and r600 match specifically aromatic 
                                      #   5- and 6-membered rings, respectively [Jmol 12.3.24]
   [v_Prop] == { "v" [digits] | "v" } # valence -- total bond order (counting double as 2, e.g.)
   [X_Prop] == { "X" [digits] | "X" } # connectivity -- total number of connections
                                      #   includes implicit H atoms
   [x_Prop] == { "x" [digits] | "x" } # ring connectivity -- total ring connections
   [at_Prop] == { "\"" [charsExceptDoubleQuote] | "\"" } # atom type (in double quotes) [ Jmol 12.3.24]
   
 ######## Nested and Multiple Expressions ########
 
   [nestedExpression] == "$(" [atomExpression] ")" | "$(select" [contextualSearchPhrase] ")" 
      # note: nestedExpressions return only the first atom as a match when an atom expression 
      #       is involved, not all atoms in the expression.
      
   [contextualSearchPhrase] == [any characters with well-matched "(" and ")"]
      # note: the contextual search phrase is to be processed by the context implementing
      #       the SMARTS. In the case of Jmol, [contextualSearchPhrase] is any Jmol
      #       atom expression that can be in a standard Jmol SELECT command. 

   [multipleExpression] == { "[$" [nTimes] "(" [orExpression] ")]" 
                             | "[$[nMinimum] "-" [nMaximum](" [orExpression] ")]" }  
   [orExpression] = { [atomExpression] 
                       | [atomExpression "|" [orExpression] 
                       | [atomExpression "||" [orExpression] }
      # note: "|" and "||" are synonymous in this inner context; "|" is preferred simply
      #       for readability (whereas "||" is required for the [smartsSet] context). 
      # note: This syntax is carefully written to exclude $(xxx) by itself, which
      #       is a nestedExpression, not a multipleExpression. The difference is that
      #       the nestedExpression only returns the first atom, while the multipleExpression
      #       returns all atoms. To return only the first atom within this context 
      #       it is necessary to use a nested expression within the multiple expression.
      #       For example: "CC[$2( $(C=O) | $(C=N) )]"
      #       is the same as "CC$(C=[O,N])$(C=[O,N])", although Jmol preprocesses it as
      #          "CC$(C=O)$(C=O)||CC$(C=O)$(C=N)||CC$(C=N)$(C=O)||CC$(C=N)$(C=N)"
      
   [nTimes] == [digits]
   [nMinimum] == [digits]
   [nMaximum] == [digits]
      # note: multipleExpressions allow for searching a given number of expressions or 
      #       a variable number of expressions (including 0, perhaps)
      #       Jmol pre-processes these expressions and turns them into a set:
      #       pattern1 || pattern2 || pattern3....

 ######## BioSEQUENCE ########

   [bioSequence] == [bioCode] [bioNode] [connections]
   [bioCode] == { "~" | "~" [bioType] "~" }
      # note: The "~" must be the first character in a component and must be repeated 
      #       for each component (separated by ".")
   [bioType] == { "p" | "n" | "r" | "d" }
      # note: protein, nucleic, RNA, DNA
   [bioNode] == { "[" [bioResidueName] "]" 
                 | "[" [bioResidueName] "." [bioAtomName] "]" 
                 | "[" [bioResidueName] "." [bioAtomName] [A_Prop] "]" 
                 | [bioResidueCode] } 
   [bioResidueName] == { "*" | "ARG" | "GLY" ... } (case-insensitive) 
   [bioAtomName] == { "*" | "0" | "C" | "CA" | "N" ... } (case-insensitive)
      # note: "0" indicates the "lead atom":
      #   nucleic: P if present, or H5T if present, or O5'/O5*
      #   protein: CA
      #   carbohydrate: the first atom of the group listed in the model file
   [bioResidueCode] == { "*" | "A" | "R" | "G" ... } (case-sensitive)
      # note: wildcard or standard group 1-letter-code
      #       or, in the case of RNA or DNA:
      #         "N" (any residue; same as "*"), 
      #         "R" (any purine -- A or G)
      #         "Y" (any pyrimidine -- C or T or U)

 ######## CONNECTIONS (aka "rings") ########

   [connectionPointer] == { [digit] | "%" [digit][digit] | "%(" [digits] ")" }
      # note: All connectionPointers must have a second matching connectionPointer 
      #       and must be preceded by an atomExpression for the 
      #       first occurance and either an atomExpression or a bondExpression
      #       for the second occurance. The matching connectionPointers may be
      #       in different "components" (separated by "."), in which case they
      #       represent general connections and not necessarily rings.

 ######## BONDS ########

   [bondExpression] == { [bondOrSet] | [bondOrSet] ";" [bondAndSet] } 
   
   [bondOrSet] == { [bondAndSet] | [bondAndSet] "," [bondAndSet] }
   [bondAndSet] == { [bondPrimitives] | [bondPrimitives] "&" [bondAndSet]
                              | "!" [bondPrimitive] 
                              | "!" [bondPrimitive] "&" [bondAndSet] }
                                              
 ######## BOND PRIMITIVES ########
                              
   [bondPrimitives] == { [bondPrimitive] | [bondPrimitive] [bondPrimitives] }       
   [bond] == { "-" | "=" | "#" | "." | "/" | "\\" | ":" | "~" | "@" | "+" | "^" | NULL
      # note: All bondExpressions are not valid. Stereochemistry should not 
      #       be mixed with the others, as it represents a single bond always.
      #       In addition, "." ("no bond") cannot be mixed with any bond type.
      #       Nothing would be retrieved by "-&=", as a bond cannot be both single
      #       and double. However, "-@" is potentially very useful -- "ring single-bonds"
      #       or "=&!@" -- "doubly-bonded atoms where the double bond is not in a ring"
      # note: Jmol will not match two totally independent molecular pieces. For example,
      #       Jmol will not math [Na+].[Cl-]
      # note: "+" indicates "adjacent biomolecular groups in a chain"
      # note: a bioSEQUENCE ends with "." or the end of the string. A new bioSEQUENCE
      #       can continue with "~" immediately following this "." 
      # note: For a SMARTS search, "." indicates the start of a new subset, not necessarily a
      #       new component.
      # note: "^" indicates atropisomer bond with positive dihedral angle
      
 ######## MEASURES ########
   
   [measure] == { [measureId] | [measureId] ":" [ranges] | [measureId] ":!" [range] }
   [measureId] == { [measureCode] | [measureCode] [digits] }
   [measureCode == { "d" | "a" | "t" }
   [ranges] == {[range] | [ranges] { "," | "-" } [range]}
   [range] == [minimumValue] { "," | "-" } [maximumValue]
   [minimumValue] == [decimalNumber]
   [maximumValue] == [decimalNumber]

Jmol SMILES and Jmol SMARTS Definition of "aromatic"

The default Jmol SMILES and Jmol SMARTS defines "aromatic" unambiguously and strictly geometrically. However you can use the directive /aromaticStrict/ to add to that a 6-electron Hueckel model for specifying aromaticity. This discussion relates to the 3D definition. We define "aromatic" here strictly in terms of geometry - a flat ring with trigonal planar geometry for all atoms in the ring. No consideration of bond order is used, because for the sorts of models that can be loaded into Jmol, many do not assume a bonding scheme (PDB, GAUSSIAN, etc.).

Given a ring of N atoms...

                  1
                /   \
               2     6 -- 6a
               |     |
         5a -- 5     4
                \   /
                  3

with arbitrary order and up to N substituents...

Check to see if all ring atoms have no more than 3 connections. Note: An alternative definition might include "and no substituent is explicitly double-bonded to its ring atom, as in quinone. Here we opt to allow the atoms of quinone to be called "aromatic."
Select a cutoff value close to zero. We use 0.01 here (increased to 0.1 for /aromaticStrict/).
Generate a set of normals as follows:
1. For each ring atom, construct the normal associated with the plane formed by that ring atom and its two nearest ring-atom neighbors.
2. For each ring atom with a substituent, construct a normal associated with the plane formed by its connecting substituent atom and the two nearest ring-atom neighbors.
3. If this is the first normal, assign vMean to it.
4. If this is not the first normal, check vNorm.dot.vMean. If this value is less than zero, scale vNorm by -1.
5. Add vNorm to vMean.
Calculate the standard deviation of the dot products of the individual vNorms with the normalized vMean.
The ring is deemed flat if this standard deviation is less than the selected cutoff value.

-- Bob Hanson
updated 8/26/2015: switch to HTML5; added measure option for multiple ranges
updated 5/21/2012: added $(select...)
updated 5/13/2012: added /aromaticStrict/ and /aromaticDouble/
updated 5/13/2012: added [<atomType>]
updated 4/10/2012: fix for [$(...)n] and [$(...)min-max]
original5/19/2010