[apologies to those who are offended by HTML mail, there's no way to do
this kind of explanation effectively without using pictures]

This is an entertaining one.[1]

The short form is that it's a bug (https://github.com/rdkit/rdkit/issues/523)
in the way aromaticity is handled, not in the SMILES generation or
canonicalization; the RDKit views the two input molecules as being
different.

To understand what's going on you need to understand how the RDKit's
aromaticity perception works. There's an explanation here:
http://rdkit.org/docs/RDKit_Book.html#aromaticity
The key bit is the sentence:
"An aromatic bond must be between aromatic atoms, but a bond between
aromatic atoms does not need to be aromatic."
This is illustrated by the biphenylene example in the docs, where the two
bonds in the four-ring that fuses the two aromatic 6 rings are not
considered aromatic:
[image: _images/picture_9.png]

The problematic part of your two molecules can be reduced to:
[image: Inline image 3]
and
[image: Inline image 4]
That second one shows the kekulized form that the RDKit ends up using.

These produce the following canonical SMILES:

In [31]: Chem.CanonSmiles('C1=CC2=CC=C12')
Out[31]: 'c1cc2ccc1-2'

In [32]: Chem.CanonSmiles('C1=CC2=C1C=C2')
Out[32]: 'c1cc2ccc1=2'



The aromatic system in each case is the envelope ring made up of atoms
[1,3,5,6,4,2]. The bond between atoms 3 and 4 is not part of the aromatic
system, so the RDKit does not consider it to have an aromatic bond type.
That is clearly problematic here. I'm going to have to think for a bit
about what should happen.The obvious thing would be, in this special case,
to mark the bond as aromatic, but I'm not sure that's the right answer.


-greg
[1] entertaining in that "oh, look, there's an edge case I missed!" way.



On Tue, Jun 16, 2015 at 10:20 PM, Peter Shenkin <[email protected]> wrote:

> [N-]=[N+]=NC(=O)N1C(=O)N([N+]([O-])=O)C2(C13C4=C56)C4=C5C2=C36
> [N-]=[N+]=NC(=O)N(C(=O)N1[N+]([O-])=O)C(c23)(c4c56)C16c3c5c24
>
> rdkit canonicalizes the two to the following, respectively:
>
> [N-]=[N+]=NC(=O)N1C(=O)N([N+](=O)[O-])C23c4c5c2c2c-5c4C213
> [N-]=[N+]=NC(=O)N1C(=O)N([N+](=O)[O-])C23c4c5c6c(c2c4=6)C513
>
> I believe these represent the same structure, with the following caveat:
>
> It is not impossible that the two SMILES actually code for different
> structures in some subtle way. I've tried visualizing them in several
> packages, however, and I've not been able to find a difference. Some
> packages canonicalize them to the same structure and others do not.
> The actual structure is chiral, but I've been looking at this from the
> point of view of SMILES without stereochemical information.
>
> The two original SMILES come from a different package. That package
> puts them out as SMILES which are dependent on the atom numbering in
> the input structure file. The originating package does canonicalize
> these to the same structure, however.
>
> I don't think it is correct to consider the double-bonded atoms
> aromatic, which the originating package does in one case. However,
> FWIW, RDKit canonicalizes them as aromatic in both cases. But the main
> issue is that RDKit canonicalizes them differently.
>
> It's kind of a grotty molecule, so it's possible I'm missing
> something. If so, I'd appreciate being set right.
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to