I've looked into the stereo examples and fixed 1 and 2 in SVN,
bringing the number of failures down to 554.

Number 6 was by design, but I've queried whether this is the right
thing to do on the Blue Oblelisk exchange
(http://blueobelisk.shapado.com/questions/how-should-unspecified-tetrahedral-stereochemistry-be-represented-in-a-2d-molfile).
I'd appreciate your input here also.

Number 7 must be a bug - I'll wait to fix this until I get an a clear
answer on what to do about 6.

Number 8 is a bug, but it looks like an easy fix.

- Noel

On 15 August 2011 14:04, Noel O'Boyle <baoille...@gmail.com> wrote:
> Thanks for this - it is really useful and I'll go through these
> by-and-by. It's also really nice to have an independent verification
> of the error rate of this sort of conversion.
>
> Regarding the specifics, the stereo examples may or may not be bugs
> (I'll have to check each individually), but the implicit valences ones
> probably are. Chris has already some new code for implicit valence,
> and I keep meaning to put together a test set to try it out.
>
> Do you have any idea of the rough breakdown of the 878 failures into
> the 7 cases you list?
>
> - Noel
>
> On 9 August 2011 13:39, Róbert Kiss <rk...@mcule.com> wrote:
>> Dear OpenBabel Developers,
>>
>> We recently did some testing with different cheminformatic tools
>> including OpenBabel to see how accurately they can read and write SD
>> files. For this study we collected molecules from PubChem. We selected
>> molecules with at least one tetrahedral and one cis/trans
>> stereocentres with molecular weight between 350 and 700. This resulted
>> in 477860 PubChem molecules in SDF format. We fed OpenBabel with these
>> SD files and asked it to output another SD file. To test whether the
>> input and output SD files contain the same information, we generated
>> InChIs from both SD files and compared them. Since our database is
>> primarily based on InChI, from a molecule registration point of view
>> we have a problem if the two InChIs are different. This assumes,
>> however, that InChI's SDF parser works correctly.
>>
>> We excluded 356 from the 477860 entries where an InChI > Structure >
>> InChI conversion by using solely the InChI software resulted in
>> different input and output InChIs (we previously reported these cases
>> on the InChI-discuss forum).
>>
>> We found 878 entries out of 477504 where InChIs generated from the
>> input PubChem SDF and the output OpenBabel SDF were different. This is
>> about 0.2% of the test database, which is not a bad performance I
>> think. The attached SD and InChI files contain all these cases, first
>> the PubChem input, then the OpenBabel output. Most of these
>> inconsistencies fall into one of the cases below:
>>
>> 1. There seems to be a loss of stereochemical information for
>> quaternary amines (e.g.: 49765543).
>> input InChI: 
>> InChI=1S/C23H28NO4/c1-24(13-8-11-22(25)28-4)14-12-18-15-20(26-2)21(27-3)16-19(18)23(24)17-9-6-5-7-10-17/h5-11,15-16,23H,12-14H2,1-4H3/q+1/b11-8+/t23-,24-/m1/s1
>> output InChI: 
>> InChI=1S/C23H28NO4/c1-24(13-8-11-22(25)28-4)14-12-18-15-20(26-2)21(27-3)16-19(18)23(24)17-9-6-5-7-10-17/h5-11,15-16,23H,12-14H2,1-4H3/q+1/b11-8+/t23-,24?/m1/s1
>>
>> input SDF: 5 11  1  1  0  0  0
>> output SDF: 5 11  1  0  0  0  0
>>
>> 2. There seems to be a loss of stereochemical information for phosphor
>> (e.g: 46936393)
>> input InChI: 
>> InChI=1S/C21H36FO2P/c1-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-25(22,23)24-2/h7-8,10-11,13-14,16-17H,3-6,9,12,15,18-21H2,1-2H3/b8-7+,11-10+,14-13+,17-16+/t25-/m1/s1
>> output InChI: 
>> InChI=1S/C21H36FO2P/c1-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-25(22,23)24-2/h7-8,10-11,13-14,16-17H,3-6,9,12,15,18-21H2,1-2H3/b8-7+,11-10+,14-13+,17-16+
>>
>> input SDF: 1  3  1  1  0  0  0
>> output SDF: 1  3  1  0  0  0  0
>>
>> 3. For atoms with unusual valence state it seems that OpenBabel
>> automatically sets the typical valence, while InChI accepts the
>> valence as described in the SDF (e.g. 44420892). This difference in
>> the valence state results in a difference in the number of implicit
>> hydrogens (connected to Si in this example), and thus in different
>> InChIs:
>>
>> input InChI: 
>> InChI=1S/C27H35NO8Si/c1-27(2,3)37-36-21-14-18(28-29)24(17-12-22(32-6)26(34-8)23(13-17)33-7)25(21)16-9-10-19(31-5)20(11-16)35-15-30-4/h9-13,21,29H,14-15H2,1-8H3/b28-18-
>> (InChI warning: value="Accepted unusual valence(s): Si(2)")
>> output InChI: 
>> InChI=1S/C27H37NO8Si/c1-27(2,3)37-36-21-14-18(28-29)24(17-12-22(32-6)26(34-8)23(13-17)33-7)25(21)16-9-10-19(31-5)20(11-16)35-15-30-4/h9-13,21,29H,14-15,37H2,1-8H3/b28-18-
>>
>> input SDF: 5.2791    3.0717    0.0000 Si  0  0  0  0  0  2  0  0  0  0
>>  0  0 (valence is 2)
>> output SDF: 5.2791    3.0717    0.0000 Si  0  0  0  0  0  0  0  0  0
>> 0  0  0 (valence is set to default; in case of Si this means 4)
>>
>> 4. For atoms with unusual valence state sometimes the valence state in
>> the atom block disappears and an extra "M  RAD" line appears  in the
>> output SDF (e.g.: 19350442). AFAIK the valence count in the atom block
>> and the "M  RAD" line are two different things (not totally
>> independent though) so the valence information cannot be converted to
>> a radical state information directly. Also the last number in the "M
>> RAD" line can only be 0,1,2 or 3 according to the MOL file
>> specification, while we found numbers 4 and 5 in some cases.
>>
>> input InChI: 
>> InChI=1S/C9H13.C8H11.2CH3.2ClH.Si.Zr/c1-6-5-7(2)9(4)8(6)3;1-2-4-6-8-7-5-3-1;;;;;;/h6H,1-4H3;1-3H,4,6-8H2;2*1H3;2*1H;;/q4*-1;;;;+4/p-2/b;2-1-;;;;;;
>> output InChI: 
>> InChI=1S/C9H13.C8H11.2CH3.2ClH.H4Si.Zr/c1-6-5-7(2)9(4)8(6)3;1-2-4-6-8-7-5-3-1;;;;;;/h6H,1-4H3;1-3H,4,6-8H2;2*1H3;2*1H;1H4;/q4*-1;;;;+4/p-2/b;2-1-;;;;;;
>>
>> input SDF atom block: 9.9774    5.0246    0.0000 Si  0  0  0  0  0 15
>> 0  0  0  0  0  0 (15 means valence: 0)
>> output SDF atom block: 9.9774    5.0246    0.0000 Si  0  0  0  0  0  0
>>  0  0  0  0  0  0 (0 means valence is default: 4)
>>
>> input SDF: no "M  RAD" line
>> output SDF: M  RAD  1   4   5
>>
>> 5. This is a quite extreme molecule (23569471). It contains a carbon
>> atom connected to another carbon and two hydrogens. It has a positive
>> charge according to the input SDF. OpenBabel preserves the charge
>> information, but adds an additional "M  RAD" line, which is (together
>> with the positive charge) not correct, I think. This difference in the
>> SDF results in different InChIs because InChI can only remove the
>> positive charge from the PubChem input SDF.
>>
>> input InChI: 
>> InChI=1S/C29H42O2/c1-20(2)12-9-13-21(3)14-10-15-22(4)16-11-18-29(8)19-17-26-25(7)27(30)23(5)24(6)28(26)31-29/h12,14,16H,1,9-11,13,15,17-19H2,2-8H3/p+1/b20-12-,21-14+,22-16+
>> output InChI: 
>> InChI=1S/C29H43O2/c1-20(2)12-9-13-21(3)14-10-15-22(4)16-11-18-29(8)19-17-26-25(7)27(30)23(5)24(6)28(26)31-29/h12,14,16,30H,1,9-11,13,15,17-19H2,2-8H3/q+1/b21-14+,22-16+
>>
>> input SDF: no "M  RAD" line
>> output SDF: M  RAD  1  31   2
>>
>> 6. OpenBabel converts several bond stereo in the bond block from "0"
>> (not stereo) to "4" (either) when they are connected to a terahedral
>> stereo centre (e.g.: 45489540). This is not an issue when standard
>> InChIs are generated, but a problem with InChIs generated by using the
>> -SLUUD option, where undefined and unknown centres are distinguished.
>>
>> input SDF:
>>  12 38  1  0  0  0  0
>>  15 42  1  0  0  0  0
>>  19 47  1  0  0  0  0
>>  20 48  1  0  0  0  0
>>
>> output SDF:
>>  12 38  1  4  0  0  0
>>  15 42  1  4  0  0  0
>>  19 47  1  4  0  0  0
>>  20 48  1  4  0  0  0
>>
>> 7. We found, however, cases, where OpenBabel converted a double bond
>> stereo from "0" to "4", which seems to be an error, as bond stereo for
>> double bonds can only be "0" or "3" (e.g.:20725075).
>>
>> input SDF:  2  5  2  0  0  0  0
>> output SDF:  2  5  2  4  0  0  0
>>
>> In this example (20725075) there is another interesting difference
>> between the input and output SDFs:
>>
>> input SDF: 3  7  1  0  0  0  0
>> output SDF: 3  7  1  4  0  0  0
>>
>> OpenBabel converted the above bond stereo type from "0" to "4". Atom
>> "3" is, however, not a tetrahedral stereo centre. Not sure why this is
>> happening, but it results in a "wavy" bond connected to a double bond
>> in this case, which usually indicates an unknown double bond stereo
>> resulting in different InChIs:
>>
>> input InChI: 
>> InChI=1S/C26H58N2O4Si4/c1-23(2)20-31-34(22-32-36(8,9)10)19-15-14-18-28-26(29)27-17-13-11-12-16-25(35(5,6)7)24(3)21-33(4)30/h22-25H,11-21H2,1-10H3,(H2,27,28,29)/b34-22+
>> output InChI: 
>> InChI=1S/C26H58N2O4Si4/c1-23(2)20-31-34(22-32-36(8,9)10)19-15-14-18-28-26(29)27-17-13-11-12-16-25(35(5,6)7)24(3)21-33(4)30/h22-25H,11-21H2,1-10H3,(H2,27,28,29)
>>
>> 8. It seems that isotopes are not considered as different substituents
>> of stereocentres (e.g.: 25240904)
>>
>> input InChI: 
>> InChI=1S/C20H31NO.C4H4O4/c1-18(2)17-11-12-19(18,3)20(15-17,22-14-13-21(4)5)16-9-7-6-8-10-16;5-3(6)1-2-4(7)8/h6-10,17H,11-15H2,1-5H3;1-2H,(H,5,6)(H,7,8)/b;2-1+/t17-,19-,20+;/m1./s1/i15T;/t15-,17+,19+,20-;/m0.
>> output InChI: 
>> InChI=1S/C20H31NO.C4H4O4/c1-18(2)17-11-12-19(18,3)20(15-17,22-14-13-21(4)5)16-9-7-6-8-10-16;5-3(6)1-2-4(7)8/h6-10,17H,11-15H2,1-5H3;1-2H,(H,5,6)(H,7,8)/b;2-1+/t17-,19-,20+;/m1./s1/i15T;/t15?,17-,19-,20+;
>>
>> input sdf:   11 32  1  6  0  0  0
>> output sdf: 11 32  1  0  0  0  0
>>
>> It looks like the wedge bond between the tetrahedral stereocentre and
>> the connected tritium disappeared.
>>
>> --
>>
>> I would greatly appreciate if you could comment on the examples above
>> whether it is the expected behaviour or we identified some bugs. If
>> the latter I hope they can help you to further improve OpenBabel.
>>
>> BTW, we used 1.03 InChI software and the subversion trunk for
>> OpenBabel for this study.
>>
>> Best Regards,
>> Robert
>>
>> sorry for the long letter :)
>>
>> --
>>
>> Robert Kiss
>> http://.mcule.com
>>
>> ------------------------------------------------------------------------------
>> uberSVN's rich system and user administration capabilities and model
>> configuration take the hassle out of deploying and managing Subversion and
>> the tools developers use with it. Learn more about uberSVN and get a free
>> download at:  http://p.sf.net/sfu/wandisco-dev2dev
>>
>> _______________________________________________
>> OpenBabel-discuss mailing list
>> OpenBabel-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
>>
>>
>

------------------------------------------------------------------------------
uberSVN's rich system and user administration capabilities and model 
configuration take the hassle out of deploying and managing Subversion and 
the tools developers use with it. Learn more about uberSVN and get a free 
download at:  http://p.sf.net/sfu/wandisco-dev2dev
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to