On 23/08/2011 11:29, Noel O'Boyle wrote:
> I note that Chris has fixed some problems with metals, and we're down
> to 146 failures.
Some notes on this are below. (I was waiting to make sure the changes
didn't cause mass test failure.)
On 09/08/2011 13:39, Róbert Kiss wrote:
> We recently did some testing with different cheminformatic tools
> including OpenBabel to see how accurately they can read and write SD
> files.
Noel has addressed the stereochemical discrepancies previously. Some of
the other ones, mainly valence-related, are now fixed with the changes
discussed in more detail below.
> 3. For atoms with unusual valence state it seems that OpenBabel
> automatically sets the typical valence, while InChI accepts the
> valence as described in the SDF (e.g. 44420892). This difference in
> the valence state results in a difference in the number of implicit
> hydrogens (connected to Si in this example), and thus in different
> InChIs:
>
> input InChI:
InChI=1S/C27H35NO8Si/c1-27(2,3)37-36-21-14-18(28-29)24(17-12-22(32-6)26(34-8)23(13-17)33-7)25(21)16-9-10-19(31-5)20(11-16)35-15-30-4/h9-13,21,29H,14-15H2,1-8H3/b28-18-
> (InChI warning: value="Accepted unusual valence(s): Si(2)")
> output InChI:
InChI=1S/C27H37NO8Si/c1-27(2,3)37-36-21-14-18(28-29)24(17-12-22(32-6)26(34-8)23(13-17)33-7)25(21)16-9-10-19(31-5)20(11-16)35-15-30-4/h9-13,21,29H,14-15,37H2,1-8H3/b28-18-
>
> input SDF: 5.27913.07170. Si 0 0 0 0 0 2 0 0 0 0
> 0 0 (valence is 2)
> output SDF: 5.27913.07170. Si 0 0 0 0 0 0 0 0 0
> 0 0 0 (valence is set to default; in case of Si this means 4)
OpenBabel currently interprets only the value 15 (= 0 valence) (because
of a technical difficulty) and I have now made it so that any value
causes no implicit hydrogens. I expect that this is nearly always why
this feature is used, but there could be cases within the spec (such as
it is) which would not necessarily be correctly interpreted. However,
the sd file in this example is now read correctly. But when output it
uses the M RAD line, rather than the valence value, which IMO gives a
better chemical description of the molecule.
> 4. For atoms with unusual valence state sometimes the valence state in
> the atom block disappears and an extra "M RAD" line appears in the
> output SDF (e.g.: 19350442). AFAIK the valence count in the atom block
> and the "M RAD" line are two different things (not totally
> independent though) so the valence information cannot be converted to
> a radical state information directly. Also the last number in the "M
> RAD" line can only be 0,1,2 or 3 according to the MOL file
> specification, while we found numbers 4 and 5 in some cases.
> input InChI:
InChI=1S/C9H13.C8H11.2CH3.2ClH.Si.Zr/c1-6-5-7(2)9(4)8(6)3;1-2-4-6-8-7-5-3-1;;/h6H,1-4H3;1-3H,4,6-8H2;2*1H3;2*1H;;/q4*-1+4/p-2/b;2-1-;;
> output InChI:
InChI=1S/C9H13.C8H11.2CH3.2ClH.H4Si.Zr/c1-6-5-7(2)9(4)8(6)3;1-2-4-6-8-7-5-3-1;;/h6H,1-4H3;1-3H,4,6-8H2;2*1H3;2*1H;1H4;/q4*-1+4/p-2/b;2-1-;;
>
> input SDF atom block: 9.97745.02460. Si 0 0 0 0 0 15
> 0 0 0 0 0 0 (15 means valence: 0)
> output SDF atom block: 9.97745.02460. Si 0 0 0 0 0 0
> 0 0 0 0 0 0 (0 means valence is default: 4)
>
> input SDF: no "M RAD" line
> output SDF: M RAD 1 4 5
OpenBabel uses the equivalent of the RAD value to represent hydrogen
deficiency in the organic subset of elements and silicon, so it is
necessary to use the values 4 and 5 internally. Isolated C or Si atoms
would have a value of 5, even if their spinmultiplicity was smaller. The
molecule 19350442 has such an unbonded Si atom (which does not seem very
realistic to me, and illustrates the inadequacy of SDF for
organometallic molecules). But in the writing of MDL files the RAD
values 4 and 5 are now replaced by a valence value as are any values on
metal atoms.
> 5. This is a quite extreme molecule (23569471). It contains a carbon
> atom connected to another carbon and two hydrogens. It has a positive
> charge according to the input SDF. OpenBabel preserves the charge
> information, but adds an additional "M RAD" line, which is (together
> with the positive charge) not correct, I think. This difference in the
> SDF results in different InChIs because InChI can only remove the
> positive charge from the PubChem input SDF.
>
> input InChI:
InChI=1S/C29H42O2/c1-20(2)12-9-13-21(3)14-10-15-22(4)16-11-18-29(8)19-17-26-25(7)27(30)23(5)24(6)28(26)31-29/h12,14,16H,1,9-11,13,15,17-19H2,2-8H3/p+1/b20-12-,21-14+,22-16+
> output InChI:
InChI=1S/C29H43O2/c1-20(2)12-9-13-21(3)14-10-15-22(4)16-11-18-29(8)19-17-26-25(7)27(30)23(5)24(6)28(26)31-29/h12,14,16,30H,1,9-11,13,15,17-19H2,2-8H3/q+1/b21-14+,22-16+
>
> input SDF: no "M RAD" line
> output SDF: M RAD 1 31 2
There was a missing valence value in a data file for such carbanions,
which is now corrected.
---