Hi JW,

On Thu, Oct 22, 2015 at 12:47 AM, JW Feng <[email protected]> wrote:

>
> I read a post (link below) about SD tag reordering by Matthew and replied
> by Greg and I have a follow up question. I would like to preserve the
> ordering of SD tags as they appear in the input SD file. I tried getting
> the list of SD tags by mol.GetPropNames() and setting the order with
> sd_writer.SetProps() but that didn't work. Turns out mol.GetPropNames()
> returns a list in alphabetical order instead of order of appearance.
>

I would say instead that they appear in an unspecified, implementation
dependant, order. This may be alphabetic, but it's certainly not guaranteed
to be so.


> Is there a way to preserve SD tag orders?
>

There is currently no way to do this automatically. I have always thought
about those properties as being unordered, so the RDKit doesn't maintain
any record of what order properties are added to a molecule.

As long as you have the original SDMolSupplier, you can pretty easily get
the ordered list of property names from that:

In [22]: suppl = Chem.SDMolSupplier('tmp.sdf')

In [23]: m = suppl[0]

In [25]: list(m.GetPropNames())   # <- here's the non-ordered list
Out[25]:
['PUBCHEM_ATOM_DEF_STEREO_COUNT',
 'PUBCHEM_ATOM_UDEF_STEREO_COUNT',
 'PUBCHEM_BONDANNOTATIONS',
 'PUBCHEM_BOND_DEF_STEREO_COUNT',
 'PUBCHEM_BOND_UDEF_STEREO_COUNT',
 'PUBCHEM_CACTVS_COMPLEXITY',
 'PUBCHEM_CACTVS_HBOND_ACCEPTOR',
 'PUBCHEM_CACTVS_HBOND_DONOR',
 'PUBCHEM_CACTVS_ROTATABLE_BOND',
 'PUBCHEM_CACTVS_SUBSKEYS',
 'PUBCHEM_CACTVS_TAUTO_COUNT',
 'PUBCHEM_CACTVS_TPSA',
 'PUBCHEM_COMPONENT_COUNT',
 'PUBCHEM_COMPOUND_CANONICALIZED',
 'PUBCHEM_COMPOUND_CID',
 'PUBCHEM_COORDINATE_TYPE',
 'PUBCHEM_EXACT_MASS',
 'PUBCHEM_HEAVY_ATOM_COUNT',
 'PUBCHEM_ISOTOPIC_ATOM_COUNT',
 'PUBCHEM_IUPAC_CAS_NAME',
 'PUBCHEM_IUPAC_INCHI',
 'PUBCHEM_IUPAC_INCHIKEY',
 'PUBCHEM_IUPAC_NAME',
 'PUBCHEM_IUPAC_OPENEYE_NAME',
 'PUBCHEM_IUPAC_SYSTEMATIC_NAME',
 'PUBCHEM_IUPAC_TRADITIONAL_NAME',
 'PUBCHEM_MOLECULAR_FORMULA',
 'PUBCHEM_MOLECULAR_WEIGHT',
 'PUBCHEM_MONOISOTOPIC_WEIGHT',
 'PUBCHEM_OPENEYE_CAN_SMILES',
 'PUBCHEM_OPENEYE_ISO_SMILES',
 'PUBCHEM_TOTAL_CHARGE',
 'PUBCHEM_XLOGP3_AA']

In [26]: txt = suppl.GetItemText(0)

In [27]: pns = re.findall(r'> *<(\w+)>',txt)    # <- this gives you the
list in order

In [28]: pns
Out[28]:
['PUBCHEM_COMPOUND_CID',
 'PUBCHEM_COMPOUND_CANONICALIZED',
 'PUBCHEM_CACTVS_COMPLEXITY',
 'PUBCHEM_CACTVS_HBOND_ACCEPTOR',
 'PUBCHEM_CACTVS_HBOND_DONOR',
 'PUBCHEM_CACTVS_ROTATABLE_BOND',
 'PUBCHEM_CACTVS_SUBSKEYS',
 'PUBCHEM_IUPAC_OPENEYE_NAME',
 'PUBCHEM_IUPAC_CAS_NAME',
 'PUBCHEM_IUPAC_NAME',
 'PUBCHEM_IUPAC_SYSTEMATIC_NAME',
 'PUBCHEM_IUPAC_TRADITIONAL_NAME',
 'PUBCHEM_IUPAC_INCHI',
 'PUBCHEM_IUPAC_INCHIKEY',
 'PUBCHEM_XLOGP3_AA',
 'PUBCHEM_EXACT_MASS',
 'PUBCHEM_MOLECULAR_FORMULA',
 'PUBCHEM_MOLECULAR_WEIGHT',
 'PUBCHEM_OPENEYE_CAN_SMILES',
 'PUBCHEM_OPENEYE_ISO_SMILES',
 'PUBCHEM_CACTVS_TPSA',
 'PUBCHEM_MONOISOTOPIC_WEIGHT',
 'PUBCHEM_TOTAL_CHARGE',
 'PUBCHEM_HEAVY_ATOM_COUNT',
 'PUBCHEM_ATOM_DEF_STEREO_COUNT',
 'PUBCHEM_ATOM_UDEF_STEREO_COUNT',
 'PUBCHEM_BOND_DEF_STEREO_COUNT',
 'PUBCHEM_BOND_UDEF_STEREO_COUNT',
 'PUBCHEM_ISOTOPIC_ATOM_COUNT',
 'PUBCHEM_COMPONENT_COUNT',
 'PUBCHEM_CACTVS_TAUTO_COUNT',
 'PUBCHEM_COORDINATE_TYPE',
 'PUBCHEM_BONDANNOTATIONS']

If you pass that list of property names to the SDWriter's SetPropNames()
method, it will write things out in the input order.

I hope this helps,
-greg
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to