On Feb 11, 2025, at 10:06, Chris Swain via OpenBabel-discuss 
<openbabel-discuss@lists.sourceforge.net> wrote:
> I want to convert some very large sdf files to SMILES.  I know that some 
> structures will fail to convert.
> 
> Is it possible to do the conversion creating two files, one containing the 
> valid SMILES and another file containing the records that failed to convert? 
> Either as a sdf file or simply a text file containing the title of molecules.

Since Geoff commented that it wasn't possible directly in Open Babel, I'll 
suggest an alternative.

Chemfp supports multiple cheminformatics toolkits. To make things easier for me 
and for chemfp users, I've implemented a "toolkit" wrapper API for consistent 
molecule I/O across the supported toolkits.

This includes a "text" toolkit which knows just enough about SMILES and SD 
files to read the records as text blocks, with no need for a chemistry toolkit. 
This was designed to get access to the original record in order to, for 
example, preserve the exact input atom order and aromaticity, or to add a note 
like "could not process file" to an SD data item.

For chemfp's Open Babel SDF reader wrapper, I added an option to choose to use 
Open Babel to read the molecules (this is the default), or to use chemfp's 
text-based parser to identify the records, and then have Open Babel parse the 
record to get the molecule.

Chemfp's wrapper API also has a way to pass in the error handler to use when 
Open Babel fails to parse a record, eg, to ignore the problem and keep 
processing, or to stop processing immediately.

I'm pretty sure you can use the following, with some filename changes, to 
extract the records you mentioned.

import sys

from chemfp.io import ErrorHandler
from chemfp import openbabel_toolkit as T

# Create a user-defined error handler which writes
# the failing record to the specified file object.
class SaveErrors(ErrorHandler):
    def __init__(self, outfile):
        self.outfile = outfile
    
    # Called whenever there is an error.
    # The "location" object (always present for file I/O)
    # when used with the "chemfp" implementation stores the
    # current record in ".record", as a byte string.
    def error(self, msg, location=None, extra=None):
        assert location is not None
        sys.stderr.write(f"ERROR: {msg}\n")
        self.outfile.write(location.record)

with open("errors.sdf", "wb") as err_file:
    with T.read_molecules(
            "chembl_33.sdf.gz",
            # have the chemfp wrapper use chemfp to tokenize SDF
            # records instead of letting Open Babel parse everything.
            reader_args = {"implementation": "chemfp"},
            # Specify a user-defined error handler
            errors = SaveErrors(err_file),
            ) as reader:
        with T.open_molecule_writer("dest.smi") as writer:
            with reader.location.progress_bar() as progress_bar:
                writer.write_molecules(progress_bar(reader))

This is a bit slower than using Open Babel's native reader, though I don't 
recall how much.

If the input is gzip compressed, then you can get a bit extra performance by 
off-loading the decompression to gzip, rather than use chemfp's own gzip 
reader, by setting the environment variable CHEMFP_GZIP to "gzip", like this:

  env CHEMFP_GZIP=gzip python chemfp_converter.py

If you do that, the progress bar will switch from the current read position / 
total input file size, to showing the number of records processed per second.

Best regards,

                                        Andrew
                                        da...@dalkescientific.com





_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to