On Feb 11, 2025, at 10:06, Chris Swain via OpenBabel-discuss <openbabel-discuss@lists.sourceforge.net> wrote: > I want to convert some very large sdf files to SMILES. I know that some > structures will fail to convert. > > Is it possible to do the conversion creating two files, one containing the > valid SMILES and another file containing the records that failed to convert? > Either as a sdf file or simply a text file containing the title of molecules.
Since Geoff commented that it wasn't possible directly in Open Babel, I'll suggest an alternative. Chemfp supports multiple cheminformatics toolkits. To make things easier for me and for chemfp users, I've implemented a "toolkit" wrapper API for consistent molecule I/O across the supported toolkits. This includes a "text" toolkit which knows just enough about SMILES and SD files to read the records as text blocks, with no need for a chemistry toolkit. This was designed to get access to the original record in order to, for example, preserve the exact input atom order and aromaticity, or to add a note like "could not process file" to an SD data item. For chemfp's Open Babel SDF reader wrapper, I added an option to choose to use Open Babel to read the molecules (this is the default), or to use chemfp's text-based parser to identify the records, and then have Open Babel parse the record to get the molecule. Chemfp's wrapper API also has a way to pass in the error handler to use when Open Babel fails to parse a record, eg, to ignore the problem and keep processing, or to stop processing immediately. I'm pretty sure you can use the following, with some filename changes, to extract the records you mentioned. import sys from chemfp.io import ErrorHandler from chemfp import openbabel_toolkit as T # Create a user-defined error handler which writes # the failing record to the specified file object. class SaveErrors(ErrorHandler): def __init__(self, outfile): self.outfile = outfile # Called whenever there is an error. # The "location" object (always present for file I/O) # when used with the "chemfp" implementation stores the # current record in ".record", as a byte string. def error(self, msg, location=None, extra=None): assert location is not None sys.stderr.write(f"ERROR: {msg}\n") self.outfile.write(location.record) with open("errors.sdf", "wb") as err_file: with T.read_molecules( "chembl_33.sdf.gz", # have the chemfp wrapper use chemfp to tokenize SDF # records instead of letting Open Babel parse everything. reader_args = {"implementation": "chemfp"}, # Specify a user-defined error handler errors = SaveErrors(err_file), ) as reader: with T.open_molecule_writer("dest.smi") as writer: with reader.location.progress_bar() as progress_bar: writer.write_molecules(progress_bar(reader)) This is a bit slower than using Open Babel's native reader, though I don't recall how much. If the input is gzip compressed, then you can get a bit extra performance by off-loading the decompression to gzip, rather than use chemfp's own gzip reader, by setting the environment variable CHEMFP_GZIP to "gzip", like this: env CHEMFP_GZIP=gzip python chemfp_converter.py If you do that, the progress bar will switch from the current read position / total input file size, to showing the number of records processed per second. Best regards, Andrew da...@dalkescientific.com _______________________________________________ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss