Thanks everyone, lots for me to explore Cheers
Chris > On 13 Feb 2025, at 12:39, openbabel-discuss-requ...@lists.sourceforge.net > wrote: > > Send OpenBabel-discuss mailing list submissions to > openbabel-discuss@lists.sourceforge.net > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/openbabel-discuss > or, via email, send a message with subject or body 'help' to > openbabel-discuss-requ...@lists.sourceforge.net > > You can reach the person managing the list at > openbabel-discuss-ow...@lists.sourceforge.net > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of OpenBabel-discuss digest..." > > > Today's Topics: > > 1. Re: Errors in file conversion (Andrew Dalke) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 12 Feb 2025 22:34:50 +0100 > From: Andrew Dalke <da...@dalkescientific.com> > To: Chris Swain <sw...@mac.com> > Cc: Open Babel <openbabel-discuss@lists.sourceforge.net> > Subject: Re: [Open Babel] Errors in file conversion > Message-ID: <296b53a0-eed5-4b9e-9217-845651d9f...@dalkescientific.com> > Content-Type: text/plain; charset=us-ascii > > On Feb 11, 2025, at 10:06, Chris Swain via OpenBabel-discuss > <openbabel-discuss@lists.sourceforge.net> wrote: >> I want to convert some very large sdf files to SMILES. I know that some >> structures will fail to convert. >> >> Is it possible to do the conversion creating two files, one containing the >> valid SMILES and another file containing the records that failed to convert? >> Either as a sdf file or simply a text file containing the title of molecules. > > Since Geoff commented that it wasn't possible directly in Open Babel, I'll > suggest an alternative. > > Chemfp supports multiple cheminformatics toolkits. To make things easier for > me and for chemfp users, I've implemented a "toolkit" wrapper API for > consistent molecule I/O across the supported toolkits. > > This includes a "text" toolkit which knows just enough about SMILES and SD > files to read the records as text blocks, with no need for a chemistry > toolkit. This was designed to get access to the original record in order to, > for example, preserve the exact input atom order and aromaticity, or to add a > note like "could not process file" to an SD data item. > > For chemfp's Open Babel SDF reader wrapper, I added an option to choose to > use Open Babel to read the molecules (this is the default), or to use > chemfp's text-based parser to identify the records, and then have Open Babel > parse the record to get the molecule. > > Chemfp's wrapper API also has a way to pass in the error handler to use when > Open Babel fails to parse a record, eg, to ignore the problem and keep > processing, or to stop processing immediately. > > I'm pretty sure you can use the following, with some filename changes, to > extract the records you mentioned. > > import sys > > from chemfp.io import ErrorHandler > from chemfp import openbabel_toolkit as T > > # Create a user-defined error handler which writes > # the failing record to the specified file object. > class SaveErrors(ErrorHandler): > def __init__(self, outfile): > self.outfile = outfile > > # Called whenever there is an error. > # The "location" object (always present for file I/O) > # when used with the "chemfp" implementation stores the > # current record in ".record", as a byte string. > def error(self, msg, location=None, extra=None): > assert location is not None > sys.stderr.write(f"ERROR: {msg}\n") > self.outfile.write(location.record) > > with open("errors.sdf", "wb") as err_file: > with T.read_molecules( > "chembl_33.sdf.gz", > # have the chemfp wrapper use chemfp to tokenize SDF > # records instead of letting Open Babel parse everything. > reader_args = {"implementation": "chemfp"}, > # Specify a user-defined error handler > errors = SaveErrors(err_file), > ) as reader: > with T.open_molecule_writer("dest.smi") as writer: > with reader.location.progress_bar() as progress_bar: > writer.write_molecules(progress_bar(reader)) > > This is a bit slower than using Open Babel's native reader, though I don't > recall how much. > > If the input is gzip compressed, then you can get a bit extra performance by > off-loading the decompression to gzip, rather than use chemfp's own gzip > reader, by setting the environment variable CHEMFP_GZIP to "gzip", like this: > > env CHEMFP_GZIP=gzip python chemfp_converter.py > > If you do that, the progress bar will switch from the current read position / > total input file size, to showing the number of records processed per second. > > Best regards, > > Andrew > da...@dalkescientific.com > > > > > > > ------------------------------ > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > OpenBabel-discuss mailing list > OpenBabel-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/openbabel-discuss > > > ------------------------------ > > End of OpenBabel-discuss Digest, Vol 217, Issue 4 > ************************************************* _______________________________________________ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss