Thanks everyone, lots for me to explore

Cheers

Chris

> On 13 Feb 2025, at 12:39, openbabel-discuss-requ...@lists.sourceforge.net 
> wrote:
> 
> Send OpenBabel-discuss mailing list submissions to
>       openbabel-discuss@lists.sourceforge.net
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>       https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
> or, via email, send a message with subject or body 'help' to
>       openbabel-discuss-requ...@lists.sourceforge.net
> 
> You can reach the person managing the list at
>       openbabel-discuss-ow...@lists.sourceforge.net
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of OpenBabel-discuss digest..."
> 
> 
> Today's Topics:
> 
>   1. Re: Errors in file conversion (Andrew Dalke)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Wed, 12 Feb 2025 22:34:50 +0100
> From: Andrew Dalke <da...@dalkescientific.com>
> To: Chris Swain <sw...@mac.com>
> Cc: Open Babel <openbabel-discuss@lists.sourceforge.net>
> Subject: Re: [Open Babel] Errors in file conversion
> Message-ID: <296b53a0-eed5-4b9e-9217-845651d9f...@dalkescientific.com>
> Content-Type: text/plain;     charset=us-ascii
> 
> On Feb 11, 2025, at 10:06, Chris Swain via OpenBabel-discuss 
> <openbabel-discuss@lists.sourceforge.net> wrote:
>> I want to convert some very large sdf files to SMILES.  I know that some 
>> structures will fail to convert.
>> 
>> Is it possible to do the conversion creating two files, one containing the 
>> valid SMILES and another file containing the records that failed to convert? 
>> Either as a sdf file or simply a text file containing the title of molecules.
> 
> Since Geoff commented that it wasn't possible directly in Open Babel, I'll 
> suggest an alternative.
> 
> Chemfp supports multiple cheminformatics toolkits. To make things easier for 
> me and for chemfp users, I've implemented a "toolkit" wrapper API for 
> consistent molecule I/O across the supported toolkits.
> 
> This includes a "text" toolkit which knows just enough about SMILES and SD 
> files to read the records as text blocks, with no need for a chemistry 
> toolkit. This was designed to get access to the original record in order to, 
> for example, preserve the exact input atom order and aromaticity, or to add a 
> note like "could not process file" to an SD data item.
> 
> For chemfp's Open Babel SDF reader wrapper, I added an option to choose to 
> use Open Babel to read the molecules (this is the default), or to use 
> chemfp's text-based parser to identify the records, and then have Open Babel 
> parse the record to get the molecule.
> 
> Chemfp's wrapper API also has a way to pass in the error handler to use when 
> Open Babel fails to parse a record, eg, to ignore the problem and keep 
> processing, or to stop processing immediately.
> 
> I'm pretty sure you can use the following, with some filename changes, to 
> extract the records you mentioned.
> 
> import sys
> 
> from chemfp.io import ErrorHandler
> from chemfp import openbabel_toolkit as T
> 
> # Create a user-defined error handler which writes
> # the failing record to the specified file object.
> class SaveErrors(ErrorHandler):
>    def __init__(self, outfile):
>        self.outfile = outfile
> 
>    # Called whenever there is an error.
>    # The "location" object (always present for file I/O)
>    # when used with the "chemfp" implementation stores the
>    # current record in ".record", as a byte string.
>    def error(self, msg, location=None, extra=None):
>        assert location is not None
>        sys.stderr.write(f"ERROR: {msg}\n")
>        self.outfile.write(location.record)
> 
> with open("errors.sdf", "wb") as err_file:
>    with T.read_molecules(
>            "chembl_33.sdf.gz",
>            # have the chemfp wrapper use chemfp to tokenize SDF
>            # records instead of letting Open Babel parse everything.
>            reader_args = {"implementation": "chemfp"},
>            # Specify a user-defined error handler
>            errors = SaveErrors(err_file),
>            ) as reader:
>        with T.open_molecule_writer("dest.smi") as writer:
>            with reader.location.progress_bar() as progress_bar:
>                writer.write_molecules(progress_bar(reader))
> 
> This is a bit slower than using Open Babel's native reader, though I don't 
> recall how much.
> 
> If the input is gzip compressed, then you can get a bit extra performance by 
> off-loading the decompression to gzip, rather than use chemfp's own gzip 
> reader, by setting the environment variable CHEMFP_GZIP to "gzip", like this:
> 
>  env CHEMFP_GZIP=gzip python chemfp_converter.py
> 
> If you do that, the progress bar will switch from the current read position / 
> total input file size, to showing the number of records processed per second.
> 
> Best regards,
> 
>                                       Andrew
>                                       da...@dalkescientific.com
> 
> 
> 
> 
> 
> 
> ------------------------------
> 
> 
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> OpenBabel-discuss mailing list
> OpenBabel-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
> 
> 
> ------------------------------
> 
> End of OpenBabel-discuss Digest, Vol 217, Issue 4
> *************************************************



_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to