Re: [Open Babel] changes to fingerprint generation, and FPS output

Andrew Dalke Wed, 09 Jan 2019 12:23:40 -0800

On Jan 9, 2019, at 15:45, Noel O'Boyle <baoille...@gmail.com> wrote:
> Making such an API addition adds a maintainence would commit us to correctly 
> maintaining the underlying information, and that's a maintainence task I'm 
> not willing to take on.


Certainly, and I appreciate that maintenance is an important factor.

Let me try this again.

Hi! Open Babel supports the FPS output format. What can I do to help maintain 
that plugin?

When Open Babel decided to support the FPS format, I realized they took on a 
maintenance task. For example, the FPS format at that time, at 
https://code.google.com/archive/p/chem-fingerprints/wikis/FPS.wiki , says "The 
version MUST change whenever the underlying fingerprint algorithm changes."

I noticed that the implementations in version control have changed. I want to 
help update the version numbers so they will be ready for the next release.

I have a lot of Python experience but haven't really used C++ in 20 years - and 
even then without shared library experience - so I'm afraid I don't really know 
what I'm doing with that language. But I'll try!

My proposals to simply update the version number to "/3", and to change the 
plugin API to allow version numbers, don't seem like they were acceptable. What 
about changing the plugin so it is structured like this:

  const char *plugin_version = "1";
  const char *plugin_id = _pFP->GetID();

  if (!strcmp(plugin_id, "MACCS"))) {
    plugin_version = "3";
  }
  ...
        << "#type=OpenBabel-" << plugin_id << "/" << plugin_version << '\n'

so that it is possible to have different version numbers for each fingerprint 
type, embedded as logic inside of the plugin?

If this third alternative seems like it might be acceptable, then I can work on 
it further and submit a pull request for a more detailed examination.

> In addition, I don't personally think that this information adds anything 
> beyond reporting the version number.

And I personally think it does. I know I've liked it when chemfp reported 
mismatched versions when I accidentally used an outdated file.

But that's okay. Open Babel doesn't need to follow my feelings. I can totally 
see that if the overall fingerprint generation process in Open Babel 
effectively changes for every release, then the type version adds nothing.

In that case, here's a fourth proposal. Change the FPS plugin so the output 
"#type" includes the Babel version, that is:

        << "#type=OpenBabel-" << _pFP->GetID() << "/" <<  BABEL_VERSION << '\n'

The output would look like something like "OpenBabel-MACCS/2.4.1", which is a 
valid FPS fingerprint type string.

This type string provides a hint to all downstream users that they need to 
re-build their fingerprints after every Open Babel upgrade.

If this seems like it might be acceptable, then I can submit a pull request for 
it.


For some background on why there is a "#type" line with a fingerprint type 
version, and a "#software" line with a version of the software, that's because 
other tools have different needs than Open Babel. I might have my own tool 
built on Open Babel which generates fingerprints. The releases are:

  1.0 - initial release, v1 of the fingerprint

  1.1 - performance improvements, fingerprints don't change

  1.2 - fix a bug in my implementation, call the new version v2,
     but allow v1 to be selected as a command-line option

  1.3 - port to MS Windows, otherwise unchanged

  1.4 - change to require C++14 instead of C++03, otherwise unchanged

  1.5 - switch to faster PRNG as the default, call it v3,
     and allow v1 and v2 to be selected as a command-line option.

That is, version 1.5 of my program can generate versions 1, 2, and 3 of the 
fingerprint type I created, selectable at run-time. These fingerprint types 
might be available for reproducibility or contractual reasons, among others.

In this case, the program version cannot be used to distinguish between the 
fingerprint types, which is why there is a distinct type. 

As another example, RDKit's and OpenEye's fingerprints have not changed for 
many years, though they both have had multiple releases during that time. 
Releases for those toolkits typically add features elsewhere. There is no 
reason for users of those toolkits to re-build their fingerprint data sets 
after each release.

My hope is to simplify and standardize the mechanism used to inform users that 
there may be a problem. For example, a web server might use a component, 
installed with 'pip', which has a hard-coded set of fingerprints. Users might 
not even be aware that that FPS file exists. To help reduce silent errors, that 
component might generate a log message about a potential problem if it sees a 
version skew.

That's why chemfp includes a function to compare two FPS headers and generate a 
list of errors (like comparing 1024 vs. 2048 bit fingerprints), warnings (e.g., 
different fingerprint types) and "info"-level messages (e.g., different program 
versions) for things that typically aren't important - and lets the calling 
software decide what to do with each severity level and message type.

Best regards,


                                Andrew
                                da...@dalkescientific.com




_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Re: [Open Babel] changes to fingerprint generation, and FPS output

Reply via email to