I'm trying to understand how to use the ECFP fingerprints added in Open Babel 
2.4.

They generate fingerprints where the length is a function of the number of 
heavy atoms. I want a fixed length binary fingerprint so I can compare two 
fingerprints using the normal binary Tanimoto. (This is for chemfp.) It doesn't 
seem possible. It seems like what's missing is something like a hash method to 
turn these per-atom values into something that fits in with how the other 
fingerprints work.


Here's an example which shows how the length varies:

>>> import openbabel as ob
>>> import pybel
>>>
>>> mol1 = pybel.readstring("smi", "C").OBMol
>>> mol3 = pybel.readstring("smi", "CCC").OBMol
>>> mol9 = pybel.readstring("smi", "C"*9).OBMol
>>> mol_info = [("mol1", mol1), ("mol3", mol3), ("mol9", mol9)]
>>>
>>> fptype = ob.OBFingerprint.FindFingerprint("ECFP0")
>>>
>>> # Show that the size is a function of length
... for name, mol in mol_info:
...     fp = ob.vectorUnsignedInt()
...     fptype.GetFingerprint(mol, fp)
...     print("%s: %r" % (name, list(fp)))
...
True
mol1: [1526808443]
True
mol3: [3405580958, 3405580958, 3756301279]
True
mol9: [3405580958, 3405580958, 3756301279, 3756301279, 3756301279, 3756301279, 
3756301279, 3756301279, 3756301279]

I understand what it's showing. Each heavy atom has its own value in the list, 
and the list is sorted to give a canonical ordering.

More specifically, the [CH4] generates the characteristic value "1526808443", 
the two "[CH3]-" generates the characteristic value "3405580958", and the 
"-[CH2]-" generates the characteristic value 3756301279.


However, the non-ECFP fingerprints all generate a constant size, and parts of 
Open Babel will break with a variable length size.

For example, the fast search indexing in fingerprint.cpp:322  
FastSearchIndex::Add() assumes the fingerprint vector returned from  
GetFingerprint() will be constant, where headwords = vectors.size(). If you try 
to generate a .fs file using  "-ofs -xfECFP0" then it will work, but the 
similarity search will fail with "Difficulty reading from index".


Is there an Open Babel function to compare two of these variable-length 
fingerprints? It looks like a count-based Tanimoto is needed, so mol3 and mol9 
have a similarity of (2+1)/(2+7) = 3/9 = 1/3.

Is there any way to turn this into a useful fixed-length fingerprint? I tried 
to generate FPS output using "-ofps -xfECFP0" but the fingerprint content was 
empty.


I could zero-pad small fingerprints, but it's not really possible to compare, 
say, "C", "O", and "CO" as the corresponding values of [X, 0], [Y, 0], and 
either [X, Y] or [Y, X] won't give the right comparison scores.

The current folding method also isn't really useful for larger fingerprints. 
There's an nBits parameter of GetFingerprint():


... for name, mol in mol_info:
...     fp = ob.vectorUnsignedInt()
...     fptype.GetFingerprint(mol, fp, 128)
...     print("%s (fold 128): %r" % (name, list(fp)))
...
True
mol1 (fold 128): [1526808443]
True
mol3 (fold 128): [3405580958, 3405580958, 3756301279]
True
mol9 (fold 128): [3757939679, 3757939679, 3756301279, 3756301279]

The underlying code in fingerecfp.cpp implements this by calling Fold(). 
However, if you want to be able to compare the post-folded fingerprints, then 
this only works if the initial positions are globally invariant for the given 
characteristic. But in the ECFP case the initial position depends on the other 
features in the molecule, because the fingerprints are sorted.



(Also, there's a bug where the nBits doesn't work until the number of bits is 
at least twice as long as that value:

>>> for i in range(4, 10):
...   mol = pybel.readstring("smi", "C"*i).OBMol
...   fp = ob.vectorUnsignedInt()
...   fptype.GetFingerprint(mol, fp, 128)
...   print("%s: %r" % (i, list(fp)))
...
True
4: [4293393407, 3757939679, 4026359518, 4286507510]
True
5: [4293393407, 4294433791, 4026359518, 3942468319, 4293835775]
True
6: [4293393407, 4294433791, 4294433791, 3942472702, 4026359518, 4293835775]
True
7: [4293393407, 4294433791, 4294433791, 4294433791, 4026359518, 3942468319, 
4293835775]
True
8: [4294441983, 4294959103, 4294967295, 4294966271]
True
9: [4294441983, 4294433791, 4294959103, 4294958079]
)

To make a long email short, it feels like there should be an entirely different 
function than folding to turn these list of per-atom ECFP values into the type 
of fingerprint that the rest of Open Babel (and of chemfp) can use.



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to