Re: [Open Babel] how to use the new ECFP fingerprints?

Noel O'Boyle Wed, 29 Mar 2017 01:32:15 -0700

The ECFP is new to Open Babel and hasn't been sorted out properly.
Geoff's been on to me to look into it, but it's way down my list at
the moment. So in short, I agree, and encourage a prospective user to
step up and look into it.


On 29 March 2017 at 04:53, Andrew Dalke <da...@dalkescientific.com> wrote:
> I'm trying to understand how to use the ECFP fingerprints added in Open Babel 
> 2.4.
>
> They generate fingerprints where the length is a function of the number of 
> heavy atoms. I want a fixed length binary fingerprint so I can compare two 
> fingerprints using the normal binary Tanimoto. (This is for chemfp.) It 
> doesn't seem possible. It seems like what's missing is something like a hash 
> method to turn these per-atom values into something that fits in with how the 
> other fingerprints work.
>
>
> Here's an example which shows how the length varies:
>
>>>> import openbabel as ob
>>>> import pybel
>>>>
>>>> mol1 = pybel.readstring("smi", "C").OBMol
>>>> mol3 = pybel.readstring("smi", "CCC").OBMol
>>>> mol9 = pybel.readstring("smi", "C"*9).OBMol
>>>> mol_info = [("mol1", mol1), ("mol3", mol3), ("mol9", mol9)]
>>>>
>>>> fptype = ob.OBFingerprint.FindFingerprint("ECFP0")
>>>>
>>>> # Show that the size is a function of length
> ... for name, mol in mol_info:
> ...     fp = ob.vectorUnsignedInt()
> ...     fptype.GetFingerprint(mol, fp)
> ...     print("%s: %r" % (name, list(fp)))
> ...
> True
> mol1: [1526808443]
> True
> mol3: [3405580958, 3405580958, 3756301279]
> True
> mol9: [3405580958, 3405580958, 3756301279, 3756301279, 3756301279, 
> 3756301279, 3756301279, 3756301279, 3756301279]
>
> I understand what it's showing. Each heavy atom has its own value in the 
> list, and the list is sorted to give a canonical ordering.
>
> More specifically, the [CH4] generates the characteristic value "1526808443", 
> the two "[CH3]-" generates the characteristic value "3405580958", and the 
> "-[CH2]-" generates the characteristic value 3756301279.
>
>
> However, the non-ECFP fingerprints all generate a constant size, and parts of 
> Open Babel will break with a variable length size.
>
> For example, the fast search indexing in fingerprint.cpp:322  
> FastSearchIndex::Add() assumes the fingerprint vector returned from  
> GetFingerprint() will be constant, where headwords = vectors.size(). If you 
> try to generate a .fs file using  "-ofs -xfECFP0" then it will work, but the 
> similarity search will fail with "Difficulty reading from index".
>
>
> Is there an Open Babel function to compare two of these variable-length 
> fingerprints? It looks like a count-based Tanimoto is needed, so mol3 and 
> mol9 have a similarity of (2+1)/(2+7) = 3/9 = 1/3.
>
> Is there any way to turn this into a useful fixed-length fingerprint? I tried 
> to generate FPS output using "-ofps -xfECFP0" but the fingerprint content was 
> empty.
>
>
> I could zero-pad small fingerprints, but it's not really possible to compare, 
> say, "C", "O", and "CO" as the corresponding values of [X, 0], [Y, 0], and 
> either [X, Y] or [Y, X] won't give the right comparison scores.
>
> The current folding method also isn't really useful for larger fingerprints. 
> There's an nBits parameter of GetFingerprint():
>
>
> ... for name, mol in mol_info:
> ...     fp = ob.vectorUnsignedInt()
> ...     fptype.GetFingerprint(mol, fp, 128)
> ...     print("%s (fold 128): %r" % (name, list(fp)))
> ...
> True
> mol1 (fold 128): [1526808443]
> True
> mol3 (fold 128): [3405580958, 3405580958, 3756301279]
> True
> mol9 (fold 128): [3757939679, 3757939679, 3756301279, 3756301279]
>
> The underlying code in fingerecfp.cpp implements this by calling Fold(). 
> However, if you want to be able to compare the post-folded fingerprints, then 
> this only works if the initial positions are globally invariant for the given 
> characteristic. But in the ECFP case the initial position depends on the 
> other features in the molecule, because the fingerprints are sorted.
>
>
>
> (Also, there's a bug where the nBits doesn't work until the number of bits is 
> at least twice as long as that value:
>
>>>> for i in range(4, 10):
> ...   mol = pybel.readstring("smi", "C"*i).OBMol
> ...   fp = ob.vectorUnsignedInt()
> ...   fptype.GetFingerprint(mol, fp, 128)
> ...   print("%s: %r" % (i, list(fp)))
> ...
> True
> 4: [4293393407, 3757939679, 4026359518, 4286507510]
> True
> 5: [4293393407, 4294433791, 4026359518, 3942468319, 4293835775]
> True
> 6: [4293393407, 4294433791, 4294433791, 3942472702, 4026359518, 4293835775]
> True
> 7: [4293393407, 4294433791, 4294433791, 4294433791, 4026359518, 3942468319, 
> 4293835775]
> True
> 8: [4294441983, 4294959103, 4294967295, 4294966271]
> True
> 9: [4294441983, 4294433791, 4294959103, 4294958079]
> )
>
> To make a long email short, it feels like there should be an entirely 
> different function than folding to turn these list of per-atom ECFP values 
> into the type of fingerprint that the rest of Open Babel (and of chemfp) can 
> use.
>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> OpenBabel-discuss mailing list
> OpenBabel-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Re: [Open Babel] how to use the new ECFP fingerprints?

Reply via email to