I'm trying to understand how to use the ECFP fingerprints added in Open Babel 2.4.
They generate fingerprints where the length is a function of the number of heavy atoms. I want a fixed length binary fingerprint so I can compare two fingerprints using the normal binary Tanimoto. (This is for chemfp.) It doesn't seem possible. It seems like what's missing is something like a hash method to turn these per-atom values into something that fits in with how the other fingerprints work. Here's an example which shows how the length varies: >>> import openbabel as ob >>> import pybel >>> >>> mol1 = pybel.readstring("smi", "C").OBMol >>> mol3 = pybel.readstring("smi", "CCC").OBMol >>> mol9 = pybel.readstring("smi", "C"*9).OBMol >>> mol_info = [("mol1", mol1), ("mol3", mol3), ("mol9", mol9)] >>> >>> fptype = ob.OBFingerprint.FindFingerprint("ECFP0") >>> >>> # Show that the size is a function of length ... for name, mol in mol_info: ... fp = ob.vectorUnsignedInt() ... fptype.GetFingerprint(mol, fp) ... print("%s: %r" % (name, list(fp))) ... True mol1: [1526808443] True mol3: [3405580958, 3405580958, 3756301279] True mol9: [3405580958, 3405580958, 3756301279, 3756301279, 3756301279, 3756301279, 3756301279, 3756301279, 3756301279] I understand what it's showing. Each heavy atom has its own value in the list, and the list is sorted to give a canonical ordering. More specifically, the [CH4] generates the characteristic value "1526808443", the two "[CH3]-" generates the characteristic value "3405580958", and the "-[CH2]-" generates the characteristic value 3756301279. However, the non-ECFP fingerprints all generate a constant size, and parts of Open Babel will break with a variable length size. For example, the fast search indexing in fingerprint.cpp:322 FastSearchIndex::Add() assumes the fingerprint vector returned from GetFingerprint() will be constant, where headwords = vectors.size(). If you try to generate a .fs file using "-ofs -xfECFP0" then it will work, but the similarity search will fail with "Difficulty reading from index". Is there an Open Babel function to compare two of these variable-length fingerprints? It looks like a count-based Tanimoto is needed, so mol3 and mol9 have a similarity of (2+1)/(2+7) = 3/9 = 1/3. Is there any way to turn this into a useful fixed-length fingerprint? I tried to generate FPS output using "-ofps -xfECFP0" but the fingerprint content was empty. I could zero-pad small fingerprints, but it's not really possible to compare, say, "C", "O", and "CO" as the corresponding values of [X, 0], [Y, 0], and either [X, Y] or [Y, X] won't give the right comparison scores. The current folding method also isn't really useful for larger fingerprints. There's an nBits parameter of GetFingerprint(): ... for name, mol in mol_info: ... fp = ob.vectorUnsignedInt() ... fptype.GetFingerprint(mol, fp, 128) ... print("%s (fold 128): %r" % (name, list(fp))) ... True mol1 (fold 128): [1526808443] True mol3 (fold 128): [3405580958, 3405580958, 3756301279] True mol9 (fold 128): [3757939679, 3757939679, 3756301279, 3756301279] The underlying code in fingerecfp.cpp implements this by calling Fold(). However, if you want to be able to compare the post-folded fingerprints, then this only works if the initial positions are globally invariant for the given characteristic. But in the ECFP case the initial position depends on the other features in the molecule, because the fingerprints are sorted. (Also, there's a bug where the nBits doesn't work until the number of bits is at least twice as long as that value: >>> for i in range(4, 10): ... mol = pybel.readstring("smi", "C"*i).OBMol ... fp = ob.vectorUnsignedInt() ... fptype.GetFingerprint(mol, fp, 128) ... print("%s: %r" % (i, list(fp))) ... True 4: [4293393407, 3757939679, 4026359518, 4286507510] True 5: [4293393407, 4294433791, 4026359518, 3942468319, 4293835775] True 6: [4293393407, 4294433791, 4294433791, 3942472702, 4026359518, 4293835775] True 7: [4293393407, 4294433791, 4294433791, 4294433791, 4026359518, 3942468319, 4293835775] True 8: [4294441983, 4294959103, 4294967295, 4294966271] True 9: [4294441983, 4294433791, 4294959103, 4294958079] ) To make a long email short, it feels like there should be an entirely different function than folding to turn these list of per-atom ECFP values into the type of fingerprint that the rest of Open Babel (and of chemfp) can use. ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss