On Jan 11, 2018, at 12:04, Wandré <[email protected]> wrote:
> Thanks for the link. It is very interesting. I will read very carefully.
> So, as input on ChemFP, I have to put a file with all molecules in 1 SDF?
Chemfp works with fingerprint files, in your case, chemfp's text-based "FPS"
format. You'll need to use 'rdkit2fps' to convert your InChI structures into a
fingerprint.
Here's an example file, where I follow the Open Babel convention of allowing an
identifier after the InChI string:
% cat examples.inchi
InChI=1S/C6H6O/c7-6-4-2-1-3-5-6/h1-5,7H phenol
InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H benzene
InChI=1S/CH4/h1H4/i1D4 deuterated methane
You could also use an SDF or SMILES file.
Next, I generate AtomPair fingerprints. The output goes to "examples.fps",
which I'll then display.
% rdkit2fps --pairs examples.inchi -o examples.fps
% cat examples.fps
#FPS1
#num_bits=2048
#type=RDKit-AtomPair/2 fpSize=2048 minLength=1 maxLength=30
#software=RDKit/2016.09.3 chemfp/3.1
#source=examples.inchi
#date=2018-01-11T14:38:57
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000001000000000000000000000000000000000000310000000003000000000000000000000000000000000000000000007003000000000000000000000300000000000000000000000000000000000000073000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
phenol
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000300000000000000000000000000000000000000000000000000000007000000000000000000000000000000000000000000000000000000000000000070000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
benzene
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007000000000000000000000000000000000000000000000000000000000000000000000000070000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
deuterated methane
Finally, I run the clustering program, with a low threshold so it does
something other than the trivial output of three clusters.
% python taylor_butina.py -t 0.3 examples.fps
0 true singletons
=>
1 false singletons
=> deuterated methane
1 clusters
phenol has 1 other members
=> benzene
This output format is rather ad hoc. I need to figure out what format people
want from a clustering tool; preferably one that other tools can import without
further conversion.
I'll be glad to hear any suggestions.
Cheers,
Andrew
[email protected]
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss