Hi Gonzalo, On Apr 20, 2012, at 10:13 AM, Gonzalo Colmenarejo-Sanchez wrote: > I have performed a similarity matrix calculation of 4176 X 4016 molecules > with a program using the RDKit and it took 401 seconds. The same program with > the same sets of molecules and using the Daylight toolkit took 19 seconds.
You might consider looking at my chemfp package, from http://code.google.com/p/chem-fingerprints/ . It should be a lot faster than RDKit's or Daylight's code when doing clustering because I use a combination of optimized data structures, sub-linear algorithms, processor-dependent popcount instructions, and multithreading. For example, using the 881-bit PubChem fingerprints with a 0.8 similarity cutoff I can find the NxN similarity of 100,000 fingerprints in 46 seconds. So twice that time if the queries and targets are different, which means it should be a fraction of a second for your data set. I've also heard from another Daylight user that fingerprint generation in RDKit is markedly slower by comparison. On the plus side, it only needs to be done once. Cheers, Andrew [email protected] ------------------------------------------------------------------------------ For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

