Re: [Rdkit-discuss] speed of Tanimoto similarity calculations

Andrew Dalke Fri, 20 Apr 2012 04:08:10 -0700

Hi Gonzalo,

On Apr 20, 2012, at 10:13 AM, Gonzalo Colmenarejo-Sanchez wrote:
> I have performed a similarity matrix calculation of 4176 X 4016 molecules 
> with a program using the RDKit and it took 401 seconds. The same program with 
> the same sets of molecules and using the Daylight toolkit took 19 seconds.


You might consider looking at my chemfp package, from 
http://code.google.com/p/chem-fingerprints/ .

It should be a lot faster than RDKit's or Daylight's code when doing clustering 
because I use a combination of optimized data structures, sub-linear 
algorithms, processor-dependent popcount instructions, and multithreading.

For example, using the 881-bit PubChem fingerprints with a 0.8 similarity 
cutoff I can find the NxN similarity of 100,000 fingerprints in 46 seconds. So 
twice that time if the queries and targets are different, which means it should 
be a fraction of a second for your data set.

I've also heard from another Daylight user that fingerprint generation in RDKit 
is markedly slower by comparison. On the plus side, it only needs to be done 
once.

Cheers,


                                Andrew
                                [email protected]



------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] speed of Tanimoto similarity calculations

Reply via email to