Dear all,

At some point over the last couple of years I managed to convince
myself that the RDKit clustering code had problems when used with more
than a couple of thousand data points. Yesterday I took a bit of time
to go back and test this theory.

I used a set of molecules from PubChem, generated fingerprints, used
those to create the distance matrix, and then did clustering using
Wards minimum variance approach. The code is at the end of this
message.

Here's a table with the timing results run on a linux box with 8GB of
RAM and a 2.67 GHz Xeon processor. A single thread/process is run (the
distance matrix calculation would be easy to run in parallel, but I
don't do it here).
It's only going to look reasonable if you use a fixed-width font.

| NumPts | distance calc time | clustering time | max memory |
|--------+--------------------+-----------------+------------|
|   1000 |               0.34 |            0.14 |            |
|   2000 |               1.33 |            0.87 |            |
|   4000 |               5.44 |            4.88 |            |
|   8000 |               22.9 |            21.9 | 600MB      |
|  12000 |               49.2 |            58.5 | 950MB      |
|  16000 |               94.0 |           125.0 | 1.4GB      |
|  20000 |              144.3 |           235.9 | 1.9GB      |
|  24000 |              214.0 |           379.0 | 2.6GB      |
|  28000 |              293.9 |           605.4 | 3.4GB      |

"distance calc time" is the time to generate the distance matrix.
"clustering time" is the time in the clustering routine. "max memory"
is a rough measure of the amount of memory required; this is really
determined by the size of the distance matrix.

My concerns were ungrounded: the code works fine with >20K points and
clustering 20K molecules only takes about 6 minutes and requires 1.9GB
of RAM. There is no doubt higher-performance clustering code out
there, but this really isn't too bad.

-greg

Here's the code:

logger.info("generating fps")
fps = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2) for x in ms]
ms = None
nPts = len(fps)
logger.info("generating distances for %d points"%nPts)
dm=numpy.zeros(((nPts*(nPts-1)/2),),numpy.double)
t1=time.time()
offset=0
for i,fpi in enumerate(fps):
    if i>0:
        ds = DataStructs.BulkTanimotoSimilarity(fpi,fps[:i])
        dm[offset:offset+len(ds)]=ds
        offset+=len(ds)
        if not i%500: logger.info("   done %d"%i)
t2=time.time()
logger.info("  done in %.2f seconds"%(t2-t1))
fps=None

logger.info("clustering %d points"%(nPts))
t1=time.time()
t=Murtagh.MurtaghDistCluster(dm,nPts,Murtagh.WARDS)
t2=time.time()
logger.info("  done in %.2f seconds"%(t2-t1))

------------------------------------------------------------------------------
The Go Parallel Website, sponsored by Intel - in partnership with Geeknet, 
is your hub for all things parallel software development, from weekly thought 
leadership blogs to news, videos, case studies, tutorials, tech docs, 
whitepapers, evaluation guides, and opinion stories. Check out the most 
recent posts - join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to