Re: [Rdkit-discuss] substructure search with fingerprints

Greg Landrum Tue, 28 May 2013 20:42:47 -0700

Hi Gonzalo,

On Tue, May 28, 2013 at 5:00 PM, Gonzalo Colmenarejo-Sanchez <
[email protected]> wrote:


>
>
> **
>
> What’s the best way of doing fast (approximate) substructure searches in
> RDKit using fingerprints? I’m a bit confused about this topic. Any advice
> would be really appreciated.****
>
> **
>

The answer depends on what you want to do.

If you have one or more molecules and a single query and you want to know
if the query matches any the molecules, the fastest approach is just to do
the substructure search (the time required to generate the fingerprints is
larger than the time to do the individual search).

If you have a set of molecules you would like to search through using
multiple queries or a set that is relatively static that you'd be searching
through more than once, you have a variety of options. I'm going to run
through some of the options from Python. If you want to do the same thing
in C++ or Java, I can provide a separate answer for that.

-----------------------------
1) Install postgresql and the RDKit postgresql cartridge and use that to do
the searches. This is heavyweight, but gets you something that's flexible,
relatively easy to use, and quite suited for dealing with millions of
molecules.

-----------------------------
2) Give Riccardo's Chemicallite a try:
http://www.mail-archive.com/[email protected]/msg03077.html
This
"cartridge" for sqlite is still in development, but the early results that
Riccardo shows look quite promising.

-----------------------------
3) Using the pandas integration in the new version of the RDKit, you can
easily work with sets of molecules and do efficient substructure searches:
In [47]: from rdkit.Chem import PandasTools

In [48]: df =
PandasTools.LoadSDF('lopac_pubchem_28March07.sdf',includeFingerprints=True)
len(
In [49]: len(df)
Out[49]: 1232

In [50]: q = Chem.MolFromSmiles('c1nnccc1')

In [51]: subset = ndf[ndf['ROMol']>=q]

In [52]: len(subset)
Out[52]: 6

If you want to use this set of molecules in later python sessions, you can
save the dataframe using python's pickle module.

Needless to say, you'll need to have pandas installed (but it's great to
have installed anyway).


-----------------------------
4) If you want to avoid installing anything extra, you can do the
book-keeping and fingerprint tracking yourself with something like this:

In [63]: ms = [x for x in Chem.SDMolSupplier('lopac_pubchem_28March07.sdf')
if x is not None]
fps
In [64]: fps = [Chem.PatternFingerprint(x) for x in ms]

In [65]: def sss(ms,fps,q):
    res=[]
    qfp = Chem.PatternFingerprint(q)
    for i,fp in enumerate(fps):
        if DataStructs.AllProbeBitsMatch(qfp,fp):
            if ms[i].HasSubstructMatch(q):
                res.append(ms[i])
    return res
   ....:

In [66]: subset=sss(ms,fps,Chem.MolFromSmiles('c1nnccc1'))

In [67]: len(subset)
Out[67]: 6

You can pickle the lists ms and fps together to use them in later python
sessions.


Note that solutions 3) and 4) need to have all the molecules and
fingerprints in memory at the same time, so dealing with large numbers of
molecules this way will not be particularly efficient unless you have a
*lot* of memory.


Does that help?
-greg

------------------------------------------------------------------------------
Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
Get 100% visibility into your production application - at no cost.
Code-level diagnostics for performance bottlenecks with <2% overhead
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap1

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] substructure search with fingerprints

Reply via email to