Dear Nils,

I'm CC'ing the rdkit-discuss mailing list on my answer, since it's
probably of interest to more people. I'd appreciate it if, at least
for the duration of this conversation, you would subscribe to the list
and include it on your replies. Subscription information is here:
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

On Thu, Feb 10, 2011 at 4:20 PM, Nils Kriege
<[email protected]> wrote:
>
> thanks for providing your software RDKit. I have a question regarding
> the topological fingerprint defined in
> /GraphMol/Fingerprints/Fingerprints.h and the parameter
> 'branchedPaths'.
>
> What exactly is a branched path? Does in differ from a subgraph? Does
> it contain cycles?

A branched path is a subgraph. It can contain cycles. I used the
"branchedPaths" name in the fingerprinting code so that I could avoid
repeating the explanation of the difference between a path and
subgraph that one finds in $RDBASE/Code/GraphMol/Subgraphs/Subgraphs.h

> I have looked at the code. Paths and branched path are both identified
> by a set of bonds and are hashed in a similar fashion. The only
> difference for paths and branched paths seems to be 'bondNbrs', since
> branched paths contain bonds with more than two adjacent bonds.

Both paths and subgraphs (branched paths) are hashed the same, the
difference in the fingerprinting code is how they are identified.

The hash for each bond in a subgraph (or path) is determined by
combining: the number of neighbors that bond has in the subgraph, the
bond order, and the atomic numbers of the two atoms involved. The hash
for the entire subgraph is determined by sorting the list of bond
hashes, adding the count of the total number of atoms in the subgraph,
and then hashing that list.

> However, I don't think that the number of neighbors  stored for each
> bond uniquely defines the structure of a branched path.

It certainly doesn't, and it's not that hard to find counter-examples
where the hashing algorithm breaks down and you get collisions (I give
an example below). The simple approach used is certainly not a true
canonicalization of the subgraphs. If you have ideas for improvements,
I'd be more than happy to hear them; the primary constraint is that
the subgraph hashing algorithm should not be too computationally
intensive.

The current algorithm seems to work in practice, but that's more of a
qualitative feeling than something concrete

>
> Furthermore, does the hashing procedure distinguish, e.g., a path
> CCN=CC from CN=CCC? Both have the same bonds but are different
> paths...

These kinds of questions are, luckily, quite easy to answer:
In [13]: m1 = Chem.MolFromSmiles('CCN=CC')

In [14]: m2=Chem.MolFromSmiles('CN=CCC')

In [15]: list(Chem.RDKFingerprint(m1,minPath=4,nBitsPerHash=1).GetOnBits())
Out[15]: [65]

In [16]: list(Chem.RDKFingerprint(m2,minPath=4,nBitsPerHash=1).GetOnBits())
Out[16]: [1491]

These are different because the CN bond has two neighbors in the first
molecule and only one in the second (there's a CC bond that changes
too).

Here's a real collision:
In [23]: 
list(Chem.RDKFingerprint(Chem.MolFromSmiles('CC=CCCC'),minPath=5,nBitsPerHash=1).GetOnBits())
Out[23]: [1349]

In [24]: 
list(Chem.RDKFingerprint(Chem.MolFromSmiles('CCC=CCC'),minPath=5,nBitsPerHash=1).GetOnBits())
Out[24]: [1349]

>
> I would have expected a more sophisticated graph/tree canonization
> algorithm for branched paths. Maybe I misunderstood the code. Can you
> give me a hint on the hashing technique or is there a publication the
> procedure is based on?

The details are above.

For this application -- generating a fingerprint for molecule
similarity and filtering substructures -- it's not essential that the
subgraphs be exactly canonicalized. Obviously the closer one gets to a
true canonicalization the better, but it's definitely not critical.
That said: if you have computationally efficient ideas for improving
the canonicalization, it would be great to hear them; improved hashing
of the subgraphs would definitely improve the fingerprints.

Best Regards,
-greg

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to