Re: [Rdkit-discuss] RDKit topological fingerprints

Greg Landrum Sun, 20 Feb 2011 20:32:26 -0800

Dear Nils,

This weekend I realized that I had forgotten a point relevant to this
discussion:


On Sun, Feb 13, 2011 at 1:37 AM,  <[email protected]> wrote:
>
>> Both paths and subgraphs (branched paths) are hashed the same, the
>> difference in the fingerprinting code is how they are identified.
>>
>> The hash for each bond in a subgraph (or path) is determined by
>> combining: the number of neighbors that bond has in the subgraph,
>> the bond order, and the atomic numbers of the two atoms involved.
>> The hash for the entire subgraph is determined by sorting the list
>> of bond hashes, adding the count of the total number of atoms in
>> the subgraph, and then hashing that list.
>>
>> > However, I don't think that the number of neighbors  stored for
>> > each bond uniquely defines the structure of a branched path.
>>
>> It certainly doesn't, and it's not that hard to find
>> counter-examples where the hashing algorithm breaks down and you
>> get collisions (I give an example below). The simple approach used
>> is certainly not a true canonicalization of the subgraphs. If you
>> have ideas for improvements, I'd be more than happy to hear them;
>> the primary constraint is that the subgraph hashing algorithm
>> should not be too computationally intensive.
>>
>> The current algorithm seems to work in practice, but that's more of
>> a qualitative feeling than something concrete
>
> I think it's important to distinguish two types of "collisions": Those
> that are due to insufficient canonicalization and those that are due to
> hashing. The latter are unavoidable when using exhaustive enumeration
> of paths/subgraphs but their influence on the performance can be
> reduced, e.g. by using a good hash function, adjusting the fingerprint
> size etc.
> Collisions caused by insufficient canonicalization mean that some
> different structures can not be distinguished and thus not be fully
> exploited by the fingerprint, regardless of the fingerprint size used.
> Nevertheless, I also believe that the algorithm used in RDKit works
> well.

I forgot that there is already code to do a better job of
canonicalizing subgraphs in the RDKit:
$RDBASE/Code/GraphMol/Subgraphs/SubgraphUtils.cpp contains the
function calcPathDiscriminators(). This returns a three-tuple for the
path that can be hashed to give a better key than the current
fingerprinting code. The fact that this isn't used in the
fingerprinter is something of an accident of history plus some
premature optimization (this algorithm is slower than what is
currently used in the fingerprinting code, but I guess it's a small
delta compared to the time required to find the subgraphs themselves).

Best Regards,
-greg

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] RDKit topological fingerprints

Reply via email to