Dear Nils, I'm CC'ing the rdkit-discuss mailing list on my answer, since it's probably of interest to more people. I'd appreciate it if, at least for the duration of this conversation, you would subscribe to the list and include it on your replies. Subscription information is here: https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
On Thu, Feb 10, 2011 at 4:20 PM, Nils Kriege <[email protected]> wrote: > > thanks for providing your software RDKit. I have a question regarding > the topological fingerprint defined in > /GraphMol/Fingerprints/Fingerprints.h and the parameter > 'branchedPaths'. > > What exactly is a branched path? Does in differ from a subgraph? Does > it contain cycles? A branched path is a subgraph. It can contain cycles. I used the "branchedPaths" name in the fingerprinting code so that I could avoid repeating the explanation of the difference between a path and subgraph that one finds in $RDBASE/Code/GraphMol/Subgraphs/Subgraphs.h > I have looked at the code. Paths and branched path are both identified > by a set of bonds and are hashed in a similar fashion. The only > difference for paths and branched paths seems to be 'bondNbrs', since > branched paths contain bonds with more than two adjacent bonds. Both paths and subgraphs (branched paths) are hashed the same, the difference in the fingerprinting code is how they are identified. The hash for each bond in a subgraph (or path) is determined by combining: the number of neighbors that bond has in the subgraph, the bond order, and the atomic numbers of the two atoms involved. The hash for the entire subgraph is determined by sorting the list of bond hashes, adding the count of the total number of atoms in the subgraph, and then hashing that list. > However, I don't think that the number of neighbors stored for each > bond uniquely defines the structure of a branched path. It certainly doesn't, and it's not that hard to find counter-examples where the hashing algorithm breaks down and you get collisions (I give an example below). The simple approach used is certainly not a true canonicalization of the subgraphs. If you have ideas for improvements, I'd be more than happy to hear them; the primary constraint is that the subgraph hashing algorithm should not be too computationally intensive. The current algorithm seems to work in practice, but that's more of a qualitative feeling than something concrete > > Furthermore, does the hashing procedure distinguish, e.g., a path > CCN=CC from CN=CCC? Both have the same bonds but are different > paths... These kinds of questions are, luckily, quite easy to answer: In [13]: m1 = Chem.MolFromSmiles('CCN=CC') In [14]: m2=Chem.MolFromSmiles('CN=CCC') In [15]: list(Chem.RDKFingerprint(m1,minPath=4,nBitsPerHash=1).GetOnBits()) Out[15]: [65] In [16]: list(Chem.RDKFingerprint(m2,minPath=4,nBitsPerHash=1).GetOnBits()) Out[16]: [1491] These are different because the CN bond has two neighbors in the first molecule and only one in the second (there's a CC bond that changes too). Here's a real collision: In [23]: list(Chem.RDKFingerprint(Chem.MolFromSmiles('CC=CCCC'),minPath=5,nBitsPerHash=1).GetOnBits()) Out[23]: [1349] In [24]: list(Chem.RDKFingerprint(Chem.MolFromSmiles('CCC=CCC'),minPath=5,nBitsPerHash=1).GetOnBits()) Out[24]: [1349] > > I would have expected a more sophisticated graph/tree canonization > algorithm for branched paths. Maybe I misunderstood the code. Can you > give me a hint on the hashing technique or is there a publication the > procedure is based on? The details are above. For this application -- generating a fingerprint for molecule similarity and filtering substructures -- it's not essential that the subgraphs be exactly canonicalized. Obviously the closer one gets to a true canonicalization the better, but it's definitely not critical. That said: if you have computationally efficient ideas for improving the canonicalization, it would be great to hear them; improved hashing of the subgraphs would definitely improve the fingerprints. Best Regards, -greg ------------------------------------------------------------------------------ The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

