Sorry: typo "...quickly eliminate some duplicates using non-canonical SMILES..."
On Wed, 7 Nov 2018 at 09:32, Noel O'Boyle <baoille...@gmail.com> wrote: > I don't believe that there is a better way to eliminate duplicates apart > from using canonical forms (e.g. InChI or canonical SMILES, though InChI > also changes the structure so use with care). You can quickly eliminate > some duplicates using canonical SMILES or a matrix comparison as you > suggest, but to actually be sure you will still have to use a canonical > form. > > I'm not sure exactly what you're doing but another idea would be to avoid > generating some structures in the first place using graph symmetry > (available in OB). For example, don't apply the same reaction to symmetry > equivalent atoms. > > - Noel > > On Fri, 2 Nov 2018 at 18:08, Xianghai Sheng <xsh...@ucmerced.edu> wrote: > >> Hi Andrew and Noel, >> >> Thank you for the inspiring answers! >> >> Andrew: Sorry I didn't make it clear, but 1/2 of the time is only for a >> random job I tested and I don't know the scaling of it. Its cost may grow >> faster than the searching algorithm. I think it is beneficial to know the >> formal time complexity in this sense. But I agree that it is mostly useful >> when thinking about large N. I may have to just measure the actual scaling >> and deal with it. >> >> Noel: That's a brilliant idea! I do think most of the time is wasted on >> canonicalizing the same graph in my code. I will give it a try. On another >> note, the purpose of canonicalization in my code is to eliminate duplicates >> in my reaction path network. Is there a better way to do it? Molecules in >> my code are represented by bond-electron matrix. A naive approach would be >> to sort rows in both matrices and compare them row by row. (I think this >> takes O(N^2*logN)) >> >> Best, >> Xianghai >> >> On Fri, Nov 2, 2018 at 3:31 AM Noel O'Boyle <baoille...@gmail.com> wrote: >> >>> First of all, I'd recommend using the development code rather than the >>> latest release. Also, ensure that hydrogens are suppressed. You may also >>> consider trying to avoid multiple canonicalisations of the same graph by >>> just writing out a non-canonical SMILES, and only at the end collating them >>> and canonicalising. >>> >>> While I agree with Andrew, it is also true that Open Babel's >>> canonicalisation implementation is slow. This is partly due to the fact >>> that we calculate all of the graph invariants up-front rather than lazily >>> to split ties. In particular, we calculate the pairwise distance of every >>> atom from every other atom (as far as I remember). This is O(N**2). >>> >>> >>> Regards, >>> - Noel >>> >>> On Fri, 2 Nov 2018 at 08:02, Andrew Dalke <da...@dalkescientific.com> >>> wrote: >>> >>>> On Nov 2, 2018, at 01:33, Xianghai Sheng <xsh...@ucmerced.edu> wrote: >>>> > I am trying to figure out the time complexity (Big O) of converting >>>> from OBMol to canonical smiles. >>>> >>>> I don't think that's a useful way to think about your problem. >>>> >>>> Graph canonicalization is "at least as computationally hard as the >>>> graph isomorphism problem." >>>> https://en.wikipedia.org/wiki/Graph_canonization , and graph >>>> isomorphism is believed to be quasi-polynomial. >>>> >>>> That said, I believe the actual algorithm Open Babel uses may have >>>> exponential time for some graphs. >>>> >>>> However, most molecular graphs don't reach the worst-time behavior. >>>> >>>> Furthermore, time complexity is mostly useful when thinking about large >>>> N. Molecules typically have small N, and there may be lower-order terms >>>> which dominate the times for small N. >>>> >>>> If you are interested in the topic of graph canonicalization algorithms >>>> in general, you might start with the nauty papers; see >>>> http://pallini.di.uniroma1.it/ >>>> >>>> > I want to know this because getting canonical smiles takes up half of >>>> the running time of my reaction path searching program. >>>> >>>> How would knowing the time complexity help you? >>>> >>>> While this may be obvious, if canonicalization takes 1/2 the time, then >>>> you'll only get 2x speedup even if that time goes to 0. How much work are >>>> you willing to do for a 2x speedup? >>>> >>>> Might it be possible to do about the same amount of work to parallelize >>>> your algorithm? For example, if your reaction path search generates a set >>>> of nodes which must all be canonicalized in order to evaluate them, then >>>> you can do all of those canonicalizations in parallel. >>>> >>>> Cheers, >>>> >>>> Andrew >>>> da...@dalkescientific.com >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> OpenBabel-discuss mailing list >>>> OpenBabel-discuss@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/openbabel-discuss >>>> >>>
_______________________________________________ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss