On 24/06/2010 23:37, Rob Soe wrote: > Hi all, > I have two lists of molecules. The first list contains testing molecules > (test.sdf) and the second contains training molecules (train.sdf). > I would like to compare each test molecule to all the training molecule > and calculate a corresponding Tanimoto similarity score. (I implemented > it in my code and it is super slow as it is O(n^2) ). I use > GetFingerprint() and Tanimoto() functions for such purpose. After the > comparison, I picked up the kth most similar molecules to the test > molecule and predict something for the test molecule. I am trying to > make things a bit faster. > So I tried the following babel command: > > babel test_mol.smi train_mols.sdf -ofpt > > But the problem is that I have to manually replace each testing molecule > smile information for every time I run the command. In other words, the > script will only compare the first molecule in the 'test_mol.smi" to all > the molecules in 'train_mols.sdf'. So even if I have a list of testing > molecules in the 'test_mol.smi', the script will not compare them to the > training molecules except the first one in the list. > Is there a way or trick I can use so that I can compare all my testing > molecules to the training molecules and get a list of Tc scores? > Thanks so much for your help!
I think you are going to have to use some kind of programming - Python scripting or even C++ - to do things that the command line doesn't provide. And working with two sets of molecules is something it does not currently do. If you are interested in only the closest matches to a test molecule, the following may be a cleaner way of doing it: Make a fast search index from test.sdf babel test.sdf -ofs For a given test molecule, find, say, the 10 molecules with the largest Tanimotos with it babel test.fs -S test_mol.sdf -aat10 result.smi The second 'a' adds the Tanimoto to the result molecule's title. You are still going to replace the test molecule or its file name. If there were not too many you can split test.sdf babel test.sdf test_mol.sdf -m will put each molecule into a different file. Maybe you could then iterate using shell or batch scripting. This isn't any help to you, but it happens that I am currently writing code to prepare and N by N matrix of Tanimoto coefficients (on the way to selecting a diverse set of molecules). As others have said, in making this matrix you can't beat O(N2) but you can make each Tanimoto calculation faster. Normally each Tanimoto requires two bit counts, but only one is needed if the bit count of each fingerprint is known. So these are pre-calculated, which is O(N). I'm also using a 16 bit look-up table for these bit counts, which is faster than the way currently in OB. Chris ------------------------------------------------------------------------------ ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo _______________________________________________ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss